Genetics is universally recognized as a core aspect of biological and scientific literacy. Beyond genetics' own role as a major unifying topic within the biological sciences, understanding genetics is essential for understanding other integral ideas such as evolution and development. Genetics understanding also underlies public decision making about modern advances in health sciences and biotechnology and broader socio-scientific issues. Consequently, educators have attempted to measure student and teacher understanding of this domain. Using Rasch modeling--a superior but underutilized framework for instrument evaluation--this dissertation explored psychometric, cognitive, and demographic aspects of educational measurement in the field of genetics education in order to generate evidence-based examples to illustrate how instruments can be more carefully developed and robustly evaluated. The first study (Chapter 3) sought to expand the sources of evidence supporting validity and reliability inferences produced by a relatively new concept inventory (the Genetic Drift Inventory [GeDI]) designed for use in diagnosing undergraduate students' conceptual understanding of genetic drift. Concept Inventories (CIs) are commonly used tools for assessing student understanding of normative (scientific) and non-normative (naive) ideas, yet the body of empirical evidence supporting the inferences drawn from CI scores is often limited in scope and remains deeply rooted in Classical Test Theory (CTT) despite the availability of more robust Item Response Theory (IRT) and Rasch frameworks. Specifically, this study focused on: (1) GeDI instrument and item properties as revealed by Rasch modeling, (2) item order effects on response patterns, and (3) generalization to a new geographic sample. A sample of 336 advanced undergraduate biology majors completed one of four randomly assigned and equivalent versions of the GeDI that differed in presentation order of the GeDI item suites. Rasch analysis indicated the GeDI was unidimensional, with good fit to the Rasch model. Items had high reliability and were well matched to the ability of the sample. Person reliability was low. Rotating the GeDI's vignette-based item suites had no significant impact on overall scores, suggesting each vignette functioned independently. Scores from this new sample from the Northeast United States were comparable to those from other geographic regions and provide evidence in support of score generalizability. Suggestions for improvement include: (1) incorporation of additional items to differentiate high-ability persons and improve person reliability, and (2) re-examination of items with redundant or low difficulty levels. These results expand the range and quality of evidence in support of validity claims and illustrate changes that are likely to improve the quality of the GeDI (and other) evolution education instruments. The second study (Chapter 4) sought to determine how situational features impact inferences about participants' understanding of Mendelian genetics. Understanding how situational features of assessment tasks impact reasoning is important for many educational pursuits, notably the selection of curricular examples to illustrate phenomena, the design of formative and summative assessment items, and determination of whether instruction has fostered the development of abstract schemas divorced from particular instances. To test for context effects, an experimental research design was employed to measure differences in item difficulty among items varying only in situational features (e.g., plant, animal, human, fictitious) across five common genetics problem types. A multi-matrix test design was employed, and item packets were randomly distributed to a sample of undergraduate biology majors (n=444). Rasch analyses of participant scores produced good item fit, person reliability, and item reliability. Surprisingly, no significant differences in performance occurred among the animal, plant, and human item contexts, or between the fictitious and "real" item contexts. Also notable, incomplete dominance problems proved to be more difficult than dominant-recessive problems, and problems featuring homozygous parents were more difficult than those featuring heterozygous parents. Tests for differences in performance between genders, among ethnic groups, and by prior biology coursework revealed that none of these factors had a meaningful impact upon performance or context effects. Thus some, but not all, types of genetics problem solving or item formats are impacted by situational features. Overall, substantial evidence was generated about how current knowledge in the field of genetics education is measured and how measurement in this domain may be improved. The studies included herein exemplify some ways in which new and existing instruments can be examined to amass robust evidence for the quality of inferences generated by an instrument. Only with rigorously evaluated instruments can the educational community be confident that inferences about student learning are accurate and that consequent decisions are evidence-based. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.]