377 results on '"Sireci, Stephen G."'
Search Results
2. Exploring Relationships among Test Takers' Behaviors and Performance Using Response Process Data
- Author
-
Araneda, Sergio, Lee, Dukjae, Lewis, Jennifer, Sireci, Stephen G., Moon, Jung Aa, Lehman, Blair, Arslan, Burcu, and Keehner, Madeleine
- Abstract
Students exhibit many behaviors when responding to items on a computer-based test, but only some of these behaviors are relevant to estimating their proficiencies. In this study, we analyzed data from computer-based math achievement tests administered to elementary school students in grades 3 (ages 8-9) and 4 (ages 9-10). We investigated students' response process data, including the total amount of time they spent on an item, the amount of time they took to first respond to an item, the number of times they "visited" an item and the number of times they changed their responses to items, in order to explore whether these behaviors were related to overall proficiency and whether they differed across item formats and grades. The results indicated a non-linear relationship between the mean number of actions and proficiency, as well as some notable interactions between correctly answering an item, item format, response time, and response time latency. Implications for test construction and future analyses in this area are discussed.
- Published
- 2022
3. Embedded Accommodation and Accessibility Support Usage on a Computer-Based Statewide Achievement Test
- Author
-
Lee, Dukjae, Buzick, Heather, Sireci, Stephen G., Lee, Mina, and Laitusis, Cara
- Abstract
Although there has been substantial research on the effects of test accommodations on students' performance, there has been far less research on students' use of embedded accommodations and other accessibility supports at the item and whole test level in operational testing programs. Data on embedded accessibility supports from digital logs generated by computer-based assessment platforms are complex, and so decisions need to be made to make sense of the data with respect to appropriate and effective accommodation use. In this study, we explored different ways of defining students' use of accessibility supports and how to best summarize such use for accountability and other purposes. Examples of descriptive statistical indices and data visualizations are presented using mathematics and English language arts test data from a large statewide assessment. Such data are important for accommodations monitoring required by the United States Department of Education and for identifying schools and districts that may be over- or under-using these accommodations and supports.
- Published
- 2021
4. Targeted Linguistic Simplification of Science Test Items for English Learners
- Author
-
Noble, Tracy, Sireci, Stephen G., Wells, Craig S., Kachchaf, Rachel R., Rosebery, Ann S., and Wang, Yang Caroline
- Abstract
In this experimental study, 20 multiple-choice test items from the Massachusetts Grade 5 science test were linguistically simplified, and original and simplified test items were administered to 310 English learners (ELs) and 1,580 non-ELs in four Massachusetts school districts. This study tested the hypothesis that specific linguistic features of test items contributed to construct-irrelevant variance in science test scores of ELs. Simplifications targeted specific linguistic features, to identify those features with the largest impacts on ELs' test performance. Of all the linguistic simplifications used in this study, adding visual representations to answer choices had the largest positive effect on ELs' performance. These findings have significant implications for the design of multiple-choice test items that are fair and valid for ELs.
- Published
- 2020
- Full Text
- View/download PDF
5. Deriving Decisions from Disrupted Data
- Author
-
Sireci, Stephen G. and Suarez-Alvarez, Javier
- Abstract
The COVID-19 pandemic negatively affected the quality of data from educational testing programs. These data were previously used for many important purposes ranging from placing students in instructional programs to school accountability. In this article, we draw from the research design literature to point out the limitations inherent in "disrupted" educational testing data, suggest questions and criteria to be considered in evaluating the use of such data for decision making, and indicate how such data may be valid or invalid for specific purposes. Six criteria are proposed for evaluating the degree to which educational testing data are valid for specific decisions. These criteria suggest data from COVID-disrupted school years are not likely to be valid for accountability purposes, but may be valuable for making decisions at the individual student level. Thus, we encourage researchers and policy makers to focus on how decisions derived from such disrupted data affect children.
- Published
- 2022
- Full Text
- View/download PDF
6. Setting and Validating Multiple Standards on a Multistage-Adaptive Test
- Author
-
Lewis, Jennifer, Lim, Hwanggyu, Padellaro, Frank, Sireci, Stephen G., and Zenisky, April L.
- Abstract
Setting cut scores on (MSTs) is difficult, particularly when the test spans several grade levels, and the selection of items from MST panels must reflect the operational test specifications. In this study, we describe, illustrate, and evaluate three methods for mapping panelists' Angoff ratings into cut scores on the scale underlying an MST. The results suggest the test characteristic function and item characteristic curve methods performed similarly, but the method based on dichotomizing panelists' ratings at a response probability of 0.67 was unacceptable. The study featured a rating booklet design that allowed us to systematically evaluate the validity of the Angoff ratings across test levels, which contributed internal validity evidence for the cut scores, which were also evaluated using procedural and external validity evidence. The implications of the results for future standard setting studies and research in this area are discussed.
- Published
- 2022
- Full Text
- View/download PDF
7. Language Matters: Teacher and Parent Perceptions of Achievement Labels from Educational Tests
- Author
-
O'Donnell, Francis and Sireci, Stephen G.
- Abstract
Since the standards-based assessment practices required by the No Child Left Behind legislation, almost all students in the United States are "labeled" according to their performance on educational achievement tests. In spite of their widespread use in reporting test results, research on how achievement level labels are perceived by teachers, parents, and students is minimal. In this study, we surveyed teachers (N = 51) and parents (N = 50) regarding their perceptions of 73 achievement labels (e.g., "inadequate," "level 2," "proficient") used in statewide testing programs. These teachers and parents also sorted the labels according to their similarity. Using multidimensional scaling, we found labels used to denote the same level of performance (e.g., "basic" and "below proficient") were perceived to differ in important ways, including in their tone and how much achievement they convey. Additionally, some labels were perceived as more encouraging or clear than others. Teachers' and parents' perceptions were similar, with a few exceptions. The results have important implications for reporting results that encourage, rather than discourage, student learning.
- Published
- 2022
- Full Text
- View/download PDF
8. Linguistic Distance and Translation Differential Item Functioning on Trends in International Mathematics and Science Study Mathematics Assessment Items
- Author
-
Gökçe, Semirhan, Berberoglu, Giray, Wells, Craig S., and Sireci, Stephen G.
- Abstract
The 2015 Trends in International Mathematics and Science Study (TIMSS) involved 57 countries and 43 different languages to assess students' achievement in mathematics and science. The purpose of this study is to evaluate whether items and test scores are affected as the differences between language families and cultures increase. Using differential item functioning (DIF) procedures, we compared the consistency of students' performance across three combinations of languages and countries: (1) same language but different countries; (2) same countries but different languages; and (3) different languages and different countries. The analyses consisted of the detection of the number of DIF items for all paired comparisons within each condition, the direction of DIF, the magnitude of DIF, and the differences between test characteristic curves. As the countries were more distant with respect to cultures and language families, the presence of DIF increased. The magnitude of DIF was greatest when both language and country differed, and smallest when the languages were same, but the countries were different. Results suggest that when TIMSS results are compared across countries, the language- and country-specific differences which could reflect cultural, curriculum, or other differences should be considered.
- Published
- 2021
- Full Text
- View/download PDF
9. Linking TIMSS and NAEP Assessments to Evaluate International Trends in Achievement
- Author
-
Lim, Hwanggyu and Sireci, Stephen G.
- Abstract
The Trends in International Mathematics and Science Study (TIMSS) makes it possible to compare the performance of students in the US in Mathematics and Science to the performance of students in other countries. TIMSS uses four international benchmarks for describing student achievement: Low, Intermediate, High, and Advanced. In this study, we linked the eighth-grade Math TIMSS and NAEP scales using equipercentile equating to (a) help better interpret U.S. eighth-grade students' performance on TIMSS, and (b) investigate the progress of eighth-grade U.S. students over time relative to the progress of students in other countries. Results indicated that relative to other countries, U.S. eighth-grade students increased with respect to the "At or Above Basic" NAEP Achievement level, but that other countries saw larger improvements in the higher achievement level categories, relative to the US. This finding may reflect the emphasis of No Child Left Behind on raising lower achievement to "proficient." However, with respect to "Advanced" mathematics achievement, eighth-grade U.S. students showed less improvement than students in other countries.
- Published
- 2017
10. NCME Presidential Address 2020: Valuing Educational Measurement
- Author
-
Sireci, Stephen G.
- Abstract
The community of educational measurement researchers and practitioners has made many positive contributions to education, but has also become complacent and lost the public trust. In this article, reasons for the lack of public trust in educational testing are described, and core values for educational measurement are proposed. Reasons for distrust of educational measurement include hypocritical practices that conflict with our professional standards, a biased and selected presentation of the history of testing, and inattention to social problems associated with educational measurement. The five core values proposed to help educational measurement serve education are: (1) everyone is capable of learning; (2) there are no differences in the capacity to learn across groups defined by race, ethnicity, or sex; (3) all educational tests are fallible to some degree; (4) educational tests can provide valuable information to improve student learning and certify competence; and (5) all uses of educational test scores must be sufficiently justified by validity evidence. The importance of these core values for improving the science and practice of educational measurement to benefit society is discussed.
- Published
- 2021
- Full Text
- View/download PDF
11. College Admission Tests and Social Responsibility
- Author
-
Koljatic, Mladen, Silva, Mónica, and Sireci, Stephen G.
- Abstract
In this article we address the mounting criticism and rejection of standardized tests used in the selection of students for college or university education. Admission tests are being increasingly demonized in many parts of the world and many colleges and universities are dropping tests for selection purposes, claiming the tests are detrimental to fair selection. The testing industry is at the center of this criticism and is accused of maintaining, and even facilitating, the social ills associated with admissions testing, much like iconic business corporations were accused of supporting unfair labor practices in the 1990s. The response of some business corporations to those criticisms was to embrace corporate social responsibility and increase transparency and accountability in their operations. Unfortunately, such acceptance of responsibility and increased transparency have not emerged in the testing industry. We believe the legitimacy of admission tests will continue to be challenged until the testing industry adopts a new way of conducting their business to regain the goodwill of relevant stakeholders in society that so far have been largely ignored.
- Published
- 2021
- Full Text
- View/download PDF
12. Evaluating Panelists' Understanding of Standard Setting Data
- Author
-
Baron, Patricia, Sireci, Stephen G., and Slater, Sharon C.
- Abstract
Since the No Child Left Behind Act (No Child Left Behind [NCLB], 2001) was enacted, the Bookmark method has been used in many state standard setting studies (Karantonis and Sireci; Zieky, Perie, and Livingston). The purpose of the current study is to evaluate the criticism that when panelists are presented with data during the Bookmark standard setting process, these data are often misunderstood. We collected survey responses from eight panels of teachers who worked on an alternate assessment standard setting workshop. We found that although many panelists understood these data, others misunderstood them. For example, when panelists reviewed panel judgment statistics, some extrapolated beyond what these data represent. Our results include themes describing the types of misconceptions we observed, and the need for training and evaluation related to understanding and use of data used in standard setting. We share some suggestions for consideration when implementing the Bookmark method.
- Published
- 2021
- Full Text
- View/download PDF
13. Evolving Educational Testing to Meet Students’ Needs: Design‐in‐Real‐Time Assessment.
- Author
-
Sireci, Stephen G., Suárez‐Álvarez, Javier, Zenisky, April L., and Oliveri, Maria Elena
- Subjects
- *
TECHNOLOGY assessment , *AUTOMATED teller machines , *INDIVIDUAL needs , *COMPUTER adaptive testing , *ADAPTIVE testing - Abstract
The goal in personalized assessment is to best fit the needs of each individual test taker, given the assessment purposes. Design‐In‐Real‐Time (DIRTy) assessment reflects the progressive evolution in testing from a single test, to an adaptive test, to an adaptive assessment
system . In this article, we lay the foundation for DIRTy assessment and illustrate how it meets the complex needs of each individual learner. The assessment framework incorporates culturally responsive assessment principles, thus making it innovative with respect to both technology and equity. Key aspects are (a) assessment building blocks called “assessment task modules” (ATMs) linked to multiple content standards and skill domains, (b) gathering information on test takers’ characteristics and preferences and using this information to improve their testing experience, and (c) selecting, modifying, and compiling ATMs to create a personalized test that best meets the needs of the testing purpose and individual test taker. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
14. Measurement invariance across immigrant and nonimmigrant populations on PISA non-cognitive scales.
- Author
-
Casas, Maritza and Sireci, Stephen G.
- Subjects
- *
CONFIRMATORY factor analysis , *IMMIGRANT students , *BULLYING , *IMMIGRANTS , *SCHOOL bullying - Abstract
AbstractIn this study, we take a critical look at the degree to which the measurement of bullying and sense of belonging at school is invariant across groups of students defined by immigrant status. Our study focuses on the invariance of these constructs as measured on a recent PISA administration and includes a discussion of two statistical methods for assessing measurement invariance—multiple-group confirmatory factor analysis (MGCFA) and alignment optimization—. We discuss and illustrate how the alignment optimization method is optimal for handling data from large-scale international assessments like PISA. An acceptable degree of noninvariance was achieved for the two scales. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
15. Commentary: What Is Truly Foundational?
- Author
-
Crespo Cruz, Eduardo J., Immanuel, Aria, Keller, Lisa A., Ketan, McIntee, Kimberly, Serrano, Fernando José Mena, Sireci, Stephen G., Smith, Nate, Suárez‐Álvarez, Javier, Wells, Craig S., Woodland, Rebecca, and Zenisky, April L.
- Subjects
EDUCATIONAL tests & measurements ,ARTIFICIAL intelligence ,MACHINE learning ,TASK forces ,UNIVERSITY faculty - Abstract
The Task Force on Foundational Competencies in Educational Measurement has produced a set of foundational competencies and invited comment on the document. The students and faculty at the University of Massachusetts Amherst provide their comments and critique of the proposed competencies. Both students and faculty agree that there needs to be more specificity regarding the purpose of the document, the nature of the data used to produce the document, and the definition of the relevant terms. Additionally, attention should be paid to the international context, and the role of artificial intelligence and machine learning. The authors acknowledge the contribution of the draft of the foundational competencies and look forward to more conversation regarding this topic. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Student Assessment Opt Out and the Impact on Value-Added Measures of Teacher Quality
- Author
-
Marland, Joshua, Harrick, Matthew, and Sireci, Stephen G.
- Abstract
Student assessment nonparticipation (or opt out) has increased substantially in K-12 schools in states across the country. This increase in opt out has the potential to impact achievement and growth (or value-added) measures used for educator and institutional accountability. In this simulation study, we investigated the extent to which value-added measures of teacher quality are affected as a result of varying degrees of opt out, as well as a result of various types of nonrandom opt out. Results show that the magnitude of opt out and choice of classification scheme has a greater impact on value-added estimates than the type of opt-out patterns simulated in this study. Specifically, root mean square differences in value-added estimates increased as magnitude of opt out increased. In addition, teacher effectiveness classification agreement decreased as opt out magnitude increased. One type of opt out, where the highest achieving students in the highest achieving classrooms opted out, had the largest impact on stability than the other types of opt outs.
- Published
- 2020
- Full Text
- View/download PDF
17. Standardization and 'UNDERSTAND'ardization in Educational Assessment
- Author
-
Sireci, Stephen G.
- Abstract
Educational tests are standardized so that all examinees are tested on the same material, under the same testing conditions, and with the same scoring protocols. This uniformity is designed to provide a level "playing field" for all examinees so that the test is "the same" for everyone. Thus, standardization is designed to promote fairness in testing. In practice, the material tested, the conditions under which a test is administered, and the scoring processes, are often too rigid to provide the intended level playing field. For example, standardized testing conditions may interact with personal characteristics of examinees that affect test performance, but are not construct-relevant. Thus, more flexibility in standardization is needed to account for the diversity of experiences, talents, and handicaps of the incredibly heterogeneous populations of examinees we currently assess. Traditional standardization procedures grew out of experimental psychology and psychophysics laboratories where keeping all conditions constant was crucial. Today, accounting for and measuring what is not constant across examinees is crucial to valid construct interpretations. To meet this need I introduce the concept of "understandardization," which refers to ensuring sufficient flexibility in standardized testing conditions to yield the most accurate measurement of proficiency for "each" examinee.
- Published
- 2020
- Full Text
- View/download PDF
18. Evaluating Random and Systematic Error in Student Growth Percentiles
- Author
-
Wells, Craig S. and Sireci, Stephen G.
- Abstract
Student growth percentiles (SGPs) are currently used by several states and school districts to provide information about individual students as well as to evaluate teachers, schools, and school districts. For SGPs to be defensible for these purposes, they should be reliable. In this study, we examine the amount of systematic and random error in SGPs by simulating test scores for four grades and estimating SGPs using one, two, or three conditioning years. The results indicated that, although the amount of systematic error was small to moderate, the amount of random error was substantial, regardless of the number of conditioning years. For example, the standard error of the SGP estimates associated with an SGP value of 56 was 22.2 resulting in a 68% confidence interval that would range from 33.8 to 78.2 when using three conditioning years. The results are consistent with previous research and suggest SGP estimates are too imprecise to be reported for the purpose of understanding students' progress over time.
- Published
- 2020
- Full Text
- View/download PDF
19. A Review of Models for Computer-Based Testing. Research Report 2011-12
- Author
-
College Board, Luecht, Richard M., and Sireci, Stephen G.
- Abstract
Over the past four decades, there has been incremental growth in computer-based testing (CBT) as a viable alternative to paper-and-pencil testing. However, the transition to CBT is neither easy nor inexpensive. As Drasgow, Luecht, and Bennett (2006) noted, many design engineering, test development, operations/logistics, and psychometric changes are required to develop a successful operational program. Early research on CBT almost exclusively focused on theoretical issues such as improving measurement efficiency by achieving adequate levels of test score reliability using as few items as possible. However, it was soon evident that practical issues--such as ensuring content representation, making sure all examinees have sufficient time to complete the test, implementation of new item types, and controlling the degree to which items were exposed to examinees--needed to be addressed, too. In the past few years, research on CBT has focused on developing models that achieve desired levels of measurement efficiency while simultaneously satisfying other important goals, such as minimizing item exposure and maintaining content validity. In addition, there has been a growing awareness among practitioners that basic CBT research using small samples or simulation studies needs to be vetted using cost-benefit analysis, as well as engineering design and implementation criteria to ensure that feasibility, scalability, and efficiency are evaluated in more concrete ways than by merely reporting a reduction of error variances for theoretical examinee scores (Luecht, 2005a, 2005b).
- Published
- 2011
20. Evolving Notions of Fairness in Testing in the United States
- Author
-
Sireci, Stephen G., primary and Randall, Jennifer, additional
- Published
- 2021
- Full Text
- View/download PDF
21. Evaluation of the National Assessment of Educational Progress. Study Reports
- Author
-
Department of Education (ED), Office of Planning, Evaluation and Policy Development, Buckendahl, Chad W., Davis, Susan L., Plake, Barbara S., Sireci, Stephen G., Hambleton, Ronald K., Zenisky, April L., Wells, Craig S., Buckendahl, Chad W., Davis, Susan L., Plake, Barbara S., Sireci, Stephen G., Hambleton, Ronald K., Zenisky, April L., Wells, Craig S., and Department of Education (ED), Office of Planning, Evaluation and Policy Development
- Abstract
The "Evaluation of the National Assessment of Educational Progress: Study Reports" describes the special studies that comprised the design of the evaluation. In the Final Report, the authors presented a practical discussion of the evaluation studies to its primary, intended audience, namely policymakers. On this accompanying CD, readers will find additional evidence to support findings and recommendations in the six study reports. The study reports represent summaries of the data collection, analysis, and findings of the different lines of inquiry that comprised the evaluation design. Included in this volume are: (1) Audit Study Report (Barbara S. Plake, Chad W. Buckendahl, and Susan L. Davis); (2) Evaluation of the Standard Setting on the 2005 Grade 12 National Assessment of Educational Progress Mathematics Test (Stephen G. Sireci, Jeffrey Hauger, Christine Lewis, Craig Wells, April L. Zenisky, and Jill Delton); (3) How Do Other Countries Measure Up to the Mathematics Achievement Levels on the National Assessment of Educational Progress? (Ronald K. Hambleton, Stephen G. Sireci, and Zachary R. Smith); (4) A Study of the Utility of the National Assessment of Educational Progress (April L. Zenisky, Ronald K. Hambleton, and Stephen G. Sireci); (5) Evaluating Score Equity Across Selected States for the 2005 Grade 8 NAEP Math and Reading Assessments (Craig S. Wells, Su Baldwin, Ronald K. Hambleton, Stephen G. Sireci, Ana Karantonis, Stephen Jirka, Robert Keller and Lisa A. Keller); and (6) Methods for Evaluating the Alignment Between State Curriculum Frameworks and State Assessments: A Literature Review (Drey Martone, Stephen G. Sireci, and Jill Delton). Individual tables, figures, footnotes, references, and appendices.
- Published
- 2009
22. Promoting Valid Assessment of Students with Disabilities and English Learners
- Author
-
Sireci, Stephen G., Banda, Ella, Wells, Craig S., Elliott, Stephen N., editor, Kettler, Ryan J., editor, Beddow, Peter A., editor, and Kurz, Alexander, editor
- Published
- 2018
- Full Text
- View/download PDF
23. Item Response Theory-Based Methods for Estimating Classification Accuracy and Consistency
- Author
-
Diao, Hongyu and Sireci, Stephen G.
- Abstract
Whenever classification decisions are made on educational tests, such as pass/fail, or basic, proficient, or advanced, the consistency and accuracy of those decisions should be estimated and reported. Methods for estimating the reliability of classification decisions made on the basis of educational tests are well-established (e.g., Rudner, 2001; Rudner, 2005; Lee, 2010). However, they are not covered in most measurement textbooks and so they are not widely known. Moreover, few practitioners are aware of freely available software that can be used to implement current methods for evaluating decision consistency and decision accuracy that are appropriate for contemporary educational assessments. In this article, we describe current methods for estimating decision consistency and decision accuracy and provide descriptions of "freeware" software that can estimate these statistics. Similarities and differences across these software packages are discussed. We focus on methods based on item response theory, which are particularly well-suited to most 21st century assessments.
- Published
- 2018
24. Exploring the Factor Structure of a K-12 English Language Proficiency Assessment
- Author
-
Faulkner-Bond, Molly, Wolf, Mikyung Kim, Wells, Craig S., and Sireci, Stephen G.
- Abstract
In this study we investigated the internal factor structure of a large-scale K--12 assessment of English language proficiency (ELP) using samples of fourth- and eighth-grade English learners (ELs) in one state. While U.S. schools are mandated to measure students' ELP in four language domains (listening, reading, speaking, and writing), some ELP standards released recently have defined ELP on the basis of integrated modalities, such as receptive language or collaborative communication. To explore whether current assessments can empirically support new conceptualizations such as these, we compared seven models based on different hypothesized structures for language proficiency. For the Grade 8 students, we find support for a hierarchical factor model, with general language underlying the four domains. A model with the four domains offered the best fit for the Grade 4 sample but fell just shy of criteria for acceptable fit. Models that incorporate more specific higher-order modalities, such as literacy or productive language, functioned less well for the given data of Grades 4 and 8 samples, suggesting the current shift in ELP definition may require shifts in how ELP assessments are built and scored.
- Published
- 2018
- Full Text
- View/download PDF
25. Using Bilingual Students to Link and Evaluate Different Language Versions of an Exam
- Author
-
Ong, Saw Lan and Sireci, Stephen G.
- Abstract
Many researchers and the International Test Commission's (Hambleton, 2005) caution against treating scores from different language versions of a test as equivalent, without conducting empirical research to verify such equivalence. In this study, we evaluated the equivalence of English and Malay versions of a 9th-grade math test administered in Malaysia by conducting several statistical analyses. All analyses were conducted on data from a large sample of English-Malay bilingual students who took both versions of the exam. First, we conducted two equating analyses--one based on classical test theory and another based on item response theory (IRT). Then differential item functioning analyses (DIF) were performed to see if any items functioned differentially across their English and Malay versions. The DIF results flagged 7 items for statistically significant DIF, but only one had a non-negligible effect size. We then conducted another equating analysis dropping the DIF items. The equating results suggested an adjustment of 1 or 2 points, depending on the mathematics achievement levels. The results indicate that bilingual examinees can be useful for evaluating different language versions of a test and adjusting for differences in difficulty across test forms due to translation. (Contains 5 tables and 2 figures.)
- Published
- 2008
26. A Summary of the Research on the Effects of Test Accommodations: 2005-2006. Technical Report 47
- Author
-
National Center on Educational Outcomes, Minneapolis, MN., Zenisky, April L., and Sireci, Stephen G.
- Abstract
The purpose of this report is to provide an update on the state of the research on testing accommodations, as well as to identify promising areas of research to further clarify and enhance understanding of current and emerging issues. The research described encompasses empirical studies of score comparability and validity studies, as well as investigations into accommodations use and perceptions of their effectiveness. Taken together, the current research explores many of the issues surrounding test accommodations practices in both breadth and depth. Insofar as reporting on the findings of current research studies is a primary goal of this analysis, a second goal is to also identify areas requiring continued investigation in the future. (Contains 37 tables and 1 figure.)
- Published
- 2007
27. High-Stakes Testing in the Warm Heart of Africa: The Challenges and Successes of the Malawi National Examinations Board
- Author
-
Chakwera, Elias, Khembo, Dafter, and Sireci, Stephen G.
- Abstract
In the United States, tests are held to high standards of quality. In developing countries such as Malawi, psychometricians must deal with these same high standards as well as several additional pressures such as widespread cheating, test administration difficulties due to challenging landscapes and poor resources, difficulties in reliably scoring performance assessments, and extreme scrutiny from political parties and the popular press. The purposes of this paper are to (a) familiarize the measurement community in the US about Malawi's assessment programs, (b) discuss some of the unique challenges inherent in such a program, (c) compare testing conditions and test administration formats between Malawi and the US, and (d) provide suggestions for improving large-scale testing in countries such as the US and Malawi. By learning how a small country instituted and supports its current testing programs, a broader perspective on resolving current measurement problems throughout the world will emerge. (Contains 4 tables and 3 notes.)
- Published
- 2004
28. Validity Issues in Accommodating NAEP Reading Tests
- Author
-
National Assessment Governing Board, Washington, DC. and Sireci, Stephen G.
- Abstract
The National Assessment of Educational Progress (NAEP) seeks to include all students in the United States in the sampling frame from which students are selected to participate in the assessment. However, some students with disabilities (SWD) are either unable to take NAEP tests under standard testing conditions or are unable to perform at their best under standard testing conditions. In many testing situations, accommodations to standard testing conditions are given to SWD to improve measurement of their knowledge, skills, and abilities. This practice is in the pursuit of more valid test score interpretation; however, it produces the ultimate psychometric oxymoron--an accommodated standardized test. In this paper, I review validity issues related to test accommodations and summarize some empirical studies in this area. The focus of the paper is on accommodations for reading tests because some types of accommodations on these tests are particularly controversial. The specific accommodations emphasized in this review are extended time and oral (read-aloud) accommodations. A review of professional standards, validity theory, and recent empirical research in this area suggests that extended time accommodations may be appropriate for reading tests, but read-aloud accommodations are likely to alter the construct measured. Suggestions for determining when to provide accommodations and how to report scores from accommodated test administrations are provided. (Contains 3 tables, 2 figures, and 2 footnotes.) [This paper is one of a set of research-oriented papers commissioned by National Assessment Governing Board (NAGB) to serve as background information for attendees of the NAGB Conference on Increasing the Participation of Students with Disabilities (SD) and limited English proficient (LEP) Students in NAEP. This paper was also published as: Center for Educational Assessment Research Report No. 515. Amherst, MA: School of Education, University of Massachusetts Amherst.]
- Published
- 2004
29. Anchor-Based Methods for Judgmentally Estimating Item Difficulty Parameters. LSAC Research Report Series.
- Author
-
Law School Admission Council, Newtown, PA., Hambleton, Ronald K., Sireci, Stephen G., Swaminathan, H., Xing, Dehui, and Rizavi, Saba
- Abstract
The purposes of this research study were to develop and field test anchor-based judgmental methods for enabling test specialists to estimate item difficulty statistics. The study consisted of three related field tests. In each, researchers worked with six Law School Admission Test (LSAT) test specialists and one or more of the LSAT subtests. The three field tests produced a number of conclusions. A considerable amount was learned about the process of extracting test specialists' estimates of item difficulty. The ratings took considerably longer to obtain than had been expected. Training, initial ratings, and discussion took a considerable amount of time. Test specialists felt they could be trained to estimate item difficulty accurately and, to some extent, they demonstrated this. Average error in the estimates of item difficulty varied from about 11% to 13 %. Also the discussions were popular with the panelists, and almost always resulted in improved item difficulty estimates. By the end of the study, the two expected frameworks that developers thought they might provide test specialists, had merged to one. Test specialists seemed to benefit from the descriptions of items located at three levels of difficulty and from information about the item statistics of many items. Four appendixes describe tasks and contain the field test materials. (Contains 8 tables and 18 references.) (SLD)
- Published
- 2003
30. Small Sample Estimation in Dichotomous Item Response Models: Effect of Priors Based on Judgmental Information on the Accuracy of Item Parameter Estimates. LSAC Research Report Series.
- Author
-
Law School Admission Council, Newtown, PA., Swaminathan, Hariharan, Hambleton, Ronald K., Sireci, Stephen G., Xing, Dehui, and Rizavi, Saba M.
- Abstract
The primary objective of this study was to investigate how incorporating prior information improves estimation of item parameters in two small samples. The factors that were investigated were sample size and the type of prior information. To investigate the accuracy with which item parameters in the Law School Admission Test (LSAT) are estimated, the item parameter estimates were compared with known item parameter values. By randomly drawing small samples of varying sizes from the population of test takers, the relationship between sample size and the accuracy with which item parameters are estimated was studied. Data used were from the Reading Comprehension subtest of the LAST. Results indicate that the incorporation of ratings of item difficulty provided by subject matter specialists/test developers produced estimates of item difficulty statistics that were more accurate than that obtained without using such information. The improvement was observed for all item response models, including the model used in the LSAT. (SLD)
- Published
- 2003
31. Evaluating the Structural Equivalence of Tests Used in International Comparisons of Educational Achievement.
- Author
-
Sireci, Stephen G. and Gonzalez, Eugenio J.
- Abstract
International comparative educational studies make use of test instruments originally developed in English by international panels of experts, but that are ultimately administered in the language of instruction of the students. The comparability of the different language versions of these assessments is a critical issue in validating the comparative inferences drawn from the test results. This study analyzed data from the 1999 Third International Mathematics and Science Study (TIMSS) science assessment to evaluation the consistency of the structure of the item response data across different test versions. Individual differences multidimensional scaling analyses were used to evaluate data structure. The findings suggest that slight structural differences exist across countries, and that these differences are related to differences in item difficulty. The implications of these findings for better understanding of international comparisons of educational achievement, and for future research in this area, are discussed. (Contains 1 figure, 7 tables, and 26 references.) (Author/SLD)
- Published
- 2003
32. An Analysis of the Psychometric Properties of Dual Language Test Forms.
- Author
-
Massachusetts Univ., Amherst. School of Education., Sireci, Stephen G., and Khaliq, Shameem Nyla
- Abstract
Many students in the United States who are required to take educational tests are not fully proficient in English. To address this problem, a state-mandated testing program created dual language English-Spanish versions of some of their tests. In this study, the psychometric properties of the English and dual language versions of a fourth-grade mathematics test were explored. Analyses of the consistency of test structure across the two forms were conducted using structural equation modeling and multidimensional scaling. Analyses of differential item functioning (DIF) were conducted using Poly-SIBTEST. The results suggest slight structural differences across the two versions of the test. Part of this difference was attributed to overall proficiency differences across the two studied groups, and part was attributed to DIF. The implications of the findings for future research in this area are discussed. (Contains 4 figures, 11 tables, and 28 references.) (Author/SLD)
- Published
- 2002
33. Effects of Local Item Dependence on the Validity of IRT Item, Test, and Ability Statistics. MCAT Monograph.
- Author
-
Zenisky, April L., Hambleton, Ronald K., and Sireci, Stephen G.
- Abstract
Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. In this study, methods for detecting local item dependence (LID) are reviewed, and the use of testlets to account for LID in context-dependent item sets is discussed. LID detection methods and testlet-based item calibrations are applied to data from a large-scale, high stakes admissions test, and the results are evaluated with respect to test score reliability and examinee proficiency estimation. Data were from two forms of the Medical College Admission Test (MCAT) for 8,494 and 8,026 examinees. Results suggest the presence of LID impacts estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding the calibration of context-dependent item sets using item response theory. (Contains 3 figures, 11 tables, and 30 references.) (Author/SLD)
- Published
- 2001
34. Timing Considerations in Test Development and Administration
- Author
-
Sireci, Stephen G., primary and Botha, Sandra M., additional
- Published
- 2020
- Full Text
- View/download PDF
35. Appraising the Dimensionality of the Medical College Admission Test. MCAT Monograph.
- Author
-
Meara, Kevin and Sireci, Stephen G.
- Abstract
To provide a better understanding of the structure of the Medical College Admission Test (MCAT) and to determine if there are structural differences across selected groups of MCAT examinees, several dimensionality analyses were conducted on data from recent administrations of the MCAT. The first set of analyses focused on the global structure of the MCAT, and the second set appraised the consistency of the structure of data across groups of testtakers that differed with respect to sex, repeater/nonrepeater status, orientation to the English language, and race/ethnicity. Data from two forms of the MCAT were used. Forms 15A and 15B were administered in 1994 to 16,520 examinees, and Forms 23A and 23B were administered in 1996 to 12,625 examinees. Results suggest that appraisals of the MCAT structure should be conducted at the parcel level rather than at the item level. Parcel-level results suggest that a dominant factor underlies the MCAT. This is probably a "general intelligence" factor. The results also suggest additional factors that represent the principal disciplines measured on the MCAT. After the general factor, the next structural layer of the MCAT separates test material measuring science from test material measuring verbal reasoning and writing skills. The next structural level depicts three factors: science, verbal reasoning, and writing skills. These three factors were supported by all analyses. Results also support the distinction between the science disciplines, and, in general, analyses support the current content structure of the MCAT reported in the test blueprint. From a statistical perspective, results suggest it might be possible to scale the biological and physical sciences along a single continuum. With respect to the consistency of the MCAT structure across selected groups of testtakers, results supported the hypothesis of structural invariance across groups. In general, multidimensional scaling analyses indicated that all dimensions were relevant for accounting for the variation in the data for each group. Some exceptions are discussed. An appendix contains depictions of the confirmatory factor analysis models and parceling schemes. (Contains 13 figures, 30 tables, and 20 references.) (SLD)
- Published
- 2000
36. Setting Standards on a Computerized-Adaptive Placement Examination. Laboratory or Psychometric and Evaluative Research Report No. 378.
- Author
-
Massachusetts Univ., Amherst. Laboratory of Psychometric and Evaluative Research., Sireci, Stephen G., Patelis, Thanos, Rizavi, Saba, Dillingham, Alan M., and Rodriguez, Georgette
- Abstract
Setting standards on educational tests is extremely challenging. The psychometric literature is replete with methods and guidelines for setting standards on educational tests; however, little attention has been paid to the process of setting standards on computerized adaptive tests (CATs). This lack of attention is unfortunate because CATs are becoming more widely used, and setting standards on these tests is typically more difficult than setting standards on nonadaptive (linear) tests. This paper discusses some of the issues to be addressed when setting standards on CATs, presents the results of a standard setting study conducted on a computerized adaptive placement test, and discusses the implications of the findings for future research and practice in this area. Thirteen mathematics experts participated in the standard-setting study using ACCUPLACER (College Board) scores. The results of the study suggest that standards can be set on CATs using subsets of items from a CAT item pool, and that methods designed to gather test-centered standard setting data more quickly than traditional methods show promise for setting standards on CATs. (Contains 2 figures, 5 tables, and 14 references.) (Author/SLD)
- Published
- 2000
37. Evaluating the Construct Equivalence of International Employee Opinion Surveys. Laboratory of Psychometric and Evaluative Research Report No. 379.
- Author
-
Massachusetts Univ., Amherst. Laboratory of Psychometric and Evaluative Research., Sireci, Stephen G., Harter, James, and Yang, Yongwei
- Abstract
Assessing people who operate in different languages necessitates the use of multiple language versions of an assessment. However, different language versions of an assessment are not necessarily equivalent. In this paper, the psychometric properties of different language versions on an international employee attitude survey are evaluated. This survey was administered to more than 50,000 employees of a large telecommunications company using both paper-and-pencil and Web administration formats. The structural equivalence of the survey was evaluated across language versions, cultural groups, and administration formats using multidimensional scaling. The statistical equivalence of English, French, and Spanish versions of the survey items was evaluated using analysis of covariance. The results indicate the structure of the survey is consistent across the groups studied, and that the different language versions of the items functioned similarly. The implications of the results for future research in this area are discussed. (Contains 3 figures, 4 tables, and 38 references.) (Author/SLD)
- Published
- 2000
38. Comparing Computerized and Human Scoring of Students' Essays.
- Author
-
Massachusetts Univ., Amherst. Laboratory of Psychometric and Evaluative Research., Sireci, Stephen G., and Rizavi, Saba
- Abstract
Although computer-based testing is becoming popular, many of these tests are limited to the use of selected-response item formats due to the difficulty in mechanically scoring constructed-response items. This limitation is unfortunate because many constructs, such as writing proficiency, can be measured more directly using items that require examinees to produce a response. Therefore, computerized scoring of essays and other constructed response items is an important area of research. This study compared computerized scoring of essays with the scores produced by two independent human graders. Data were essay scores for 931 students from 24 postsecondary institutions in Texas. Although high levels of computer-human congruence were observed, the human graders were more consistent with one another than the computer was with them. Statistical methods for evaluating computer-human congruence are presented. The case is made that the percentage agreement statistics that appear in the literature are insufficient for comparing the computerized and human scoring of constructed response items. In this study, scoring differences were most pronounced when researchers looked at the percentage of essays scored exactly the same, the percentage scored the same at specific score points, and the percentage of exact agreement corrected for chance. The implications for future research in this area are discussed. (Contains 11 tables, 2 figures, and 15 references.) (Author/SLD)
- Published
- 2000
39. Computer Attitudes and Opinions of Students with and without Learning Disabilities.
- Author
-
Brown-Chidsey, Rachel, Boscardin, Mary Lynn, and Sireci, Stephen G.
- Abstract
This study investigated the attitudes and opinions of 970 students with and without learning disabilities regarding the use of computers for school-related work. Using a quasi-experimental design with three non-equivalent groups, within and between subjects effects were studied using a survey instrument. The students in grades 5 through 12 at three school sites completed pre- and post-test surveys at the beginning and end of the school year. One site served as the experimental group, while the other two were control groups. The experimental condition consisted of the installation of a campus-wide computer network for use by all students at the experimental site. A 24-item scale measured participants' attitudes about the general use of computers in schools and the use of computers by students with special needs. The most significant variables related to students' attitudes and opinions were their past experiences using computers and their school affiliation. These data also showed there was no relationship between the installation of a campus-wide computer network and changes in students' attitudes and opinions about computer use in special education. There were no significant differences in attitudes toward computers between students with and without learning disabilities. (Contains 25 references.) (Author/CR)
- Published
- 1999
40. Evaluating Computer-Based Test Accommodations for English Learners
- Author
-
Roohr, Katrina Crotts and Sireci, Stephen G.
- Abstract
Test accommodations for English learners (ELs) are intended to reduce the language barrier and level the playing field, allowing ELs to better demonstrate their true proficiencies. Computer-based accommodations for ELs show promising results for leveling that field while also providing us with additional data to more closely investigate the validity and effectiveness of those accommodations. In this study, we evaluate differences across non-ELs and two EL groups in their decision to use either of two computer-based accommodations on high school history and math assessments. We also evaluate differences in response times across these groups. Results showed that ELs used accommodations more than non-ELs; however, many students did not use any accommodations, and use decreased as the assessment progressed. In addition, students had longer response time for items with accommodations in history but not mathematics. Recommendations for future research in accommodations for ELs are discussed.
- Published
- 2017
- Full Text
- View/download PDF
41. An Empirical Evaluation of Selected Multiple-Choice Item Writing Guidelines.
- Author
-
Sireci, Stephen G., Wiley, Andrew, and Keller, Lisa A.
- Abstract
Seven specific guidelines included in the taxonomy proposed by T. Haladyna and S. Downing (1998) for writing multiple-choice test items were evaluated. These specific guidelines are: (1) avoid the complex multiple-choice, K-type format; (2) state the stem in question format; (3) word the stem positively; (4) avoid the phrase "all of the above"; (5) avoid the phrase "none of the above"; (6) avoid specific determiners such as "always" or "never"; and (7) keep the length of options fairly consistent. These guidelines were evaluated by comparing statistical indices of item quality across items that do and do not violate one or more of these guidelines. The items and their statistics were taken from a recently administered, high-stakes, large-scale licensure examination, the Uniform Certified Public Accountant Examination. Only 1 of the 285 items evaluated violated the guidelines to avoid the phrases, "none of the above" and "all of the above"; and the determiners, "always" and "never." The only guideline supported by the data was avoiding the K-type item, since K-type items on this test tended to be more difficult and to have lower discrimination statistics. Results do not support the "state the stem in the question" guideline. Relatively few items violated more than one guideline. (Contains 8 tables and 10 references.) (SLD)
- Published
- 1998
42. Evaluating Construct Equivalence across Adapted Tests.
- Author
-
Sireci, Stephen G. and Bastari, B.
- Abstract
In many cross-cultural research studies, assessment instruments are translated or adapted for use in multiple languages. However, it cannot be assumed that different language versions of an assessment are equivalent across languages. A fundamental issue to be addressed is the comparability or equivalence of the construct measured by each language version of the assessment. This paper presents and critiques several methods for evaluating structural equivalence across different language versions of a test or questionnaire. Applications of these techniques to large-scale, cross-lingual tests are presented and discussed. Simulated data are also used to evaluate the methods. It is concluded that weighted multidimensional scaling and confirmatory factor analysis are effective for helping evaluate construct equivalence across groups. Qualifications for using these procedures to evaluate construct equivalence are provided. (Contains 2 figures, 6 tables, and 42 references.) (Author/SLD)
- Published
- 1998
43. Adapting Credentialing Examinations for International Uses.
- Author
-
Sireci, Stephen G., Fitzgerald, Cyndy, and Xing, Dehui
- Abstract
Adapting credentialing examinations for international uses involves translating tests for use in multiple languages. This paper explores methods for evaluating construct equivalence and item equivalence across different language versions of a test. These methods were applied to four different language versions (English, French, German, and Japanese) of a Microsoft certification examination with samples ranging from 1,329 to 2,000 examinees per test. Principal components analysis, multidimensional scaling, and confirmatory factor analysis of these data were conducted to evaluate construct equivalence. Detection of differential item functioning across languages was conducted using the standardized p-difference index. The results indicate that these procedures provide a great deal of information useful for evaluating test and item functioning across groups. Some differences in factor and dimension loadings across groups were noted, but a common, one-factor model fit the data well. Four items were flagged for differential item functioning across all groups. Suggestions for using these methods to evaluate translated tests are provided. (Contains 8 tables, 3 figures, and 13 references.) (Author/SLD)
- Published
- 1998
44. Evaluating Content Validity Using Multidimensional Scaling.
- Author
-
Sireci, Stephen G.
- Abstract
Multidimensional scaling (MDS) is a versatile technique for understanding the structure of multivariate data. Recent studies have applied MDS to the problem of evaluating content validity. This paper describes the importance of evaluating test content and the logic of using MDS to analyze data gathered from subject matter experts employed in content validation studies. Some recent applications of the procedure are reviewed, and illustrations of the results are presented. Suggestions for gathering content validity data and using MDS to analyze them are presented. (Contains 3 exhibits, 7 figures, and 24 references.) (Author/SLD)
- Published
- 1998
45. Effect of Item Bundling on the Assessment of Test Dimensionality. Laboratory of Psychometric and Evaluative Research Report No. 328.
- Author
-
Massachusetts Univ., Amherst. School of Education., Egan, Karla L., Sireci, Stephen G., Swaminathan, Hariharan, and Sweeney, Kevin P.
- Abstract
The primary purpose of this study was to assess the effect of item bundling on multidimensional data. A second purpose was to compare three methods for assessing dimensionality. Eight multidimensional data sets consisting of 100 items and 1,000 examinees were simulated varying in terms of dimensionality, inter-dimensional correlation, and number of items loading on each dimension. Analyses were also performed on two samples of examinees from the November 1996 administration of the Uniform Certified Public Accountant examination. The items from both data sets were grouped into bundles that varied in size and content. Principal components factor analysis, maximum likelihood factor analysis, and multidimensional scaling were used to analyze the item bundles as well as the items themselves. Results suggest that item bundling tends to obscure multidimensionality, but analyses on the items themselves overestimate dimensionality. Multidimensional scaling also appeared better able to recover the underlying dimensionality of the data than the other two techniques. (Contains 13 tables and 17 references.) (Author/SLD)
- Published
- 1998
46. Using Cluster Analysis To Facilitate the Standard Setting Process.
- Author
-
Sireci, Stephen G., Robin, Frederic, and Patelis, Thanos
- Abstract
The most popular methods for setting passing scores and other standards on educational tests rely heavily on subjective judgment. This paper presents and evaluates a new procedure for setting and evaluating standards on tests based on cluster analysis of test data. The clustering procedure was applied to a statewide mathematics proficiency test administered to 818 seventh-grade students in a small urban/suburban school district. Content area subscores were derived from the test specifications to serve as clustering variables. Subsequent course grades in mathematics were used to validate the cluster solutions and the stability of the solutions were evaluated using two random samples. The three-cluster (K-means) solution provided relatively homogeneous groupings of students that were consistent across the two samples and were congruent with school mathematics grades. Standards for "intervention,""proficient," and "excellent" levels of student performance were derived from these results. These standards were similar to those established by the local school district. The results suggest that cluster analytic techniques may be useful for helping set standards on educational tests, as well as for evaluating standards set by other methods. Suggestions for future research are provided. (Contains 2 figures, 7 tables, and 23 references.) (Author/SLD)
- Published
- 1997
47. Comparing Dual-Language Versions of an International Computerized-Adaptive Certification Exam.
- Author
-
Sireci, Stephen G., Foster, David F., and Robin, Frederic
- Abstract
Evaluating the comparability of a test administered in different languages is a difficult, if not impossible, task. Comparisons are problematic because observed differences in test performance between groups who take different language versions of a test could be due to a difference in difficulty between the tests, to cultural differences in test taking behavior, or to a difference in proficiency between the language groups. The international certification testing programs conducted by Novell, Inc. are exceptional examples of the complex psychometric demands inherent in multiple language assessment programs. Novell's international certification program includes tests administered in 12 languages. Many of these tests are computerized adaptive (CAT), complicating comparisons across tests and languages. This paper reports the results of a study comparing English and German language versions of a high-stakes Novell CAT certification exam. The two versions of the test were compared by analyses including separate and concurrent item response theory calibrations. Results with 1,668 English-language candidates and 922 German-language candidates indicate that the English and German CATs are highly similar, and that the tests appear to be unidimensional in both the English and German versions. It is also concluded that the German candidate sample was more proficient than the English sample, and that 2 of 15 items functioned differentially across the languages. The source of the differential item functioning was identified post hoc using bilingual subject matter experts. The comparability of the passing scores, and other critical validity issues are discussed. (Contains 4 tables, 11 figures, and 20 references.) (Author/SLD)
- Published
- 1997
48. Challenges for Psychological Assessment: Retaining UN's Sustainable Development Goal 4.1.1a.
- Author
-
Greiff, Samuel, Iliescu, Dragos, Ketan, and Sireci, Stephen G.
- Published
- 2024
- Full Text
- View/download PDF
49. Evaluating Translation Equivalence: So What's the Big Dif?
- Author
-
Sireci, Stephen G. and Swaminathan, Hariharan
- Abstract
Procedures for evaluating differential item functioning (DIF) are commonly used to investigate the statistical equivalence of items that are translated from one language to another. However, the methodology developed for detecting DIF is designed to evaluate the functioning of the same items administered to two groups. In evaluating the differential functioning of dual language versions of a test item (translation DIF), the items being compared (i.e., an original item and its translated version) are not the same. Thus, studies of translation DIF may not fulfill the requirements of currently available DIF detection procedures. This paper discusses the complex issues involved in evaluating translation DIF. An important, but often overlooked, issue is the dimensionality of the construct measured across the two languages. It is concluded that the dimensionality issues must be addressed first in studies of translation DIF. The development of an adequate research design is another important issue in studies of translation DIF. The design must be able to control for extraneous language proficiency effects. Some suggestions and examples of such research designs are proposed. (Contains 29 references.) (Author/SLD)
- Published
- 1996
50. Technical Issues in Linking Assessments across Languages.
- Author
-
Sireci, Stephen G.
- Abstract
Test developers continue to struggle with the technical and logistical problems inherent in assessing achievement across different languages. Many testing programs offer separate language versions of a test to evaluate the achievement of examinees in different language groups. However, comparison of individuals who took different language versions of a test are not valid unless the score scales for the different versions are linked or equated. This paper discusses the psychometric problems involved in cross-lingual assessment, reviews linking models that have been proposed to enhance score comparability, and provides suggestions for developing and evaluating a model for linking different language versions of a test. Attempts to link different language versions of a test onto a common scale are classified into three general research design categories: (1) separate monolingual group designs, usually linked through item response theory; (2) bilingual group designs; and (3) matched monolingual group designs. (Contains 4 figures and 47 references.) (Author/SLD)
- Published
- 1996
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.