14 results on '"Bejar,Isaac I"'
Search Results
2. Building 'e-rater'® Scoring Models Using Machine Learning Methods. Research Report. ETS RR-16-04
- Author
-
Chen, Jing, Fife, James H., Bejar, Isaac I., and Rupp, André A.
- Abstract
The "e-rater"® automated scoring engine used at Educational Testing Service (ETS) scores the writing quality of essays. In the current practice, e-rater scores are generated via a multiple linear regression (MLR) model as a linear combination of various features evaluated for each essay and human scores as the outcome variable. This study evaluates alternative scoring models based on several additional machine learning algorithms, including support vector machines (SVM), random forests (RF), and "k"-nearest neighbor regression (k-NN). The results suggest that models based on the SVM algorithm outperform MLR models in predicting human scores. Specifically, SVM-based models yielded the highest agreement between human and e-rater scores. Furthermore, compared with MLR, SVM-based models improved the agreement between human and e-rater scores at the ends of the score scale. In addition, the high correlation between SVM-based e-rater scores with external measures such as examinee's scores on the other parts of the test provided some validity evidence for SVM-based e-rater scores. Future research is encouraged to explore the generalizability of these findings.
- Published
- 2016
3. Predictive Modeling of Rater Behavior: Implications for Quality Assurance in Essay Scoring
- Author
-
Bejar, Isaac I., Li, Chen, and McCaffrey, Daniel
- Abstract
We evaluate the feasibility of developing predictive models of rater behavior, that is, "rater-specific" models for predicting the scores produced by a rater under operational conditions. In the present study, the dependent variable is the score assigned to essays by a rater, and the predictors are linguistic attributes of the essays used by the e-rater® engine. Specifically, for each rater, the linear regression of rater scores on the linguistic attributes is obtained based on data from two consecutive time periods. The regression from each period was cross validated against data from the other period. Raters were characterized in terms of their level of predictability and the importance of the predictors. Results suggest that rater models capture stable individual differences among raters. To evaluate the feasibility of using rater models as a quality control mechanism, we evaluated the relationship between rater predictability and inter-rater agreement and performance on pre-scored essays. Finally, we conducted a simulation whereby raters are simulated to score exclusively as a function of essay length at different points during the scoring day. We concluded that predictive rater models merit further investigation as a means of quality controlling human scoring.
- Published
- 2020
- Full Text
- View/download PDF
4. Length of Textual Response as a Construct-Irrelevant Response Strategy: The Case of Shell Language. Research Report. ETS RR-13-07
- Author
-
Bejar, Isaac I., VanWinkle, Waverely, Madnani, Nitin, Lewis, William, and Steier, Michael
- Abstract
The paper applies a natural language computational tool to study a potential construct-irrelevant response strategy, namely the use of "shell language." Although the study is motivated by the impending increase in the volume of scoring of students responses from assessments to be developed in response to the Race to the Top initiative, the data for the study were obtained from the "GRE"™ Analytical Writing measure. The functioning of the shell detection computational tool was first evaluated by applying it to a corpus of over 200,000 issue and argument essays and by means of a study to evaluate whether the shell language score agreed with the characterization of shell by two scoring experts. It was concluded that the computational tool worked well. The tool was then used to select essays for rescoring to determine whether the presence of shell language had had an effect on the operational scores they received. We found no evidence that such an effect was present. However, we did find a leniency effect in the operational scores. That is, the essays that were rescored as part of this project received a lower score compared to the operational score. The validity implications of these results are discussed.
- Published
- 2013
5. Toward an Understanding of the Role of Speech Recognition in Nonnative Speech Assessment. TOEFL iBT Research Report. TOEFL iBT-02. ETS RR-07-02
- Author
-
Zechner, Klaus, Bejar, Isaac I., and Hemat, Ramin
- Abstract
The increasing availability and performance of computer-based testing has prompted more research on the automatic assessment of language and speaking proficiency. In this investigation, we evaluated the feasibility of using an off-the-shelf speech-recognition system for scoring speaking prompts from the LanguEdge field test of 2002. We first established the level of agreement between two trained scorers. We then adapted a speech engine to the language backgrounds and proficiency ranges of the speakers and developed a classification and regression tree (CART) for each of five prompts based on features computed from the output of the speech recognizer. In a validation on held-out data, we found that while our features are not sufficiently comprehensive to adequately score these prompts, collectively these features appear to capture reliably some aspects of speaking proficiency.
- Published
- 2007
6. Automated Tools for Subject Matter Expert Evaluation of Automated Scoring. Research Report. ETS RR-04-14
- Author
-
Williamson, David M., Bejar, Isaac I., and Sax, Anne
- Abstract
As automated scoring of complex constructed-response examinations reaches operational status, the process of evaluating the quality of resultant scores, particularly in contrast to scores of expert human graders, becomes as complex as the data itself. Using a vignette from the Architectural Registration Examination (ARE), this paper explores the potential utility of classification and regression trees (CART) and Kohonen self-organizing maps (SOM) as tools to facilitate subject matter expert (SME) examination of the fine-grained (feature level) quality of automated scores for complex data, with implications for the validity of the resultant scores. The paper explores both supervised and unsupervised learning techniques, the former being represented by CART (Breiman, Friedman, Olshen, & Stone, 1984) and the latter by SOM (Kohonen, 1989). Three applications comprise this investigation, the first of which suggests that CART can facilitate efficient and economical identification of specific elements of complex solutions that contribute to automated and human score discrepancies. The second application builds on the first by exploring CART's value for efficiently and accurately automating case selection for human intervention to ensure score validity. The final application explores the potential for SOM to reduce the need for SMEs in evaluating automated scoring. While both supervised and unsupervised methodologies examined were found to be promising tools for facilitating SME roles in maintaining and improving the quality of automated scoring, such applications remain unproven and further studies are necessary to establish the reliability of these techniques.
- Published
- 2004
7. Kohonen Self-Organizing Maps in Validity Maintenance for Automated Scoring of Constructed Response.
- Author
-
Williamson, David M. and Bejar, Isaac I.
- Abstract
As the automated scoring of constructed responses reaches operational status, monitoring the scoring process becomes a primary concern, particularly if automated scoring is intended to operate completely unassisted by humans. Using actual candidate selections from the Architectural Registration Examination (n=326), this study uses Kohonen Self-Organizing Maps (SOM) to build on previous research (D. Williamson, A. Hone, S. Miller, and I. Bejar, 1998) suggesting that classification trees are a useful means of validity maintenance. Classification trees can assist in identifying sources of disagreement between human and automated scoring, identify tendencies for human graders to overlook elementary or complex solutions, and provide significant efficiency in future case selection for human intervention. Since classification trees require a criterion value of score discrepancy between human and automated scores, Kohonen SOM provide an advantage in the ability to classify solutions in similar groups through neural networks without requiring prior human grading. Results suggest that Kohonen SOM could be used to classify solutions prior to human grading and classification tree analyses, thus providing a 43% reduction in the human grading required. However, further analyses are needed to establish whether classification trees would produce similar results with a reduced sample on the basis of Kohonen SOM classifications. (Contains 3 figures, 3 tables, and 14 references.) (Author/SLD)
- Published
- 2000
8. Classification Trees for Quality Control Processes in Automated Constructed Response Scoring.
- Author
-
Williamson, David M., Hone, Anne S., Miller, Susan, and Bejar, Isaac I.
- Abstract
As the automated scoring of constructed responses reaches operational status, the issue of monitoring the scoring process becomes a primary concern, particularly when the goal is to have automated scoring operate completely unassisted by humans. Using a vignette from the Architectural Registration Examination and data for 326 cases with both human and computer scores available, this study reports on the usefulness of an approach based on classification trees (L. Breiman, J. Friedman, R. Olshen, and C. Stone, 1984) as a means of quality control. Five studies were carried out analyzing different aspects of the "training set" and making efforts to cross-validate the results of the analysis by applying the resulting classification trees to data that had not been used in the development of the tree. The application of classification trees led to valuable insights with implications for operational quality control processes. Furthermore, classification tree methods were shown to be able to select cases for future quality control processes accurately and efficiently, thereby suggesting that future quality control selection procedures may be completely automated. However, further analyses are needed to establish whether classification trees can be relied on to identify cases that are the most likely to require some adjustment without incurring the potentially costly error of ignoring solutions that are likely to require adjustment. (Contains 10 tables, 7 figures, and 13 references.) (Author/SLD)
- Published
- 1998
9. A Validity-Based Approach to Quality Control and Assurance of Automated Scoring
- Author
-
Bejar, Isaac I.
- Abstract
Automated scoring of constructed responses is already operational in several testing programmes. However, as the methodology matures and the demand for the utilisation of constructed responses increases, the volume of automated scoring is likely to increase at a fast pace. Quality assurance and control of the scoring process will likely be more prominent as a result. A validity-based approach to ensure the quality of scores based on automated means is proposed. First, an argument is made for aligning quality assurance/control with validity, rather than viewing them as separate processes. Second, to pre-emptively avoid quality and design defects that would erode the validity of scores, it is argued that quality must be "designed in" through the assessment design process by attending to the interdependencies among the different components of an assessment, including scoring. Third, key elements of the design of scoring engines, evidence extraction and evidence synthesis, are discussed as a further avenue for maintaining score quality. (Contains 10 notes, 2 tables, and 1 figure.)
- Published
- 2011
- Full Text
- View/download PDF
10. Automated Tools for Subject Matter Expert Evaluation of Automated Scoring
- Author
-
Williamson, David M., Bejar, Isaac I., and Sax, Anne
- Abstract
As automated scoring of complex constructed-response examinations reaches operational status, the process of evaluating the quality of resultant scores, particularly in contrast to scores of expert human graders, becomes as complex as the data itself. Using a vignette from the Architectural Registration Examination (ARE), this article explores the potential utility of Classification and Regression Trees (CART) and Kohonen Self-Organizing Maps (SOM) as tools to facilitate subject matter expert (SME) examination of the fine-grained (feature level) quality of automated scores for complex data, with implications for the validity of resultant scores. This article explores both supervised and unsupervised learning techniques, with the former being represented by CART (Breiman, Friedman, Olshen, & Stone, 1984) and the latter by SOM (Kohonen, 1989). Three applications comprise this investigation, the first of which suggests that CART can facilitate efficient and economical identification of specific elements of complex responses that contribute to automated and human score discrepancies. The second application builds on the first by exploring CART for efficiently and accurately automating case selection for human intervention to ensure score validity. The final application explores the potential for SOM to reduce the need for SMEs in evaluating automated scoring. Although both supervised and unsupervised methodologies examined were found to be promising tools for facilitating SME roles in maintaining and improving the quality of automated scoring, such applications remain unproven, and further studies are necessary to establish the reliability of these techniques.
- Published
- 2004
11. 'Mental Model' Comparison of Automated and Human Scoring.
- Author
-
Williamson, David M., Bejar, Isaac I., and Hone, Anne S.
- Abstract
Contrasts "mental models" used by automated scoring for the simulation division of the computerized Architect Registration Examination with those used by experienced human graders for 3,613 candidate solutions. Discusses differences in the models used and the potential of automated scoring to enhance the validity evidence of scores. (SLD)
- Published
- 1999
12. An Information Comparison of Conventional and Adaptive Tests in the Measurement of Classroom Achievement. Research Report 77-7.
- Author
-
Minnesota Univ., Minneapolis. Dept. of Psychology. and Bejar, Isaac I.
- Abstract
Information provided by typical and improved conventional classroom achievement tests was compared with information provided by an adaptive test covering the same subject matter. Both tests were administered to over 700 college students in a general biology course. Using the same scoring method, adaptive testing was found to yield substantially more precise estimates of achievement level than the classroom test throughout the entire range of achievment, even in the range where the improved conventional test was designed to be optimal. Adaptive testing also made it possible to reduce the length of the test. An analysis of the effects of expanding an adaptive test item pool indicated that improved precision of measurement could result from the addition of only slightly more discriminating items. A comparison of response pattern information values (observed information) with test information values (theoretical information) showed that the observed information consistently underestimated theoretical information, although the pattern of results from the two procedures was quite similar. It was concluded that the adaptive measurement of classroom achievement results in scores which are less likely to be confounded by errors of measurement and, therefore, are more likely to reflect the true level of achievement. (Author/MV)
- Published
- 1977
13. A Preliminary Study of Raters for the Test of Spoken English.
- Author
-
Educational Testing Service, Princeton, NJ. and Bejar, Isaac I.
- Abstract
The feasibility of reducing scoring costs for the Test of Spoken English (TSE) by using one rater was investigated. Currently, two raters are used. It was found that, because of the possibility of different standards used by potential raters, it does not appear feasible to use a single rater as the sole determiner of speaking proficiency under the current system. Other possible alternatives were also examined. One approach was the development of a quality control index which would predict the extent of the disagreement between two raters, but the index that was developed could not be validated. The best predictors of rater disagreement were the identities of the raters. Their disagreements, however, resulted from the differing standards they used. Raters agreed substantially about the ordering of examinees, but varied slightly in the severity of their ratings. (Author/GDC)
- Published
- 1985
14. Relational Databases in Assessment: An Application to Online Scoring.
- Author
-
Whalen, Sean J. and Bejar, Isaac I.
- Abstract
Discusses the design of online scoring software for computer-based education supporting high psychometric and fairness standards. Describes a system from the point of view of graders and supervisors; the design of the underlying database; data integrity and confidentiality; and the advantages of database design and online grading. (PEN)
- Published
- 1998
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.