Scientific summary In Flanders, Belgium, most international L2 students are required to pass a B2 Dutch language test before they can enroll at university. Students who spent the final two years of secondary school in Flanders are exempt from this requirement. The two main tests used in the international L2 admission policy, are ITNA and STRT. Even though these tests are used for the same purpose, they differ substantially in operationalization. ITNA’s computer-based written component consists of selected-response tasks that rely on vocabulary, grammar, and receptive skills. STRT’s written component is task-based and integrated. The task types used in the oral sections of STRT and ITNA are very similar; both feature an argumentation and a presentation task. The scoring procedures differ however. ITNA is scored in situ, and STRT is recorded and rated centrally. Both tests share five linguistic rating criteria in the oral component, but STRT also assigns substantial importance to content criteria, which ITNA does not. This research project investigated the effectiveness of the Flemish university entrance policy for international L2 students from three perspectives: Constructs & levels, Selection & discrimination, and Prediction & gains. These three research angles rely on three assumptions that must logically support any university entrance policy that relies on two or more high-stakes language tests for gatekeeping purposes. First of all, universities must assume that their gatekeeping policy is useful. At a minimum this implies differentiating at an appropriate language level. A level that is too low will allow entrance to people who will not be ready for real-life linguistic demands, and a level that is too high will exclude people who might have managed, given the opportunity. Useful tests should have demonstrable content relevance: they should include relevant skills and representative task types. Scores of high-stakes tests should not be based on construct-irrelevant tasks or criteria. Additionally, a useful gatekeeping policy must be as watertight as possible. Consequently, if one population is required to take an entrance test, while another population is not, it must be empirically proven that all people who are exempt from taking the test are able to pass it. If this is not the case, there are gaps in the fence, defeating the purpose of a gate. Secondly, if a policy considers two tests as equivalent measures of the same level of language ability, empirical data should confirm that both tests measure at the same level. If two tests are considered adequate measures of B2 language ability and have the same societal impact, both tests should operationalize comparable criteria in a comparable way. Naturally, since the B2 level is rather broad, there must be some leeway, but given the stakes involved in university entrance testing, the percentage of people passing one test but failing the other, should be kept to a minimum. The last assumption concerns the policy after the language test has been administered. University admission officers often assume that even if international L2 students enter university with a language level that is below the level of their Flemish peers, they will make the required language gains by virtue of attending class in a Dutch-medium context. The relatively scant research in this field does not fully support this assumption, but no research has been conducted on spoken language gains, or on gains in a Dutch-medium context. To date, none of these assumptions had been verified by empirical research. Constructs & levels The first study compared the operationalization of STRT and ITNA to the real-life demands of academia, and investigated to what extent passing the test entailed preparedness for the linguistic demands of university. The study combined the opinions and experiences of 24 university staff members and 31 international L2 students, 20 of whom were tracked longitudinally after taking both ITNA and STRT. The results revealed that the real-life language demands at Flemish universities sometimes deviate crucially from the test content, that L2 students who passed ITNA, or STRT, or both, were not ready for the receptive demands of academia, and that four of the seven students who had failed STRT or ITNA actually performed well at university. This study, in short, showed that both B2 tests used in the Flemish university entrance policy do not discriminate between people who will manage the linguistic demands of academia, and those who will not. The second study checked the assumption that Flemish students meet the B2 requirement that is a mandatory entrance requirement for international L2 students. Since students who have graduated from a Flemish high school do not have to sit a language test, the implicit assumption behind the entrance policy is that all Flemish students have attained the B2 level in Dutch at the end of secondary education. If not all Flemish students attain this level as measured by one of the entrance tests, the entrance policy does not succeed in adequately maintaining a minimum level of B2 language ability among the student population. To examine this, 159 first-year Flemish L1 students sat two written STRT tasks during their first month of university education. All L1 performances were randomly assigned to trained raters and were double rated. Using nonparametric statistics and Multi-Faceted Rasch analysis, the L1 scores were compared against two groups of L2 candidates (L2 who studied Dutch abroad, N = 629, and L2 who studied Dutch in Flanders, N = 116). The results showed that L1 students outperformed L2 students overall, but that L2 students who had studied Dutch abroad achieved higher scores on content criteria. Flemish students scored higher on formal criteria and on the overall level. Importantly, this does not mean that all L1 students pass STRT: 11% of the L1 students did not attain the B2 level. Flemish students outperformed L2 students on grammar and vocabulary, but were outperformed on content criteria. The results of this study showed that not all students who are exempt from taking a language test would pass it. Selection & discrimination In two studies it was verified whether STRT and ITNA might be equivalent measures of B2 ability. The first study considered level and construct equivalence, and the second study focused specifically on the equivalence of corresponding CEFR-based criteria. Relying on the scores of 118 participants who took STRT and ITNA within the same week, one study showed that the overall correlation between STRT and ITNA scores was moderately high (r = .767**), as was the correlation for the written components (r = .694**). The agreement between the scores on the oral tests was much lower however (τ = .387**). Additional analyses revealed further discrepancies, nuancing the conclusions that could be drawn from the overall correlation alone. First of all, the pass probability is significantly (p = .02) larger for STRT (.50) than for ITNA (.35). Second, linear regression and multi-faceted Rasch analyses showed important discrepancies in terms of constructs. ITNA’s vocabulary and grammar tasks are more difficult than any other written ITNA or STRT task, while STRTS’s argumentative tasks are the easiest written tasks. Additionally, Rasch analysis reliably (.88) found that ITNA’s spoken component is more difficult than STRT’s, again because the relative difficulty of linguistic criteria in the former, and the relative easiness of content criteria in the latter. The second study considered the scores on the oral components of STRT and ITNA. These components are very similar in terms of task type and rating criteria. Both tests include five criteria that are based on the same CEFR descriptors. The analyses (linear and multiple regression and multi-faceted Rasch) showed that for every CEFR-based criterion ITNA and STRT interpreted the B2 level in a different way. Weighted kappa coefficients were low for every corresponding criterion (kw ≤ .216), and corresponding criteria were never included within the same difficulty band of the multi-faceted Rasch analysis. In sum: this study did not find evidence supporting the hypothesis that STRT and ITNA operationalize corresponding CEFR-based criteria in comparable tasks in a comparable way. Furthermore, the study showed that using the same language proficiency scales as the basis for rating scale development may lead to superficial correspondences or perceived equivalence, but does not necessarily lead to greater comparability of shared criteria. Prediction & gains To date, little research has focused on how well L2 students cope linguistically in the target language use context in the months after the entrance test. Even fewer studies have taken a qualitative, longitudinal perspective on this. In this study, 20 international L2 students were tracked during their first academic year at a Flemish university. After eight months, they took two STRT tasks again. The results showed that the respondents had made no significant gains in terms of STRT score, or in terms of complexity, accuracy, or fluency measures. The only significant difference was a decreased amount of words used in the oral presentation task. The interview data were analyzed for interactional and institutional variables that may help explain the zero gains. The analyses showed that nearly all respondents experienced social and academic isolation, and reported a perceived lack of institutional support. Likely, an important reasons why the respondents had made such limited gains, was limited exposure to meaningful interaction with L1 speakers. Conclusion The results of the research project disproved the main claims that support the Flemish university entrance policy, and shed doubt on its effectiveness. The university admission policy is unlikely to guarantee a consistent minimum language level among students, since the tests used cannot be considered equivalent, and since one in ten people who are exempt from taking the test, do not pass it. Moreover, at B2, the minimum performance level is below the real-life requirements, and the tasks used to elicit B2 performances are not always in line with actual language tasks. Lastly, international L2 students who do enroll after passing the entrance test make very few language gains, and do not easily gain access to the academic community. In other words, the tests do little to facilitate integration. status: published