101. University of Amsterdam at the CLEF 2022 SimpleText Track
- Author
-
Mostert, F., Sampatsing, A., Spronk, M., Rau, D., Kamps, J., Faggioli, G., Ferro, N., Hanbury, A., Potthast, M., and ILLC (FGw)
- Abstract
This paper reports on the University of Amsterdam’s participation in the CLEF 2022 SimpleText track. The overall goal of removing barriers that prevent the general public from accessing scientific literature is of great importance to help users make sense of a world of misinformation and shallow opinions. We perform preliminary studies within the track’s setup, analyzing the text complexity of searching a large set of academic abstracts in the context of popular science topics emerging in the news, with a specific focus at the relation between the topical relevance and the text complexity of the retrieved information. Our main findings are the following. First, we analyzed a large corpus of scientific abstracts and confirmed that these are highly complex on average, but that the variation is large and many abstracts with accessible readability levels exist. Second, we ran retrieval experiments and found that standard search ignores readability, yet filtering on the desirable reading level still retains competitive performance while avoiding retrieving relevant but incomprehensible results. Third, we ran complexity spotting experiments and found that straightforward lexical complexity or term frequency measures are strong indicators, but have to be combined with the importance of the concept in the broader context of the information request. Fourth, we ran a GPT-2 based text simplification model in a zero-shot way, resulting in conservative rewriting of abstracts, able to significantly reduce the text complexity. More generally, our results demonstrate that text complexity is an essential aspect to consider for improving non-expert access to scientific information, and opens up new routes to develop effective scientific information access technology tailored to needs of the general public.
- Published
- 2022