246 results on '"Goggi, Sara"'
Search Results
2. Correction to: The LRE Map: what does it tell us about the last decade of our field?
- Author
-
Del Gratta, Riccardo, Goggi, Sara, Pardelli, Gabriella, and Calzolari, Nicoletta
- Published
- 2021
- Full Text
- View/download PDF
3. AGILe: The First Lemmatizer for Ancient Greek Inscriptions
- Author
-
de Graaf, Evelien, Stopponi, Silvia, Bos, Jasper, Peels-Matthey, Saskia, Nissim, Malvina, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, Computational Linguistics (CL), Theoretical and Empirical Linguistics (TEL), and Research Centre for Historical Studies (CHS)
- Subjects
ancient Greek ,lemmatizer ,digital classics - Abstract
To facilitate corpus searches by classicists as well as to reduce data sparsity when training models, we focus on the automatic lemmatization of ancient Greek inscriptions, which have not received as much attention in this sense as literary text data has. We show that existing lemmatizers for ancient Greek, trained on literary data, are not performant on epigraphic data, due to major language differences between the two types of texts. We thus train the first inscription-specific lemmatizer achieving above 80% accuracy, and make both the models and the lemmatized data available to the community. We also provide a detailed error analysis highlighting peculiarities of inscriptions which again highlights the importance of a lemmatizer dedicated to inscriptions.
- Published
- 2022
4. Proceedings of the Language Resources and Evaluation Conference
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, LS OZ Taal en spraaktechnologie, and ILS LLI
- Subjects
Artificial Intelligence ,Language and Linguistics - Published
- 2022
5. Evaluating Pre-training Objectives for Low-Resource Translation into Morphologically Rich Languages
- Author
-
Dhar, Prajit, Bisazza, Arianna, van Noord, Gertjan, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, and Computational Linguistics (CL)
- Abstract
The scarcity of parallel data is a major limitation for Neural Machine Translation (NMT) systems, in particular for translation into morphologically rich languages (MRLs). An important way to overcome the lack of parallel data is to leverage target monolingual data, which is typically more abundant and easier to collect. We evaluate a number of techniques to achieve this, ranging from back-translation to random token masking, on the challenging task of translating English into four typologically diverse MRLs, under low-resource settings. Additionally, we introduce Inflection Pre-Training (or PT-Inflect), a novel pre-training objective whereby the NMT system is pre-trained on the task of re-inflecting lemmatized target sentences before being trained on standard source-to-target language translation. We conduct our evaluation on four typologically diverse target MRLs, and find that PT-Inflect surpasses NMT systems trained only on parallel data. While PT-Inflect is outperformed by back-translation overall, combining the two techniques leads to gains in some of the evaluated language pairs.
- Published
- 2022
6. Introducing Frege to Fillmore: A FrameNet Dataset that Captures both Sense and Reference
- Author
-
Remijnse, Levi, Vossen, Piek, Fokkens, Antske, Titarsolej, Sam, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, Piperidis, Stelios, Language, and Network Institute
- Subjects
lexicon ,annotation tool ,frame semantics ,reference ,events ,SDG 4 - Quality Education - Abstract
This article presents the first output of the Dutch FrameNet annotation tool, which facilitates both referential- and frame-annotations of language-independent corpora. On the referential level, the tool links in-text mentions to structured data, grounding the text in the real world. On the frame level, those same mentions are annotated with respect to their semantic sense. This way of annotating not only generates a rich linguistic dataset that is grounded in real-world event instances, but also guides the annotators in frame identification, resulting in high inter-annotator-agreement and consistent annotations across documents and at discourse level, exceeding traditional sentence level annotations of frame elements. Moreover, the annotation tool features a dynamic lexical lookup that increases the development of a cross-domain FrameNet lexicon.
- Published
- 2022
7. Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method
- Author
-
Verkijk, Stella, Vossen, Piek, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, Piperidis, Stelios, Language, and Network Institute
- Subjects
SDG 17 - Partnerships for the Goals ,Medical Text Data ,Anonymization ,Language Model - Abstract
Neural Network (NN) architectures are used more and more to model large amounts of data, such as text data available online. Transformer-based NN architectures have shown to be very useful for language modelling. Although many researchers study how such Language Models (LMs) work, not much attention has been paid to the privacy risks of training LMs on large amounts of data and publishing them online. This paper presents a new method for anonymizing a language model by presenting the way in which MedRoBERTa.nl, a Dutch language model for hospital notes, was anonymized. The two step method involves i) automatic anonymization of the training data and ii) semi-automatic anonymization of the LM's vocabulary. Adopting the fill-mask task where the model predicts what tokens are most probable to appear in a certain context, it was tested how often the model will predict a name in a context where a name should be. It was shown that it predicts a name-like token 0.2% of the time. Any name-like token that was predicted was never the name originally presented in the training data. By explaining how a LM trained on highly private real-world medical data can be safely published with open access, we hope that more language resources will be published openly and responsibly so the community can profit from them.
- Published
- 2022
8. Story Trees: Representing Documents using Topological Persistence
- Author
-
Haghighatkhah, Pantea, Fokkens, Antske, Sommerauer, Pia, Speckmann, Bettina, Verbeek, Kevin, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, Piperidis, Stelios, Language, and Network Institute
- Subjects
Semantic Vectors ,SDG 4 - Quality Education ,Document level discourse ,Topical Data Analysis - Abstract
Topological Data Analysis (TDA) focuses on the inherent shape of (spatial) data. As such, it may provide useful methods to explore spatial representations of linguistic data (embeddings) which have become central in NLP. In this paper we aim to introduce TDA to researchers in language technology. We use TDA to represent document structure as so-called story trees. Story trees are hierarchical representations created from semantic vector representations of sentences via persistent homology. They can be used to identify and clearly visualize prominent components of a story line. We showcase their potential by using story trees to create extractive summaries for news stories.
- Published
- 2022
9. Proceedings of the Language Resources and Evaluation Conference
- Author
-
LS OZ Taal en spraaktechnologie, ILS LLI, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, LS OZ Taal en spraaktechnologie, ILS LLI, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, and Piperidis, Stelios
- Published
- 2022
10. The Universal Anaphora Scorer
- Author
-
Sub Natural Language Processing, Natural Language Processing, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, Piperidis, Stelios, Yu, Juntao, Khosla, Sopan, Moosavi, Nafise Sadat, Paun, Silviu, Pradhan, Sameer, Poesio, Massimo, Sub Natural Language Processing, Natural Language Processing, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, Piperidis, Stelios, Yu, Juntao, Khosla, Sopan, Moosavi, Nafise Sadat, Paun, Silviu, Pradhan, Sameer, and Poesio, Massimo
- Published
- 2022
11. What a Creole Wants, What a Creole Needs
- Author
-
Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, Piperidis, Stelios, Lent, Heather, Ogueji, Kelechi, de Lhoneux, Miryam, Ahia, Orevaoghene, Søgaard, Anders, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, Piperidis, Stelios, Lent, Heather, Ogueji, Kelechi, de Lhoneux, Miryam, Ahia, Orevaoghene, and Søgaard, Anders
- Abstract
In recent years, the natural language processing (NLP) community has given increased attention to the disparity of efforts directed towards high-resource languages over low-resource ones. Efforts to remedy this delta often begin with translations of existing English datasets into other languages. However, this approach ignores that different language communities have different needs. We consider a group of low-resource languages, Creole languages. Creoles are both largely absent from the NLP literature, and also often ignored by society at large due to stigma, despite these languages having sizable and vibrant communities. We demonstrate, through conversations with Creole experts and surveys of Creole-speaking communities, how the things needed from language technology can change dramatically from one language to another, even when the languages are considered to be very similar to each other, as with Creoles. We discuss the prominent themes arising from these conversations, and ultimately demonstrate that useful language technology cannot be built without involving the relevant community.
- Published
- 2022
12. The Index Thomisticus Treebank as Linked Data in the LiLa Knowledge Base
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, Mambrini, Francesco, Passarotti, Marco, Moretti, Giovanni, Pellegrini, Matteo, Mambrini Francesco (ORCID:0000-0003-0834-7562), Passarotti Marco (ORCID:0000-0002-9806-7187), Pellegrini Matteo (ORCID:0000-0003-4378-5824), Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Odijk, Jan, Piperidis, Stelios, Mambrini, Francesco, Passarotti, Marco, Moretti, Giovanni, Pellegrini, Matteo, Mambrini Francesco (ORCID:0000-0003-0834-7562), Passarotti Marco (ORCID:0000-0002-9806-7187), and Pellegrini Matteo (ORCID:0000-0003-4378-5824)
- Abstract
Although the Universal Dependencies initiative today allows for cross-linguistically consistent annotation of morphology and syntax in treebanks for several languages, syntactically annotated corpora are not yet interoperable with many lexical resources that describe properties of the words that occur therein. In order to cope with such limitation, we propose to adopt the principles of the Linguistic Linked Open Data community, to describe and publish dependency treebanks as LLOD. In particular, this paper illustrates the approach pursued in the LiLa Knowledge Base, which enables interoperability between corpora and lexical resources for Latin, to publish as Linguistic Linked Open Data the annotation layers of two versions of a Medieval Latin treebank (the Index Thomisticus Treebank).
- Published
- 2022
13. SuMe: A Dataset Towards Summarizing Biomedical Mechanisms
- Author
-
Bastan, Mohaddeseh, Shankar, N., Surdeanu, Mihai, Balasubramanian, Niranjan, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Odijk, Jan, and Piperidis, Stelios
- Subjects
Biomedical NLP ,Summarization ,Text Generation ,Explanation Generation ,Relation Extraction - Abstract
Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 22k instances. We also introduce conclusion sentence generation as a pretraining task with 611k instances. We benchmark the performance of large bio-domain language models. We find that while the pretraining task help improves performance, the best model produces acceptable mechanism outputs in only 32% of the instances, which shows the task presents significant challenges in biomedical language understanding and summarization.
- Published
- 2022
14. Visualising Italian Language Resources: a Snapshot
- Author
-
Del Gratta, Riccardo, primary, Frontini, Francesca, additional, Monachini, Monica, additional, Pardelli, Gabriella, additional, Russo, Irene, additional, Bartolini, Roberto, additional, Goggi, Sara, additional, Khan, Fahad, additional, Quochi, Valeria, additional, Soria, Claudia, additional, and Calzolari, Nicoletta, additional
- Published
- 2015
- Full Text
- View/download PDF
15. Must children be vaccinated or not? Annotating modal verbs in the vaccination debate
- Author
-
King, Liza, Morante, Roser, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Language, and Network Institute
- Subjects
genetic structures ,SDG 3 - Good Health and Well-being ,Annotation ,otorhinolaryngologic diseases ,Modality ,Vaccination debate - Abstract
The study of modal verbs in the growing vaccination debate reveals important insights into perspectives on vaccination: must children be vaccinated or are parents allowed not to vaccinate? How strong are the recommendations by pro- and anti-vaccination supporters? We present experimental work on annotation of modal verbs and their senses in texts related to the vaccination debate, as well as the resulting corpus. The results from our pilot study suggest that the most frequent type of modality was epistemic - indicating that participants in the debate appear to be more concerned with the safety and efficacy of vaccines than with moral arguments. Those against vaccination appear to be more committed or convinced of their views than those in favor, as evidenced by the use of the modal must.
- Published
- 2020
16. Annotating perspectives on vaccination
- Author
-
Morante Vallejo, Roser, van Son, Chantal Michelle, Maks, E., Vossen, Piek, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Network Institute, and Language
- Subjects
Attribution ,SDG 3 - Good Health and Well-being ,Claims ,Vaccination debate ,Opinions ,Perspectives - Abstract
In this paper we present the Vaccination Corpus, a corpus of texts related to the online vaccination debate that has been annotated with three layers of information about perspectives: attribution, claims and opinions. Additionally, events related to the vaccination debate are also annotated. The corpus contains 294 documents from the Internet which reflect different views on vaccinations. It has been compiled to study the language of online debates, with the final goal of experimenting with methodologies to extract and contrast perspectives within the vaccination debate.
- Published
- 2020
17. Detecting negation cues and scopes in Spanish
- Author
-
Jiménez-Zafra, Salud María, Morante, Roser, Blanco, Eduardo, Martín-Valdivia, María Teresa, Alfonso Ureña-López, L., Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Language, and Network Institute
- Subjects
Scope identification ,SFU Review-NEG corpus ,Negation cues detection ,Negation processing ,Spanish - Abstract
In this work we address the processing of negation in Spanish. We first present a machine learning system that processes negation in Spanish. Specifically, we focus on two tasks: i) negation cue detection and ii) scope identification. The corpus used in the experimental framework is the SFU ReviewSP-NEG. The results for cue detection outperform state-of-the-art results, whereas for scope detection this is the first system that performs the task for Spanish. Moreover, we provide a qualitative error analysis aimed at understanding the limitations of the system and showing which negation cues and scopes are straightforward to predict automatically, and which ones are challenging.
- Published
- 2020
18. A shared task of a new, collaborative type to foster reproducibility: A first exercise in the area of language science and technology with REPROLANG2020
- Author
-
Branco, António, Calzolari, Nicoletta, Vossen, Piek, van Noord, Gertjan, van Uytvank, Dieter, Silva, João, Gomes, Luís, Moreira, André, Elbers, Willem, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Language, and Network Institute
- Subjects
Natural language processing ,Reproduction ,Replication ,Computational linguistics ,Language technology - Abstract
In this paper, we introduce a new type of shared task - which is collaborative rather than competitive - designed to support and foster the reproduction of research results. We also describe the first event running such a novel challenge, present the results obtained, discuss the lessons learned and ponder on future undertakings.
- Published
- 2020
19. MAGPIE: A Large Corpus of Potentially Idiomatic Expressions
- Author
-
Haagsma, Hessel, Bos, Johan, Nissim, Malvina, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, and Piperidis, Stelios
- Abstract
Given the limited size of existing idiom corpora, we aim to enable progress in automatic idiom processing and linguistic analysis by creating the largest-to-date corpus of idioms for English. Using a fixed idiom list, automatic pre-extraction, and a strictly controlled crowdsourced annotation procedure, we show that it is feasible to build a high-quality corpus comprising more than 50K instances, an order of a magnitude larger than previous resources. Crucial ingredients of crowdsourcing were the selection of crowdworkers, clear and comprehensive instructions, and an interface that breaks down the task in small, manageable steps. Analysis of the resulting corpus revealed strong effects of genre on idiom distribution, providing new evidence for existing theories on what influences idiom usage. The corpus also contains rich metadata, and is made publicly available.
- Published
- 2020
20. BLISS: An agent for collecting spoken dialogue data about health and well-being
- Author
-
van Waterschoot, Jelte, Hendrickx, Iris, Khan, Arif, Klabbers, Esther, de Korte, Marcel, Strik, Helmer, Cucchiarini, Catia, Theune, Mariët, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, and Human Media Interaction
- Subjects
Spoken dialogue system ,Spoken Dutch ,Conversational ,Healthcare ,Natural language generation ,Happiness model - Abstract
An important objective in health-technology is the ability to gather information about people's well-being. Structured interviews can be used to obtain this information, but these are time-consuming and not scalable. Questionnaires provide an alternative way to extract such information, yet they typically lack depth. In this paper, we present our first prototype of the Behaviour-based Language-Interactive Speaking Systems (BLISS), an artificial intelligent agent which intends to automatically discover what makes people happy and healthy. The goal of BLISS is to understand the motivations behind people's happiness by conducting a personalized spoken dialogue based on a happiness model. We built our first prototype of the model to collect 55 spoken dialogues, in which the BLISS agent asked questions to users about their happiness and well-being. Apart from a description of the BLISS architecture, we also provide details about our dataset, which contains mentions of over 120 activities and 100 motivations and is made available for usage.
- Published
- 2020
21. The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, D'Souza, Jennifer, Hoppe, Anett, Brack, Arthur, Jaradeh, Mohamad Yaser, Auer, Sören, Ewerth, Ralph, Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, D'Souza, Jennifer, Hoppe, Anett, Brack, Arthur, Jaradeh, Mohamad Yaser, Auer, Sören, and Ewerth, Ralph
- Abstract
We introduce the STEM (Science, Technology, Engineering, and Medicine) Dataset for Scientific Entity Extraction, Classification, and Resolution, version 1.0 (STEM-ECR v1.0). The STEM-ECR v1.0 dataset has been developed to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion. It comprises abstracts in 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. We describe the creation of such a multidisciplinary corpus and highlight the obtained findings in terms of the following features: 1) a generic conceptual formalism for scientific entities in a multidisciplinary scientific context; 2) the feasibility of the domain-independent human annotation of scientific entities under such a generic formalism; 3) a performance benchmark obtainable for automatic extraction of multidisciplinary scientific entities using BERT-based neural models; 4) a delineated 3-step entity resolution procedure for human annotation of the scientific entities via encyclopedic entity linking and lexicographic word sense disambiguation; and 5) human evaluations of Babelfy returned encyclopedic links and lexicographic senses for our entities. Our findings cumulatively indicate that human annotation and automatic learning of multidisciplinary scientific concepts as well as their semantic disambiguation in a wide-ranging setting as STEM is reasonable.
- Published
- 2020
22. Model-based annotation of coreference
- Author
-
Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Aralikatte, Rahul, Søgaard, Anders, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Aralikatte, Rahul, and Søgaard, Anders
- Abstract
Humans do not make inferences over texts, but over models of what texts are about. When annotators are asked to annotate coreferent spans of text, it is therefore a somewhat unnatural task. This paper presents an alternative in which we preprocess documents, linking entities to a knowledge base, and turn the coreference annotation task - in our case limited to pronouns - into an annotation task where annotators are asked to assign pronouns to entities. Model-based annotation is shown to lead to faster annotation and higher inter-annotator agreement, and we argue that it also opens up for an alternative approach to coreference resolution. We present two new coreference benchmark datasets, for English Wikipedia and English teacher-student dialogues, and evaluate state-of-the-art coreference resolvers on them.
- Published
- 2020
23. DaNE:A named entity resource for danish
- Author
-
Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Hvingelby, Rasmus, Pauli, Amalie Brogaard, Barrett, Maria, Rosted, Christina, Lidegaard, Lasse Malm, Søgaard, Anders, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Hvingelby, Rasmus, Pauli, Amalie Brogaard, Barrett, Maria, Rosted, Christina, Lidegaard, Lasse Malm, and Søgaard, Anders
- Abstract
We present a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme: DaNE. It is the largest publicly available, Danish named entity gold annotation. We evaluate the quality of our annotations intrinsically by double annotating the entire treebank and extrinsically by comparing our annotations to a recently released named entity annotation of the validation and test sections of the Danish Universal Dependencies treebank. We benchmark the new resource by training and evaluating competitive architectures for supervised named entity recognition (NER), including FLAIR, monolingual (Danish) BERT and multilingual BERT. We explore cross-lingual transfer in multilingual BERT from five related languages in zero-shot and direct transfer setups, and we show that even with our modestly-sized training set, we improve Danish NER over a recent cross-lingual approach, as well as over zero-shot transfer from five related languages. Using multilingual BERT, we achieve higher performance by fine-tuning on both DaNE and a larger Bokmål (Norwegian) training set compared to only using DaNE. However, the highest performance is achieved by using a Danish BERT fine-tuned on DaNE. Our dataset enables improvements and applicability for Danish NER beyond cross-lingual methods. We employ a thorough error analysis of the predictions of the best models for seen and unseen entities, as well as their robustness on un-capitalized text. The annotated dataset and all the trained models are made publicly available.
- Published
- 2020
24. WikiBank:Using wikidata to improve multilingual frame-semantic parsing
- Author
-
Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Sas, Cezar, Beloucif, Meriem, Søgaard, Anders, Calzolari, Nicoletta, Bechet, Frederic, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Sas, Cezar, Beloucif, Meriem, and Søgaard, Anders
- Abstract
Frame-semantic annotations exist for a tiny fraction of the world's languages, Wikidata, however, links knowledge base triples to texts in many languages, providing a common, distant supervision signal for semantic parsers. We present WIKIBANK, a multilingual resource of partial semantic structures that can be used to extend pre-existing resources rather than creating new man-made resources from scratch. We also integrate this form of supervision into an off-the-shelf frame-semantic parser and allow cross-lingual transfer. Using Google's SLING architecture, we show significant improvements on the English and Spanish CoNLL 2009 datasets, whether training on the full available datasets or small subsamples thereof.
- Published
- 2020
25. Odi et Amo. Creating, Evaluating and Extending Sentiment Lexicons for Latin
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Sprugnoli, Rachele, Passarotti, Marco Carlo, Corbetta, Daniela, Peverelli, Andrea, Sprugnoli Rachele (ORCID:0000-0001-6861-5595), Passarotti Marco (ORCID:0000-0002-9806-7187), Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Sprugnoli, Rachele, Passarotti, Marco Carlo, Corbetta, Daniela, Peverelli, Andrea, Sprugnoli Rachele (ORCID:0000-0001-6861-5595), and Passarotti Marco (ORCID:0000-0002-9806-7187)
- Abstract
Sentiment lexicons are essential for developing automatic sentiment analysis systems, but the resources currently available mostly cover modern languages. Lexicons for ancient languages are few and not evaluated with high-quality gold standards. However, the study of attitudes and emotions in ancient texts is a growing field of research which poses specific issues (e.g., lack of native speakers, limited amount of data, unusual textual genres for the sentiment analysis task, such as philosophical or documentary texts) and can have an impact on the work of scholars coming from several disciplines besides computational linguistics, e.g. historians and philologists. The work presented in this paper aims at providing the research community with a set of sentiment lexicons built by taking advantage of manually-curated resources belonging to the long tradition of Latin corpora and lexicons creation. Our interdisciplinary approach led us to release: i) two automatically generated sentiment lexicons; ii) a Gold Standard developed by two Latin language and culture experts; iii) a Silver Standard in which semantic and derivational relations are exploited so to extend the list of lexical items of the Gold Standard. In addition, the evaluation procedure is described together with a first application of the lexicons to a Latin tragedy.
- Published
- 2020
26. A New Latin Treebank for Universal Dependencies: Charters between Ancient Latin and Romance Languages
- Author
-
Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Cecchini, Flavio Massimiliano, Korkiakangas, Timo, Passarotti, Marco, Cecchini Flavio Massimiliano (ORCID:0000-0001-9029-1822), Passarotti Marco (ORCID:0000-0002-9806-7187), Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Cecchini, Flavio Massimiliano, Korkiakangas, Timo, Passarotti, Marco, Cecchini Flavio Massimiliano (ORCID:0000-0001-9029-1822), and Passarotti Marco (ORCID:0000-0002-9806-7187)
- Abstract
The present work introduces a new Latin treebank that follows the Universal Dependencies (UD) annotation standard. The treebank is obtained from the automated conversion of the Late Latin Charter Treebank 2 (LLCT2), originally in the Prague Dependency Treebank (PDT) style. As this treebank consists of Early Medieval legal documents, its language variety differs considerably from both the Classical and Medieval learned varieties prevalent in the other currently available UD Latin treebanks. Consequently, besides significant phenomena from the perspective of diachronic linguistics, this treebank also poses several challenging technical issues for the current and future syntactic annotation of Latin in the UD framework. Some of the most relevant cases are discussed in depth, with comparisons between the original PDT and the resulting UD annotations. Additionally, an overview of the UD-style structure of the treebank is given, and some diachronic aspects of the transition from Latin to Romance languages are highlighted.
- Published
- 2020
27. The circumstantial event ontology (CEO) and ECB+/CEO: An ontology and corpus for implicit causal relations between events
- Author
-
Segers, Roxane, Caselli, Tommaso, Vossen, Piek, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Language, and Network Institute
- Subjects
Causality ,Annotated Corpora ,Semantic Role Labeling ,Ontology ,Text Mining ,Event Chaining ,Event Modeling - Abstract
In this paper, we describe the Circumstantial Event Ontology (CEO), a newly developed ontology for calamity events that models semantic circumstantial relations between event classes, where we define circumstantial as inferred implicit causal relations. The circumstantial relations are inferred from the assertions of the event classes that involve a change to the same property of a participant. Our model captures that the change yielded by one event, explains to people the happening of the next event when observed. We describe the meta model and the contents of the ontology, the creation of a manually annotated corpus for circumstantial relations based on ECB+ and the first results on the evaluation of the ontology.
- Published
- 2019
28. Discovering the language of wine reviews: A text mining account
- Author
-
Lefever, Els, Hendrickx, Iris, Croijmans, Ilja, Van Den Bosch, Antal, Majid, Asifa, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Leerstoel Smeets, and Social-cognitive and interpersonal determinants of behaviour
- Subjects
Linguistics and Language ,Wine vocabulary ,Terminology extraction ,Library and Information Sciences ,Classification ,Supervised learning ,Wine reviews ,Language and Linguistics ,Education - Abstract
It is widely held that smells and flavors are impossible to put into words. In this paper we test this claim by seeking predictive patterns in wine reviews, which ostensibly aim to provide guides to perceptual content. Wine reviews have previously been critiqued as random and meaningless. We collected an English corpus of wine reviews with their structured metadata, and applied machine learning techniques to automatically predict the wine's color, grape variety, and country of origin. To train the three supervised classifiers, three different information sources were incorporated: lexical bag-of-words features, domain-specific terminology features, and semantic word embedding features. In addition, using regression analysis we investigated basic review properties, i.e., review length, average word length, and their relationship to the scalar values of price and review score. Our results show that wine experts do share a common vocabulary to describe wines and they use this in a consistent way, which makes it possible to automatically predict wine characteristics based on the review text alone. This means that odors and flavors may be more expressible in language than typically acknowledged.
- Published
- 2019
29. Studying muslim stereotyping through microportrait extraction
- Author
-
Fokkens, Antske, Ruigrok, Nel, Beukeboom, Camiel, Gagestein, Sarah, Van Atteveldt, Wouter, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Language, Network Institute, Communication Science, and Communication Choices, Content and Consequences (CCCC)
- Subjects
Digital social science ,Stereotyping ,Text analysis ,SDG 10 - Reduced Inequalities - Abstract
Research from communication science has shown that stereotypical ideas are often reflected in language use. Media coverage of different groups in society influences the perception people have about these groups and even increases distrust and polarization among different groups. Investigating the forms of (especially subtle) stereotyping can raise awareness to journalists and help prevent reinforcing oppositions between groups in society. Conducting large-scale, deep investigations to determine whether we are faced with stereotyping is time-consuming and costly. We propose to tackle this challenges through the means of microportraits: an impression of a target group or individual conveyed in a single text. We introduce the first system implementation for Dutch and show that microportraits allow social scientists to explore various dimensions of stereotyping. We explore the possibilities provided by microportraits by investigating stereotyping of Muslims in the Dutch media. Our (preliminary) results show that microportraits provide more detailed insights into stereotyping compared to more basic models such as word clouds.
- Published
- 2019
30. Systems' agreements and disagreements in temporal processing: An extensive error analysis of the TempEval-3 task
- Author
-
Caselli, Tommaso, Morante, Roser, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, and Language
- Subjects
Temporal processing ,Error analysis ,Written corpora - Abstract
In this article we review Temporal Processing systems that participated in the TempEval-3 task as a basis to develop our own system, that we also present and release. The system incorporates high level lexical semantic features, obtaining the best scores for event detection (F1-Class 72.24) and second best result for temporal relation classification from raw text (F1 29.69) when evaluated on the TempEval-3 data. Additionally, we analyse the errors of all TempEval-3 systems for which the output is publicly available with the purpose of finding out what are the weaknesses of current approaches. Although incorporating lexical semantics features increases the performance of our system, the error analysis shows that systems should incorporate inference mechanisms and world knowledge, as well as having strategies to compensate for data skewness.
- Published
- 2019
31. Semantic Query Analysis from the Global Science Gateway
- Author
-
GOGGI, SARA, Gabriella, Pardelli, Bartolini, Roberto, MONACHINI, MONICA, biagioni, stefania, and Carlesi, Carlo
- Subjects
FOS: Computer and information sciences ,Global Science Gateway ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Information gateways ,Semantic Query Analysis ,WorldWideScience Alliance ,Query Log ,Terminology ,Information Extraction ,Social Media - Abstract
We focused on building a corpus constituted by the query logs registered by the GreyGuide: Repository and Portal to Good Practices and Resources in Grey Literature and received by the WorldWideScience.org (The Global Science Gateway) portal: the aim is to retrieve information related to social media which as of today represent a considerable source of data more and more widely used for research ends. This project includes eight months of query logs3 registered between July 2017 and February 2018 for a total of 445,827 queries. The analysis mainly concentrates on the semantics of the queries received from the portal clients: it is a process of information retrieval from a rich digital catalogue whose language is dynamic, is evolving and follows - as well as reflects - the cultural changes of our modern society.
- Published
- 2019
32. Automating document discovery in the systematic review process: How to use chaff to extract wheat
- Author
-
Norman, Christopher, Leeflang, Mariska, Zweigenbaum, Pierre, Névéol, Aurélie, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Epidemiology and Data Science, APH - Methodology, and APH - Personalized Medicine
- Abstract
Systematic reviews in e.g. empirical medicine address research questions by comprehensively examining the entire published literature. Conventionally, manual literature surveys decide inclusion in two steps, first based on abstracts and title, then by full text, yet current methods to automate the process make no distinction between gold data from these two stages. In this work we compare the impact different schemes for choosing positive and negative examples from the different screening stages have on the training of automated systems. We train a ranker using logistic regression and evaluate it on a new gold standard dataset for clinical NLP, and on an existing gold standard dataset for drug class efficacy. The classification and ranking achieves an average AUC of 0.803 and 0.768 when relying on gold standard decisions based on title and abstracts of articles, and an AUC of 0.625 and 0.839 when relying on gold standard decisions based on full text. Our results suggest that it makes little difference which screening stage the gold standard decisions are drawn from, and that the decisions need not be based on the full text. The results further suggest that common-off-the-shelf algorithms can reduce the amount of work required to retrieve relevant literature.
- Published
- 2019
33. Discovering the language of wine reviews: A text mining account
- Author
-
Leerstoel Smeets, Social-cognitive and interpersonal determinants of behaviour, Lefever, Els, Hendrickx, Iris, Croijmans, Ilja, Van Den Bosch, Antal, Majid, Asifa, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Leerstoel Smeets, Social-cognitive and interpersonal determinants of behaviour, Lefever, Els, Hendrickx, Iris, Croijmans, Ilja, Van Den Bosch, Antal, Majid, Asifa, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, and Tokunaga, Takenobu
- Published
- 2019
34. The AnnCor CHILDES Treebank
- Author
-
Odijk, Jan, Dimitriadis, Alexis, Klis, Martijn Van der, Koppen, Marjo Van, Otten, Meie, Veen, Remco van der, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Piperidis, Stelios, Tokunaga, Takenobu, LS OZ Taal en spraaktechnologie, LS Psycholinguistiek, LS Franse Taalkunde, LS BZ Variatielinguistiek vh Nederlands, and ILS LLI
- Subjects
treebank ,treebank querying ,CHILDES ,Dutch ,GrETEL ,Language and Linguistics ,Computer Science(all) - Abstract
This paper (1) presents the first partially manually verified treebank for Dutch CHILDES corpora, the AnnCor CHILDES Treebank; (2) argues explicitly that it is useful to assign adult grammar syntactic structures to utterances of children who are still in the process of acquiring the language; (3) argues that human annotation and automatic checks on this annotation must go hand in hand; (4) argues that explicit annotation guidelines and conventions must be developed and adhered to and emphasises consistency of the annotations as an important desirable property for annotations. It also describes the tools used for annotation and automated checks on edited syntactic structures, as well as extensions to an existing treebank query application (GrETEL) and the multiple formats in which the resources will be made available
- Published
- 2018
35. Discovering the Language of Wine Reviews: A Text Mining Account
- Author
-
Lefever, Els, Hendrickx, Iris, Croijmans, Ilja, Bosch, Antal van den, Majid, Asifa, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Leerstoel Smeets, Social-cognitive and interpersonal determinants of behaviour, LS Language, Communication & Computation, and ILS L&C
- Subjects
Linguistics and Language ,Wine vocabulary ,Terminology extraction ,Library and Information Sciences ,Classification ,Supervised learning ,Wine reviews ,Language and Linguistics ,Education - Abstract
It is widely held that smells and flavors are impossible to put into words. In this paper we test this claim by seeking predictive patterns in wine reviews, which ostensibly aim to provide guides to perceptual content. Wine reviews have previously been critiqued as random and meaningless. We collected an English corpus of wine reviews with their structured metadata, and applied machine learning techniques to automatically predict the wine's color, grape variety, and country of origin. To train the three supervised classifiers, three different information sources were incorporated: lexical bag-of-words features, domain-specific terminology features, and semantic word embedding features. In addition, using regression analysis we investigated basic review properties, i.e., review length, average word length, and their relationship to the scalar values of price and review score. Our results show that wine experts do share a common vocabulary to describe wines and they use this in a consistent way, which makes it possible to automatically predict wine characteristics based on the review text alone. This means that odors and flavors may be more expressible in language than typically acknowledged.
- Published
- 2018
36. A Multilingual Wikified Data Set of Educational Material
- Author
-
Hendrickx, Iris, Takoulidou, Eirini, Naskos, Thanasis, Kermanidis, Katia Lida, Sosoni, Vilelmini, Vos, Hugo De, Stasimioti, Maria, Zaanen, Menno Van, Georgakopoulou, Panayota, Kordoni, Valia, Popovic, Maja, Egg, Markus, Bosch, Antal Van den, chair), Nicoletta Calzolari (Conference, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, and Cognitive Science & AI
- Published
- 2018
37. Translation Crowdsourcing: Creating a Multilingual Corpus of Online Educational Content
- Author
-
Sosoni, Vilelmini, Kermanidis, Katia Lida, Stasimioti, Maria, Naskos, Thanasis, Takoulidou, Eirini, Zaanen, Menno Van, Castilho, Sheila, Georgakopoulou, Panayota, Kordoni, Valia, Egg, Markus, chair), Nicoletta Calzolari (Conference, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, and Cognitive Science & AI
- Published
- 2018
38. Neural Models of Selectional Preferences for Implicit Semantic Role Labeling
- Author
-
Le, M.N., Fokkens, A.S., Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Network Institute, and Language
- Subjects
Selectional preferences ,Implicit semantic role labeling ,Neural network - Abstract
Implicit Semantic Role Labeling is a challenging task: it requires high-level understanding of the text while annotated data is very limited. Due to the lack of training data, most researches either resort to simplistic machine learning methods or focus on automatically acquiring training data. In this paper, we explore the possibilities of using more complex and expressive machine learning models trained on a large amount of explicit roles. In addition, we compare the impact of one-way and multi-way selectional preference with the hypothesis that the added information in multi-way models are beneficial. Although our models surpass a baseline that uses prototypical vectors for SemEval-2010, we otherwise face mostly negative results. Selectional preference models perform lower than the baseline on ON5V, a dataset of five ambiguous and frequent verbs. They are also outperformed by the Naive Bayes model of Feizabadi and Pado (2015) on both datasets. We conclude that, even though multi-way selectional preference improves results for predicting explicit semantic roles compared to one-way selectional preference, it harms performance for implicit roles. We release our source code, including the reimplementation of two previously unavailable systems to enable further experimentation.
- Published
- 2018
39. Towards an ISO Standard for the Annotation of Quantification
- Author
-
Bunt, Harry, Pustejovsky, J., Lee, K., Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, and Cognitive Science & AI
- Published
- 2018
40. Resource Interoperability for Sustainable Benchmarking: The Case of Events: The case of events
- Author
-
van Son, C.M., Inel, O.A., Morante Vallejo, R., Aroyo, L.M., Vossen, P.T.J.M., Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Language, Network Institute, Business Web and Media, and Intelligent Information Systems
- Subjects
Events ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,Resource interoperability ,Annotation consistency - Abstract
With the continuous growth of benchmark corpora, which often annotate the same documents, there is a range of opportunities to compare and combine similar and complementary annotations. However, these opportunities are hampered by a wide range of problems that are related to the lack of resource interoperability. In this paper, we illustrate these problems by assessing aspects of interoperability at the document-level across a set of 20 corpora annotated with (aspects of) events. The issues range from applying different document naming conventions, to mismatches in textual content and structural/conceptual differences among annotation schemes. We provide insight into the exact document intersections between the corpora by mapping their document identifiers and perform an empirical analysis of event annotations showing their compatibility and consistency in and across the corpora. This way, we aim to make the community more aware of the challenges and opportunities and to inspire working collaboratively towards interoperable resources.
- Published
- 2018
41. Don't Annotate, but Validate: a Data-to-Text Method for Capturing Event Data
- Author
-
Vossen, Piek, Ilievski, Filip, Postma, Marten, Segers, R.H., Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Language, and Network Institute
- Subjects
Text corpora ,SDG 16 - Peace, Justice and Strong Institutions ,Event coreference ,Structured data - Abstract
In this paper, we present a new method to obtain large volumes of high-quality text corpora with event data for studying identity and reference relations. We report on the current methods to create event reference data by annotating texts and deriving the event data a posteriori. Our method starts from event registries in which event data is defined a priori. From this data, we extract so-called Microworlds of referential data with the Reference Texts that report on these events. This makes it possible to easily establish referential relations with high precision and at a large scale. In a pilot, we successfully obtained data from these resources with extreme ambiguity and variation, while maintaining the identity and reference relations and without having to annotate large quantities of texts word-by-word. The data from this pilot was annotated using an annotation tool created specifically in order to validate our method and to enrich the reference texts with event coreference annotations. This annotation process resulted in the Gun Violence Corpus, whose development process and outcome are described in this paper.
- Published
- 2018
42. The AnnCor CHILDES Treebank
- Author
-
LS OZ Taal en spraaktechnologie, LS Psycholinguistiek, LS Franse Taalkunde, LS BZ Variatielinguistiek vh Nederlands, UiL OTS LLI, Odijk, Jan, Dimitriadis, Alexis, Klis, Martijn Van der, Koppen, Marjo Van, Otten, Meie, Veen, Remco van der, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Piperidis, Stelios, Tokunaga, Takenobu, LS OZ Taal en spraaktechnologie, LS Psycholinguistiek, LS Franse Taalkunde, LS BZ Variatielinguistiek vh Nederlands, UiL OTS LLI, Odijk, Jan, Dimitriadis, Alexis, Klis, Martijn Van der, Koppen, Marjo Van, Otten, Meie, Veen, Remco van der, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Piperidis, Stelios, and Tokunaga, Takenobu
- Published
- 2018
43. Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, El Haj, Mahmoud, Rayson, Paul Edward, Piao, Scott Songlin, Knight, Jo, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, El Haj, Mahmoud, Rayson, Paul Edward, Piao, Scott Songlin, and Knight, Jo
- Abstract
In many areas of academic publishing, there is an explosion of literature, and sub-division of fields into subfields, leading to stove-piping where sub-communities of expertise become disconnected from each other. This is especially true in the genetics literature over the last 10 years where researchers are no longer able to maintain knowledge of previously related areas. This paper extends several approaches based on natural language processing and corpus linguistics which allow us to examine corpora derived from bodies of genetics literature and will help to make comparisons and improve retrieval methods using domain knowledge via an existing gene ontology. We derived two open access medical journal corpora from PubMed related to psychiatric genetics and immune disorder genetics. We created a novel Gene Ontology Semantic Tagger (GOST) and lexicon to annotate the corpora and are then able to compare subsets of literature to understand the relative distributions of genetic terminology, thereby enabling researchers to make improved connections between them.
- Published
- 2018
44. Arabic Dialect Identification in the Context of Bivalency and Code-Switching
- Author
-
Calzolari, Nicoletti, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, El Haj, Mahmoud, Rayson, Paul Edward, Aboelezz, Mariam, Calzolari, Nicoletti, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, El Haj, Mahmoud, Rayson, Paul Edward, and Aboelezz, Mariam
- Abstract
In this paper we use a novel approach towards Arabic dialect identification using language bivalency and written code-switching. Bivalency between languages or dialects is where a word or element is treated by language users as having a fundamentally similar semantic content in more than one language or dialect. Arabic dialect identification in writing is a difficult task even for humans due to the fact that words are used interchangeably between dialects. The task of automatically identifying dialect is harder and classifiers trained using only n-grams will perform poorly when tested on unseen data. Such approaches require significant amounts of annotated training data which is costly and time consuming to produce. Currently available Arabic dialect datasets do not exceed a few hundred thousand sentences, thus we need to extract features other than word and character n-grams. In our work we present experimental results from automatically identifying dialects from the four main Arabic dialect regions (Egypt, North Africa, Gulf and Levant) in addition to Standard Arabic. We extend previous work by incorporating additional grammatical and stylistic features and define a subtractive bivalency profiling approach to address issues of bivalent words across the examined Arabic dialects. The results show that our new methods classification accuracy can reach more than 76% and score well (66%) when tested on completely unseen data.
- Published
- 2018
45. Towards A Welsh Semantic Annotation System
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Piao, Scott Songlin, Rayson, Paul Edward, Knight, Dawn, Watkins, Gareth, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Piao, Scott Songlin, Rayson, Paul Edward, Knight, Dawn, and Watkins, Gareth
- Abstract
Automatic semantic annotation of natural language data is an important task in Natural Language Processing, and a variety of semantic taggers have been developed for this task, particularly for English. However, for many languages, particularly for low-resource languages, such tools are yet to be developed. In this paper, we report on the development of an automatic Welsh semantic annotation tool (named CySemTagger) in the CorCenCC Project, which will facilitate semantic-level analysis of Welsh language data on a large scale. Based on Lancaster’s USAS semantic tagger framework, this tool tags words in Welsh texts with semantic tags from a semantic classification scheme, and is designed to be compatible with multiple Welsh POS taggers and POS tagsets by mapping different tagsets into a core shared POS tagset that is used internally by CySemTagger. Our initial evaluation shows that the tagger can cover up to 91.78% of words in Welsh text. This tagger is under continuous development, and will provide a critical tool for Welsh language corpus and information processing at semantic level.
- Published
- 2018
46. Discovering the Language of Wine Reviews: A Text Mining Account
- Author
-
Leerstoel Smeets, Social-cognitive and interpersonal determinants of behaviour, LS Language, Communication & Computation, ILS L&C, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Lefever, Els, Hendrickx, Iris, Croijmans, Ilja, Bosch, Antal van den, Majid, Asifa, Leerstoel Smeets, Social-cognitive and interpersonal determinants of behaviour, LS Language, Communication & Computation, ILS L&C, Isahara, Hitoshi, Maegaard, Bente, Piperidis, Stelios, Cieri, Christopher, Declerck, Thierry, Hasida, Koiti, Mazo, Helene, Choukri, Khalid, Goggi, Sara, Mariani, Joseph, Moreno, Asuncion, Calzolari, Nicoletta, Odijk, Jan, Tokunaga, Takenobu, Lefever, Els, Hendrickx, Iris, Croijmans, Ilja, Bosch, Antal van den, and Majid, Asifa
- Published
- 2018
47. SMILE Swiss German Sign Language Dataset
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Calzolari, N ( Nicoletta ), Choukri, K ( Khalid ), Cieri, C ( Christopher ), Declerck, T ( Thierry ), Goggi, S ( Sara ), Hasida, K ( Koiti ), Isahara, H ( Hitoshi ), Maegaard, B ( Bente ), Mariani, J ( Joseph ), Mazo, H ( Hélène ), Moreno, A ( Asuncion ), Odijk, J ( Jan ), Piperidis, S ( Stelios ), Tokunaga, T ( Takenobu ), Ebling, Sarah; https://orcid.org/0000-0001-6511-5085, Camgöz, Necati Cihan, Boyes Braem, Penny, Tissi, Katja, Sidler-Miserez, Sandra, Stoll, Stephanie, Hadfield, Simon, Haug, Tobias, Bowden, Richard, Tornay, Sandrine, Razavi, Marzieh, Magimai-Doss, Mathew, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Calzolari, N ( Nicoletta ), Choukri, K ( Khalid ), Cieri, C ( Christopher ), Declerck, T ( Thierry ), Goggi, S ( Sara ), Hasida, K ( Koiti ), Isahara, H ( Hitoshi ), Maegaard, B ( Bente ), Mariani, J ( Joseph ), Mazo, H ( Hélène ), Moreno, A ( Asuncion ), Odijk, J ( Jan ), Piperidis, S ( Stelios ), Tokunaga, T ( Takenobu ), Ebling, Sarah; https://orcid.org/0000-0001-6511-5085, Camgöz, Necati Cihan, Boyes Braem, Penny, Tissi, Katja, Sidler-Miserez, Sandra, Stoll, Stephanie, Hadfield, Simon, Haug, Tobias, Bowden, Richard, Tornay, Sandrine, Razavi, Marzieh, and Magimai-Doss, Mathew
- Published
- 2018
48. CLARIN: towards FAIR and responsible data science using language resources
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, de Jong, Franciska, de Smedt, Koenraad, Fiser, Darja, van Uytvanck, Dieter, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, de Jong, Franciska, de Smedt, Koenraad, Fiser, Darja, and van Uytvanck, Dieter
- Published
- 2018
49. Proceedings of Eleventh International Conference on Language Resources and Evaluation
- Author
-
Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asunción, Odijk, Jan, Piperidis, Stelios, Tokunaga, Takenobu, Calzolari, Nicoletta, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Hasida, Koiti, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Helene, Moreno, Asunción, Odijk, Jan, Piperidis, Stelios, and Tokunaga, Takenobu
- Published
- 2018
50. A terminological 'journey' in the Grey Literature domain
- Author
-
Bartolini, Roberto (ILC-CNR), Pardelli, Gabriella (ILC-CNR), Goggi, Sara (ILC-CNR), Silvia Giannini (ISTI-CNR), Biagioni, Stefania (ISTI-CNR), GreyNet, Grey Literature Network Service, and GL18, New York, NY (USA), 2016-11-28
- Subjects
05B - Information science, librarianship ,ComputingMethodologies_SYMBOLICANDALGEBRAICMANIPULATION ,ComputingMilieux_PERSONALCOMPUTING ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,GeneralLiterature_MISCELLANEOUS - Abstract
Includes: Conference preprint, Powerpoint presentation, Abstract and Biographical notes XA International
- Published
- 2017
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.