Author: "Biemann, Chris" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

1. CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models

Author: Wang, Xintong, Pan, Jingheng, Jiang, Longqin, Ding, Liang, Li, Xingshan, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Despite their impressive capabilities, large language models (LLMs) often lack interpretability and can generate toxic content. While using LLMs as foundation models and applying semantic steering methods are widely practiced, we believe that efficient methods should be based on a thorough understanding of LLM behavior. To this end, we propose using eye movement measures to interpret LLM behavior across layers. We find that LLMs exhibit patterns similar to human gaze across layers and different layers function differently. Inspired by these findings, we introduce a heuristic steering layer selection and apply it to layer intervention methods via fine-tuning and inference. Using language toxification and detoxification as test beds, we demonstrate that our proposed CogSteer methods achieve better results in terms of toxicity scores while efficiently saving 97% of the computational resources and 60% of the training time. Our model-agnostic approach can be adopted into various LLMs, contributing to their interpretability and promoting trustworthiness for safe deployment.
Published: 2024

2. Large Language Models Are Overparameterized Text Encoders

Author: K, Thennal D, Fischer, Tim, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last $p\%$ layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30\% of layers with negligible impact on performance and up to 80\% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose $\text{L}^3 \text{Prune}$, a novel layer-pruning strategy based on the model's initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21\% of the parameters with a $-0.3$ performance drop, and the small variant only suffers from a $-5.1$ decrease while pruning 74\% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned., Comment: 8 pages of content + 1 for limitations and ethical considerations, 14 pages in total including references and appendix, 5+1 figures
Published: 2024

3. Demarked: A Strategy for Enhanced Abusive Speech Moderation through Counterspeech, Detoxification, and Message Management

Author: Yimam, Seid Muhie, Dementieva, Daryna, Fischer, Tim, Moskovskiy, Daniil, Rizwan, Naquee, Saha, Punyajoy, Roy, Sarthak, Semmann, Martin, Panchenko, Alexander, Biemann, Chris, and Mukherjee, Animesh
Subjects: Computer Science - Computation and Language, Computer Science - Social and Information Networks
Abstract: Despite regulations imposed by nations and social media platforms, such as recent EU regulations targeting digital violence, abusive content persists as a significant challenge. Existing approaches primarily rely on binary solutions, such as outright blocking or banning, yet fail to address the complex nature of abusive speech. In this work, we propose a more comprehensive approach called Demarcation scoring abusive speech based on four aspect -- (i) severity scale; (ii) presence of a target; (iii) context scale; (iv) legal scale -- and suggesting more options of actions like detoxification, counter speech generation, blocking, or, as a final measure, human intervention. Through a thorough analysis of abusive speech regulations across diverse jurisdictions, platforms, and research papers we highlight the gap in preventing measures and advocate for tailored proactive steps to combat its multifaceted manifestations. Our work aims to inform future strategies for effectively addressing abusive speech online.
Published: 2024

4. Low-Resource Machine Translation through the Lens of Personalized Federated Learning

Author: Moskvoretskii, Viktor, Tupitsa, Nazarii, Biemann, Chris, Horváth, Samuel, Gorbunov, Eduard, and Nikishina, Irina
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We present a new approach based on the Personalized Federated Learning algorithm MeritFed that can be applied to Natural Language Tasks with heterogeneous data. We evaluate it on the Low-Resource Machine Translation task, using the dataset from the Large-Scale Multilingual Machine Translation Shared Task (Small Track #2) and the subset of Sami languages from the multilingual benchmark for Finno-Ugric languages. In addition to its effectiveness, MeritFed is also highly interpretable, as it can be applied to track the impact of each language used for training. Our analysis reveals that target dataset size affects weight distribution across auxiliary languages, that unrelated languages do not interfere with the training, and auxiliary optimizer parameters have minimal impact. Our approach is easy to apply with a few lines of code, and we provide scripts for reproducing the experiments at https://github.com/VityaVitalich/MeritFed, Comment: 18 pages, 7 figures
Published: 2024

5. Dataset of Quotation Attribution in German News Articles

Author: Petersen-Frey, Fynn and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks., Comment: To be published at LREC-COLING 2024
Published: 2024

6. Exploring Boundaries and Intensities in Offensive and Hate Speech: Unveiling the Complex Spectrum of Social Media Discourse

Author: Ayele, Abinew Ali, Jalew, Esubalew Alemneh, Ali, Adem Chanie, Yimam, Seid Muhie, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: The prevalence of digital media and evolving sociopolitical dynamics have significantly amplified the dissemination of hateful content. Existing studies mainly focus on classifying texts into binary categories, often overlooking the continuous spectrum of offensiveness and hatefulness inherent in the text. In this research, we present an extensive benchmark dataset for Amharic, comprising 8,258 tweets annotated for three distinct tasks: category classification, identification of hate targets, and rating offensiveness and hatefulness intensities. Our study highlights that a considerable majority of tweets belong to the less offensive and less hate intensity levels, underscoring the need for early interventions by stakeholders. The prevalence of ethnic and political hatred targets, with significant overlaps in our dataset, emphasizes the complex relationships within Ethiopia's sociopolitical landscape. We build classification and regression models and investigate the efficacy of models in handling these tasks. Our results reveal that hate and offensive speech can not be addressed by a simplistic binary classification, instead manifesting as variables across a continuous range of values. The Afro-XLMR-large model exhibits the best performances achieving F1-scores of 75.30%, 70.59%, and 29.42% for the category, target, and regression tasks, respectively. The 80.22% correlation coefficient of the Afro-XLMR-large model indicates strong alignments.
Published: 2024

7. Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

Author: Wang, Xintong, Pan, Jingheng, Ding, Liang, and Biemann, Chris
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Large Vision-Language Models (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive Decoding (ICD) method, a novel approach designed to reduce hallucinations during LVLM inference. Our method is inspired by our observation that what we call disturbance instructions significantly exacerbate hallucinations in multimodal fusion modules. ICD contrasts distributions from standard and instruction disturbance, thereby increasing alignment uncertainty and effectively subtracting hallucinated concepts from the original distribution. Through comprehensive experiments on discriminative benchmarks (POPE and MME) and a generative benchmark (LLaVa-Bench), we demonstrate that ICD significantly mitigates both object-level and attribute-level hallucinations. Moreover, our method not only addresses hallucinations but also significantly enhances the general perception and recognition capabilities of LVLMs., Comment: Accepted to Findings of ACL 2024
Published: 2024

8. On Zero-Shot Counterspeech Generation by LLMs

Author: Saha, Punyajoy, Agrawal, Aalok, Jana, Abhik, Biemann, Chris, and Mukherjee, Animesh
Subjects: Computer Science - Computation and Language
Abstract: With the emergence of numerous Large Language Models (LLM), the usage of such models in various Natural Language Processing (NLP) applications is increasing extensively. Counterspeech generation is one such key task where efforts are made to develop generative models by fine-tuning LLMs with hatespeech - counterspeech pairs, but none of these attempts explores the intrinsic properties of large language models in zero-shot settings. In this work, we present a comprehensive analysis of the performances of four LLMs namely GPT-2, DialoGPT, ChatGPT and FlanT5 in zero-shot settings for counterspeech generation, which is the first of its kind. For GPT-2 and DialoGPT, we further investigate the deviation in performance with respect to the sizes (small, medium, large) of the models. On the other hand, we propose three different prompting strategies for generating different types of counterspeech and analyse the impact of such strategies on the performance of the models. Our analysis shows that there is an improvement in generation quality for two datasets (17%), however the toxicity increase (25%) with increase in model size. Considering type of model, GPT-2 and FlanT5 models are significantly better in terms of counterspeech quality but also have high toxicity as compared to DialoGPT. ChatGPT are much better at generating counter speech than other models across all metrics. In terms of prompting, we find that our proposed strategies help in improving counter speech generation across all the models., Comment: 12 pages, 7 tables, accepted at LREC-COLING 2024
Published: 2024

9. SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages

Author: Ousidhoum, Nedjma, Muhammad, Shamsuddeen Hassan, Abdalla, Mohamed, Abdulmumin, Idris, Ahmad, Ibrahim Said, Ahuja, Sanchit, Aji, Alham Fikri, Araujo, Vladimir, Ayele, Abinew Ali, Baswani, Pavan, Beloucif, Meriem, Biemann, Chris, Bourhim, Sofia, De Kock, Christine, Dekebo, Genet Shanko, Hourrane, Oumaima, Kanumolu, Gopichand, Madasu, Lokesh, Rutunda, Samuel, Shrivastava, Manish, Solorio, Thamar, Surange, Nirmal, Tilaye, Hailegnaw Getaneh, Vishnubhotla, Krishnapriya, Winata, Genta, Yimam, Seid Muhie, and Mohammad, Saif M.
Subjects: Computer Science - Computation and Language
Abstract: Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: \textit{Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish,} and \textit{Telugu}. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP., Comment: Accepted to the Findings of ACL 2024
Published: 2024

10. Probing Large Language Models from A Human Behavioral Perspective

Author: Wang, Xintong, Li, Xiaoyu, Li, Xingshan, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) have emerged as dominant foundational models in modern NLP. However, the understanding of their prediction processes and internal mechanisms, such as feed-forward networks (FFN) and multi-head self-attention (MHSA), remains largely unexplored. In this work, we probe LLMs from a human behavioral perspective, correlating values from LLMs with eye-tracking measures, which are widely recognized as meaningful indicators of human reading patterns. Our findings reveal that LLMs exhibit a similar prediction pattern with humans but distinct from that of Shallow Language Models (SLMs). Moreover, with the escalation of LLM layers from the middle layers, the correlation coefficients also increase in FFN and MHSA, indicating that the logits within FFN increasingly encapsulate word semantics suitable for predicting tokens from the vocabulary., Comment: Accepted by LREC-COLING NeusymBridge 2024
Published: 2023

11. DBLPLink: An Entity Linker for the DBLP Scholarly Knowledge Graph

Author: Banerjee, Debayan, Arefa, Usbeck, Ricardo, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: In this work, we present a web application named DBLPLink, which performs entity linking over the DBLP scholarly knowledge graph. DBLPLink uses text-to-text pre-trained language models, such as T5, to produce entity label spans from an input text question. Entity candidates are fetched from a database based on the labels, and an entity re-ranker sorts them based on entity embeddings, such as TransE, DistMult and ComplEx. The results are displayed so that users may compare and contrast the results between T5-small, T5-base and the different KG embeddings used. The demo can be accessed at https://ltdemos.informatik.uni-hamburg.de/dblplink/., Comment: Accepted at International Semantic Web Conference (ISWC) 2023 Posters & Demo Track
Published: 2023

12. The Role of Output Vocabulary in T2T LMs for SPARQL Semantic Parsing

Author: Banerjee, Debayan, Nair, Pranav Ajit, Usbeck, Ricardo, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: In this work, we analyse the role of output vocabulary for text-to-text (T2T) models on the task of SPARQL semantic parsing. We perform experiments within the the context of knowledge graph question answering (KGQA), where the task is to convert questions in natural language to the SPARQL query language. We observe that the query vocabulary is distinct from human vocabulary. Language Models (LMs) are pre-dominantly trained for human language tasks, and hence, if the query vocabulary is replaced with a vocabulary more attuned to the LM tokenizer, the performance of models may improve. We carry out carefully selected vocabulary substitutions on the queries and find absolute gains in the range of 17% on the GrailQA dataset., Comment: Accepted as a short paper to ACL 2023 findings
Published: 2023

13. DBLP-QuAD: A Question Answering Dataset over the DBLP Scholarly Knowledge Graph

Author: Banerjee, Debayan, Awale, Sushil, Usbeck, Ricardo, and Biemann, Chris
Subjects: Computer Science - Digital Libraries, Computer Science - Computation and Language
Abstract: In this work we create a question answering dataset over the DBLP scholarly knowledge graph (KG). DBLP is an on-line reference for bibliographic information on major computer science publications that indexes over 4.4 million publications published by more than 2.2 million authors. Our dataset consists of 10,000 question answer pairs with the corresponding SPARQL queries which can be executed over the DBLP KG to fetch the correct answer. DBLP-QuAD is the largest scholarly question answering dataset., Comment: 12 pages ceur-ws 1 column accepted at International Bibliometric Information Retrieval Workshp @ ECIR 2023
Published: 2023

14. GETT-QA: Graph Embedding based T2T Transformer for Knowledge Graph Question Answering

Author: Banerjee, Debayan, Nair, Pranav Ajit, Usbeck, Ricardo, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Databases, Computer Science - Information Retrieval
Abstract: In this work, we present an end-to-end Knowledge Graph Question Answering (KGQA) system named GETT-QA. GETT-QA uses T5, a popular text-to-text pre-trained language model. The model takes a question in natural language as input and produces a simpler form of the intended SPARQL query. In the simpler form, the model does not directly produce entity and relation IDs. Instead, it produces corresponding entity and relation labels. The labels are grounded to KG entity and relation IDs in a subsequent step. To further improve the results, we instruct the model to produce a truncated version of the KG embedding for each entity. The truncated KG embedding enables a finer search for disambiguation purposes. We find that T5 is able to learn the truncated KG embeddings without any change of loss function, improving KGQA performance. As a result, we report strong results for LC-QuAD 2.0 and SimpleQuestions-Wikidata datasets on end-to-end KGQA over Wikidata., Comment: 16 pages single column format accepted at ESWC 2023 research track
Published: 2023

15. A System for Human-AI collaboration for Online Customer Support

Author: Banerjee, Debayan, Poser, Mathis, Wiethof, Christina, Subramanian, Varun Shankar, Paucar, Richard, Bittner, Eva A. C., and Biemann, Chris
Subjects: Computer Science - Artificial Intelligence
Abstract: AI enabled chat bots have recently been put to use to answer customer service queries, however it is a common feedback of users that bots lack a personal touch and are often unable to understand the real intent of the user's question. To this end, it is desirable to have human involvement in the customer servicing process. In this work, we present a system where a human support agent collaborates in real-time with an AI agent to satisfactorily answer customer queries. We describe the user interaction elements of the solution, along with the machine learning techniques involved in the AI agent.
Published: 2023

16. ARDIAS: AI-Enhanced Research Management, Discovery, and Advisory System

Author: Banerjee, Debayan, Yimam, Seid Muhie, Awale, Sushil, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: In this work, we present ARDIAS, a web-based application that aims to provide researchers with a full suite of discovery and collaboration tools. ARDIAS currently allows searching for authors and articles by name and gaining insights into the research topics of a particular researcher. With the aid of AI-based tools, ARDIAS aims to recommend potential collaborators and topics to researchers. In the near future, we aim to add tools that allow researchers to communicate with each other and start new projects.
Published: 2023

17. Modern Baselines for SPARQL Semantic Parsing

Author: Banerjee, Debayan, Nair, Pranav Ajit, Kaur, Jivat Neet, Usbeck, Ricardo, and Biemann, Chris
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: In this work, we focus on the task of generating SPARQL queries from natural language questions, which can then be executed on Knowledge Graphs (KGs). We assume that gold entity and relations have been provided, and the remaining task is to arrange them in the right order along with SPARQL vocabulary, and input tokens to produce the correct SPARQL query. Pre-trained Language Models (PLMs) have not been explored in depth on this task so far, so we experiment with BART, T5 and PGNs (Pointer Generator Networks) with BERT embeddings, looking for new baselines in the PLM era for this task, on DBpedia and Wikidata KGs. We show that T5 requires special input tokenisation, but produces state of the art performance on LC-QuAD 1.0 and LC-QuAD 2.0 datasets, and outperforms task-specific models from previous works. Moreover, the methods enable semantic parsing for questions where a part of the input needs to be copied to the output query, thus enabling a new paradigm in KG semantic parsing., Comment: 5 pages, short paper, SIGIR 2022
Published: 2022
Full Text: View/download PDF

18. SCoT: Sense Clustering over Time: a tool for the analysis of lexical change

Author: Haase, Christian, Anwar, Saba, Yimam, Seid Muhie, Friedrich, Alexander, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: We present Sense Clustering over Time (SCoT), a novel network-based tool for analysing lexical change. SCoT represents the meanings of a word as clusters of similar words. It visualises their formation, change, and demise. There are two main approaches to the exploration of dynamic networks: the discrete one compares a series of clustered graphs from separate points in time. The continuous one analyses the changes of one dynamic network over a time-span. SCoT offers a new hybrid solution. First, it aggregates time-stamped documents into intervals and calculates one sense graph per discrete interval. Then, it merges the static graphs to a new type of dynamic semantic neighbourhood graph over time. The resulting sense clusters offer uniquely detailed insights into lexical change over continuous intervals with model transparency and provenance. SCoT has been successfully used in a European study on the changing meaning of `crisis'., Comment: Update of https://aclanthology.org/2021.eacl-demos.23/
Published: 2022
Full Text: View/download PDF

19. Language Models Explain Word Reading Times Better Than Empirical Predictability

Author: Hofmann, Markus J., Remus, Steffen, Biemann, Chris, Radach, Ralph, and Kuchinke, Lars
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Though there is a strong consensus that word length and frequency are the most important single-word features determining visual-orthographic access to the mental lexicon, there is less agreement as how to best capture syntactic and semantic factors. The traditional approach in cognitive reading research assumes that word predictability from sentence context is best captured by cloze completion probability (CCP) derived from human performance data. We review recent research suggesting that probabilistic language models provide deeper explanations for syntactic and semantic effects than CCP. Then we compare CCP with (1) Symbolic n-gram models consolidate syntactic and semantic short-range relations by computing the probability of a word to occur, given two preceding words. (2) Topic models rely on subsymbolic representations to capture long-range semantic similarity by word co-occurrence counts in documents. (3) In recurrent neural networks (RNNs), the subsymbolic units are trained to predict the next word, given all preceding words in the sentences. To examine lexical retrieval, these models were used to predict single fixation durations and gaze durations to capture rapidly successful and standard lexical access, and total viewing time to capture late semantic integration. The linear item-level analyses showed greater correlations of all language models with all eye-movement measures than CCP. Then we examined non-linear relations between the different types of predictability and the reading times using generalized additive models. N-gram and RNN probabilities of the present word more consistently predicted reading performance compared with topic models or CCP.
Published: 2022
Full Text: View/download PDF

20. How Hateful are Movies? A Study and Prediction on Movie Subtitles

Author: von Boguszewski, Niklas, Moin, Sana, Bhowmick, Anirban, Yimam, Seid Muhie, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: In this research, we investigate techniques to detect hate speech in movies. We introduce a new dataset collected from the subtitles of six movies, where each utterance is annotated either as hate, offensive or normal. We apply transfer learning techniques of domain adaptation and fine-tuning on existing social media datasets, namely from Twitter and Fox News. We evaluate different representations, i.e., Bag of Words (BoW), Bi-directional Long short-term memory (Bi-LSTM), and Bidirectional Encoder Representations from Transformers (BERT) on 11k movie subtitles. The BERT model obtained the best macro-averaged F1-score of 77%. Hence, we show that transfer learning from the social media domain is efficacious in classifying hate and offensive speech in movies through subtitles.
Published: 2021

21. HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection

Author: Mathew, Binny, Saha, Punyajoy, Yimam, Seid Muhie, Biemann, Chris, Goyal, Pawan, and Mukherjee, Animesh
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Social and Information Networks
Abstract: Hate speech is a challenging issue plaguing the online social media. While better models for hate speech detection are continuously being developed, there is little research on the bias and interpretability aspects of hate speech. In this paper, we introduce HateXplain, the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in our dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based. We utilize existing state-of-the-art models and observe that even models that perform very well in classification do not score high on explainability metrics like model plausibility and faithfulness. We also observe that models, which utilize the human rationales for training, perform better in reducing unintended bias towards target communities. We have made our code and dataset public at https://github.com/punyajoy/HateXplain, Comment: 12 pages, 7 figues, 8 tables. Accepted at AAAI 2021
Published: 2020

22. Social Media Unrest Prediction during the {COVID}-19 Pandemic: Neural Implicit Motive Pattern Recognition as Psychometric Signs of Severe Crises

Author: Johannßen, Dirk and Biemann, Chris
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: The COVID-19 pandemic has caused international social tension and unrest. Besides the crisis itself, there are growing signs of rising conflict potential of societies around the world. Indicators of global mood changes are hard to detect and direct questionnaires suffer from social desirability biases. However, so-called implicit methods can reveal humans intrinsic desires from e.g. social media texts. We present psychologically validated social unrest predictors and replicate scalable and automated predictions, setting a new state of the art on a recent German shared task dataset. We employ this model to investigate a change of language towards social unrest during the COVID-19 pandemic by comparing established psychological predictors on samples of tweets from spring 2019 with spring 2020. The results show a significant increase of the conflict indicating psychometrics. With this work, we demonstrate the applicability of automated NLP-based approaches to quantitative psychological research., Comment: 8 pages
Published: 2020

23. Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets

Author: Yimam, Seid Muhie, Ayele, Abinew Ali, Venkatesh, Gopalakrishnan, Gashaw, Ibrahim, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The availability of different pre-trained semantic models enabled the quick development of machine learning components for downstream applications. Despite the availability of abundant text data for low resource languages, only a few semantic models are publicly available. Publicly available pre-trained models are usually built as a multilingual version of semantic models that can not fit well for each language due to context variations. In this work, we introduce different semantic models for Amharic. After we experiment with the existing pre-trained semantic models, we trained and fine-tuned nine new different models using a monolingual text corpus. The models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings obtained via network embedding algorithms. Moreover, we employ these models for different NLP tasks and investigate their impact. We find that newly trained models perform better than pre-trained multilingual models. Furthermore, models based on contextual embeddings from RoBERTA perform better than the word2Vec models., Comment: 18 pages
Published: 2020
Full Text: View/download PDF

24. Individual corpora predict fast memory retrieval during reading

Author: Hofmann, Markus J., Müller, Lara, Rölke, Andre, Radach, Ralph, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: The corpus, from which a predictive language model is trained, can be considered the experience of a semantic system. We recorded everyday reading of two participants for two months on a tablet, generating individual corpus samples of 300/500K tokens. Then we trained word2vec models from individual corpora and a 70 million-sentence newspaper corpus to obtain individual and norm-based long-term memory structure. To test whether individual corpora can make better predictions for a cognitive task of long-term memory retrieval, we generated stimulus materials consisting of 134 sentences with uncorrelated individual and norm-based word probabilities. For the subsequent eye tracking study 1-2 months later, our regression analyses revealed that individual, but not norm-corpus-based word probabilities can account for first-fixation duration and first-pass gaze duration. Word length additionally affected gaze duration and total viewing duration. The results suggest that corpora representative for an individual's longterm memory structure can better explain reading performance than a norm corpus, and that recently acquired information is lexically accessed rapidly., Comment: Proceedings of the 6th workshop on Cognitive Aspects of the Lexicon (CogALex-VI), Barcelona, Spain, December 12, 2020; accepted manuscript; 11 pages, 2 figures, 4 Tables
Published: 2020

25. Neural Entity Linking: A Survey of Models Based on Deep Learning

Author: Sevgili, Ozge, Shelmanov, Artem, Arkhipov, Mikhail, Panchenko, Alexander, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: This survey presents a comprehensive description of recent neural entity linking (EL) systems developed since 2015 as a result of the "deep learning revolution" in natural language processing. Its goal is to systemize design features of neural entity linking systems and compare their performance to the remarkable classic methods on common benchmarks. This work distills a generic architecture of a neural EL system and discusses its components, such as candidate generation, mention-context encoding, and entity ranking, summarizing prominent methods for each of them. The vast variety of modifications of this general architecture are grouped by several common themes: joint entity mention detection and disambiguation, models for global linking, domain-independent techniques including zero-shot and distant supervision methods, and cross-lingual approaches. Since many neural models take advantage of entity and mention/context embeddings to represent their meaning, this work also overviews prominent entity embedding techniques. Finally, the survey touches on applications of entity linking, focusing on the recently emerged use-case of enhancing deep pre-trained masked language models based on the Transformer architecture., Comment: Published in Semantic Web journal
Published: 2020
Full Text: View/download PDF

26. Improving Unsupervised Sparsespeech Acoustic Models with Categorical Reparameterization

Author: Milde, Benjamin and Biemann, Chris
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: The Sparsespeech model is an unsupervised acoustic model that can generate discrete pseudo-labels for untranscribed speech. We extend the Sparsespeech model to allow for sampling over a random discrete variable, yielding pseudo-posteriorgrams. The degree of sparsity in this posteriorgram can be fully controlled after the model has been trained. We use the Gumbel-Softmax trick to approximately sample from a discrete distribution in the neural network and this allows us to train the network efficiently with standard backpropagation. The new and improved model is trained and evaluated on the Libri-Light corpus, a benchmark for ASR with limited or no supervision. The model is trained on 600h and 6000h of English read speech. We evaluate the improved model using the ABX error measure and a semi-supervised setting with 10h of transcribed speech. We observe a relative improvement of up to 31.4% on ABX error rates across speakers on the test set with the improved Sparsespeech model on 600h of speech data and further improvements when we scale the model to 6000h.
Published: 2020

27. UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection

Author: Wiedemann, Gregor, Yimam, Seid Muhie, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Typically, fine-tuning is performed on task-specific training datasets in a supervised manner. One can also fine-tune in unsupervised manner beforehand by further pre-training the masked language modeling (MLM) task. Hereby, in-domain data for unsupervised MLM resembling the actual classification target dataset allows for domain adaptation of the model. In this paper, we compare current pre-trained transformer networks with and without MLM fine-tuning on their performance for offensive language detection. Our MLM fine-tuned RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Shared Task~12 for the English language. Further experiments with the ALBERT model even surpass this result.
Published: 2020

28. Word Sense Disambiguation for 158 Languages using Word Embeddings Only

Author: Logacheva, Varvara, Teslenko, Denis, Shelmanov, Artem, Remus, Steffen, Ustalov, Dmitry, Kutuzov, Andrey, Artemova, Ekaterina, Biemann, Chris, Ponzetto, Simone Paolo, and Panchenko, Alexander
Subjects: Computer Science - Computation and Language
Abstract: Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al. (2018), enabling WSD in these languages. Models and system are available online., Comment: 10 pages, 5 figures, 4 tables, accepted at LREC 2020
Published: 2020

29. Automatic Compilation of Resources for Academic Writing and Evaluating with Informal Word Identification and Paraphrasing System

Author: Yimam, Seid Muhie, Venkatesh, Gopalakrishnan, Lee, John Sie Yuen, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: We present the first approach to automatically building resources for academic writing. The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing. On top of existing academic resources, such as the Corpus of Contemporary American English (COCA) academic Word List, the New Academic Word List, and the Academic Collocation List, we also explore how to dynamically build such resources that would be used to automatically identify informal or non-academic words or phrases. The resources are compiled using different generic approaches that can be extended for different domains and languages. We describe the evaluation of resources with a system implementation. The system consists of an informal word identification (IWI), academic candidate paraphrase generation, and paraphrase ranking components. To generate candidates and rank them in context, we have used the PPDB and WordNet paraphrase resources. We use the Concepts in Context (CoInCO) "All-Words" lexical substitution dataset both for the informal word identification and paraphrase generation experiments. Our informal word identification component achieves an F-1 score of 82%, significantly outperforming a stratified classifier baseline. The main contribution of this work is a domain-independent methodology to build targeted resources for writing aids.
Published: 2020

30. Analysis of the Ethiopic Twitter Dataset for Abusive Speech in Amharic

Author: Yimam, Seid Muhie, Ayele, Abinew Ali, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Social and Information Networks
Abstract: In this paper, we present an analysis of the first Ethiopic Twitter Dataset for the Amharic language targeted for recognizing abusive speech. The dataset has been collected since 2014 that is written in Fidel script. Since several languages can be written using the Fidel script, we have used the existing Amharic, Tigrinya and Ge'ez corpora to retain only the Amharic tweets. We have analyzed the tweets for abusive speech content with the following targets: Analyze the distribution and tendency of abusive speech content over time and compare the abusive speech content between a Twitter and general reference Amharic corpus.
Published: 2019

31. Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings

Author: Wiedemann, Gregor, Remus, Steffen, Chawla, Avi, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: Contextualized word embeddings (CWE) such as provided by ELMo (Peters et al., 2018), Flair NLP (Akbik et al., 2018), or BERT (Devlin et al., 2019) are a major recent innovation in NLP. CWEs provide semantic vector representations of words depending on their respective context. Their advantage over static word embeddings has been shown for a number of tasks, such as text classification, sequence tagging, or machine translation. Since vectors of the same word type can vary depending on the respective context, they implicitly provide a model for word sense disambiguation (WSD). We introduce a simple but effective approach to WSD using a nearest neighbor classification on CWEs. We compare the performance of different CWE models for the task and can report improvements above the current state of the art for two standard WSD benchmark datasets. We further show that the pre-trained BERT model is able to place polysemic words into distinct 'sense' regions of the embedding space, while ELMo and Flair NLP do not seem to possess this ability., Comment: 10 pages, 3 figures, 6 tables, Accepted for Konferenz zur Verarbeitung nat\"urlicher Sprache / Conference on Natural Language Processing (KONVENS) 2019, Erlangen/Germany
Published: 2019

32. Making Fast Graph-based Algorithms with Graph Metric Embeddings

Author: Kutuzov, Andrey, Dorgham, Mohammad, Oliynyk, Oleksiy, Biemann, Chris, and Panchenko, Alexander
Subjects: Computer Science - Computation and Language
Abstract: The computation of distance measures between nodes in graphs is inefficient and does not scale to large graphs. We explore dense vector representations as an effective way to approximate the same information: we introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g.the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks and show evaluations on the WordNet graph and two knowledge base graphs., Comment: In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL'2019). Florence, Italy
Published: 2019

33. Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records

Author: Friedrich, Max, Köhn, Arne, Wiedemann, Gregor, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for research. Automatic de-identification classifierscan significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a classifier that works wellacross many types of medical text poses a challenge as privacy laws prohibit the sharing of raw medical records. We introduce a method to create privacy-preserving shareable representations of medical text (i.e. they contain no PHI) that does not require expensive manual pseudonymization. These representations can be shared between organizations to create unified datasets for training de-identification models. Our representation allows training a simple LSTM-CRF de-identification model to an F1 score of 97.4%, which is comparable to a strong baseline that exposes private information in its representation. A robust, widely available de-identification classifier based on our representation could potentially enable studies for which de-identification would otherwise be too costly., Comment: Accepted at ACL 2019; camera-ready version
Published: 2019

34. On the Compositionality Prediction of Noun Phrases using Poincar\'e Embeddings

Author: Jana, Abhik, Puzyrev, Dmitry, Panchenko, Alexander, Goyal, Pawan, Biemann, Chris, and Mukherjee, Animesh
Subjects: Computer Science - Computation and Language
Abstract: The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional information for predicting compositionality. In particular, we use hypernymy information of the multiword and its constituents encoded in the form of the recently introduced Poincar\'e embeddings in addition to the distributional information to detect compositionality for noun phrases. Using a weighted average of the distributional similarity and a Poincar\'e similarity function, we obtain consistent and substantial, statistically significant improvement across three gold standard datasets over state-of-the-art models based on distributional information only. Unlike traditional approaches that solely use an unsupervised setting, we have also framed the problem as a supervised task, obtaining comparable improvements. Further, we publicly release our Poincar\'e embeddings, which are trained on the output of handcrafted lexical-syntactic patterns on a large corpus., Comment: Accepted in ACL 2019 [Long Paper]
Published: 2019

35. Every child should have parents: a taxonomy refinement algorithm based on hyperbolic term embeddings

Author: Aly, Rami, Acharya, Shantanu, Ossa, Alexander, Köhn, Arne, Biemann, Chris, and Panchenko, Alexander
Subjects: Computer Science - Computation and Language
Abstract: We introduce the use of Poincar\'e embeddings to improve existing state-of-the-art approaches to domain-specific taxonomy induction from text as a signal for both relocating wrong hyponym terms within a (pre-induced) taxonomy as well as for attaching disconnected terms in a taxonomy. This method substantially improves previous state-of-the-art results on the SemEval-2016 Task 13 on taxonomy extraction. We demonstrate the superiority of Poincar\'e embeddings over distributional semantic representations, supporting the hypothesis that they can better capture hierarchical lexical-semantic relationships than embeddings in the Euclidean space., Comment: 7 pages (5 + 2 pages references), 2 Figures, 3 Tables, Accepted to the ACL 2019 conference. Will appear in its proceedings
Published: 2019

36. HHMM at SemEval-2019 Task 2: Unsupervised Frame Induction using Contextualized Word Embeddings

Author: Anwar, Saba, Ustalov, Dmitry, Arefyev, Nikolay, Ponzetto, Simone Paolo, Biemann, Chris, and Panchenko, Alexander
Subjects: Computer Science - Computation and Language
Abstract: We present our system for semantic frame induction that showed the best performance in Subtask B.1 and finished as the runner-up in Subtask A of the SemEval 2019 Task 2 on unsupervised semantic frame induction (QasemiZadeh et al., 2019). Our approach separates this task into two independent steps: verb clustering using word and their context embeddings and role labeling by combining these embeddings with syntactical features. A simple combination of these steps shows very competitive results and can be extended to process other datasets and languages., Comment: 5 pages, 3 tables, accepted at SemEval 2019
Published: 2019
Full Text: View/download PDF

37. Answering Comparative Questions: Better than Ten-Blue-Links?

Author: Schildwächter, Matthias, Bondarenko, Alexander, Zenker, Julian, Hagen, Matthias, Biemann, Chris, and Panchenko, Alexander
Subjects: Computer Science - Computation and Language
Abstract: We present CAM (comparative argumentative machine), a novel open-domain IR system to argumentatively compare objects with respect to information extracted from the Common Crawl. In a user study, the participants obtained 15% more accurate answers using CAM compared to a "traditional" keyword-based search and were 20% faster in finding the answer to comparative questions., Comment: In Proceeding of 2019 Conference on Human Information Interaction and Retrieval (CHIIR '19), March 10--14, 2019, Glasgow, United Kingdom
Published: 2019
Full Text: View/download PDF

38. Transfer Learning from LDA to BiLSTM-CNN for Offensive Language Detection in Twitter

Author: Wiedemann, Gregor, Ruppert, Eugen, Jindal, Raghav, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: We investigate different strategies for automatic offensive language classification on German Twitter data. For this, we employ a sequentially combined BiLSTM-CNN neural network. Based on this model, three transfer learning tasks to improve the classification performance with background knowledge are tested. We compare 1. Supervised category transfer: social media data annotated with near-offensive language categories, 2. Weakly-supervised category transfer: tweets annotated with emojis they contain, 3. Unsupervised category transfer: tweets annotated with topic clusters obtained by Latent Dirichlet Allocation (LDA). Further, we investigate the effect of three different strategies to mitigate negative effects of 'catastrophic forgetting' during transfer learning. Our results indicate that transfer learning in general improves offensive language detection. Best results are achieved from pre-training our model on the unsupervised topic clustering of tweets in combination with thematic user cluster information., Comment: 10 pages, 1 figure
Published: 2018

39. microNER: A Micro-Service for German Named Entity Recognition based on BiLSTM-CRF

Author: Wiedemann, Gregor, Jindal, Raghav, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: For named entity recognition (NER), bidirectional recurrent neural networks became the state-of-the-art technology in recent years. Competing approaches vary with respect to pre-trained word embeddings as well as models for character embeddings to represent sequence information most effectively. For NER in German language texts, these model variations have not been studied extensively. We evaluate the performance of different word and character embeddings on two standard German datasets and with a special focus on out-of-vocabulary words. With F-Scores above 82% for the GermEval'14 dataset and above 85% for the CoNLL'03 dataset, we achieve (near) state-of-the-art performance for this task. We publish several pre-trained models wrapped into a micro-service based on Docker to allow for easy integration of German NER into other applications via a JSON API., Comment: 7 pages, 1 figure
Published: 2018

40. Unsupervised Sense-Aware Hypernymy Extraction

Author: Ustalov, Dmitry, Panchenko, Alexander, Biemann, Chris, and Ponzetto, Simone Paolo
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we show how unsupervised sense representations can be used to improve hypernymy extraction. We present a method for extracting disambiguated hypernymy relationships that propagates hypernyms to sets of synonyms (synsets), constructs embeddings for these sets, and establishes sense-aware relationships between matching synsets. Evaluation on two gold standard datasets for English and Russian shows that the method successfully recognizes hypernymy relationships that cannot be found with standard Hearst patterns and Wiktionary datasets for the respective languages., Comment: In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018). Vienna, Austria
Published: 2018
Full Text: View/download PDF

41. Categorizing Comparative Sentences

Author: Panchenko, Alexander, Bondarenko, Alexander, Franzek, Mirco, Hagen, Matthias, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: We tackle the tasks of automatically identifying comparative sentences and categorizing the intended preference (e.g., "Python has better NLP libraries than MATLAB" => (Python, better, MATLAB). To this end, we manually annotate 7,199 sentences for 217 distinct target item pairs from several domains (27% of the sentences contain an oriented comparison in the sense of "better" or "worse"). A gradient boosting model based on pre-trained sentence embeddings reaches an F1 score of 85% in our experimental evaluation. The model can be used to extract comparative sentences for pro/con argumentation in comparative / argument search engines or debating technologies., Comment: In Proceedings of the the 6th Workshop on Argument Mining (ArgMining'2019) August 1st, collocated with ACL 2019 in Florence, Italy
Published: 2018

42. A Multilingual Information Extraction Pipeline for Investigative Journalism

Author: Wiedemann, Gregor, Yimam, Seid Muhie, and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: We introduce an advanced information extraction pipeline to automatically process very large collections of unstructured textual data for the purpose of investigative journalism. The pipeline serves as a new input processor for the upcoming major release of our New/s/leak 2.0 software, which we develop in cooperation with a large German news organization. The use case is that journalists receive a large collection of files up to several Gigabytes containing unknown contents. Collections may originate either from official disclosures of documents, e.g. Freedom of Information Act requests, or unofficial data leaks. Our software prepares a visually-aided exploration of the collection to quickly learn about potential stories contained in the data. It is based on the automatic extraction of entities and their co-occurrence in documents. In contrast to comparable projects, we focus on the following three major requirements particularly serving the use case of investigative journalism in cross-border collaborations: 1) composition of multiple state-of-the-art NLP tools for entity extraction, 2) support of multi-lingual document sets up to 40 languages, 3) fast and easy-to-use extraction of full-text, metadata and entities from various file formats., Comment: EMNLP 2018 Demo. arXiv admin note: text overlap with arXiv:1807.05151
Published: 2018

43. Demonstrating PAR4SEM - A Semantic Writing Aid with Adaptive Paraphrasing

Author: Yimam, Seid Muhie and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we present Par4Sem, a semantic writing aid tool based on adaptive paraphrasing. Unlike many annotation tools that are primarily used to collect training examples, Par4Sem is integrated into a real word application, in this case a writing aid tool, in order to collect training examples from usage data. Par4Sem is a tool, which supports an adaptive, iterative, and interactive process where the underlying machine learning models are updated for each iteration using new training examples from usage data. After motivating the use of ever-learning tools in NLP applications, we evaluate Par4Sem by adopting it to a text simplification task through mere usage., Comment: EMNLP Demo paper
Published: 2018

44. Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction

Author: Ustalov, Dmitry, Panchenko, Alexander, Biemann, Chris, and Ponzetto, Simone Paolo
Subjects: Computer Science - Computation and Language, 68T50, I.2.7
Abstract: We present a detailed theoretical and computational analysis of the Watset meta-algorithm for fuzzy graph clustering, which has been found to be widely applicable in a variety of domains. This algorithm creates an intermediate representation of the input graph that reflects the "ambiguity" of its nodes. Then, it uses hard clustering to discover clusters in this "disambiguated" intermediate graph. After outlining the approach and analyzing its computational complexity, we demonstrate that Watset shows competitive results in three applications: unsupervised synset induction from a synonymy graph, unsupervised semantic frame induction from dependency triples, and unsupervised semantic class induction from a distributional thesaurus. Our algorithm is generic and can be also applied to other networks of linguistic data., Comment: 58 pages, 17 figures, accepted at the Computational Linguistics journal
Published: 2018
Full Text: View/download PDF

45. Learning Graph Embeddings from WordNet-based Similarity Measures

Author: Kutuzov, Andrey, Dorgham, Mohammad, Oliynyk, Oleksiy, Biemann, Chris, and Panchenko, Alexander
Subjects: Computer Science - Computation and Language
Abstract: We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances., Comment: Accepted to StarSem 2019
Published: 2018

46. New/s/leak 2.0 - Multilingual Information Extraction and Visualization for Investigative Journalism

Author: Wiedemann, Gregor, Yimam, Seid Muhie, and Biemann, Chris
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Investigative journalism in recent years is confronted with two major challenges: 1) vast amounts of unstructured data originating from large text collections such as leaks or answers to Freedom of Information requests, and 2) multi-lingual data due to intensified global cooperation and communication in politics, business and civil society. Faced with these challenges, journalists are increasingly cooperating in international networks. To support such collaborations, we present the new version of new/s/leak 2.0, our open-source software for content-based searching of leaks. It includes three novel main features: 1) automatic language detection and language-dependent information extraction for 40 languages, 2) entity and keyword visualization for efficient exploration, and 3) decentral deployment for analysis of confidential data from various formats. We illustrate the new analysis capabilities with an exemplary case study., Comment: Social Informatics 2018
Published: 2018

47. Par4Sim -- Adaptive Paraphrasing for Text Simplification

Author: Yimam, Seid Muhie and Biemann, Chris
Subjects: Computer Science - Computation and Language
Abstract: Learning from a real-world data stream and continuously updating the model without explicit supervision is a new challenge for NLP applications with machine learning components. In this work, we have developed an adaptive learning system for text simplification, which improves the underlying learning-to-rank model from usage data, i.e. how users have employed the system for the task of simplification. Our experimental result shows that, over a period of time, the performance of the embedded paraphrase ranking model increases steadily improving from a score of 62.88% up to 75.70% based on the NDCG@10 evaluation metrics. To our knowledge, this is the first study where an NLP component is adaptively improved through usage., Comment: COLING 2018 main conference
Published: 2018

48. Unsupervised Semantic Frame Induction using Triclustering

Author: Ustalov, Dmitry, Panchenko, Alexander, Kutuzov, Andrei, Biemann, Chris, and Ponzetto, Simone Paolo
Subjects: Computer Science - Computation and Language
Abstract: We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived dataset and performing on par with competitive methods on a verb class clustering task., Comment: 8 pages, 1 figure, 4 tables, accepted at ACL 2018
Published: 2018
Full Text: View/download PDF

49. BomJi at SemEval-2018 Task 10: Combining Vector-, Pattern- and Graph-based Information to Identify Discriminative Attributes

Author: Santus, Enrico, Biemann, Chris, and Chersoni, Emmanuele
Subjects: Computer Science - Computation and Language
Abstract: This paper describes BomJi, a supervised system for capturing discriminative attributes in word pairs (e.g. yellow as discriminative for banana over watermelon). The system relies on an XGB classifier trained on carefully engineered graph-, pattern- and word embedding based features. It participated in the SemEval- 2018 Task 10 on Capturing Discriminative Attributes, achieving an F1 score of 0:73 and ranking 2nd out of 26 participant systems., Comment: 3 tables, 4 pages, SemEval, NAACL, NLP, Task
Published: 2018

50. An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages

Author: Ustalov, Dmitry, Teslenko, Denis, Panchenko, Alexander, Chernoskutov, Mikhail, Biemann, Chris, and Ponzetto, Simone Paolo
Subjects: Computer Science - Computation and Language
Abstract: In this paper, we present Watasense, an unsupervised system for word sense disambiguation. Given a sentence, the system chooses the most relevant sense of each input word with respect to the semantic similarity between the given sentence and the synset constituting the sense of the target word. Watasense has two modes of operation. The sparse mode uses the traditional vector space model to estimate the most similar word sense corresponding to its context. The dense mode, instead, uses synset embeddings to cope with the sparsity problem. We describe the architecture of the present system and also conduct its evaluation on three different lexical semantic resources for Russian. We found that the dense mode substantially outperforms the sparse one on all datasets according to the adjusted Rand index., Comment: In Proceedings of the 11th Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan
Published: 2018

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

65 results on '"Biemann, Chris"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources