Author: "Šnajder, Jan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Šnajder, Jan"' showing total 405 results

Start Over Author "Šnajder, Jan"

405 results on '"Šnajder, Jan"'

1. Disentangling Latent Shifts of In-Context Learning Through Self-Training

Author: Jukić, Josip and Šnajder, Jan
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: In-context learning (ICL) has become essential in natural language processing, particularly with autoregressive large language models capable of learning from demonstrations provided within the prompt. However, ICL faces challenges with stability and long contexts, especially as the number of demonstrations grows, leading to poor generalization and inefficient inference. To address these issues, we introduce STICL (Self-Training ICL), an approach that disentangles the latent shifts of demonstrations from the latent shift of the query through self-training. STICL employs a teacher model to generate pseudo-labels and trains a student model using these labels, encoded in an adapter module. The student model exhibits weak-to-strong generalization, progressively refining its predictions over time. Our empirical results show that STICL improves generalization and stability, consistently outperforming traditional ICL methods and other disentangling strategies across both in-domain and out-of-domain data.
Published: 2024

2. Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation Guidelines?

Author: Majer, Laura and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: The increasing threat of disinformation calls for automating parts of the fact-checking pipeline. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs' predictive and calibration accuracy on five CD/CW datasets from diverse domains, each utilizing a different worthiness criterion. We investigate two key aspects: (1) how best to distill factuality and worthiness criteria into a prompt and (2) what amount of context to provide for each claim. To this end, we experiment with varying the level of prompt verbosity and the amount of contextual information provided to the model. Our results show that optimal prompt verbosity is domain-dependent, adding context does not improve performance, and confidence scores can be directly used to produce reliable check-worthiness rankings., Comment: Accepted to WASSA at EMNLP 2024
Published: 2024

3. From Robustness to Improved Generalization and Calibration in Pre-trained Language Models

Author: Jukić, Josip and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: Enhancing generalization and uncertainty quantification in pre-trained language models (PLMs) is crucial for their effectiveness and reliability. Building on machine learning research that established the importance of robustness for improving generalization, we investigate the role of representation smoothness, achieved via Jacobian and Hessian regularization, in enhancing PLM performance. Although such regularization methods have proven effective in computer vision, their application in natural language processing (NLP), where PLM inputs are derived from a discrete domain, poses unique challenges. We introduce a novel two-phase regularization approach, JacHess, which minimizes the norms of the Jacobian and Hessian matrices within PLM intermediate representations relative to their inputs. Our evaluation using the GLUE benchmark demonstrates that JacHess significantly improves in-domain generalization and calibration in PLMs, outperforming unregularized fine-tuning and other similar regularization methods.
Published: 2024

4. LLMs for Targeted Sentiment in News Headlines: Exploring the Descriptive-Prescriptive Dilemma

Author: Juroš, Jana, Majer, Laura, and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: News headlines often evoke sentiment by intentionally portraying entities in particular ways, making targeted sentiment analysis (TSA) of headlines a worthwhile but difficult task. Due to its subjectivity, creating TSA datasets can involve various annotation paradigms, from descriptive to prescriptive, either encouraging or limiting subjectivity. LLMs are a good fit for TSA due to their broad linguistic and world knowledge and in-context learning abilities, yet their performance depends on prompt design. In this paper, we compare the accuracy of state-of-the-art LLMs and fine-tuned encoder models for TSA of news headlines using descriptive and prescriptive datasets across several languages. Exploring the descriptive--prescriptive continuum, we analyze how performance is affected by prompt prescriptiveness, ranging from plain zero-shot to elaborate few-shot prompts. Finally, we evaluate the ability of LLMs to quantify uncertainty via calibration error and comparison to human label variation. We find that LLMs outperform fine-tuned encoders on descriptive datasets, while calibration and F1-score generally improve with increased prescriptiveness, yet the optimal level varies., Comment: Presented at 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis (WASSA) at ACL 2024
Published: 2024

5. Are ELECTRA's Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity

Author: Rep, Ivan, Dukić, David, and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELECTRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The community tacitly stopped utilizing ELECTRA's sentence embeddings for semantic textual similarity (STS). We notice a significant drop in performance for the ELECTRA discriminator's last layer in comparison to prior layers. We explore this drop and propose a way to repair the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over $8$ points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other tasks. Further, we discover the surprising efficacy of ELECTRA's generator model, which performs on par with BERT, using significantly fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain adaptive pre-training., Comment: Accepted at EMNLP 2024 Findings
Published: 2024

6. Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only LLMs for Sequence Labeling

Author: Dukić, David and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs' poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs' performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs., Comment: Accepted at ACL 2024 Findings
Published: 2024

7. Out-of-Distribution Detection by Leveraging Between-Layer Transformation Smoothness

Author: Jelenić, Fran, Jukić, Josip, Tutek, Martin, Puljiz, Mate, and Šnajder, Jan
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Effective out-of-distribution (OOD) detection is crucial for reliable machine learning models, yet most current methods are limited in practical use due to requirements like access to training data or intervention in training. We present a novel method for detecting OOD data in Transformers based on transformation smoothness between intermediate layers of a network (BLOOD), which is applicable to pre-trained models without access to training data. BLOOD utilizes the tendency of between-layer representation transformations of in-distribution (ID) data to be smoother than the corresponding transformations of OOD data, a property that we also demonstrate empirically. We evaluate BLOOD on several text classification tasks with Transformer networks and demonstrate that it outperforms methods with comparable resource requirements. Our analysis also suggests that when learning simpler tasks, OOD data transformations maintain their original sharpness, whereas sharpness increases with more complex tasks., Comment: International Conference on Learning Representations: ICLR 2024
Published: 2023

8. Parameter-Efficient Language Model Tuning with Active Learning in Low-Resource Settings

Author: Jukić, Josip and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: Pre-trained language models (PLMs) have ignited a surge in demand for effective fine-tuning techniques, particularly in low-resource domains and languages. Active learning (AL), a set of algorithms designed to decrease labeling costs by minimizing label complexity, has shown promise in confronting the labeling bottleneck. In parallel, adapter modules designed for parameter-efficient fine-tuning (PEFT) have demonstrated notable potential in low-resource settings. However, the interplay between AL and adapter-based PEFT remains unexplored. We present an empirical study of PEFT behavior with AL in low-resource settings for text classification tasks. Our findings affirm the superiority of PEFT over full-fine tuning (FFT) in low-resource settings and demonstrate that this advantage persists in AL setups. We further examine the properties of PEFT and FFT through the lens of forgetting dynamics and instance-level representations, where we find that PEFT yields more stable representations of early and middle layers compared to FFT. Our research underscores the synergistic potential of AL and PEFT in low-resource settings, paving the way for advancements in efficient and effective fine-tuning., Comment: Accepted at EMNLP 2023
Published: 2023

9. Leveraging Open Information Extraction for More Robust Domain Transfer of Event Trigger Detection

Author: Dukić, David, Gashteovski, Kiril, Glavaš, Goran, and Šnajder, Jan
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Event detection is a crucial information extraction task in many domains, such as Wikipedia or news. The task typically relies on trigger detection (TD) -- identifying token spans in the text that evoke specific events. While the notion of triggers should ideally be universal across domains, domain transfer for TD from high- to low-resource domains results in significant performance drops. We address the problem of negative transfer in TD by coupling triggers between domains using subject-object relations obtained from a rule-based open information extraction (OIE) system. We demonstrate that OIE relations injected through multi-task training can act as mediators between triggers in different domains, enhancing zero- and few-shot TD domain transfer and reducing performance drops, in particular when transferring from a high-resource source domain (Wikipedia) to a low(er)-resource target domain (news). Additionally, we combine this improved transfer with masked language modeling on the target domain, observing further TD transfer gains. Finally, we demonstrate that the gains are robust to the choice of the OIE system., Comment: Accepted at EACL 2024 Findings
Published: 2023

10. Paragraph-level Citation Recommendation based on Topic Sentences as Queries

Author: Medić, Zoran and Šnajder, Jan
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: Citation recommendation (CR) models may help authors find relevant articles at various stages of the paper writing process. Most research has dealt with either global CR, which produces general recommendations suitable for the initial writing stage, or local CR, which produces specific recommendations more fitting for the final writing stages. We propose the task of paragraph-level CR as a middle ground between the two approaches, where the paragraph's topic sentence is taken as input and recommendations for citing within the paragraph are produced at the output. We propose a model for this task, fine-tune it using the quadruplet loss on the dataset of ACL papers, and show improvements over the baselines.
Published: 2023

11. On Dataset Transferability in Active Learning for Transformers

Author: Jelenić, Fran, Jukić, Josip, Drobac, Nina, and Šnajder, Jan
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Active learning (AL) aims to reduce labeling costs by querying the examples most beneficial for model learning. While the effectiveness of AL for fine-tuning transformer-based pre-trained language models (PLMs) has been demonstrated, it is less clear to what extent the AL gains obtained with one model transfer to others. We consider the problem of transferability of actively acquired datasets in text classification and investigate whether AL gains persist when a dataset built using AL coupled with a specific PLM is used to train a different PLM. We link the AL dataset transferability to the similarity of instances queried by the different PLMs and show that AL methods with similar acquisition sequences produce highly transferable datasets regardless of the models used. Additionally, we show that the similarity of acquisition sequences is influenced more by the choice of the AL method than the choice of the model., Comment: Findings of the Association for Computational Linguistics: ACL 2023
Published: 2023

12. Data Augmentation for Neural NLP

Author: Pluščec, Domagoj and Šnajder, Jan
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Data scarcity is a problem that occurs in languages and tasks where we do not have large amounts of labeled data but want to use state-of-the-art models. Such models are often deep learning models that require a significant amount of data to train. Acquiring data for various machine learning problems is accompanied by high labeling costs. Data augmentation is a low-cost approach for tackling data scarcity. This paper gives an overview of current state-of-the-art data augmentation methods used for natural language processing, with an emphasis on methods for neural and transformer-based models. Furthermore, it discusses the practical challenges of data augmentation, possible mitigations, and directions for future research.
Published: 2023

13. You Are What You Talk About: Inducing Evaluative Topics for Personality Analysis

Author: Jukić, Josip, Vukojević, Iva, and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: Expressing attitude or stance toward entities and concepts is an integral part of human behavior and personality. Recently, evaluative language data has become more accessible with social media's rapid growth, enabling large-scale opinion analysis. However, surprisingly little research examines the relationship between personality and evaluative language. To bridge this gap, we introduce the notion of evaluative topics, obtained by applying topic models to pre-filtered evaluative text from social media. We then link evaluative topics to individual text authors to build their evaluative profiles. We apply evaluative profiling to Reddit comments labeled with personality scores and conduct an exploratory study on the relationship between evaluative topics and Big Five personality facets, aiming for a more interpretable, facet-level analysis. Finally, we validate our approach by observing correlations consistent with prior research in personality psychology., Comment: Accepted at EMNLP 2022 (Findings), NLP+CSS
Published: 2023

14. Smooth Sailing: Improving Active Learning for Pre-trained Language Models with Representation Smoothness Analysis

Author: Jukić, Josip and Šnajder, Jan
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Developed to alleviate prohibitive labeling costs, active learning (AL) methods aim to reduce label complexity in supervised learning. While recent work has demonstrated the benefit of using AL in combination with large pre-trained language models (PLMs), it has often overlooked the practical challenges that hinder the effectiveness of AL. We address these challenges by leveraging representation smoothness analysis to ensure AL is feasible, that is, both effective and practicable. Firstly, we propose an early stopping technique that does not require a validation set -- often unavailable in realistic AL conditions -- and observe significant improvements over random sampling across multiple datasets and AL methods. Further, we find that task adaptation improves AL, whereas standard short fine-tuning in AL does not provide improvements over random sampling. Our work demonstrates the usefulness of representation smoothness analysis for AL and introduces an AL stopping criterion that reduces label complexity., Comment: Accepted at Learning with Small Data 2023, Association for Computational Linguistics
Published: 2022

15. Easy to Decide, Hard to Agree: Reducing Disagreements Between Saliency Methods

Author: Jukić, Josip, Tutek, Martin, and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: A popular approach to unveiling the black box of neural NLP models is to leverage saliency methods, which assign scalar importance scores to each input component. A common practice for evaluating whether an interpretability method is faithful has been to use evaluation-by-agreement -- if multiple methods agree on an explanation, its credibility increases. However, recent work has found that saliency methods exhibit weak rank correlations even when applied to the same model instance and advocated for the use of alternative diagnostic methods. In our work, we demonstrate that rank correlation is not a good fit for evaluating agreement and argue that Pearson-$r$ is a better-suited alternative. We further show that regularization techniques that increase faithfulness of attention explanations also increase agreement between saliency methods. By connecting our findings to instance categories based on training dynamics, we show that the agreement of saliency method explanations is very low for easy-to-learn instances. Finally, we connect the improvement in agreement across instance categories to local representation space statistics of instances, paving the way for work on analyzing which intrinsic model properties improve their predisposition to interpretability methods., Comment: Accepted to findings of ACL 2023
Published: 2022

16. ALANNO: An Active Learning Annotation System for Mortals

Author: Jukić, Josip, Jelenić, Fran, Bićanić, Miroslav, and Šnajder, Jan
Subjects: Computer Science - Machine Learning
Abstract: Supervised machine learning has become the cornerstone of today's data-driven society, increasing the need for labeled data. However, the process of acquiring labels is often expensive and tedious. One possible remedy is to use active learning (AL) -- a special family of machine learning algorithms designed to reduce labeling costs. Although AL has been successful in practice, a number of practical challenges hinder its effectiveness and are often overlooked in existing AL annotation tools. To address these challenges, we developed ALANNO, an open-source annotation system for NLP tasks equipped with features to make AL effective in real-world annotation projects. ALANNO facilitates annotation management in a multi-annotator setup and supports a variety of AL methods and underlying models, which are easily configurable and extensible., Comment: Accepted at EACL 2023
Published: 2022

17. Large-scale Evaluation of Transformer-based Article Encoders on the Task of Citation Recommendation

Author: Medić, Zoran and Šnajder, Jan
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: Recently introduced transformer-based article encoders (TAEs) designed to produce similar vector representations for mutually related scientific articles have demonstrated strong performance on benchmark datasets for scientific article recommendation. However, the existing benchmark datasets are predominantly focused on single domains and, in some cases, contain easy negatives in small candidate pools. Evaluating representations on such benchmarks might obscure the realistic performance of TAEs in setups with thousands of articles in candidate pools. In this work, we evaluate TAEs on large benchmarks with more challenging candidate pools. We compare the performance of TAEs with a lexical retrieval baseline model BM25 on the task of citation recommendation, where the model produces a list of recommendations for citing in a given input article. We find out that BM25 is still very competitive with the state-of-the-art neural retrievers, a finding which is surprising given the strong performance of TAEs on small benchmarks. As a remedy for the limitations of the existing benchmarks, we propose a new benchmark dataset for evaluating scientific article representations: Multi-Domain Citation Recommendation dataset (MDCR), which covers different scientific fields and contains challenging candidate pools., Comment: Accepted to the Third Workshop on Scholarly Document Processing @ COLING 2022
Published: 2022

18. A Topic Coverage Approach to Evaluation of Topic Models

Author: Korenčić, Damir, Ristov, Strahil, Repar, Jelena, and Šnajder, Jan
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language, Computer Science - Machine Learning, H.3.3, I.5.4, I.2.7
Abstract: Topic models are widely used unsupervised models capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. In this paper we revisit and extend a so far neglected approach to topic model evaluation based on measuring topic coverage - computationally matching model topics with a set of reference topics that models are expected to uncover. The approach is well suited for analyzing models' performance in topic discovery and for large-scale analysis of both topic models and measures of model quality. We propose new measures of coverage and evaluate, in a series of experiments, different types of topic models on two distinct text domains for which interest for topic discovery exists. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the analysis of the relationship between coverage and other methods of topic model evaluation. The paper contributes a new supervised measure of coverage, and the first unsupervised measure of coverage. The supervised measure achieves topic matching accuracy close to human agreement. The unsupervised measure correlates highly with the supervised one (Spearman's $\rho \geq 0.95$). Other contributions include insights into both topic models and different methods of model evaluation, and the datasets and code for facilitating future research on topic coverage., Comment: Final version accepted for publication in IEEE Access (https://doi.org/10.1109/ACCESS.2021.3109425); Contributions unchanged, results augmented with the analysis of the models' precision and recall and the analysis of the measures' running time; Improved description of the contributions; Improved future work
Published: 2020
Full Text: View/download PDF

19. Staying True to Your Word: (How) Can Attention Become Explanation?

Author: Tutek, Martin and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: The attention mechanism has quickly become ubiquitous in NLP. In addition to improving performance of models, attention has been widely used as a glimpse into the inner workings of NLP models. The latter aspect has in the recent years become a common topic of discussion, most notably in work of Jain and Wallace, 2019; Wiegreffe and Pinter, 2019. With the shortcomings of using attention weights as a tool of transparency revealed, the attention mechanism has been stuck in a limbo without concrete proof when and whether it can be used as an explanation. In this paper, we provide an explanation as to why attention has seen rightful critique when used with recurrent networks in sequence classification tasks. We propose a remedy to these issues in the form of a word level objective and our findings give credibility for attention to provide faithful interpretations of recurrent models.
Published: 2020

20. PANDORA Talks: Personality and Demographics on Reddit

Author: Gjurković, Matej, Karan, Mladen, Vukojević, Iva, Bošnjak, Mihaela, and Šnajder, Jan
Subjects: Computer Science - Computation and Language, Computer Science - Computers and Society, Computer Science - Social and Information Networks
Abstract: Personality and demographics are important variables in social sciences, while in NLP they can aid in interpretability and removal of societal biases. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first large-scale dataset of Reddit comments labeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables., Comment: Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, NAACL 2021, https://www.aclweb.org/anthology/2021.socialnlp-1.12
Published: 2020

21. Not Just Depressed: Bipolar Disorder Prediction on Reddit

Author: Sekulić, Ivan, Gjurković, Matej, and Šnajder, Jan
Subjects: Computer Science - Computation and Language, Computer Science - Social and Information Networks
Abstract: Bipolar disorder, an illness characterized by manic and depressive episodes, affects more than 60 million people worldwide. We present a preliminary study on bipolar disorder prediction from user-generated text on Reddit, which relies on users' self-reported labels. Our benchmark classifiers for bipolar disorder prediction outperform the baselines and reach accuracy and F1-scores of above 86%. Feature analysis shows interesting differences in language use between users with bipolar disorders and the control group, including differences in the use of emotion-expressive words., Comment: WASSA at EMNLP 2018
Published: 2018

22. Iterative Recursive Attention Model for Interpretable Sequence Classification

Author: Tutek, Martin and Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: Natural language processing has greatly benefited from the introduction of the attention mechanism. However, standard attention models are of limited interpretability for tasks that involve a series of inference steps. We describe an iterative recursive attention model, which constructs incremental representations of input data through reusing results of previously computed queries. We train our model on sentiment classification datasets and demonstrate its capacity to identify and combine different aspects of the input in an easily interpretable manner, while obtaining performance close to the state of the art., Comment: 7 pages, 5 figures, Analyzing and interpreting neural networks for NLP Workshop at EMNLP 2018
Published: 2018

23. An empirical study of the design choices for local citation recommendation systems

Author: Medić, Zoran and Šnajder, Jan
Published: 2022
Full Text: View/download PDF

24. SIMPA: Statement-to-Item Matching Personality Assessment from text

Author: Gjurković, Matej, Vukojević, Iva, and Šnajder, Jan
Published: 2022
Full Text: View/download PDF

25. Closed-domain event extraction for hard news event monitoring: a systematic study.

Author: Dukić, David, Došilović, Filip Karlo, Pluščec, Domagoj, and Šnajder, Jan
Subjects: LANGUAGE models, NATURAL language processing, ARGUMENT
Abstract: News event monitoring systems allow real-time monitoring of a large number of events reported in the news, including the urgent and critical events comprising the so-called hard news. These systems heavily rely on natural language processing (NLP) to perform automatic event extraction at scale. While state-of-the-art event extraction models are readily available, integrating them into a news event monitoring system is not as straightforward as it seems due to practical issues related to model selection, robustness, and scale. To address this gap, we present a study on the practical use of event extraction models for news event monitoring. Our study focuses on the key task of closed-domain main event extraction (CDMEE), which aims to determine the type of the story's main event and extract its arguments from the text. We evaluate a range of state-of-the-art NLP models for this task, including those based on pre-trained language models. Aiming at a more realistic evaluation than done in the literature, we introduce a new dataset manually labeled with event types and their arguments. Additionally, we assess the scalability of CDMEE models and analyze the trade-off between accuracy and inference speed. Our results give insights into the performance of state-of-the-art NLP models on the CDMEE task and provide recommendations for developing effective, robust, and scalable news event monitoring systems. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. Social Media Argumentation Mining: The Quest for Deliberateness in Raucousness

Author: Šnajder, Jan
Subjects: Computer Science - Computation and Language
Abstract: Argumentation mining from social media content has attracted increasing attention. The task is both challenging and rewarding. The informal nature of user-generated content makes the task dauntingly difficult. On the other hand, the insights that could be gained by a large-scale analysis of social media argumentation make it a very worthwhile task. In this position paper I discuss the motivation for social media argumentation mining, as well as the tasks and challenges involved.
Published: 2016

27. Word sense induction using leader-follower clustering of automatically generated lexical substitutes

Author: Akkasi, Abbas and Snajder, Jan
Published: 2021
Full Text: View/download PDF

28. Document-based topic coherence measures for news media text

Author: Korenčić, Damir, Ristov, Strahil, and Šnajder, Jan
Published: 2018
Full Text: View/download PDF

29. Detecting Non-covered Questions in Frequently Asked Questions Collections

Author: Karan, Mladen, Šnajder, Jan, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Frasincar, Flavius, editor, Ittoo, Ashwin, editor, Nguyen, Le Minh, editor, and Métais, Elisabeth, editor
Published: 2017
Full Text: View/download PDF

30. Paraphrase-focused learning to rank for domain-specific frequently asked questions retrieval

Author: Karan, Mladen and Šnajder, Jan
Published: 2018
Full Text: View/download PDF

31. FAQIR – A Frequently Asked Questions Retrieval Test Collection

Author: Karan, Mladen, Šnajder, Jan, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Sojka, Petr, editor, Horák, Aleš, editor, Kopeček, Ivan, editor, and Pala, Karel, editor
Published: 2016
Full Text: View/download PDF

32. Evaluation of Manual Query Expansion Rules on a Domain Specific FAQ Collection

Author: Karan, Mladen, Šnajder, Jan, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Mothe, Josanne, editor, Savoy, Jacques, editor, Kamps, Jaap, editor, Pinel-Sauvagnat, Karen, editor, Jones, Gareth, editor, San Juan, Eric, editor, Capellato, Linda, editor, and Ferro, Nicola, editor
Published: 2015
Full Text: View/download PDF

33. Leveraging Open Information Extraction for Improving Few-Shot Trigger Detection Domain Transfer

Author: Dukić, David, Gashteovski, Kiril, Glavaš, Goran, and Šnajder, Jan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Event detection is a crucial information extraction task in many domains, such as Wikipedia or news. The task typically relies on trigger detection (TD) -- identifying token spans in the text that evoke specific events. While the notion of triggers should ideally be universal across domains, domain transfer for TD from high- to low-resource domains results in significant performance drops. We address the problem of negative transfer for TD by coupling triggers between domains using subject-object relations obtained from a rule-based open information extraction (OIE) system. We demonstrate that relations injected through multi-task training can act as mediators between triggers in different domains, enhancing zero- and few-shot TD domain transfer and reducing negative transfer, in particular when transferring from a high-resource source Wikipedia domain to a low-resource target news domain. Additionally, we combine the extracted relations with masked language modeling on the target domain and obtain further TD performance gains. Finally, we demonstrate that the results are robust to the choice of the OIE system.
Published: 2023
Full Text: View/download PDF

34. Exploring Coreference Uncertainty of Generically Extracted Event Mentions

Author: Glavaš, Goran, Šnajder, Jan, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, and Gelbukh, Alexander, editor
Published: 2013
Full Text: View/download PDF

35. Towards a Constraint Grammar Based Morphological Tagger for Croatian

Author: Peradin, Hrvoje, Šnajder, Jan, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Sojka, Petr, editor, Horák, Aleš, editor, Kopeček, Ivan, editor, and Pala, Karel, editor
Published: 2012
Full Text: View/download PDF

36. Optimizing Sentence Boundary Detection for Croatian

Author: Šarić, Frane, Šnajder, Jan, Dalbelo Bašić, Bojana, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Sojka, Petr, editor, Horák, Aleš, editor, Kopeček, Ivan, editor, and Pala, Karel, editor
Published: 2012
Full Text: View/download PDF

37. Semi-supervised Acquisition of Croatian Sentiment Lexicon

Author: Glavaš, Goran, Šnajder, Jan, Dalbelo Bašić, Bojana, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Sojka, Petr, editor, Horák, Aleš, editor, Kopeček, Ivan, editor, and Pala, Karel, editor
Published: 2012
Full Text: View/download PDF

38. From Requirements to Code: Syntax-Based Requirements Analysis for Data-Driven Application Development

Author: Glavaš, Goran, Fertalj, Krešimir, Šnajder, Jan, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Bouma, Gosse, editor, Ittoo, Ashwin, editor, Métais, Elisabeth, editor, and Wortmann, Hans, editor
Published: 2012
Full Text: View/download PDF

39. Question Classification for a Croatian QA System

Author: Lombarović, Tomislav, Šnajder, Jan, Dalbelo Bašić, Bojana, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Habernal, Ivan, editor, and Matoušek, Václav, editor
Published: 2011
Full Text: View/download PDF

40. Random Indexing Distributional Semantic Models for Croatian Language

Author: Janković, Vedrana, Šnajder, Jan, Dalbelo Bašić, Bojana, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Habernal, Ivan, editor, and Matoušek, Václav, editor
Published: 2011
Full Text: View/download PDF

41. Unsupervised Topic-Oriented Keyphrase Extraction and Its Application to Croatian

Author: Saratlija, Josip, Šnajder, Jan, Dalbelo Bašić, Bojana, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Habernal, Ivan, editor, and Matoušek, Václav, editor
Published: 2011
Full Text: View/download PDF

42. Event graphs for information retrieval and multi-document summarization

Author: Glavaš, Goran and Šnajder, Jan
Published: 2014
Full Text: View/download PDF

43. TermeX: A Tool for Collocation Extraction

Author: Delač, Davor, Krleža, Zoran, Šnajder, Jan, Dalbelo Bašić, Bojana, Šarić, Frane, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, and Gelbukh, Alexander, editor
Published: 2009
Full Text: View/download PDF

44. Detecting Non-covered Questions in Frequently Asked Questions Collections

Author: Karan, Mladen, primary and Šnajder, Jan, additional
Published: 2017
Full Text: View/download PDF

45. Linguistic Features and Newsworthiness: An Analysis of News style

Author: di Buono, Maria Pia, primary and Šnajder, Jan, additional
Published: 2017
Full Text: View/download PDF

46. FAQIR – A Frequently Asked Questions Retrieval Test Collection

Author: Karan, Mladen, primary and Šnajder, Jan, additional
Published: 2016
Full Text: View/download PDF

47. Automatska procjena ličnosti bazirana na podudarnosti rečenica s društvenih mreža i upitničkih čestica

Author: Gjurković, Matej, Vukojević, Iva, Šnajder, Jan, Bratko, Denis, Butković, Ana, Jukić, Josip, Masnikosa, Irina, Pocrnić, Martina, Vukasović Hlupić, Tena, Drobac, Nina, Tucak Junaković, Ivana, Macuka, Ivana, and Tokić, Andrea
Subjects: ličnost, umjetna inteligencija, jezik, društvene mreže, obrada prirodnog jezika
Abstract: Automatska procjena ličnosti na temelju teksta s društvenih medija dobiva sve veću pozornost kako u psihologiji tako i u području umjetne inteligencije. S jedne strane, interes psihologa primarno pobuđuje mogućnost korištenja digitalnih ponašajnih tragova za procjenu ličnosti. S druge strane, zbog brzog rasta količine tekstnih podataka koje generiraju korisnici interneta, stručnjacima računarske znanosti se kao zanimljiv izazov nameće rad s takvim velikim i nestrukturiranim podacima. No, trenutačni automatski pristupi procjenjivanju ličnosti nisu usmjereni k osiguravanju interpretabilnosti (informacija koji su točno podaci relevantni za procjenu ličnosti) i valjanosti (informacija jesu li korišteni podaci zaista valjani tragovi ličnosti). Upitnici ličnosti, međutim, kao nužni temelj imaju upravo interpretabilnost i valjanost. Kako bismo ublažili navedene slabosti automatske procjene ličnosti, predlažemo pristup koji kombinira upitničku i automatsku procjenu ličnosti. Naš pristup akronima SIMPA (engl. Statement-to-Item Matching Personality Assessment) koristi metode obrade prirodnoga jezika kako bi detektirao samoopise ličnosti koji se potom koriste za automatsku procjen ličnosti. Srž pristupa jest pojam dispozicijske semantičke sličnosti između slobodno izraženih izjava i upitničkih čestica. Takva sličnost kombinira semantičku sličnost sa znanjem o načinu na koji bi se određena dispozicija mogla manifestirati. Konceptualnu osnovu pristupa SIMPA čini model realistične točnosti (Funder, 1995), koji opisuje korake u procesu dolaska do točne procjene ličnosti, a kojeg proširujemo mehanizmom povratne petlje koja dodatno poboljšava točnost procjene. U izlaganju predstavljamo jednostavnu implementaciju pristupa SIMPA na podacima s društvene mreže Reddit. Demonstriramo kako se pristup može koristiti izravno za procjenu velikih pet crta ličnosti kod korisnika Reddita, što potvrđuju statistički značajne korelacije između tako procijenjenih crta i samoprocjena istih crta. Pristup također koristimo neizravno za proizvodnju značajki za nadzirani model strojnog učenja za automatsku procjenu ličnosti, čime dobivamo trenutačno najbolje rezultate na zadatku predviđanje ličnosti korisnika Reddita. Konačno, raspravljamo o važnim mogućnostima i izazovima analize jezika na internetu.
Published: 2022

48. Digital News Media as A Social Resilience Proxy: A Computational Political Economy Perspective

Author: Bilić, Paško, Dukić, David, Arambašić, Lucija, Gjurkjović, Matej, and Šnajder, Jan
Subjects: Digital news, political economy, platform economy, public sphere, social resilience, COVID-19, computational methods, natural language processing
Abstract: In this paper, we argue that the digital news media may serve as a proxy for social resilience in terms of uncovering meanings and themes that framed the perception of the pandemic in the public sphere. The public sphere approach has been evaluated and reconsidered ad nauseam in the last thirty years. A long list of factors limiting the deliberative, communicative action ideal of the public sphere includes fragmentation, media concentration, marketization, commercialization, digital intermediaries, echo chambers, post-democracy, and fake democracy to name a few. Yet the COVID-19 emergency has painfully exposed the need for just such a rational communication space in which citizens could find relevant information with rational discussion on how to cope, adapt and overcome a global health crisis. When physical mobility was limited, the media served as a focal point providing unprecedented coverage of political, economic, and social conditions affected by the uncontrollable spread of the virus. The media also provided citizens with a range of potential behavioral orientations in line with ongoing scientific research. We observe the digital news media from a systemic and political economy perspective, which allows us to interpret their societal role in fostering (or limiting) social resilience, managing, and potentially overcoming the crisis. To understand the role of the digital news media for social resilience, we deployed computational techniques that allowed us to decipher broad tendencies and shifts in news media reports over the course of four major waves of the pandemic (in terms of daily infection rates) between January 2020 and December 2021. We selected a total of 21 news portals in Croatia based on audience reach, regional coverage, and ownership (public, private, non-profit). We used computational techniques of natural language processing and machine learning to analyze and compare all news reports related to the pandemic and published during the analyzed period (N = 147 050).
Published: 2022

49. From 'kind' to 'I'm a kind dude': The frequency of the use of personality-descriptive adjectives on the social media website Reddit

Author: Vukojević, Iva, Butković, Ana, Gjurković, Matej, Masnikosa, Irina, Pocrnić, Martina, Šnajder, Jan, and Bratko, Denis
Subjects: personality adjectives, lexical hypothesis, Big Five, language, social media
Abstract: Personality adjectives are important in trait theories, but their significance in everyday language use is understudied. Hence, we compiled a set of 706 English adjectives previously linked to Big Five traits and analysed their frequency of occurrence in social media using the dataset PANDORA, which contains 14 million sentences from Reddit. The analysis is funnelled and proceeds in three stages, i.e., we count the occurrence of 1) the personality adjective alone (e.g., kind) ; 2) bigrams of adjectives and the noun "person" or its equivalent substitutions (e.g., kind dude) ; 3) the adjectives in self-descriptive sentences (e.g., I'm a kind dude.). Almost 1/4 of the adjectives were unmentioned and frequencies of occurrences were uneven between traits and between their positive and negative poles. We take a closer look at the least (e.g., uninquisitive, self-pitying) and most frequently used adjectives (e.g., simple, smart) and discuss the opportunities and challenges of analysing online language.
Published: 2022

50. Natural language processing for personality research: A case study of religiousness on the social media site Reddit

Author: Gjurković, Matej, Masnikosa, Irina, Vukojević, Iva, Butković, Ana, Pocrnić, Martina, Bratko, Denis, and Šnajder, Jan
Subjects: social media language, natural language processing, religiousness, text analysis, SIMPA framework
Abstract: Recently, advances in natural language processing (NLP) and machine learning have enabled research on personality cues that are ubiquitous in social media texts. In this study, we investigate whether the relationship between religiousness and personality traits demonstrated in the literature can also be found in social media texts using NLP. We propose a method based on the implementation of the statement-to-item matching framework (Gjurkovic et al., 2022), which we use to filter over 14 million sentences written by users with self-reported Big5 scores from the PANDORA Reddit dataset to find the most informative self-disclosures of users about their religious affiliation. Based on the filtered sentences alone, raters were able to classify a subset of users (n=299) into religious and non-religious groups. Although median differences on personality traits were in the expected direction, these differences were not significant. We present implications, challenges, and opportunities for using this method in future studies.
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

405 results on '"Šnajder, Jan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources