Author: "Färber, Michael" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Färber, Michael"' showing total 287 results

Start Over Author "Färber, Michael"

287 results on '"Färber, Michael"'

1. The Effects of Hallucinations in Synthetic Training Data for Relation Extraction

Author: Rogulsky, Steven, Popovic, Nicholas, and Färber, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: Relation extraction is crucial for constructing knowledge graphs, with large high-quality datasets serving as the foundation for training, fine-tuning, and evaluating models. Generative data augmentation (GDA) is a common approach to expand such datasets. However, this approach often introduces hallucinations, such as spurious facts, whose impact on relation extraction remains underexplored. In this paper, we examine the effects of hallucinations on the performance of relation extraction on the document and sentence levels. Our empirical study reveals that hallucinations considerably compromise the ability of models to extract relations from text, with recall reductions between 19.1% and 39.2%. We identify that relevant hallucinations impair the model's performance, while irrelevant hallucinations have a minimal impact. Additionally, we develop methods for the detection of hallucinations to improve data quality and model performance. Our approaches successfully classify texts as either 'hallucinated' or 'clean,' achieving high F1-scores of 83.8% and 92.2%. These methods not only assist in removing hallucinations but also help in estimating their prevalence within datasets, which is crucial for selecting high-quality data. Overall, our work confirms the profound impact of relevant hallucinations on the effectiveness of relation extraction models., Comment: Accepted at KBC-LM@ISWC'24
Published: 2024

2. Assessing Privacy Policies with AI: Ethical, Legal, and Technical Challenges

Author: Aydin, Irem, Diebel-Fischer, Hermann, Freiberger, Vincent, Möller-Klapperich, Julia, Buchmann, Erik, Färber, Michael, Lauber-Rönsberg, Anne, and Platow, Birte
Subjects: Computer Science - Computers and Society
Abstract: The growing use of Machine Learning and Artificial Intelligence (AI), particularly Large Language Models (LLMs) like OpenAI's GPT series, leads to disruptive changes across organizations. At the same time, there is a growing concern about how organizations handle personal data. Thus, privacy policies are essential for transparency in data processing practices, enabling users to assess privacy risks. However, these policies are often long and complex. This might lead to user confusion and consent fatigue, where users accept data practices against their interests, and abusive or unfair practices might go unnoticed. LLMss can be used to assess privacy policies for users automatically. In this interdisciplinary work, we explore the challenges of this approach in three pillars, namely technical feasibility, ethical implications, and legal compatibility of using LLMs to assess privacy policies. Our findings aim to identify potential for future research, and to foster a discussion on the use of LLM technologies for enabling users to fulfil their important role as decision-makers in a constantly developing AI-driven digital economy., Comment: Published at AISyS 2024
Published: 2024

3. Knowledge Graph Structure as Prompt: Improving Small Language Models Capabilities for Knowledge-based Causal Discovery

Author: Susanti, Yuni and Färber, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Causal discovery aims to estimate causal structures among variables based on observational data. Large Language Models (LLMs) offer a fresh perspective to tackle the causal discovery problem by reasoning on the metadata associated with variables rather than their actual data values, an approach referred to as knowledge-based causal discovery. In this paper, we investigate the capabilities of Small Language Models (SLMs, defined as LLMs with fewer than 1 billion parameters) with prompt-based learning for knowledge-based causal discovery. Specifically, we present KG Structure as Prompt, a novel approach for integrating structural information from a knowledge graph, such as common neighbor nodes and metapaths, into prompt-based learning to enhance the capabilities of SLMs. Experimental results on three types of biomedical and open-domain datasets under few-shot settings demonstrate the effectiveness of our approach, surpassing most baselines and even conventional fine-tuning approaches trained on full datasets. Our findings further highlight the strong capabilities of SLMs: in combination with knowledge graphs and prompt-based learning, SLMs demonstrate the potential to surpass LLMs with larger number of parameters. Our code and datasets are available on GitHub., Comment: accepted at ISWC'24
Published: 2024

4. AutoRDF2GML: Facilitating RDF Integration in Graph Machine Learning

Author: Färber, Michael, Lamprecht, David, and Susanti, Yuni
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: In this paper, we introduce AutoRDF2GML, a framework designed to convert RDF data into data representations tailored for graph machine learning tasks. AutoRDF2GML enables, for the first time, the creation of both content-based features -- i.e., features based on RDF datatype properties -- and topology-based features -- i.e., features based on RDF object properties. Characterized by automated feature extraction, AutoRDF2GML makes it possible even for users less familiar with RDF and SPARQL to generate data representations ready for graph machine learning tasks, such as link prediction, node classification, and graph classification. Furthermore, we present four new benchmark datasets for graph machine learning, created from large RDF knowledge graphs using our framework. These datasets serve as valuable resources for evaluating graph machine learning approaches, such as graph neural networks. Overall, our framework effectively bridges the gap between the Graph Machine Learning and Semantic Web communities, paving the way for RDF-based machine learning applications., Comment: accepted at ISWC'24
Published: 2024

5. ComplexTempQA: A Large-Scale Dataset for Complex Temporal Question Answering

Author: Gruber, Raphael, Abdallah, Abdelrahman, Färber, Michael, and Jatowt, Adam
Subjects: Computer Science - Computation and Language
Abstract: We introduce ComplexTempQA, a large-scale dataset consisting of over 100 million question-answer pairs designed to tackle the challenges in temporal question answering. ComplexTempQA significantly surpasses existing benchmarks like HOTPOTQA, TORQUE, and TEQUILA in scale and scope. Utilizing data from Wikipedia and Wikidata, the dataset covers questions spanning over two decades and offers an unmatched breadth of topics. We introduce a unique taxonomy that categorizes questions as attributes, comparisons, and counting questions, each revolving around events, entities, and time periods. One standout feature of ComplexTempQA is the high complexity of its questions, which demand effective capabilities for answering such as across-time comparison, temporal aggregation, and multi-hop reasoning involving temporal event ordering and entity recognition. Additionally, each question is accompanied by detailed metadata, including specific time scopes, allowing for comprehensive evaluation and enhancement of the temporal reasoning abilities of large language models. ComplexTempQA serves both as a testing ground for developing sophisticated AI models and as a foundation for advancing research in question answering, information retrieval, and language understanding.
Published: 2024

6. Advanced Equalization in 112 Gb/s Upstream PON Using a Novel Fourier Convolution-based Network

Author: Shao, Chen, Giacoumidis, Elias, Matalla, Patrick, Li, Jialei, Li, Shi, Randel, Sebastian, Richter, Andre, Faerber, Michael, and Kaefer, Tobias
Subjects: Computer Science - Machine Learning
Abstract: We experimentally demonstrate a novel, low-complexity Fourier Convolution-based Network (FConvNet) based equalizer for 112 Gb/s upstream PAM4-PON. At a BER of 0.005, FConvNet enhances the receiver sensitivity by 2 and 1 dB compared to a 51-tap Sato equalizer and benchmark machine learning algorithms respectively., Comment: 4 pages, 5 figures
Published: 2024

7. Machine Learning in Short-Reach Optical Systems: A Comprehensive Survey

Author: Shao, Chen, Giacoumidis, Elias, Billah, Syed Moktacim, Li, Shi, Li, Jialei, Sahu, Prashasti, Richter, Andre, Kaefer, Tobias, and Faerber, Michael
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Machine Learning
Abstract: In recent years, extensive research has been conducted to explore the utilization of machine learning algorithms in various direct-detected and self-coherent short-reach communication applications. These applications encompass a wide range of tasks, including bandwidth request prediction, signal quality monitoring, fault detection, traffic prediction, and digital signal processing (DSP)-based equalization. As a versatile approach, machine learning demonstrates the ability to address stochastic phenomena in optical systems networks where deterministic methods may fall short. However, when it comes to DSP equalization algorithms, their performance improvements are often marginal, and their complexity is prohibitively high, especially in cost-sensitive short-reach communications scenarios such as passive optical networks (PONs). They excel in capturing temporal dependencies, handling irregular or nonlinear patterns effectively, and accommodating variable time intervals. Within this extensive survey, we outline the application of machine learning techniques in short-reach communications, specifically emphasizing their utilization in high-bandwidth demanding PONs. Notably, we introduce a novel taxonomy for time-series methods employed in machine learning signal processing, providing a structured classification framework. Our taxonomy categorizes current time series methods into four distinct groups: traditional methods, Fourier convolution-based methods, transformer-based models, and time-series convolutional networks. Finally, we highlight prospective research directions within this rapidly evolving field and outline specific solutions to mitigate the complexity associated with hardware implementations. We aim to pave the way for more practical and efficient deployment of machine learning approaches in short-reach optical communication systems by addressing complexity concerns., Comment: 23 pages, 2 figure, 3 tables, Accepted as MDPI Photonics Journal Speical Issue Machine Learning Applied to Optical Communication Systems
Published: 2024

8. A Novel Machine Learning-based Equalizer for a Downstream 100G PAM-4 PON

Author: Shao, Chen, Giacoumidis, Elias, Li, Shi, Li, Jialei, Faerber, Michael, Kaefer, Tobias, and Richter, Andre
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Machine Learning
Abstract: A frequency-calibrated SCINet (FC-SCINet) equalizer is proposed for down-stream 100G PON with 28.7 dB path loss. At 5 km, FC-SCINet improves the BER by 88.87% compared to FFE and a 3-layer DNN with 10.57% lower complexity., Comment: 3 pages, 6 figures, accepted by Optical Fiber Communications Conference and Exhibition 2024
Published: 2024

9. GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism

Author: Yuan, Shuzhou and Färber, Michael
Subjects: Computer Science - Computation and Language
Abstract: Pretrained Language Models (PLMs) benefit from external knowledge stored in graph structures for various downstream tasks. However, bridging the modality gap between graph structures and text remains a significant challenge. Traditional methods like linearizing graphs for PLMs lose vital graph connectivity, whereas Graph Neural Networks (GNNs) require cumbersome processes for integration into PLMs. In this work, we propose a novel graph-guided self-attention mechanism, GraSAME. GraSAME seamlessly incorporates token-level structural information into PLMs without necessitating additional alignment or concatenation efforts. As an end-to-end, lightweight multimodal module, GraSAME follows a multi-task learning strategy and effectively bridges the gap between graph and textual modalities, facilitating dynamic interactions between GNNs and PLMs. Our experiments on the graph-to-text generation task demonstrate that GraSAME outperforms baseline models and achieves results comparable to state-of-the-art (SOTA) models on WebNLG datasets. Furthermore, compared to SOTA models, GraSAME eliminates the need for extra pre-training tasks to adjust graph inputs and reduces the number of trainable parameters by over 100 million., Comment: NAACL 2024 Findings
Published: 2024

10. A formal specification of the jq language

Author: Färber, Michael
Subjects: Computer Science - Logic in Computer Science, Computer Science - Programming Languages, D.3.1
Abstract: jq is a widely used tool that provides a programming language to manipulate JSON data. However, the jq language is currently only specified by its implementation, making it difficult to reason about its behaviour. To this end, we provide a formal syntax and denotational semantics for a large subset of the jq language. Our most significant contribution is to provide a new way to interpret updates that allows for more predictable and performant execution.
Published: 2024

11. GreeDy and CoDy: Counterfactual Explainers for Dynamic Graphs

Author: Qu, Zhan, Gomm, Daniel, and Färber, Michael
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Temporal Graph Neural Networks (TGNNs), crucial for modeling dynamic graphs with time-varying interactions, face a significant challenge in explainability due to their complex model structure. Counterfactual explanations, crucial for understanding model decisions, examine how input graph changes affect outcomes. This paper introduces two novel counterfactual explanation methods for TGNNs: GreeDy (Greedy Explainer for Dynamic Graphs) and CoDy (Counterfactual Explainer for Dynamic Graphs). They treat explanations as a search problem, seeking input graph alterations that alter model predictions. GreeDy uses a simple, greedy approach, while CoDy employs a sophisticated Monte Carlo Tree Search algorithm. Experiments show both methods effectively generate clear explanations. Notably, CoDy outperforms GreeDy and existing factual methods, with up to 59\% higher success rate in finding significant counterfactual inputs. This highlights CoDy's potential in clarifying TGNN decision-making, increasing their transparency and trustworthiness in practice.
Published: 2024

12. Embedded Named Entity Recognition using Probing Classifiers

Author: Popovič, Nicholas and Färber, Michael
Subjects: Computer Science - Computation and Language
Abstract: Streaming text generation has become a common way of increasing the responsiveness of language model powered applications, such as chat assistants. At the same time, extracting semantic information from generated text is a useful tool for applications such as automated fact checking or retrieval augmented generation. Currently, this requires either separate models during inference, which increases computational cost, or destructive fine-tuning of the language model. Instead, we propose an approach called EMBER which enables streaming named entity recognition in decoder-only language models without fine-tuning them and while incurring minimal additional computational cost at inference time. Specifically, our experiments show that EMBER maintains high token generation rates, with only a negligible decrease in speed of around 1% compared to a 43.64% slowdown measured for a baseline. We make our code and data available online, including a toolkit for training, testing, and deploying efficient token classification models optimized for streaming text generation., Comment: EMNLP 2024 (main)
Published: 2024

13. Decomposed Prompting: Unveiling Multilingual Linguistic Structure Knowledge in English-Centric Large Language Models

Author: Nie, Ercong, Yuan, Shuzhou, Ma, Bolei, Schmid, Helmut, Färber, Michael, Kreuter, Frauke, and Schütze, Hinrich
Subjects: Computer Science - Computation and Language
Abstract: Despite the predominance of English in their training data, English-centric Large Language Models (LLMs) like GPT-3 and LLaMA display a remarkable ability to perform multilingual tasks, raising questions about the depth and nature of their cross-lingual capabilities. This paper introduces the decomposed prompting approach to probe the linguistic structure understanding of these LLMs in sequence labeling tasks. Diverging from the single text-to-text prompt, our method generates for each token of the input sentence an individual prompt which asks for its linguistic label. We assess our method on the Universal Dependencies part-of-speech tagging dataset for 38 languages, utilizing both English-centric and multilingual LLMs. Our findings show that decomposed prompting surpasses the iterative prompting baseline in efficacy and efficiency under zero- and few-shot settings. Further analysis reveals the influence of evaluation methods and the use of instructions in prompts. Our multilingual investigation shows that English-centric language models perform better on average than multilingual models. Our study offers insights into the multilingual transferability of English-centric LLMs, contributing to the understanding of their multilingual linguistic knowledge., Comment: 18 pages, 7 figures
Published: 2024

14. GNNavi: Navigating the Information Flow in Large Language Models by Graph Neural Network

Author: Yuan, Shuzhou, Nie, Ercong, Färber, Michael, Schmid, Helmut, and Schütze, Hinrich
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) exhibit strong In-Context Learning (ICL) capabilities when prompts with demonstrations are used. However, fine-tuning still remains crucial to further enhance their adaptability. Prompt-based fine-tuning proves to be an effective fine-tuning method in low-data scenarios, but high demands on computing resources limit its practicality. We address this issue by introducing a prompt-based parameter-efficient fine-tuning (PEFT) approach. GNNavi leverages insights into ICL's information flow dynamics, which indicates that label words act in prompts as anchors for information propagation. GNNavi employs a Graph Neural Network (GNN) layer to precisely guide the aggregation and distribution of information flow during the processing of prompts by hardwiring the desired information flow into the GNN. Our experiments on text classification tasks with GPT-2 and Llama2 show GNNavi surpasses standard prompt-based fine-tuning methods in few-shot settings by updating just 0.2% to 0.5% of parameters. We compare GNNavi with prevalent PEFT approaches, such as prefix tuning, LoRA and Adapter in terms of performance and efficiency. Our analysis reveals that GNNavi enhances information flow and ensures a clear aggregation process., Comment: ACL2024 Findings
Published: 2024

15. Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers

Author: Yuan, Shuzhou, Nie, Ercong, Ma, Bolei, and Färber, Michael
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) possess outstanding capabilities in addressing various natural language processing (NLP) tasks. However, the sheer size of these models poses challenges in terms of storage, training and inference due to the inclusion of billions of parameters through layer stacking. While traditional approaches such as model pruning or distillation offer ways for reducing model size, they often come at the expense of performance retention. In our investigation, we systematically explore the approach of reducing the number of layers in LLMs. Surprisingly, we observe that even with fewer layers, LLMs maintain similar or better performance levels, particularly in prompt-based fine-tuning for text classification tasks. Remarkably, in certain cases, models with a single layer outperform their fully layered counterparts. These findings offer valuable insights for future work aimed at mitigating the size constraints of LLMs while preserving their performance, thereby opening avenues for significantly more efficient use of LLMs., Comment: 6 pages, 2 figures
Published: 2024

16. ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks

Author: Ma, Bolei, Nie, Ercong, Yuan, Shuzhou, Schmid, Helmut, Färber, Michael, Kreuter, Frauke, and Schütze, Hinrich
Subjects: Computer Science - Computation and Language
Abstract: Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks., Comment: EACL 2024
Published: 2024

17. HyperPIE: Hyperparameter Information Extraction from Scientific Publications

Author: Saier, Tarek, Ohta, Mayumi, Asakura, Takuto, and Färber, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Automatic extraction of information from publications is key to making scientific knowledge machine readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this paper, we formalize and tackle hyperparameter information extraction (HyperPIE) as an entity recognition and relation extraction task. We create a labeled data set covering publications from a variety of computer science disciplines. Using this data set, we train and evaluate BERT-based fine-tuned models as well as five large language models: GPT-3.5, GALACTICA, Falcon, Vicuna, and WizardLM. For fine-tuned models, we develop a relation extraction approach that achieves an improvement of 29% F1 over a state-of-the-art baseline. For large language models, we develop an approach leveraging YAML output for structured data extraction, which achieves an average improvement of 5.5% F1 in entity recognition over using JSON. With our best performing model we extract hyperparameter information from a large number of unannotated papers, and analyze patterns across disciplines. All our data and source code is publicly available at https://github.com/IllDepence/hyperpie, Comment: accepted at ECIR2024
Published: 2023
Full Text: View/download PDF

18. Linked Papers With Code: The Latest in Machine Learning as an RDF Knowledge Graph

Author: Färber, Michael and Lamprecht, David
Subjects: Computer Science - Digital Libraries, Computer Science - Artificial Intelligence
Abstract: In this paper, we introduce Linked Papers With Code (LPWC), an RDF knowledge graph that provides comprehensive, current information about almost 400,000 machine learning publications. This includes the tasks addressed, the datasets utilized, the methods implemented, and the evaluations conducted, along with their results. Compared to its non-RDF-based counterpart Papers With Code, LPWC not only translates the latest advancements in machine learning into RDF format, but also enables novel ways for scientific impact quantification and scholarly key content recommendation. LPWC is openly accessible at https://linkedpaperswithcode.com and is licensed under CC-BY-SA 4.0. As a knowledge graph in the Linked Open Data cloud, we offer LPWC in multiple formats, from RDF dump files to a SPARQL endpoint for direct web queries, as well as a data source with resolvable URIs and links to the data sources SemOpenAlex, Wikidata, and DBLP. Additionally, we supply knowledge graph embeddings, enabling LPWC to be readily applied in machine learning applications., Comment: Published at ISWC'23
Published: 2023

19. Analyzing the Impact of Companies on AI Research Based on Publications

Author: Färber, Michael and Tampakis, Lazaros
Subjects: Computer Science - Artificial Intelligence, Computer Science - Digital Libraries, Computer Science - Information Retrieval
Abstract: Artificial Intelligence (AI) is one of the most momentous technologies of our time. Thus, it is of major importance to know which stakeholders influence AI research. Besides researchers at universities and colleges, researchers in companies have hardly been considered in this context. In this article, we consider how the influence of companies on AI research can be made measurable on the basis of scientific publishing activities. We compare academic- and company-authored AI publications published in the last decade and use scientometric data from multiple scholarly databases to look for differences across these groups and to disclose the top contributing organizations. While the vast majority of publications is still produced by academia, we find that the citation count an individual publication receives is significantly higher when it is (co-)authored by a company. Furthermore, using a variety of altmetric indicators, we notice that publications with company participation receive considerably more attention online. Finally, we place our analysis results in a broader context and present targeted recommendations to safeguard a harmonious balance between academia and industry in the realm of AI research., Comment: Published in Scientometrics
Published: 2023

20. A Full-fledged Commit Message Quality Checker Based on Machine Learning

Author: Faragó, David, Färber, Michael, and Petrov, Christian
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Commit messages (CMs) are an essential part of version control. By providing important context in regard to what has changed and why, they strongly support software maintenance and evolution. But writing good CMs is difficult and often neglected by developers. So far, there is no tool suitable for practice that automatically assesses how well a CM is written, including its meaning and context. Since this task is challenging, we ask the research question: how well can the CM quality, including semantics and context, be measured with machine learning methods? By considering all rules from the most popular CM quality guideline, creating datasets for those rules, and training and evaluating state-of-the-art machine learning models to check those rules, we can answer the research question with: sufficiently well for practice, with the lowest F$_1$ score of 82.9\%, for the most challenging task. We develop a full-fledged open-source framework that checks all these CM quality rules. It is useful for research, e.g., automatic CM generation, but most importantly for software practitioners to raise the quality of CMs and thus the maintainability and evolution speed of their software., Comment: published at COMPSAC'23
Published: 2023

21. SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples

Author: Färber, Michael, Lamprecht, David, Krause, Johan, Aung, Linn, and Haase, Peter
Subjects: Computer Science - Digital Libraries, Computer Science - Artificial Intelligence
Abstract: We present SemOpenAlex, an extensive RDF knowledge graph that contains over 26 billion triples about scientific publications and their associated entities, such as authors, institutions, journals, and concepts. SemOpenAlex is licensed under CC0, providing free and open access to the data. We offer the data through multiple channels, including RDF dump files, a SPARQL endpoint, and as a data source in the Linked Open Data cloud, complete with resolvable URIs and links to other data sources. Moreover, we provide embeddings for knowledge graph entities using high-performance computing. SemOpenAlex enables a broad range of use-case scenarios, such as exploratory semantic search via our website, large-scale scientific impact quantification, and other forms of scholarly big data analytics within and across scientific disciplines. Additionally, it enables academic recommender systems, such as recommending collaborators, publications, and venues, including explainability capabilities. Finally, SemOpenAlex can serve for RDF query optimization benchmarks, creating scholarly knowledge-guided language models, and as a hub for semantic scientific publishing., Comment: accepted at ISWC'23
Published: 2023

22. Measuring Variety, Balance, and Disparity: An Analysis of Media Coverage of the 2021 German Federal Election

Author: Färber, Michael, Schwade, Jannik, and Jatowt, Adam
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Determining and measuring diversity in news articles is important for a number of reasons, including preventing filter bubbles and fueling public discourse, especially before elections. So far, the identification and analysis of diversity have been illuminated in a variety of ways, such as measuring the overlap of words or topics between news articles related to US elections. However, the question of how diversity in news articles can be measured holistically, i.e., with respect to (1) variety, (2) balance, and (3) disparity, considering individuals, parties, and topics, has not been addressed. In this paper, we present a framework for determining diversity in news articles according to these dimensions. Furthermore, we create and provide a dataset of Google Top Stories, encompassing more than 26,000 unique headlines from more than 900 news outlets collected within two weeks before and after the 2021 German federal election. While we observe high diversity for more general search terms (e.g., "election"), a range of search terms ("education," "Europe," "climate protection," "government") resulted in news articles with high diversity in two out of three dimensions. This reflects a more subjective, dedicated discussion on rather future-oriented topics.
Published: 2023

23. Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word Embeddings

Author: Färber, Michael and Popovic, Nicholas
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In this paper, we propose Vocab-Expander at https://vocab-expander.com, an online tool that enables end-users (e.g., technology scouts) to create and expand a vocabulary of their domain of interest. It utilizes an ensemble of state-of-the-art word embedding techniques based on web text and ConceptNet, a common-sense knowledge base, to suggest related terms for already given terms. The system has an easy-to-use interface that allows users to quickly confirm or reject term suggestions. Vocab-Expander offers a variety of potential use cases, such as improving concept-based information retrieval in technology and innovation management, enhancing communication and collaboration within organizations or interdisciplinary projects, and creating vocabularies for specific courses in education., Comment: accepted at RANLP'23
Published: 2023

24. Evaluating Generative Models for Graph-to-Text Generation

Author: Yuan, Shuzhou and Färber, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have been widely employed for graph-to-text generation tasks. However, the process of finetuning LLMs requires significant training resources and annotation work. In this paper, we explore the capability of generative models to generate descriptive text from graph data in a zero-shot setting. Specifically, we evaluate GPT-3 and ChatGPT on two graph-to-text datasets and compare their performance with that of finetuned LLM models such as T5 and BART. Our results demonstrate that generative models are capable of generating fluent and coherent text, achieving BLEU scores of 10.57 and 11.08 for the AGENDA and WebNLG datasets, respectively. However, our error analysis reveals that generative models still struggle with understanding the semantic relations between entities, and they also tend to generate text with hallucinations or irrelevant information. As a part of error analysis, we utilize BERT to detect machine-generated text and achieve high macro-F1 scores. We have made the text generated by generative models publicly available., Comment: Accepted as short paper in RANLP2023
Published: 2023

25. CoCon: A Data Set on Combined Contextualized Research Artifact Use

Author: Saier, Tarek, Dong, Youxiang, and Färber, Michael
Subjects: Computer Science - Digital Libraries, Computer Science - Computation and Language
Abstract: In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets. To enable more holistic analyses and systems dealing with academic publications and their content, we propose CoCon, a large scholarly data set reflecting the combined use of research artifacts, contextualized in academic publications' full-text. Our data set comprises 35 k artifacts (data sets, methods, models, and tasks) and 340 k publications. We additionally formalize a link prediction task for "combined research artifact use prediction" and provide code to utilize analyses of and the development of ML applications on our data. All data and code is publicly available at https://github.com/IllDepence/contextgraph., Comment: submitted to JCDL2023
Published: 2023
Full Text: View/download PDF

26. unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network

Author: Saier, Tarek, Krause, Johan, and Färber, Michael
Subjects: Computer Science - Digital Libraries, Computer Science - Computation and Language
Abstract: Large-scale data sets on scholarly publications are the basis for a variety of bibliometric analyses and natural language processing (NLP) applications. Especially data sets derived from publication's full-text have recently gained attention. While several such data sets already exist, we see key shortcomings in terms of their domain and time coverage, citation network completeness, and representation of full-text content. To address these points, we propose a new version of the data set unarXive. We base our data processing pipeline and output format on two existing data sets, and improve on each of them. Our resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years. It furthermore has a more complete citation network than its predecessors and retains a richer representation of document structure as well as non-textual publication content such as mathematical notation. In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification. All data and source code is publicly available at https://github.com/IllDepence/unarXive., Comment: submitted to JCDL2023
Published: 2023
Full Text: View/download PDF

27. Denotational Semantics and a Fast Interpreter for jq

Author: Färber, Michael
Subjects: Computer Science - Logic in Computer Science, Computer Science - Programming Languages
Abstract: jq is a widely used tool that provides a programming language to manipulate JSON data. However, its semantics are currently only specified by its implementation, making it difficult to reason about its behaviour. To this end, I provide a syntax and denotational semantics for a subset of the jq language. In particular, the semantics provide a new way to interpret updates. I implement an extended version of the semantics in a novel interpreter for the jq language called jaq. Although jaq uses a significantly simpler approach to execute jq programs than jq, jaq is faster than jq on ten out of thirteen benchmarks., Comment: Submitted to OOPSLA 2023
Published: 2023

28. Biases in Scholarly Recommender Systems: Impact, Prevalence, and Mitigation

Author: Färber, Michael, Coutinho, Melissa, and Yuan, Shuzhou
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence
Abstract: With the remarkable increase in the number of scientific entities such as publications, researchers, and scientific topics, and the associated information overload in science, academic recommender systems have become increasingly important for millions of researchers and science enthusiasts. However, it is often overlooked that these systems are subject to various biases. In this article, we first break down the biases of academic recommender systems and characterize them according to their impact and prevalence. In doing so, we distinguish between biases originally caused by humans and biases induced by the recommender system. Second, we provide an overview of methods that have been used to mitigate these biases in the scholarly domain. Based on this, third, we present a framework that can be used by researchers and developers to mitigate biases in scholarly recommender systems and to evaluate recommender systems fairly. Finally, we discuss open challenges and possible research directions related to scholarly biases., Comment: 44 pages, 6 figures. To be published in Scientometrics
Published: 2023

29. Impact, Attention, Influence: Early Assessment of Autonomous Driving Datasets

Author: Bogdoll, Daniel, Hendl, Jonas, Schreyer, Felix, Gowda, Nishanth, Färber, Michael, and Zöllner, J. Marius
Subjects: Computer Science - Digital Libraries, Computer Science - Robotics
Abstract: Autonomous Driving (AD), the area of robotics with the greatest potential impact on society, has gained a lot of momentum in the last decade. As a result of this, the number of datasets in AD has increased rapidly. Creators and users of datasets can benefit from a better understanding of developments in the field. While scientometric analysis has been conducted in other fields, it rarely revolves around datasets. Thus, the impact, attention, and influence of datasets on autonomous driving remains a rarely investigated field. In this work, we provide a scientometric analysis for over 200 datasets in AD. We perform a rigorous evaluation of relations between available metadata and citation counts based on linear regression. Subsequently, we propose an Influence Score to assess a dataset already early on without the need for a track-record of citations, which is only available with a certain delay., Comment: Daniel Bogdoll and Jonas Hendl contributed equally. Accepted for publication at ICCRE 2023
Published: 2023
Full Text: View/download PDF

30. KITspotlight: A System for Spotlighting Researchers in the Media

Author: Färber, Michael, Zagoruiko, Benjamin, Wambach, Markus, Hartmanis, Juris, Founding Editor, van Leeuwen, Jan, Series Editor, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Kobsa, Alfred, Series Editor, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Nierstrasz, Oscar, Series Editor, Pandu Rangan, C., Editorial Board Member, Sudan, Madhu, Series Editor, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Vardi, Moshe Y, Series Editor, Goos, Gerhard, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Stefanidis, Kostas, editor, Systä, Kari, editor, Matera, Maristella, editor, Heil, Sebastian, editor, Kondylakis, Haridimos, editor, and Quintarelli, Elisa, editor
Published: 2024
Full Text: View/download PDF

31. HyperPIE: Hyperparameter Information Extraction from Scientific Publications

Author: Saier, Tarek, Ohta, Mayumi, Asakura, Takuto, Färber, Michael, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Goharian, Nazli, editor, Tonellotto, Nicola, editor, He, Yulan, editor, Lipani, Aldo, editor, McDonald, Graham, editor, Macdonald, Craig, editor, and Ounis, Iadh, editor
Published: 2024
Full Text: View/download PDF

32. Predicting Companies' ESG Ratings from News Articles Using Multivariate Timeseries Analysis

Author: Aue, Tanja, Jatowt, Adam, and Färber, Michael
Subjects: Quantitative Finance - General Finance, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Environmental, social and governance (ESG) engagement of companies moved into the focus of public attention over recent years. With the requirements of compulsory reporting being implemented and investors incorporating sustainability in their investment decisions, the demand for transparent and reliable ESG ratings is increasing. However, automatic approaches for forecasting ESG ratings have been quite scarce despite the increasing importance of the topic. In this paper, we build a model to predict ESG ratings from news articles using the combination of multivariate timeseries construction and deep learning techniques. A news dataset for about 3,000 US companies together with their ratings is also created and released for training. Through the experimental evaluation we find out that our approach provides accurate results outperforming the state-of-the-art, and can be used in practice to support a manual determination or analysis of ESG ratings.
Published: 2022

33. Benefits of international collaboration in computer science: a case study of China, the European Union, and the United States

Author: Gómez-Espés, Alberto, Färber, Michael, and Jatowt, Adam
Published: 2024
Full Text: View/download PDF

34. Analyzing the impact of companies on AI research based on publications

Author: Färber, Michael and Tampakis, Lazaros
Published: 2024
Full Text: View/download PDF

35. Few-Shot Document-Level Relation Extraction

Author: Popovic, Nicholas and Färber, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We present FREDo, a few-shot document-level relation extraction (FSDLRE) benchmark. As opposed to existing benchmarks which are built on sentence-level relation extraction corpora, we argue that document-level corpora provide more realism, particularly regarding none-of-the-above (NOTA) distributions. Therefore, we propose a set of FSDLRE tasks and construct a benchmark based on two existing supervised learning data sets, DocRED and sciERC. We adapt the state-of-the-art sentence-level method MNAV to the document-level and develop it further for improved domain adaptation. We find FSDLRE to be a challenging setting with interesting new characteristics such as the ability to sample NOTA instances from the support set. The data, code, and trained models are available online (https://github.com/nicpopovic/FREDo)., Comment: Published at NAACL 2022
Published: 2022

36. How Does Author Affiliation Affect Preprint Citation Count? Analyzing Citation Bias at the Institution and Country Level

Author: Nishioka, Chifumi, Färber, Michael, and Saier, Tarek
Subjects: Computer Science - Digital Libraries
Abstract: Citing is an important aspect of scientific discourse and important for quantifying the scientific impact quantification of researchers. Previous works observed that citations are made not only based on the pure scholarly contributions but also based on non-scholarly attributes, such as the affiliation or gender of authors. In this way, citation bias is produced. Existing works, however, have not analyzed preprints with respect to citation bias, although they play an increasingly important role in modern scholarly communication. In this paper, we investigate whether preprints are affected by citation bias with respect to the author affiliation. We measure citation bias for bioRxiv preprints and their publisher versions at the institution level and country level, using the Lorenz curve and Gini coefficient. This allows us to mitigate the effects of confounding factors and see whether or not citation biases related to author affiliation have an increased effect on preprint citations. We observe consistent higher Gini coefficients for preprints than those for publisher versions. Thus, we can confirm that citation bias exists and that it is more severe in case of preprints. As preprints are on the rise, affiliation-based citation bias is, thus, an important topic not only for authors (e.g., when deciding what to cite), but also to people and institutions that use citations for scientific impact quantification (e.g., funding agencies deciding about funding based on citation counts)., Comment: Accepted at the ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2022
Published: 2022
Full Text: View/download PDF

37. AIFB-WebScience at SemEval-2022 Task 12: Relation Extraction First -- Using Relation Extraction to Identify Entities

Author: Popovic, Nicholas, Laurito, Walter, and Färber, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: In this paper, we present an end-to-end joint entity and relation extraction approach based on transformer-based language models. We apply the model to the task of linking mathematical symbols to their descriptions in LaTeX documents. In contrast to existing approaches, which perform entity and relation extraction in sequence, our system incorporates information from relation extraction into entity extraction. This means that the system can be trained even on data sets where only a subset of all valid entity spans is annotated. We provide an extensive evaluation of the proposed system and its strengths and weaknesses. Our approach, which can be scaled dynamically in computational complexity at inference time, produces predictions with high precision and reaches 3rd place in the leaderboard of SemEval-2022 Task 12. For inputs in the domain of physics and math, it achieves high relation extraction macro F1 scores of 95.43% and 79.17%, respectively. The code used for training and evaluating our models is available at: https://github.com/nicpopovic/RE1st, Comment: Camera ready version
Published: 2022

38. Are Investors Biased Against Women? Analyzing How Gender Affects Startup Funding in Europe

Author: Färber, Michael and Klein, Alexander
Subjects: Computer Science - Social and Information Networks, Computer Science - Information Retrieval
Abstract: One of the main challenges of startups is to raise capital from investors. For startup founders, it is therefore crucial to know whether investors have a bias against women as startup founders and in which way startups face disadvantages due to gender bias. Existing works on gender studies have mainly analyzed the US market. In this paper, we aim to give a more comprehensive picture of gender bias in early-stage startup funding. We examine European startups listed on Crunchbase using Semantic Web technologies and analyze how the share of female founders in a founding team affects the funding amount. We find that the relative amount of female founders has a negative impact on the funding raised. Furthermore, we observe that founder characteristics have an effect on the funding raised based on the founders' gender. Moreover, we find that gender bias in early-stage funding is less prevalent for serial founders with entrepreneurial experience as female founders benefit three times more than male founders from already having founded a startup. Overall, our study suggests that gender bias exists and is worth to be considered in the context of startup funding., Comment: 35 pages
Published: 2021

39. Towards Full-Fledged Argument Search: A Framework for Extracting and Clustering Arguments from Unstructured Text

Author: Färber, Michael and Steyer, Anna
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Argument search aims at identifying arguments in natural language texts. In the past, this task has been addressed by a combination of keyword search and argument identification on the sentence- or document-level. However, existing frameworks often address only specific components of argument search and do not address the following aspects: (1) argument-query matching: identifying arguments that frame the topic slightly differently than the actual search query; (2) argument identification: identifying arguments that consist of multiple sentences; (3) argument clustering: selecting retrieved arguments by topical aspects. In this paper, we propose a framework for addressing these shortcomings. We suggest (1) to combine the keyword search with precomputed topic clusters for argument-query matching, (2) to apply a novel approach based on sentence-level sequence-labeling for argument identification, and (3) to present aggregated arguments to users based on topic-aware argument clustering. Our experiments on several real-world debate data sets demonstrate that density-based clustering algorithms, such as HDBSCAN, are particularly suitable for argument-query matching. With our sentence-level, BiLSTM-based sequence-labeling approach we achieve a macro F1 score of 0.71. Finally, evaluating our argument clustering method indicates that a fine-grained clustering of arguments by subtopics remains challenging but is worthwhile to be explored.
Published: 2021

40. Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Usage, and Impact

Author: Saier, Tarek, Färber, Michael, and Tsereteli, Tornike
Subjects: Computer Science - Digital Libraries, Computer Science - Information Retrieval, Computer Science - Machine Learning, H.3.3, H.3.7, I.2.7
Abstract: Citation information in scholarly data is an important source of insight into the reception of publications and the scholarly discourse. Outcomes of citation analyses and the applicability of citation based machine learning approaches heavily depend on the completeness of such data. One particular shortcoming of scholarly data nowadays is that non-English publications are often not included in data sets, or that language metadata is not available. Because of this, citations between publications of differing languages (cross-lingual citations) have only been studied to a very limited degree. In this paper, we present an analysis of cross-lingual citations based on over one million English papers, spanning three scientific disciplines and a time span of three decades. Our investigation covers differences between cited languages and disciplines, trends over time, and the usage characteristics as well as impact of cross-lingual citations. Among our findings are an increasing rate of citations to publications written in Chinese, citations being primarily to local non-English languages, and consistency in citation intent between cross- and monolingual citations. To facilitate further research, we make our collected data and source code publicly available., Comment: to be published in the International Journal on Digital Libraries
Published: 2021
Full Text: View/download PDF

41. Explaining Convolutional Neural Networks by Tagging Filters

Author: Nguyen, Anna, Hagenmayer, Daniel, Weller, Tobias, and Färber, Michael
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Convolutional neural networks (CNNs) have achieved astonishing performance on various image classification tasks, but it is difficult for humans to understand how a classification comes about. Recent literature proposes methods to explain the classification process to humans. These focus mostly on visualizing feature maps and filter weights, which are not very intuitive for non-experts in analyzing a CNN classification. In this paper, we propose FilTag, an approach to effectively explain CNNs even to non-experts. The idea is that when images of a class frequently activate a convolutional filter, then that filter is tagged with that class. These tags provide an explanation to a reference of a class-specific feature detected by the filter. Based on the tagging, individual image classifications can then be intuitively explained in terms of the tags of the filters that the input image activates. Finally, we show that the tags are helpful in analyzing classification errors caused by noisy input images and that the tags can be further processed by machines.
Published: 2021

42. A Curiously Effective Backtracking Strategy for Connection Tableaux

Author: Färber, Michael
Subjects: Computer Science - Logic in Computer Science, F.4.1
Abstract: Automated proof search with connection tableaux, such as implemented by Otten's leanCoP prover, depends on backtracking for completeness. Otten's restricted backtracking strategy loses completeness, yet for many problems, it significantly reduces the time required to find a proof. I introduce a new, less restricted backtracking strategy based on the notion of exclusive cuts. I implement the strategy in a new prover called meanCoP and show that it greatly improves upon the previous best strategy in leanCoP., Comment: Accepted at AReCCa 2023
Published: 2021

43. Safe, Fast, Concurrent Proof Checking for the lambda-Pi Calculus Modulo Rewriting

Author: Färber, Michael
Subjects: Computer Science - Logic in Computer Science
Abstract: Several proof assistants, such as Isabelle or Coq, can concurrently check multiple proofs. In contrast, the vast majority of today's small proof checkers either does not support concurrency at all or only limited forms thereof, restricting the efficiency of proof checking on multi-core processors. This work shows the design of a small, memory- and thread-safe kernel that efficiently checks proofs both concurrently and non-concurrently. This design is implemented in a new proof checker called Kontroli for the lambda-Pi calculus modulo rewriting, which is an established framework to uniformly express a multitude of logical systems. Kontroli is faster than the reference proof checker for this calculus, Dedukti, on all of five evaluated datasets obtained from proof assistants and interactive theorem provers. Furthermore, Kontroli reduces the time of the most time-consuming part of proof checking using eight threads by up to 6.6x., Comment: 11th ACM SIGPLAN International Conference on Certified Programs and Proofs (CPP '22), Jan 2022, Philadelphia, PA, United States
Published: 2021
Full Text: View/download PDF

44. Ablesbarkeitsmesser: A System for Assessing the Readability of German Text

Author: Pickelmann, Florian, Färber, Michael, Jatowt, Adam, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Kamps, Jaap, editor, Goeuriot, Lorraine, editor, Crestani, Fabio, editor, Maistro, Maria, editor, Joho, Hideo, editor, Davis, Brian, editor, Gurrin, Cathal, editor, Kruschwitz, Udo, editor, and Caputo, Annalina, editor
Published: 2023
Full Text: View/download PDF

45. Right for the Right Reason: Making Image Classification Robust

Author: Nguyen, Anna, Oberföll, Adrian, and Färber, Michael
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: The effectiveness of Convolutional Neural Networks (CNNs)in classifying image data has been thoroughly demonstrated. In order to explain the classification to humans, methods for visualizing classification evidence have been developed in recent years. These explanations reveal that sometimes images are classified correctly, but for the wrong reasons,i.e., based on incidental evidence. Of course, it is desirable that images are classified correctly for the right reasons, i.e., based on the actual evidence. To this end, we propose a new explanation quality metric to measure object aligned explanation in image classification which we refer to as theObAlExmetric. Using object detection approaches, explanation approaches, and ObAlEx, we quantify the focus of CNNs on the actual evidence. Moreover, we show that additional training of the CNNs can improve the focus of CNNs without decreasing their accuracy.
Published: 2020

46. Citation Recommendation: Approaches and Datasets

Author: Färber, Michael and Jatowt, Adam
Subjects: Computer Science - Information Retrieval, Computer Science - Digital Libraries, Computer Science - Machine Learning
Abstract: Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data sets have been presented. However, to the best of our knowledge, no literature survey has been conducted explicitly on citation recommendation. In this article, we give a thorough introduction into automatic citation recommendation research. We then present an overview of the approaches and data sets for citation recommendation and identify differences and commonalities using various dimensions. Last but not least, we shed light on the evaluation methods, and outline general challenges in the evaluation and how to meet them. We restrict ourselves to citation recommendation for scientific publications, as this document type has been studied the most in this area. However, many of the observations and discussions included in this survey are also applicable to other types of text, such as news articles and encyclopedic articles., Comment: to be published in the International Journal on Digital Libraries
Published: 2020
Full Text: View/download PDF

47. HybridCite: A Hybrid Model for Context-Aware Citation Recommendation

Author: Färber, Michael and Sampath, Ashwath
Subjects: Computer Science - Information Retrieval, Computer Science - Digital Libraries, Computer Science - Machine Learning
Abstract: Citation recommendation systems aim to recommend citations for either a complete paper or a small portion of text called a citation context. The process of recommending citations for citation contexts is called local citation recommendation and is the focus of this paper. Firstly, we develop citation recommendation approaches based on embeddings, topic modeling, and information retrieval techniques. We combine, for the first time to the best of our knowledge, the best-performing algorithms into a semi-genetic hybrid recommender system for citation recommendation. We evaluate the single approaches and the hybrid approach offline based on several data sets, such as the Microsoft Academic Graph (MAG) and the MAG in combination with arXiv and ACL. We further conduct a user study for evaluating our approaches online. Our evaluation results show that a hybrid model containing embedding and information retrieval-based components outperforms its individual components and further algorithms by a large margin., Comment: to be published in the Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL '20)
Published: 2020
Full Text: View/download PDF

48. Making Neural Networks FAIR

Author: Nguyen, Anna, Weller, Tobias, Färber, Michael, and Sure-Vetter, York
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Research on neural networks has gained significant momentum over the past few years. Because training is a resource-intensive process and training data cannot always be made available to everyone, there has been a trend to reuse pre-trained neural networks. As such, neural networks themselves have become research data. In this paper, we first present the neural network ontology FAIRnets Ontology, an ontology to make existing neural network models findable, accessible, interoperable, and reusable according to the FAIR principles. Our ontology allows us to model neural networks on a meta-level in a structured way, including the representation of all network layers and their characteristics. Secondly, we have modeled over 18,400 neural networks from GitHub based on this ontology, which we provide to the public as a knowledge graph called FAIRnets, ready to be used for recommending suitable neural networks to data scientists.
Published: 2019

49. Linked Crunchbase: A Linked Data API and RDF Data Set About Innovative Companies

Author: Färber, Michael
Subjects: Computer Science - Databases, Computer Science - Artificial Intelligence, Computer Science - Information Retrieval
Abstract: Crunchbase is an online platform collecting information about startups and technology companies, including attributes and relations of companies, people, and investments. Data contained in Crunchbase is, to a large extent, not available elsewhere, making Crunchbase to a unique data source. In this paper, we present how to bring Crunchbase to the Web of Data so that its data can be used in the machine-readable RDF format by anyone on the Web. First, we give insights into how we developed and hosted a Linked Data API for Crunchbase and how sameAs links to other data sources are integrated. Then, we present our method for crawling RDF data based on this API to build a custom Crunchbase RDF knowledge graph. We created an RDF data set with over 347 million triples, including 781k people, 659k organizations, and 343k investments. Our Crunchbase Linked Data API is available online at http://linked-crunchbase.org.
Published: 2019

50. SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples

Author: Färber, Michael, primary, Lamprecht, David, additional, Krause, Johan, additional, Aung, Linn, additional, and Haase, Peter, additional
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

287 results on '"Färber, Michael"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources