Author: "Leser, Ulf" / Topic: data mining - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Leser, Ulf"' showing total 22 results

Start Over Author "Leser, Ulf" Topic data mining

22 results on '"Leser, Ulf"'

1. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.

Author: Sänger M, Garda S, Wang XD, Weber-Genzel L, Droop P, Fuchs B, Akbik A, and Leser U
Subjects: Natural Language Processing, Software, Computational Biology methods, Humans, Data Mining methods
Abstract: Motivation: With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications., Results: Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools., Availability and Implementation: All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments., (© The Author(s) 2024. Published by Oxford University Press.)
Published: 2024
Full Text: View/download PDF

2. BELB: a biomedical entity linking benchmark.

Author: Garda S, Weber-Genzel L, Martin R, and Leser U
Subjects: Software, Language, Natural Language Processing, Benchmarking, Data Mining methods
Abstract: Motivation: Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied., Results: We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models., Availability and Implementation: The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp., (© The Author(s) 2023. Published by Oxford University Press.)
Published: 2023
Full Text: View/download PDF

3. RegEl corpus: identifying DNA regulatory elements in the scientific literature.

Author: Garda S, Lenihan-Geels F, Proft S, Hochmuth S, Schülke M, Seelow D, and Leser U
Subjects: DNA genetics, Databases, Factual, Humans, PubMed, Algorithms, Data Mining methods
Abstract: High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Regulatory Elements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48-0.91 for entity detection and 0.71-0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL: https://zenodo.org/record/6418451#.YqcLHvexVqg., (© The Author(s) 2022. Published by Oxford University Press.)
Published: 2022
Full Text: View/download PDF

4. Deep learning with word embeddings improves biomedical named entity recognition.

Author: Habibi M, Weber L, Neves M, Wiegandt DL, and Leser U
Subjects: Animals, Humans, Mice, Software, Data Mining methods, Machine Learning
Abstract: Motivation: Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult., Results: We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall., Availability and Implementation: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ ., Contact: habibima@informatik.hu-berlin.de., (© The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com)
Published: 2017
Full Text: View/download PDF

5. SETH detects and normalizes genetic variants in text.

Author: Thomas P, Rocktäschel T, Hakenberg J, Lichtblau Y, and Leser U
Subjects: Computational Biology methods, Genes, Humans, Information Storage and Retrieval methods, Natural Language Processing, PubMed, Publications, Terminology as Topic, Data Curation, Data Mining, Genetic Variation
Abstract: Unlabelled: : Descriptions of genetic variations and their effect are widely spread across the biomedical literature. However, finding all mentions of a specific variation, or all mentions of variations in a specific gene, is difficult to achieve due to the many ways such variations are described. Here, we describe SETH, a tool for the recognition of variations from text and their subsequent normalization to dbSNP or UniProt. SETH achieves high precision and recall on several evaluation corpora of PubMed abstracts. It is freely available and encompasses stand-alone scripts for isolated application and evaluation as well as a thorough documentation for integration into other applications., Availability and Implementation: SETH is released under the Apache 2.0 license and can be downloaded from http://rockt.github.io/SETH/ CONTACT: thomas@informatik.hu-berlin.de or leser@informatik.hu-berlin.de., (© The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.)
Published: 2016
Full Text: View/download PDF

6. A survey on annotation tools for the biomedical literature.

Author: Neves M and Leser U
Subjects: Algorithms, Artificial Intelligence, Computational Biology methods, Data Mining standards, Humans, Natural Language Processing, Data Mining methods, Publications, Software
Abstract: New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.
Published: 2014
Full Text: View/download PDF

7. Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts.

Author: Neves M, Damaschun A, Mah N, Lekschas F, Seltmann S, Stachelscheid H, Fontaine JF, Kurtz A, and Leser U
Subjects: Databases as Topic, Humans, Kidney anatomy & histology, Reproducibility of Results, Statistics as Topic, Computational Biology methods, Data Mining, Gene Expression Regulation, Kidney cytology, Kidney metabolism, Publications, Software
Abstract: Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ~50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org/
Published: 2013
Full Text: View/download PDF

8. BioCreative III interactive task: an overview.

Author: Arighi CN, Roberts PM, Agarwal S, Bhattacharya S, Cesareni G, Chatr-Aryamontri A, Clematide S, Gaudet P, Giglio MG, Harrow I, Huala E, Krallinger M, Leser U, Li D, Liu F, Lu Z, Maltais LJ, Okazaki N, Perfetto L, Rinaldi F, Sætre R, Salgado D, Srinivasan P, Thomas PE, Toldo L, Hirschman L, and Wu CH
Subjects: Animals, Computational Biology methods, Periodicals as Topic, Plants genetics, Plants metabolism, Data Mining methods, Genes
Abstract: Background: The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested., Results: A User Advisory Group (UAG) actively participated in the IAT design and assessment. The task focused on gene normalization (identifying gene mentions in the article and linking these genes to standard database identifiers), gene ranking based on the overall importance of each gene mentioned in the article, and gene-oriented document retrieval (identifying full text papers relevant to a selected gene). Six systems participated and all processed and displayed the same set of articles. The articles were selected based on content known to be problematic for curation, such as ambiguity of gene names, coverage of multiple genes and species, or introduction of a new gene name. Members of the UAG curated three articles for training and assessment purposes, and each member was assigned a system to review. A questionnaire related to the interface usability and task performance (as measured by precision and recall) was answered after systems were used to curate articles. Although the limited number of articles analyzed and users involved in the IAT experiment precluded rigorous quantitative analysis of the results, a qualitative analysis provided valuable insight into some of the problems encountered by users when using the systems. The overall assessment indicates that the system usability features appealed to most users, but the system performance was suboptimal (mainly due to low accuracy in gene normalization). Some of the issues included failure of species identification and gene name ambiguity in the gene normalization task leading to an extensive list of gene identifiers to review, which, in some cases, did not contain the relevant genes. The document retrieval suffered from the same shortfalls. The UAG favored achieving high performance (measured by precision and recall), but strongly recommended the addition of features that facilitate the identification of correct gene and its identifier, such as contextual information to assist in disambiguation., Discussion: The IAT was an informative exercise that advanced the dialog between curators and developers and increased the appreciation of challenges faced by each group. A major conclusion was that the intended users should be actively involved in every phase of software development, and this will be strongly encouraged in future tasks. The IAT Task provides the first steps toward the definition of metrics and functional requirements that are necessary for designing a formal evaluation of interactive curation systems in the BioCreative IV challenge.
Published: 2011
Full Text: View/download PDF

9. Phenotype mining for functional genomics and gene discovery.

Author: Groth P, Leser U, and Weiss B
Subjects: Animals, Databases, Factual, Humans, Software, Data Mining, Genetic Association Studies, Genomics methods, Phenotype
Abstract: In gene prediction, studying phenotypes is highly valuable for reducing the number of locus candidates in association studies and to aid disease gene candidate prioritization. This is due to the intrinsic nature of phenotypes to visibly reflect genetic activity, making them potentially one of the most useful data types for functional studies. However, systematic use of these data has begun only recently. 'Comparative phenomics' is the analysis of genotype-phenotype associations across species and experimental methods. This is an emerging research field of utmost importance for gene discovery and gene function annotation. In this chapter, we review the use of phenotype data in the biomedical field. We will give an overview of phenotype resources, focusing on PhenomicDB--a cross-species genotype-phenotype database--which is the largest available collection of phenotype descriptions across species and experimental methods. We report on its latest extension by which genotype-phenotype relationships can be viewed as graphical representations of similar phenotypes clustered together ('phenoclusters'), supplemented with information from protein-protein interactions and Gene Ontology terms. We show that such 'phenoclusters' represent a novel approach to group genes functionally and to predict novel gene functions with high precision. We explain how these data and methods can be used to supplement the results of gene discovery approaches. The aim of this chapter is to assist researchers interested in understanding how phenotype data can be used effectively in the gene discovery field.
Published: 2011
Full Text: View/download PDF

10. Phenoclustering: online mining of cross-species phenotypes.

Author: Groth P, Kalev I, Kirov I, Traikov B, Leser U, and Weiss B
Subjects: Cluster Analysis, Data Mining methods, Internet, Phenotype
Abstract: Summary: Recently, several methods for analyzing phenotype data have been published, but only few are able to cope with data sets generated in different studies, with different methods, or for different species. We developed an online system in which more than 300 000 phenotypes from a wide variety of sources and screening methods can be analyzed together. Clusters of similar phenotypes are visualized as networks of highly similar phenotypes, inducing gene groups useful for functional analysis. This system is part of PhenomicDB, providing the world's largest cross-species phenotype data collection with a tool to mine its wealth of information., Availability: Freely available at http://www.phenomicdb.de
Published: 2010
Full Text: View/download PDF

11. A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.

Author: Tikk D, Thomas P, Palaga P, Hakenberg J, and Leser U
Subjects: Algorithms, Area Under Curve, Decision Trees, Models, Molecular, Reproducibility of Results, Data Mining methods, Databases, Protein, Natural Language Processing, Protein Interaction Mapping methods, Proteins classification
Abstract: The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein-protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.
Published: 2010
Full Text: View/download PDF

12. Integrating protein-protein interactions and text mining for protein function prediction.

Author: Jaeger S, Gaudan S, Leser U, and Rebholz-Schuhmann D
Subjects: Algorithms, Reproducibility of Results, Terminology as Topic, Computational Biology methods, Data Mining methods, Databases, Protein, Proteins chemistry, Proteins physiology
Abstract: Background: Functional annotation of proteins remains a challenging task. Currently the scientific literature serves as the main source for yet uncurated functional annotations, but curation work is slow and expensive. Automatic techniques that support this work are still lacking reliability. We developed a method to identify conserved protein interaction graphs and to predict missing protein functions from orthologs in these graphs. To enhance the precision of the results, we furthermore implemented a procedure that validates all predictions based on findings reported in the literature., Results: Using this procedure, more than 80% of the GO annotations for proteins with highly conserved orthologs that are available in UniProtKb/Swiss-Prot could be verified automatically. For a subset of proteins we predicted new GO annotations that were not available in UniProtKb/Swiss-Prot. All predictions were correct (100% precision) according to the verifications from a trained curator., Conclusion: Our method of integrating CCSs and literature mining is thus a highly reliable approach to predict GO annotations for weakly characterized proteins with orthologs.
Published: 2008
Full Text: View/download PDF

13. Human Activity Segmentation Challenge @ ECML/PKDD’23

Author: Ermshaus, Arik, Schäfer, Patrick, Bagnall, Anthony, Guyet, Thomas, Ifrim, Georgiana, Lemaire, Vincent, Leser, Ulf, Leverger, Colin, Malinowski, Simon, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Ifrim, Georgiana, editor, Tavenard, Romain, editor, Bagnall, Anthony, editor, Schaefer, Patrick, editor, Malinowski, Simon, editor, Guyet, Thomas, editor, and Lemaire, Vincent, editor
Published: 2023
Full Text: View/download PDF

14. Window Size Selection in Unsupervised Time Series Analytics: A Review and Benchmark

Author: Ermshaus, Arik, Schäfer, Patrick, Leser, Ulf, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Guyet, Thomas, editor, Ifrim, Georgiana, editor, Malinowski, Simon, editor, Bagnall, Anthony, editor, Shafer, Patrick, editor, and Lemaire, Vincent, editor
Published: 2023
Full Text: View/download PDF

15. RegEl corpus: identifying DNA regulatory elements in the scientific literature.

Author: Garda, Samuele, Lenihan-Geels, Freyda, Proft, Sebastian, Hochmuth, Stefanie, Schülke, Markus, Seelow, Dominik, and Leser, Ulf
Subjects: SCIENTIFIC literature, DNA, HUMAN DNA, HUMAN genome, DATA mining, GENE enhancers
Abstract: High-throughput technologies led to the generation of a wealth of data on regulatory DNA elements in the human genome. However, results from disease-driven studies are primarily shared in textual form as scientific articles. Information extraction (IE) algorithms allow this information to be (semi-)automatically accessed. Their development, however, is dependent on the availability of annotated corpora. Therefore, we introduce RegEl (Reg ulatory El ements), the first freely available corpus annotated with regulatory DNA elements comprising 305 PubMed abstracts for a total of 2690 sentences. We focus on enhancers, promoters and transcription factor binding sites. Three annotators worked in two stages, achieving an overall 0.73 F1 inter-annotator agreement and 0.46 for regulatory elements. Depending on the entity type, IE baselines reach F1-scores of 0.48–0.91 for entity detection and 0.71–0.88 for entity normalization. Next, we apply our entity detection models to the entire PubMed collection and extract co-occurrences of genes or diseases with regulatory elements. This generates large collections of regulatory elements associated with 137 870 unique genes and 7420 diseases, which we make openly available. Database URL : https://zenodo.org/record/6418451#.YqcLHvexVqg [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition.

Author: Weber, Leon, Sänger, Mario, Münchmeyer, Jannes, Habibi, Maryam, Leser, Ulf, and Akbik, Alan
Subjects: DATA mining, NAMED-entity recognition, TEXT files, CORPORA
Abstract: Summary Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair , a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair , recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora. Availability and implementation HunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems. Supplementary information Supplementary data are available at Bioinformatics online. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

17. Mining Phenotypes for Protein Function Prediction

Author: Leser, Ulf, Groth, Philip, Weiss, Bertram, and Pohlenz, Hans-Dieter
Subjects: phenotypes, bioinformatics, text mining, Data mining, funciton prediction
Abstract: Until very recently, phenotypes only very rarely were studied in a systematic manner. While ontologies for describing gene functions now have a 10 year long tradition, similar vocabularies for describing the phenotype of genes are only emerging now; similarly, the techniques for determining phenotypes on a large scale (especially RNAi) are available only for a few years, while genomic sequencing or gene expression studies are already established for a much longer time. In this talk, we describe results from a study for exploiting phenotype descriptions for protein function prediction. We used the data from PhenomicsDB, a phenotype database integrated from several publicly available data sources. Due to the lack of standardization, phenotypes in PhenomicsDB can only be viewed as text (short statements, abstracts, singular terms, ...). We clustered these texts and analyzed the corresponding gene clusters in terms of their coherence in functional annotation and their interconnectedness by protein-protein-interactions. We also devised a method for using the close similarity in their phenotype descriptions to predict the function of proteins. We show that this methods yields a very good precision at acceptable coverage.
Published: 2008
Full Text: View/download PDF

18. Recognizing chemicals in patents: a comparative analysis.

Author: Habibi, Maryam, Wiegandt, David, Schmedding, Florian, and Leser, Ulf
Subjects: CHEMICALS, PATENTS, PHARMACEUTICAL industry, SCIENTIFIC community, TEXT mining, DATA mining, COMPARATIVE studies
Abstract: Recently, methods for Chemical Named Entity Recognition (NER) have gained substantial interest, driven by the need for automatically analyzing todays ever growing collections of biomedical text. Chemical NER for patents is particularly essential due to the high economic importance of pharmaceutical findings. However, NER on patents has essentially been neglected by the research community for long, mostly because of the lack of enough annotated corpora. A recent international competition specifically targeted this task, but evaluated tools only on gold standard patent abstracts instead of full patents; furthermore, results from such competitions are often difficult to extrapolate to real-life settings due to the relatively high homogeneity of training and test data. Here, we evaluate the two state-of-the-art chemical NER tools, tmChem and ChemSpot, on four different annotated patent corpora, two of which consist of full texts. We study the overall performance of the tools, compare their results at the instance level, report on high-recall and high-precision ensembles, and perform cross-corpus and intra-corpus evaluations. Our findings indicate that full patents are considerably harder to analyze than patent abstracts and clearly confirm the common wisdom that using the same text genre (patent vs. scientific) and text type (abstract vs. full text) for training and testing is a pre-requisite for achieving high quality text mining results. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

19. Computer-assisted curation of a human regulatory core network from the biological literature.

Author: Thomas, Philippe, Durek, Pawel, Solt, Illés, Klinger, Bertram, Witzel, Franziska, Schulthess, Pascal, Mayer, Yvonne, Tikk, Domonkos, Blüthgen, Nils, and Leser, Ulf
Subjects: GENETIC databases, DATA mining, TEXT mining, TRANSCRIPTION factors, GENE expression profiling, GENE regulatory networks, GENE targeting, FULL-text databases
Abstract: Motivation: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. Results: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45?000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ~900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. Conclusions: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

20. Simple tricks for improving pattern-based information extraction from the biomedical literature.

Author: Quang Long Nguyen, Tikk, Domonkos, and Leser, Ulf
Subjects: MEDICAL literature, LIFE sciences literature, TEXT mining, DATA mining, INFORMATION retrieval
Abstract: Background: Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns. Results: We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8% to 51.9%. Conclusions: Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction. [ABSTRACT FROM AUTHOR]
Published: 2010
Full Text: View/download PDF

21. A fast and effective dependency graph kernel for PPI relation extraction.

Author: Tikk, Domonkos, Palaga, Peter, and Leser, Ulf
Subjects: DATA mining, KERNEL functions
Abstract: An abstract of the article related to the use of kernel functions for information extraction of protein-protein interactions (PPIs) by Domokos Tikk, Ulf Leser and colleagues is presented.
Published: 2010

22. Scalable time series similarity search for data analytics

Author: Schäfer, Patrick, Reinefeld, Alexander, Leser, Ulf, and Andrzejak, Artur
Subjects: Similarity Search, Skalierbar, SK 845, ST 265, 28 Informatik, Datenverarbeitung, Data Analytics, Data Mining, Ähnlichkeitssuche, Time Series, ddc:004, 004 Informatik, Zeitreihen, Scalable
Abstract: Eine Zeitreihe ist eine zeitlich geordnete Folge von Datenpunkten. Zeitreihen werden typischerweise über Sensormessungen oder Experimente erfasst. Sensoren sind so preiswert geworden, dass sie praktisch allgegenwärtig sind. Während dadurch die Menge an Zeitreihen regelrecht explodiert, lag der Schwerpunkt der Forschung in den letzten Jahrzehnten auf der Analyse von (a) vorgefilterten und (b) kleinen Zeitreihendatensätzen. Die Analyse realer Zeitreihendatensätze wirft zwei Probleme auf: Erstens setzen aktuelle Ähnlichkeitsmodelle eine Vorfilterung der Zeitreihen voraus. Das beinhaltet die Extraktion charakteristischer Teilsequenzen und das Entfernen von Rauschen. Diese Vorverarbeitung muss durch einen Spezialisten erfolgen. Sie kann zeit- und kostenintensiver als die anschließende Analyse und für große Datensätze unrentabel werden. Zweitens führte die Verbesserung der Genauigkeit aktueller Ähnlichkeitsmodelle zu einem unverhältnismäßig hohen Anstieg der Komplexität (quadratisch bis biquadratisch). Diese Dissertation behandelt beide Probleme. Es wird eine symbolische Zeitreihenrepräsentation vorgestellt. Darauf aufbauend werden drei verschiedene Ähnlichkeitsmodelle eingeführt. Diese erweitern den aktuellen Stand der Forschung insbesondere dadurch, dass sie vorverarbeitungsfrei, unempfindlich gegenüber Rauschen und skalierbar sind. Anhand von 91 realen Datensätzen und Benchmarkdatensätzen wird zusätzlich gezeigt, dass die hier eingeführten Modelle auf den meisten Datenätzen die höchste Genauigkeit im Vergleich zu 15 aktuellen Ähnlichkeitsmodellen liefern. Sie sind teilweise drei Größenordnungen schneller und benötigen kaum Vorfilterung. A time series is a collection of values sequentially recorded from sensors or live observations over time. Sensors for recording time series have become cheap and omnipresent. While data volumes explode, research in the field of time series data analytics has focused on the availability of (a) pre-processed and (b) moderately sized time series datasets in the last decades. The analysis of real world datasets raises two major problems: Firstly, state-of-the-art similarity models require the time series to be pre-processed. Pre-processing aims at extracting approximately aligned characteristic subsequences and reducing noise. It is typically performed by a domain expert, may be more time consuming than the data mining part itself, and simply does not scale to large data volumes. Secondly, time series research has been driven by accuracy metrics and not by reasonable execution times for large data volumes. This results in quadratic to biquadratic computational complexities of state-of-the-art similarity models. This dissertation addresses both issues by introducing a symbolic time series representation and three different similarity models. These contribute to state of the art by being pre-processing-free, noise-robust, and scalable. Our experimental evaluation on 91 real-world and benchmark datasets shows that our methods provide higher accuracy for most datasets when compared to 15 state-of-the-art similarity models. Meanwhile they are up to three orders of magnitude faster, require less pre-processing for noise or alignment, or scale to large data volumes.
Published: 2015

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

22 results on '"Leser, Ulf"'

1. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.

2. BELB: a biomedical entity linking benchmark.

3. RegEl corpus: identifying DNA regulatory elements in the scientific literature.

4. Deep learning with word embeddings improves biomedical named entity recognition.

5. SETH detects and normalizes genetic variants in text.

6. A survey on annotation tools for the biomedical literature.

7. Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts.

8. BioCreative III interactive task: an overview.

9. Phenotype mining for functional genomics and gene discovery.

10. Phenoclustering: online mining of cross-species phenotypes.

11. A comprehensive benchmark of kernel methods to extract protein-protein interactions from literature.

12. Integrating protein-protein interactions and text mining for protein function prediction.

13. Human Activity Segmentation Challenge @ ECML/PKDD’23

14. Window Size Selection in Unsupervised Time Series Analytics: A Review and Benchmark

15. RegEl corpus: identifying DNA regulatory elements in the scientific literature.

16. HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition.

17. Mining Phenotypes for Protein Function Prediction

18. Recognizing chemicals in patents: a comparative analysis.

19. Computer-assisted curation of a human regulatory core network from the biological literature.

20. Simple tricks for improving pattern-based information extraction from the biomedical literature.

21. A fast and effective dependency graph kernel for PPI relation extraction.

22. Scalable time series similarity search for data analytics

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

22 results on '"Leser, Ulf"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources