112 results on '"Tsujii J"'
Search Results
2. Next big challenges in core AI technology
- Author
-
Dengel, A., Etzioni, O., DeCario, N., Hoos, H.H., Li, F.F., Tsujii, J., Traverso, P., Braunschweig, B., Ghallab, M., Braunschweig, B., and Ghallab, M.
- Subjects
Core (game theory) ,Order (exchange) ,Computer science ,Interpretation (philosophy) ,Perception ,media_common.quotation_subject ,Group coordination ,Robot ,Data science ,Field (computer science) ,media_common - Abstract
The field of AI is rich in scientific and technical challenges. Progress needs to be made in machine learning paradigms to make them more efficient and less data intensive. Bridges between data-based and model-based AI are needed in order to benefit from the best of both approaches. Many real-life situations cannot yet be addressed by current robots, demanding progress in perception, scene interpretation or group coordination. This chapter addresses some of the major scientific and technological challenges in core AI technology.
- Published
- 2021
3. GENIA corpus—a semantically annotated corpus for bio-textmining
- Author
-
Kim, J.-D., Ohta, T., Tateisi, Y., and Tsujii, J.
- Published
- 2003
4. The Importance of Being Recurrent for Modeling Hierarchical Structure
- Author
-
Tran, K., Bisazza, A., Monz, C., Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Faculty of Science, and Information and Language Processing Syst (IVI, FNWI)
- Subjects
Structure (mathematical logic) ,FOS: Computer and information sciences ,Computer Science - Computation and Language ,Artificial neural network ,Machine translation ,Computer science ,business.industry ,Contrast (statistics) ,02 engineering and technology ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,Recurrent neural network ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,020201 artificial intelligence & image processing ,Language model ,Artificial intelligence ,business ,computer ,Computation and Language (cs.CL) ,0105 earth and related environmental sciences - Abstract
Recent work has shown that recurrent neural networks (RNNs) can implicitly capture and exploit hierarchical information when trained to solve common natural language processing tasks such as language modeling (Linzen et al., 2016) and neural machine translation (Shi et al., 2016). In contrast, the ability to model structured data with non-recurrent neural networks has received little attention despite their success in many NLP tasks (Gehring et al., 2017; Vaswani et al., 2017). In this work, we compare the two architectures---recurrent versus non-recurrent---with respect to their ability to model hierarchical structure and find that recurrency is indeed important for this purpose., Comment: EMNLP 2018
- Published
- 2018
- Full Text
- View/download PDF
5. Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation
- Author
-
Fadaee, M., Monz, C., Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., and Information and Language Processing Syst (IVI, FNWI)
- Subjects
FOS: Computer and information sciences ,Machine translation ,Computer science ,media_common.quotation_subject ,Sample (statistics) ,Context (language use) ,02 engineering and technology ,010501 environmental sciences ,Translation (geometry) ,computer.software_genre ,01 natural sciences ,Task (project management) ,0202 electrical engineering, electronic engineering, information engineering ,Quality (business) ,0105 earth and related environmental sciences ,BLEU ,media_common ,Computer Science - Computation and Language ,business.industry ,Sampling (statistics) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Computation and Language (cs.CL) ,Natural language processing - Abstract
Neural Machine Translation has achieved state-of-the-art performance for several language pairs using a combination of parallel and synthetic data. Synthetic data is often generated by back-translating sentences randomly sampled from monolingual data using a reverse translation model. While back-translation has been shown to be very effective in many cases, it is not entirely clear why. In this work, we explore different aspects of back-translation, and show that words with high prediction loss during training benefit most from the addition of synthetic data. We introduce several variations of sampling strategies targeting difficult-to-predict words using prediction losses and frequencies of words. In addition, we also target the contexts of difficult words and sample sentences that are similar in context. Experimental results for the WMT news translation task show that our method improves translation quality by up to 1.7 and 1.2 Bleu points over back-translation using random sampling for German-English and English-German, respectively., Comment: 11 pages, 2 figures. Accepted at EMNLP 2018
- Published
- 2018
- Full Text
- View/download PDF
6. BanditSum: Extractive Summarization as a Contextual Bandit
- Author
-
Dong, Y., Shen, Y., Crawford, E., van Hoof, H., Cheung, J.C.K., Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., and Amsterdam Machine Learning lab (IVI, FNWI)
- Subjects
FOS: Computer and information sciences ,Sequence ,Computer Science - Computation and Language ,Artificial neural network ,Computer science ,business.industry ,Context (language use) ,02 engineering and technology ,010501 environmental sciences ,16. Peace & justice ,computer.software_genre ,01 natural sciences ,Automatic summarization ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Computation and Language (cs.CL) ,computer ,Natural language processing ,0105 earth and related environmental sciences - Abstract
In this work, we propose a novel method for training neural networks to perform single-document extractive summarization without heuristically-generated extractive labels. We call our approach BanditSum as it treats extractive summarization as a contextual bandit (CB) problem, where the model receives a document to summarize (the context), and chooses a sequence of sentences to include in the summary (the action). A policy gradient reinforcement learning algorithm is used to train the model to select sequences of sentences that maximize ROUGE score. We perform a series of experiments demonstrating that BanditSum is able to achieve ROUGE scores that are better than or comparable to the state-of-the-art for extractive summarization, and converges using significantly fewer update steps than competing approaches. In addition, we show empirically that BanditSum performs significantly better than competing approaches when good summary sentences appear late in the source document., Comment: 12 pages, 2 figures, EMNLP 2018
- Published
- 2018
7. Empirical Analysis of Aggregation Methods for Collective Annotation
- Author
-
Qing, C., Endriss, U., Fernández, R., Kruger, J., Tsujii, J., Hajic, J., ILLC (FNWI), Logic and Computation (ILLC, FNWI/FGw), Brain and Cognition, and Logic and Language (ILLC, FNWI/FGw)
- Published
- 2014
8. Class-Based Language Modeling for Translating into Morphologically Rich Languages
- Author
-
Bisazza, A., Monz, C., Tsujii, J., Hajic, J., and Information and Language Processing Syst (IVI, FNWI)
- Abstract
Class-based language modeling (LM) is a long-studied and effective approach to overcome data sparsity in the context of n-gram model training. In statistical machine translation (SMT), differ- ent forms of class-based LMs have been shown to improve baseline translation quality when used in combination with standard word-level LMs but no published work has systematically com- pared different kinds of classes, model forms and LM combination methods in a unified SMT setting. This paper aims to fill these gaps by focusing on the challenging problem of translating into Russian, a language with rich inflectional morphology and complex agreement phenomena. We conduct our evaluation in a large-data scenario and report statistically significant BLEU im- provements of up to 0.6 points when using a refined variant of the class-based model originally proposed by Brown et al. (1992).
- Published
- 2014
9. Latent Domain Translation Models in Mix-of-Domains Haystack
- Author
-
Cuong, H., Simaan, K., Tsujii, J., Hajic, J., Faculty of Science, Brain and Cognition, Language and Computation (ILLC, FNWI/FGw), and ILLC (FNWI)
- Published
- 2014
10. Applying automatically parsed corpora to the study of language variation
- Author
-
Bloem, J., Versloot, A., Weerman, F., Tsujii, J., Hajic, J., and ACLC (FGw)
- Abstract
In this work, we discuss the benefits of using automatically parsed corpora to study language variation. The study of language variation is an area of linguistics in which quantitative methods have been particularly successful. We argue that the large datasets that can be obtained using automatic annotation can help drive further research in this direction, providing sufficient data for the increasingly complex models used to describe variation. We demonstrate this by replicating and extending a previous quantitative variation study that used manually and semi-automatically annotated data. We show that while the study cannot be replicated completely due to limitations of the existing automatic annotation, we can draw at least the same conclusions as the original study. In addition, we demonstrate the flexibility of this method by extending the findings to related linguistic constructions and to another domain of text, using additional data.
- Published
- 2014
11. Prior-informed distant supervision for temporal evidence classification
- Author
-
Reinanda, R., de Rijke, M., Tsujii, J., Hajic, J., and Information and Language Processing Syst (IVI, FNWI)
- Subjects
ComputingMethodologies_PATTERNRECOGNITION - Abstract
Temporal evidence classification, i.e., finding associations between temporal expressions and relations expressed in text, is an important part of temporal relation extraction. To capture the variations found in this setting, we employ a distant supervision approach, modeling the task as multi-class text classification. There are two main challenges with distant supervision: (1) noise generated by incorrect heuristic labeling, and (2) distribution mismatch between the target and distant supervision examples. We are particularly interested in addressing the second problem and propose a sampling approach to handle the distribution mismatch. Our prior-informed distant supervision approach improves over basic distant supervision and outperforms a purely supervised approach when evaluated on TAC-KBP data, both on classification and end-to-end metrics.
- Published
- 2014
12. NLPRS2001
- Author
-
Tsujii, J and Rijksuniversiteit Groningen
- Published
- 2001
13. U-Compare bio-event meta-service: compatible BioNLP event extraction services
- Author
-
Kano, Y, Bjoerne, J, Ginter, F, Salakoski, T, Buyko, E, Hahn, U, Cohen, KB, Verspoor, K, Roeder, C, Hunter, LE, Kilicoglu, H, Bergler, S, Van Landeghem, S, Van Parys, T, Van de Peer, Y, Miwa, M, Ananiadou, S, Neves, M, Pascual-Montano, A, Ozgur, A, Radev, DR, Riedel, S, Saetre, R, Chun, H-W, Kim, J-D, Pyysalo, S, Ohta, T, Tsujii, J, Kano, Y, Bjoerne, J, Ginter, F, Salakoski, T, Buyko, E, Hahn, U, Cohen, KB, Verspoor, K, Roeder, C, Hunter, LE, Kilicoglu, H, Bergler, S, Van Landeghem, S, Van Parys, T, Van de Peer, Y, Miwa, M, Ananiadou, S, Neves, M, Pascual-Montano, A, Ozgur, A, Radev, DR, Riedel, S, Saetre, R, Chun, H-W, Kim, J-D, Pyysalo, S, Ohta, T, and Tsujii, J
- Abstract
BACKGROUND: Bio-molecular event extraction from literature is recognized as an important task of bio text mining and, as such, many relevant systems have been developed and made available during the last decade. While such systems provide useful services individually, there is a need for a meta-service to enable comparison and ensemble of such services, offering optimal solutions for various purposes. RESULTS: We have integrated nine event extraction systems in the U-Compare framework, making them intercompatible and interoperable with other U-Compare components. The U-Compare event meta-service provides various meta-level features for comparison and ensemble of multiple event extraction systems. Experimental results show that the performance improvements achieved by the ensemble are significant. CONCLUSIONS: While individual event extraction systems themselves provide useful features for bio text mining, the U-Compare meta-service is expected to improve the accessibility to the individual systems, and to enable meta-level uses over multiple event extraction systems such as comparison and ensemble.
- Published
- 2011
14. Maximum entropy models with inequality constraints: A case study on text categorization
- Author
-
Kazama, J, Tsujii, J, Kazama, J, and Tsujii, J
- Abstract
Data sparseness or overfitting is a serious problem in natural language processing employing machine learning methods. This is still true even for the maximum entropy (ME) method, whose flexible modeling capability has alleviated data sparseness more successfully than the other probabilistic models in many NLP tasks. Although we usually estimate the model so that it completely satisfies the equality constraints on feature expectations with the ME method, complete satisfaction leads to undesirable overfitting, especially for sparse features, since the constraints derived from a limited amount of training data are always uncertain. To control overfitting in ME estimation, we propose the use of box-type inequality constraints, where equality can be violated up to certain predefined levels that reflect this uncertainty. The derived models, inequality ME models, in effect have regularized estimation with L_1 norm penalties of bounded parameters. Most importantly, this regularized estimation enables the model parameters to become sparse. This can be thought of as automatic feature selection, which is expected to improve generalization performance further. We evaluate the inequality ME models on text categorization datasets, and demonstrate their advantages over standard ME estimation, similarly motivated Gaussian MAP estimation of ME models, and support vector machines (SVMs), which are one of the state-of-the-art methods for text categorization., identifier:08856125, identifier:https://dspace.jaist.ac.jp/dspace/handle/10119/3305
- Published
- 2005
15. Information flow analysis with Chinese text
- Author
-
Su, K Y, Lee, J H, Kwong, O, Tsujii, J, Cheong, Paulo, Song, Dawei, Bruza, Peter, Wong, Kam-Fai, Su, K Y, Lee, J H, Kwong, O, Tsujii, J, Cheong, Paulo, Song, Dawei, Bruza, Peter, and Wong, Kam-Fai
- Abstract
This article investigates the effectiveness of an information inference mechanism on Chinese text. The information inference derives implicit associations via computation of information flow on a high dimensional conceptual space, which is approximated by a cognitively motivated lexical semantic space model, namely Hyperspace Analogue to Language (HAL). A dictionary-based Chinese word segmentation system was used to segment words. To evaluate the Chinese-based information flow model, it is applied to query expansion, in which a set of test queries are expanded automatically via information flow computations and documents are retrieved. Standard recall-precision measures are used to measure performance. Experimental results for TREC-5 Chinese queries and People Daily’s corpus suggest that the Chinese information flow model significantly increases average precision, though the increase is not as high as those achieved using English corpus. Nevertheless, there is justification to believe that the HAL-based information flow model, and in turn our psychologistic stance on the next generation of information processing systems, have a promising degree of language independence.
- Published
- 2004
16. U-Compare: A modular NLP workflow construction and evaluation system
- Author
-
Kano, Y., primary, Miwa, M., additional, Cohen, K. B., additional, Hunter, L. E., additional, Ananiadou, S., additional, and Tsujii, J., additional
- Published
- 2011
- Full Text
- View/download PDF
17. AGRA: analysis of gene ranking algorithms
- Author
-
Kocbek, S., primary, Saetre, R., additional, Stiglic, G., additional, Kim, J.-D., additional, Pernek, I., additional, Tsuruoka, Y., additional, Kokol, P., additional, Ananiadou, S., additional, and Tsujii, J., additional
- Published
- 2011
- Full Text
- View/download PDF
18. MaSTerClass: a case-based reasoning system for the classification of biomedical terms
- Author
-
Spasic, I., primary, Ananiadou, S., additional, and Tsujii, J., additional
- Published
- 2005
- Full Text
- View/download PDF
19. Design and Implementation of GXP Make -- A Workflow System Based on Make.
- Author
-
Taura, K., Matsuzaki, T., Miwa, M., Kamoshida, Y., Yokoyama, D., Nan Dun, Shibata, T., Choi Sung Jun, and Tsujii, J.
- Published
- 2010
- Full Text
- View/download PDF
20. Highly scalable Text Mining - parallel tagging application.
- Author
-
Tekiner, F., Tsuruoka, Y., Tsujii, J., and Ananiadou, S.
- Published
- 2009
- Full Text
- View/download PDF
21. Move Prediction in Go with the Maximum Entropy Method.
- Author
-
Araki, N., Yoshida, K., Tsuruoka, Y., and Tsujii, J.
- Published
- 2007
- Full Text
- View/download PDF
22. Machine translation from japanese into english.
- Author
-
Nagao, M., Tsujii, J., and Nakamura, J.
- Published
- 1986
- Full Text
- View/download PDF
23. Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain
- Author
-
Collier, N., Nobata, C., and Tsujii, J.
- Abstract
This article describes our work to identify and classify terms in the domain of molecular biology according to examples that have been marked up by a domain expert in a corpus of abstracts taken from a controlled search of the Medline database. Automatic acquisition of biomedical term lists has so far been slow due to high variability in both the terms and their classification scheme, which we attribute to the diversity of research disciplines involved. Nevertheless, the explosive growth in online molecular biology literature makes a persuasive case for automating many tasks. This includes acquisition of records for gene-product databases such as SwissProt which are currently updated by human experts, a task that is both time consuming and often highly idiosyncratic. In this article we report results from a tool based on a hidden-Markov model for extracting and classifying terms that can be used as a key component in an information extraction system. We discuss the results in light of lexical, syntactic and semantic properties of terms that were revealed by our study.
- Published
- 2001
24. Interaction between charged soft microcapsules and red blood cells: effects of PEGylation of microcapsule membranes upon their surface properties
- Author
-
Makino, K., Umetsu, M., Goto, Y., Nakayama, A., Suhara, T., Tsujii, J., Kikuchi, A., Ohshima, H., Sakurai, Y., and Okano, T.
- Published
- 1999
- Full Text
- View/download PDF
25. GENIA corpus--a semantically annotated corpus for bio-textmining
- Author
-
J.-D., Kim, Ohta, T., Tateisi, Y., and Tsujii, J.
- Abstract
Motivation: Natural language processing (NLP) methods are regarded as being useful to raise the potential of text mining from biological literature. The lack of an extensively annotated corpus of this literature, however, causes a major bottleneck for applying NLP techniques. GENIA corpus is being developed to provide reference materials to let NLP techniques work for bio-textmining. Results: GENIA corpus version 3.0 consisting of 2000 MEDLINE abstracts has been released with more than 400 000 words and almost 100 000 annotations for biological terms.Availability: GENIA corpus is freely available at http://www-tsujii.is.s.u-tokyo.ac.jp/GENIAKeywords: Text Mining, Information Extraction, Corpus, Natural Language Processing, Computational Molecular Biology
- Published
- 2003
26. An attempt to computerized dictionary data bases
- Author
-
Nagao, M., primary, Tsujii, J., additional, Ueda, Y., additional, and Takiyama, M., additional
- Published
- 1980
- Full Text
- View/download PDF
27. Science and technology Agency's Mu machine translation project
- Author
-
Nagao, M., primary, Tsujii, J., additional, and Nakamura, J., additional
- Published
- 1986
- Full Text
- View/download PDF
28. A machine translation system from Japanese into English
- Author
-
Nagao, M., primary, Tsujii, J., additional, Mitamura, K., additional, Hirakawa, H., additional, and Kume, M., additional
- Published
- 1980
- Full Text
- View/download PDF
29. Machine Translation in Natural Language Understanding
- Author
-
TSUJII, J.-I., primary
- Published
- 1989
- Full Text
- View/download PDF
30. Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011
- Author
-
Pyysalo Sampo, Ohta Tomoko, Rak Rafal, Sullivan Dan, Mao Chunhong, Wang Chunxia, Sobral Bruno, Tsujii Jun'ichi, and Ananiadou Sophia
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties.
- Published
- 2012
- Full Text
- View/download PDF
31. Event extraction for DNA methylation
- Author
-
Ohta Tomoko, Pyysalo Sampo, Miwa Makoto, and Tsujii Jun’ichi
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Background We consider the task of automatically extracting DNA methylation events from the biomedical domain literature. DNA methylation is a key mechanism of epigenetic control of gene expression and implicated in many cancers, but there has been little study of automatic information extraction for DNA methylation. Results We present an annotation scheme for DNA methylation following the representation of the BioNLP shared task on event extraction, select a set of 200 abstracts including a representative sample of all PubMed citations relevant to DNA methylation, and introduce manual annotation for this corpus marking nearly 3000 gene/protein mentions and 1500 DNA methylation and demethylation events. We retrain a state-of-the-art event extraction system on the corpus and find that automatic extraction of DNA methylation events, the methylated genes, and their methylation sites can be performed at 78% precision and 76% recall. Conclusions Our results demonstrate that reliable extraction methods for DNA methylation events can be created through corpus annotation and straightforward retraining of a general event extraction system. The introduced resources are freely available for use in research from the GENIA project homepage http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA.
- Published
- 2011
- Full Text
- View/download PDF
32. An analysis of gene/protein associations at PubMed scale
- Author
-
Pyysalo Sampo, Ohta Tomoko, and Tsujii Jun’ichi
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Abstract Background Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available. Results In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology. Conclusions We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.
- Published
- 2011
- Full Text
- View/download PDF
33. Combining statistical models with symbolic grammar in parsing.
- Author
-
Tsujii, J.
- Published
- 2007
- Full Text
- View/download PDF
34. New challenges for text mining: mapping between text and manually curated pathways
- Author
-
Tateisi Yuka, Matsuzaki Takuya, Okanohara Daisuke, Ohta Tomoko, Kim Jin-Dong, Oda Kanae, and Tsujii Jun'ichi
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge. Results To address these challenges, we constructed new resources to link the text with a model pathway; they are: the GENIA pathway corpus with event annotation and NF-kB pathway. Through their detailed analysis, we address the untapped resource, ‘bio-inference,’ as well as the differences between text and pathway representation. Here, we show the precise comparisons of their representations and the nine classes of ‘bio-inference’ schemes observed in the pathway corpus. Conclusions We believe that the creation of such rich resources and their detailed analysis is the significant first step for accelerating the research of the automatic construction of pathway from text.
- Published
- 2008
- Full Text
- View/download PDF
35. Improving protein coreference resolution by simple semantic classification
- Author
-
Nguyen Ngan, Kim Jin-Dong, Miwa Makoto, Matsuzaki Takuya, and Tsujii Junichi
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Current research has shown that major difficulties in event extraction for the biomedical domain are traceable to coreference. Therefore, coreference resolution is believed to be useful for improving event extraction. To address coreference resolution in molecular biology literature, the Protein Coreference (COREF) task was arranged in the BioNLP Shared Task (BioNLP-ST, hereafter) 2011, as a supporting task. However, the shared task results indicated that transferring coreference resolution methods developed for other domains to the biological domain was not a straight-forward task, due to the domain differences in the coreference phenomena. Results We analyzed the contribution of domain-specific information, including the information that indicates the protein type, in a rule-based protein coreference resolution system. In particular, the domain-specific information is encoded into semantic classification modules for which the output is used in different components of the coreference resolution. We compared our system with the top four systems in the BioNLP-ST 2011; surprisingly, we found that the minimal configuration had outperformed the best system in the BioNLP-ST 2011. Analysis of the experimental results revealed that semantic classification, using protein information, has contributed to an increase in performance by 2.3% on the test data, and 4.0% on the development data, in F-score. Conclusions The use of domain-specific information in semantic classification is important for effective coreference resolution. Since it is difficult to transfer domain-specific information across different domains, we need to continue seek for methods to utilize such information in coreference resolution.
- Published
- 2012
- Full Text
- View/download PDF
36. U-Compare bio-event meta-service: compatible BioNLP event extraction services
- Author
-
Kano Yoshinobu, Björne Jari, Ginter Filip, Salakoski Tapio, Buyko Ekaterina, Hahn Udo, Cohen K Bretonnel, Verspoor Karin, Roeder Christophe, Hunter Lawrence E, Kilicoglu Halil, Bergler Sabine, Van Landeghem Sofie, Van Parys Thomas, Van de Peer Yves, Miwa Makoto, Ananiadou Sophia, Neves Mariana, Pascual-Montano Alberto, Özgür Arzucan, Radev Dragomir R, Riedel Sebastian, Sætre Rune, Chun Hong-Woo, Kim Jin-Dong, Pyysalo Sampo, Ohta Tomoko, and Tsujii Jun'ichi
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Published
- 2011
- Full Text
- View/download PDF
37. Investigating heterogeneous protein annotations toward cross-corpora utilization
- Author
-
Pyysalo Sampo, Sætre Rune, Kim Jin-Dong, Wang Yue, and Tsujii Jun'ichi
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. Results We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. Conclusion Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.
- Published
- 2009
- Full Text
- View/download PDF
38. Corpus annotation for mining biomedical events from literature
- Author
-
Tsujii Jun'ichi, Ohta Tomoko, and Kim Jin-Dong
- Subjects
Computer applications to medicine. Medical informatics ,R858-859.7 ,Biology (General) ,QH301-705.5 - Abstract
Abstract Background Advanced Text Mining (TM) such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech), syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1) to design a scheme of annotation which meets specific requirements of text annotation, (2) to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3) to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing)-based TM in the bio-medical domain.
- Published
- 2008
- Full Text
- View/download PDF
39. Poster: Analysis of gene ranking algorithms with extraction of relevant biomedical concepts from PubMed publications.
- Author
-
Kocbek, S., Setre, R., Stiglic, G., Jin-Dong Kim, Pernek, I., Tsuruoka, Y., Kokol, P., Ananiadou, S., and Tsujii, J.
- Published
- 2011
- Full Text
- View/download PDF
40. How to link information in text with knowledge ------ Case study of text mining for pathway construction.
- Author
-
Tsujii, J.
- Published
- 2007
- Full Text
- View/download PDF
41. Domain ontology and top-level ontology: how can we co-ordinate the two?
- Author
-
Tsujii, J.
- Published
- 2003
- Full Text
- View/download PDF
42. CHEMICAL INDUSTRY AND ATOMIC ENERGY.
- Author
-
Tsujii, J
- Published
- 1968
43. Finding zelig in text: A measure for normalising linguistic accommodation
- Author
-
Simon Jones, Cotterill, R., Dewdney, N., Muir, K., Joinson, A., Tsujii, J, and Hajic, J
- Subjects
BF ,P1 - Abstract
Linguistic accommodation is a recognised indicator of social power and social distance. However,\ud different individuals will vary their language to different degrees, and only a portion of\ud this variance will be due to accommodation. This paper presents the 'Zelig Quotient', a method\ud of normalising linguistic variation towards a particular individual, using an author’s other communications\ud as a baseline, thence to derive a method for identifying accommodation-induced\ud variation with statistical significance. This work provides a platform for future efforts towards\ud examining the importance of such phenomena in large communications datasets.
44. Periplasmic chitooligosaccharide-binding protein requires a three-domain organization for substrate translocation.
- Author
-
Ohnuma T, Tsujii J, Kataoka C, Yoshimoto T, Takeshita D, Lampela O, Juffer AH, Suginta W, and Fukamizo T
- Subjects
- Humans, Chitin metabolism, Carrier Proteins metabolism, Molecular Dynamics Simulation, Ligands, Translocation, Genetic, Crystallography, X-Ray, Periplasmic Binding Proteins metabolism, Chitosan metabolism
- Abstract
Periplasmic solute-binding proteins (SBPs) specific for chitooligosaccharides, (GlcNAc)
n (n = 2, 3, 4, 5 and 6), are involved in the uptake of chitinous nutrients and the negative control of chitin signal transduction in Vibrios. Most translocation processes by SBPs across the inner membrane have been explained thus far by two-domain open/closed mechanism. Here we propose three-domain mechanism of the (GlcNAc)n translocation based on experiments using a recombinant VcCBP, SBP specific for (GlcNAc)n from Vibrio cholerae. X-ray crystal structures of unliganded or (GlcNAc)3 -liganded VcCBP solved at 1.2-1.6 Å revealed three distinct domains, the Upper1, Upper2 and Lower domains for this protein. Molecular dynamics simulation indicated that the motions of the three domains are independent and that in the (GlcNAc)3 -liganded state the Upper2/Lower interface fluctuated more intensively, compared to the Upper1/Lower interface. The Upper1/Lower interface bound two GlcNAc residues tightly, while the Upper2/Lower interface appeared to loosen and release the bound sugar molecule. The three-domain mechanism proposed here was fully supported by binding data obtained by thermal unfolding experiments and ITC, and may be applicable to other translocation systems involving SBPs belonging to the same cluster., (© 2023. The Author(s).)- Published
- 2023
- Full Text
- View/download PDF
45. Improving clinical named entity recognition in Chinese using the graphical and phonetic feature.
- Author
-
Wang Y, Ananiadou S, and Tsujii J
- Subjects
- Data Curation, Electronic Health Records, Humans, Semantics, Language, Machine Learning, Natural Language Processing, Phonetics
- Abstract
Background: Clinical Named Entity Recognition is to find the name of diseases, body parts and other related terms from the given text. Because Chinese language is quite different with English language, the machine cannot simply get the graphical and phonetic information form Chinese characters. The method for Chinese should be different from that for English. Chinese characters present abundant information with the graphical features, recent research on Chinese word embedding tries to use graphical information as subword. This paper uses both graphical and phonetic features to improve Chinese Clinical Named Entity Recognition based on the presence of phono-semantic characters., Methods: This paper proposed three different embedding models and tested them on the annotated data. The data have been divided into two sections for exploring the effect of the proportion of phono-semantic characters., Results: The model using primary radical and pinyin can improve Clinical Named Entity Recognition in Chinese and get the F-measure of 0.712. More phono-semantic characters does not give a better result., Conclusions: The paper proves that the use of the combination of graphical and phonetic features can improve the Clinical Named Entity Recognition in Chinese.
- Published
- 2019
- Full Text
- View/download PDF
46. Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries.
- Author
-
Wang Y, Fan X, Chen L, Chang EI, Ananiadou S, Tsujii J, and Xu Y
- Subjects
- Algorithms, Humans, Anatomy, Data Mining, Human Body, Knowledge Bases, Patient Discharge
- Abstract
*: Background Consisting of dictated free-text documents such as discharge summaries, medical narratives are widely used in medical natural language processing. Relationships between anatomical entities and human body parts are crucial for building medical text mining applications. To achieve this, we establish a mapping system consisting of a Wikipedia-based scoring algorithm and a named entity normalization method (NEN). The mapping system makes full use of information available on Wikipedia, which is a comprehensive Internet medical knowledge base. We also built a new ontology, Tree of Human Body Parts (THBP), from core anatomical parts by referring to anatomical experts and Unified Medical Language Systems (UMLS) to make the mapping system efficacious for clinical treatments. *: Result The gold standard is derived from 50 discharge summaries from our previous work, in which 2,224 anatomical entities are included. The F1-measure of the baseline system is 70.20%, while our algorithm based on Wikipedia achieves 86.67% with the assistance of NEN. *: Conclusions We construct a framework to map anatomical entities to THBP ontology using normalization and a scoring algorithm based on Wikipedia. The proposed framework is proven to be much more effective and efficient than the main baseline system.
- Published
- 2019
- Full Text
- View/download PDF
47. Involvement of SRF coactivator MKL2 in BDNF-mediated activation of the synaptic activity-responsive element in the Arc gene.
- Author
-
Kikuchi K, Ihara D, Fukuchi M, Tanabe H, Ishibashi Y, Tsujii J, Tsuda M, Kaneda M, Sakagami H, Okuno H, Bito H, Yamazaki Y, Ishikawa M, and Tabuchi A
- Subjects
- Animals, Brain-Derived Neurotrophic Factor metabolism, Brain-Derived Neurotrophic Factor pharmacology, Cytoskeletal Proteins genetics, Female, Nerve Tissue Proteins genetics, Neurons drug effects, Rats, Rats, Sprague-Dawley, Serum Response Factor genetics, Serum Response Factor metabolism, Transcriptional Activation drug effects, Cytoskeletal Proteins biosynthesis, Nerve Tissue Proteins biosynthesis, Neurons physiology, Transcription Factors metabolism, Transcriptional Activation physiology
- Abstract
The expression of immediate early genes (IEGs) is thought to be an essential molecular basis of neuronal plasticity for higher brain function. Many IEGs contain serum response element in their transcriptional regulatory regions and their expression is controlled by serum response factor (SRF). SRF is known to play a role in concert with transcriptional cofactors. However, little is known about how SRF cofactors regulate IEG expression during the process of neuronal plasticity. We hypothesized that one of the SRF-regulated neuronal IEGs, activity-regulated cytoskeleton-associated protein (Arc; also termed Arg3.1), is regulated by an SRF coactivator, megakaryoblastic leukemia (MKL). To test this hypothesis, we initially investigated which binding site of the transcription factor or SRF cofactor contributes to brain-derived neurotrophic factor (BDNF)-induced Arc gene transcription in cultured cortical neurons using transfection and reporter assays. We found that BDNF caused robust induction of Arc gene transcription through a cAMP response element, binding site of myocyte enhancer factor 2, and binding site of SRF in an Arc enhancer, the synaptic activity-responsive element (SARE). Regardless of the requirement for the SRF-binding site, the binding site of a ternary complex factor, another SRF cofactor, did not affect BDNF-mediated Arc gene transcription. In contrast, chromatin immunoprecipitation revealed occupation of MKL at the SARE. Furthermore, knockdown of MKL2, but not MKL1, significantly decreased BDNF-mediated activation of the SARE. Taken together, these findings suggest a novel mechanism by which MKL2 controls the Arc SARE in response to BDNF stimulation., (© 2018 International Society for Neurochemistry.)
- Published
- 2019
- Full Text
- View/download PDF
48. Annotation and detection of drug effects in text for pharmacovigilance.
- Author
-
Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, and Ananiadou S
- Abstract
Pharmacovigilance (PV) databases record the benefits and risks of different drugs, as a means to ensure their safe and effective use. Creating and maintaining such resources can be complex, since a particular medication may have divergent effects in different individuals, due to specific patient characteristics and/or interactions with other drugs being administered. Textual information from various sources can provide important evidence to curators of PV databases about the usage and effects of drug targets in different medical subjects. However, the efficient identification of relevant evidence can be challenging, due to the increasing volume of textual data. Text mining (TM) techniques can support curators by automatically detecting complex information, such as interactions between drugs, diseases and adverse effects. This semantic information supports the quick identification of documents containing information of interest (e.g., the different types of patients in which a given adverse drug reaction has been observed to occur). TM tools are typically adapted to different domains by applying machine learning methods to corpora that are manually labelled by domain experts using annotation guidelines to ensure consistency. We present a semantically annotated corpus of 597 MEDLINE abstracts, PHAEDRA, encoding rich information on drug effects and their interactions, whose quality is assured through the use of detailed annotation guidelines and the demonstration of high levels of inter-annotator agreement (e.g., 92.6% F-Score for identifying named entities and 78.4% F-Score for identifying complex events, when relaxed matching criteria are applied). To our knowledge, the corpus is unique in the domain of PV, according to the level of detail of its annotations. To illustrate the utility of the corpus, we have trained TM tools based on its rich labels to recognise drug effects in text automatically. The corpus and annotation guidelines are available at: http://www.nactem.ac.uk/PHAEDRA/ .
- Published
- 2018
- Full Text
- View/download PDF
49. Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary.
- Author
-
Xu Y, Chen L, Wei J, Ananiadou S, Fan Y, Qian Y, Chang EI, and Tsujii J
- Subjects
- Asian People, England, Humans, Information Storage and Retrieval, Medical Informatics, Multilingualism, Natural Language Processing, Patient Discharge standards, Software, Translating
- Abstract
Background: Electronic medical record (EMR) systems have become widely used throughout the world to improve the quality of healthcare and the efficiency of hospital services. A bilingual medical lexicon of Chinese and English is needed to meet the demand for the multi-lingual and multi-national treatment. We make efforts to extract a bilingual lexicon from English and Chinese discharge summaries with a small seed lexicon. The lexical terms can be classified into two categories: single-word terms (SWTs) and multi-word terms (MWTs). For SWTs, we use a label propagation (LP; context-based) method to extract candidates of translation pairs. For MWTs, which are pervasive in the medical domain, we propose a term alignment method, which firstly obtains translation candidates for each component word of a Chinese MWT, and then generates their combinations, from which the system selects a set of plausible translation candidates., Results: We compare our LP method with a baseline method based on simple context-similarity. The LP based method outperforms the baseline with the accuracies: 4.44% Acc1, 24.44% Acc10, and 62.22% Acc100, where AccN means the top N accuracy. The accuracy of the LP method drops to 5.41% Acc10 and 8.11% Acc20 for MWTs. Our experiments show that the method based on term alignment improves the performance for MWTs to 16.22% Acc10 and 27.03% Acc20., Conclusions: We constructed a framework for building an English-Chinese term dictionary from discharge summaries in the two languages. Our experiments have shown that the LP-based method augmented with the term alignment method will contribute to reduction of manual work required to compile a bilingual sydictionary of clinical terms.
- Published
- 2015
- Full Text
- View/download PDF
50. Overview of the Cancer Genetics and Pathway Curation tasks of BioNLP Shared Task 2013.
- Author
-
Pyysalo S, Ohta T, Rak R, Rowley A, Chun HW, Jung SJ, Choi SP, Tsujii J, and Ananiadou S
- Subjects
- Humans, Natural Language Processing, Gene Regulatory Networks, Genes, Information Storage and Retrieval, Knowledge Bases, Models, Theoretical, Neoplasms genetics, Neoplasms pathology
- Abstract
Background: Since their introduction in 2009, the BioNLP Shared Task events have been instrumental in advancing the development of methods and resources for the automatic extraction of information from the biomedical literature. In this paper, we present the Cancer Genetics (CG) and Pathway Curation (PC) tasks, two event extraction tasks introduced in the BioNLP Shared Task 2013. The CG task focuses on cancer, emphasizing the extraction of physiological and pathological processes at various levels of biological organization, and the PC task targets reactions relevant to the development of biomolecular pathway models, defining its extraction targets on the basis of established pathway representations and ontologies., Results: Six groups participated in the CG task and two groups in the PC task, together applying a wide range of extraction approaches including both established state-of-the-art systems and newly introduced extraction methods. The best-performing systems achieved F-scores of 55% on the CG task and 53% on the PC task, demonstrating a level of performance comparable to the best results achieved in similar previously proposed tasks., Conclusions: The results indicate that existing event extraction technology can generalize to meet the novel challenges represented by the CG and PC task settings, suggesting that extraction methods are capable of supporting the construction of knowledge bases on the molecular mechanisms of cancer and the curation of biomolecular pathway models. The CG and PC tasks continue as open challenges for all interested parties, with data, tools and resources available from the shared task homepage.
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.