Author: "Robert Gaizauskas" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Robert Gaizauskas"' showing total 114 results

Start Over Author "Robert Gaizauskas"

114 results on '"Robert Gaizauskas"'

1. Mapping and Aligning Units from Comparable Corpora

Author: Dan Ștefănescu, Sabine Hunsicker, Yang Feng, Alexandru Ceaușu, Dan Tufiș, Elena Irimia, Robert Gaizauskas, Radu Ion, and Ahmet Aker
Subjects: Machine translation, Computer science, business.industry, Computation, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Comparability, Translation (geometry), computer.software_genre, Extractor, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Artificial intelligence, business, computer, Natural language processing
Abstract: Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine translation (SMT). Yet, the existing body of research on the subject does not take into account the degree of comparability of the corpus being processed nor the computation time that it takes to extract translational similar pairs from a corpus of a given size. We will show that the performance of a parallel unit extractor crucially depends on the degree of comparability, such that it is more difficult to mine for parallel data in a weakly comparable corpus than a strongly comparable corpus.
Published: 2019

2. Introduction

Author: Inguna Skadiņa, Robert Gaizauskas, Andrejs Vasiļjevs, and Monica Lestari Paramita
Published: 2019

3. Collecting Comparable Corpora

Author: Radu Ion, Ahmet Aker, Nikos Mastropavlos, Dan Tufiș, Robert Gaizauskas, Judita Preiss, Paul Clough, Alexandru Ceausu, Dan Ștefănescu, Olga Yannoutsou, Nikos Glaros, and Monica Lestari Paramita
Subjects: Information retrieval, Computer science, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Parallel corpora
Abstract: The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automatically building comparable corpora for these under-resourced languages and domains. How do we identify these comparable documents? What approaches should be used in collecting these comparable documents from different Web sources? In this chapter, we firstly present a review of previous techniques that have been developed for collecting comparable documents from the Web. Then we describe in detail three new techniques to gather comparable documents from three different types of Web sources: Wikipedia, news articles, and narrow domains.
Published: 2019

4. Cross-Language Comparability and Its Applications for MT

Author: Paul Clough, Fangzhong Su, Robert Gaizauskas, Bogdan Babych, Monica Lestari Paramita, Anthony Hartley, and Ahmet Aker
Subjects: Machine translation, business.industry, Computer science, Closeness, Comparability, computer.software_genre, Field (computer science), Parallel corpora, Task (project management), Artificial intelligence, Computational linguistics, business, computer, Natural language processing
Abstract: The concept of comparability, or linguistic relatedness, or closeness between textual units or corpora has many possible applications in computational linguistics. Consequently, the task of measuring comparability has increasingly become a core technological challenge in the field, and needs to be developed and evaluated systematically. Many practical applications require corpora with controlled levels of comparability, which are established by comparability metrics. From this perspective, it is important to understand the linguistic and technological mechanisms and implications of comparability and develop a systematic methodology for developing, evaluating and using comparability metrics. This chapter presents our approach to developing and using such metrics for machine translation (MT), especially for under-resourced languages. We address three core areas: (1) systematic meta-evaluation (or calibration) of the metrics on the basis of parallel corpora; (2) the development of feature-selection techniques for the metrics on the basis of aligned comparable texts, such as Wikipedia articles and (3) applying the developed metrics for the tasks of MT for under-resourced languages and measuring their effectiveness for corpora with unknown degrees of comparability. This has led to redefining the vague linguistic concept of comparability in terms of task-specific performance of the tools, which extract phrase-level translation equivalents from comparable texts.
Published: 2019

5. Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Author: Inguna Skadiņa, Robert Gaizauskas, Bogdan Babych, Nikola Ljubešić, Dan Tufiş, Andrejs Vasiļjevs, Inguna Skadiņa, Robert Gaizauskas, Bogdan Babych, Nikola Ljubešić, Dan Tufiş, and Andrejs Vasiļjevs
Subjects: Natural language processing (Computer science), Computational linguistics, Data mining
Abstract: This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that can be used for the machine translation task. It is divided into several sections, each covering a specific task such as building, processing, and using comparable corpora, focusing particularly on under-resourced language pairs and domains.The book is intended for anyone interested in data-driven machine translation for under-resourced languages and domains, especially for developers of machine translation systems, computational linguists and language workers. It offers a valuable resource for specialists and students in natural language processing, machine translation, corpus linguistics and computer-assisted translation, and promotes the broader use of comparable corpora in natural language processing and computational linguistics.
Published: 2019

6. Contradictions and incompleteness: a reexamination of Goedel's result

Author: Robert Gaizauskas
Published: 2018

7. Visual and Semantic Knowledge Transfer for Large Scale Semi-supervised Object Detection

Author: Robert Gaizauskas, Liming Chen, Josiah Wang, Boyang Gao, Emmanuel Dellandréa, Yuxing Tang, Xiaofang Wang, National Institutes of Health [Bethesda] (NIH), University of Sheffield [Sheffield], Extraction de Caractéristiques et Identification (imagine), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), and Université de Lyon-Université Lumière - Lyon 2 (UL2)
Subjects: FOS: Computer and information sciences, semantic similarity, semi-supervised learning, weakly supervised object detection, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Semi-supervised learning, 010501 environmental sciences, transfer learning, Machine learning, computer.software_genre, 01 natural sciences, Convolutional neural network, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Object-class detection, Semantic similarity, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Artificial Intelligence, Minimum bounding box, convolutional neural networks, 0202 electrical engineering, electronic engineering, information engineering, 0105 earth and related environmental sciences, business.industry, Applied Mathematics, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Pattern recognition, object detection, Object detection, Computational Theory and Mathematics, [INFO.INFO-TI]Computer Science [cs]/Image Processing [eess.IV], 020201 artificial intelligence & image processing, Viola–Jones object detection framework, Computer Vision and Pattern Recognition, Artificial intelligence, visual similarity, business, Classifier (UML), computer, Software
Abstract: Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories, e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting., TPAMI. correct some typos
Published: 2018

8. Using Section Headings to Compute Cross-Lingual Similarity of Wikipedia Articles

Author: Monica Lestari Paramita, Robert Gaizauskas, and Paul Clough
Subjects: 060201 languages & linguistics, Cross lingual, Measure (data warehouse), Information retrieval, Similarity (network science), Computer science, 0602 languages and literature, Section (typography), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 06 humanities and the arts, 02 engineering and technology
Abstract: Measuring the similarity of interlanguage-linked Wikipedia articles often requires the use of suitable language resources (e.g., dictionaries and MT systems) which can be problematic for languages with limited or poor translation resources. The size of Wikipedia can also present computational demands when computing similarity. This paper presents a ‘lightweight’ approach to measure cross-lingual similarity in Wikipedia using section headings rather than the entire Wikipedia article, and language resources derived from Wikipedia and Wiktionary to perform translation. Using an existing dataset we evaluate the approach for 7 language pairs. Results show that the performance using section headings is comparable to using all article content, dictionaries derived from Wikipedia and Wiktionary are sufficient to compute cross-lingual similarity and combinations of features can further improve results.
Published: 2017

9. The SENSEI Overview of Newspaper Readers’ Comments

Author: Emma Barker, Ahmet Aker, Mark Hepple, Robert Gaizauskas, Adam Funk, and Monica Lestari Paramita
Subjects: 060201 languages & linguistics, Information retrieval, Computer science, 06 humanities and the arts, 02 engineering and technology, Viewpoints, Automatic summarization, Newspaper, Task (project management), World Wide Web, 0602 languages and literature, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Social media, User interface
Abstract: Automatic summarization of reader comments in on-line news is a challenging but clearly useful task. Work to date has produced extractive summaries using well-known techniques from other areas of NLP. But do users really want these, and do they support users in realistic tasks? We specify an alternative summary type for reader comments, based on the notions of issues and viewpoints, and demonstrate our user interface to present it. An evaluation to assess how well summarization systems support users in time-limited tasks (identifying issues and characterizing opinions) gives good results for this prototype.
Published: 2017

10. Generating descriptive multi-document summaries of geo-located entities using entity type models

Author: Ahmet Aker and Robert Gaizauskas
Subjects: Information Systems and Management, Information retrieval, Dependency (UML), Computer Networks and Communications, business.industry, Computer science, Representation (systemics), Library and Information Sciences, computer.software_genre, Automatic summarization, Signature (logic), Artificial intelligence, Language model, business, Composition (language), computer, Sentence, Natural language, Natural language processing, Information Systems
Abstract: In this article, we investigate the application of entity type models in extractive multi-document summarization using automatic caption generation for images of geo-located entities e.g., Westminsteri¾?Abbey as an application scenario. Entity type models contain sets of patterns aiming to capture the ways geo-located entities are described in natural language. They are automatically derived from texts about geo-located entities of the same type e.g., churches, lakes. We integrate entity type models into a multi-document summarizer and use them to address the 2 major tasks in extractive multi-document summarization: sentence scoring and summary composition. We experiment with 3 different representation methods for entity type models: signature words, n-gram language models, and dependency patterns. We evaluate the summarizer with integrated entity type models relative to a a summarizer using standard text-related features commonly used in text summarization and b the Wikipedia location descriptions. Our results show that entity type models significantly improve the quality of output summaries over that of summaries generated using standard summarization features and Wikipedia summaries. The representation of entity type models using dependency patterns is superior to the representations using signature words and n-gram language models.
Published: 2014

11. Summarizing Online Reviews Using Aspect Rating Distributions and Language Modeling

Author: Robert Gaizauskas, Ahmet Aker, and G. Di Fabbrizio
Subjects: Service (systems architecture), Information retrieval, Computer Networks and Communications, Computer science, business.industry, Feature extraction, computer.software_genre, Artificial Intelligence, The Internet, Product (category theory), Language model, Artificial intelligence, Computational linguistics, business, Relevant information, computer, Natural language processing
Abstract: Product and service reviews are abundantly available online, but selecting relevant information from them involves a significant amount of time. The authors address this problem with Starlet, a novel approach for extracting multidocument summarizations that considers aspect rating distributions and language modeling. These features encourage the inclusion of sentences in the summary that preserve the overall opinion distribution and reflect the reviews' original language.
Published: 2013

12. Do humans have conceptual models about geographic objects? A user study

Author: Ahmet Aker, Elena Lloret, Laura Plaza, and Robert Gaizauskas
Subjects: Human-Computer Interaction, World Wide Web, Knowledge modeling, Artificial Intelligence, Computer Networks and Communications, Computer science, Search engine indexing, Object type, Software, Information Systems
Abstract: In this article, we investigate what sorts of information humans request about geographical objects of the same type. For example, Edinburgh Castle and Bodiam Castle are two objects of the same type: “castle.” The question is whether specific information is requested for the object type “castle” and how this information differs for objects of other types (e.g., church, museum, or lake). We aim to answer this question using an online survey. In the survey, we showed 184 participants 200 images pertaining to urban and rural objects and asked them to write questions for which they would like to know the answers when seeing those objects. Our analysis of the 6,169 questions collected in the survey shows that humans have shared ideas of what to ask about geographical objects. When the object types resemble each other (e.g., church and temple), the requested information is similar for the objects of these types. Otherwise, the information is specific to an object type. Our results may be very useful in guiding Natural Language Processing tasks involving automatic generation of templates for image descriptions and their assessment, as well as image indexing and organization.
Published: 2013

13. General Overview of ImageCLEF at the CLEF 2016 Labs

Author: Mauricio Villegas, Henning Müller, Alba García Seco de Herrera, Roger Schaer, Stefano Bromuri, Andrew Gilbert, Luca Piras, Josiah Wang, Fei Yan, Arnau Ramisa, Emmanuel Dellandrea, Robert Gaizauskas, Krystian Mikolajczyk, Joan Puigcerver, Alejandro H. Toselli, Joan-Andreu Sánchez, and Enrique Vidal
Subjects: Image annotation, Evaluation campaign, 02 engineering and technology, Compound figures from biomedical literature, 021001 nanoscience & nanotechnology, 01 natural sciences, 3. Good health, Handwritten retrieval, 0103 physical sciences, ImageCLEF, 010306 general physics, 0210 nano-technology, LENGUAJES Y SISTEMAS INFORMATICOS, CLEF labs
Abstract: This paper presents an overview of the ImageCLEF 2016 evaluation campaign, an event that was organized as part of the CLEF (Conference and Labs of the Evaluation Forum) labs 2016. ImageCLEF is an ongoing initiative that promotes the evaluation of technologies for annotation, indexing and retrieval for providing information access to collections of images in various usage scenarios and domains. In 2016, the 14th edition of ImageCLEF, three main tasks were proposed: 1) identification, multi-label classification and separation of compound figures from biomedical literature; 2) automatic annotation of general web images; and 3) retrieval from collections of scanned handwritten documents. The handwritten retrieval task was the only completely novel task this year, although the other two tasks introduced several modifications to keep the proposed tasks challenging., The general coordination and the handwritten retrieval task have been supported by the European Union (EU) Horizon 2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943), EU project HIMANIS (JPICH programme, Spanish grant Ref: PCIN-2015-068) and MINECO/FEDER, UE under project TIN2015-70924-C2-1-R. The image annotation task is co-organized by the VisualSense (ViSen) consortium under the ERA-NET CHIST-ERA D2K 2011 Programme, jointly supported by UK EPSRC Grants EP/K01904X/1 and EP/K019082/1, French ANR Grant ANR-12-CHRI-0002-04 and Spanish MINECO Grant PCIN-2013-047. This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM), and Lister Hill National Center for Biomedical Communications (LHNCBC).
Published: 2016

14. Large Scale Semi-Supervised Object Detection Using Visual and Semantic Knowledge Transfer

Author: Robert Gaizauskas, Liming Chen, Josiah Wang, Boyang Gao, Emmanuel Dellandréa, Yuxing Tang, Extraction de Caractéristiques et Identification (imagine), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), École Centrale de Lyon (ECL), Université de Lyon, University of Sheffield [Sheffield], and Istituto Italiano di Tecnologia (IIT)
Subjects: Computer science, business.industry, Detector, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 02 engineering and technology, 010501 environmental sciences, Machine learning, computer.software_genre, 01 natural sciences, Object detection, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Semantic similarity, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Minimum bounding box, [INFO.INFO-TI]Computer Science [cs]/Image Processing [eess.IV], 0202 electrical engineering, electronic engineering, information engineering, Semantic memory, 020201 artificial intelligence & image processing, Artificial intelligence, business, Classifier (UML), computer, 0105 earth and related environmental sciences
Abstract: International audience; Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories , e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.
Published: 2016

15. Automatic label generation for news comment clusters

Author: Emma Barker, Emina Kurtic, Mark Hepple, Monica Lestari Paramita, Ahmet Aker, Robert Gaizauskas, and Adam Funk
Subjects: Information retrieval, Computer science, Pie chart, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, law.invention, Task (project management), ComputingMethodologies_PATTERNRECOGNITION, law, 0202 electrical engineering, electronic engineering, information engineering, Cluster (physics), 020201 artificial intelligence & image processing, Baseline (configuration management), Feature set, 0105 earth and related environmental sciences
Abstract: We present a supervised approach to automat- ically labelling topic clusters of reader com- ments to online news. We use a feature set that includes both features capturing proper- ties local to the cluster and features that cap- ture aspects from the news article and from comments outside the cluster. We evaluate the approach in an automatic and a manual, task-based setting. Both evaluations show the approach to outperform a baseline method, which uses tf*idf to select comment-internal terms for use as topic labels. We illustrate how cluster labels can be used to generate cluster summaries and present two alternative sum- mary formats: a pie chart summary and an ab- stractive summary.
Published: 2016

16. Summarizing Multi-Party Argumentative Conversations in Reader Comment on News

Author: Robert Gaizauskas and Emma Barker
Subjects: Scheme (programming language), Argumentative, Computer science, business.industry, media_common.quotation_subject, 020206 networking & telecommunications, 02 engineering and technology, Representation (arts), computer.software_genre, Linguistics, Argument, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Conversation, Artificial intelligence, business, computer, Natural language processing, media_common, computer.programming_language
Abstract: Existing approaches to summarizing multi-party argumentative conversations in reader comment are extractive and fail to capture the argumentative nature of these conversations. Work on argument mining proposes schemes for identifying argument elements and relations in text but has not yet addressed how summaries might be generated from a global analysis of a conversation based on these schemes. In this paper we: (1) propose an issue-centred scheme for analysing and graphically representing argument in reader comment discussion in on-line news, and (2) show how summaries capturing the argumentative nature of reader comment can be generated from our graphical representation.
Published: 2016

17. Don't Mention the Shoe! A Learning to Rank Approach to Content Selection for Image Description Generation

Author: Robert Gaizauskas and Josiah Wang
Subjects: Information retrieval, Point (typography), Computer science, business.industry, 05 social sciences, 02 engineering and technology, Object (computer science), Machine learning, computer.software_genre, Image (mathematics), Minimum bounding box, Bounding overwatch, Content (measure theory), 0202 electrical engineering, electronic engineering, information engineering, Selection (linguistics), 020201 artificial intelligence & image processing, Learning to rank, Artificial intelligence, 0509 other social sciences, 050904 information & library sciences, business, computer
Abstract: We tackle the sub-task of content selection as part of the broader challenge of automatically generating image descriptions. More specifically, we explore how decisions can be made to select what object instances should be mentioned in an image description, given an image and labelled bounding boxes. We propose casting the content selection problem as a learning to rank problem, where object instances that are most likely to be mentioned by humans when describing an image are ranked higher than those that are less likely to be mentioned. Several features are explored: those derived from bounding box localisations, from concept labels, and from image regions. Object instances are then selected based on the ranked list, where we investigate several methods for choosing a stopping criterion as the ‘cut-off’ point for objects in the ranked list. Our best-performing method achieves state-of-the-art performance on the ImageCLEF2015 sentence generation challenge.
Published: 2016

18. Experimental IR Meets Multilinguality, Multimodality, and Interaction

Author: Robert Gaizauskas and Paulo Quaresma
Published: 2016

19. Combining geometric, textual and visual features for predicting prepositions in image descriptions

Author: Arnau Ramisa, Josiah Wang, Ying Lu, Emmanuel Dellandrea, Francesc Moreno-Noguer, Robert Gaizauskas, Institut de Robòtica i Informàtica Industrial, Universitat Politècnica de Catalunya. ROBiri - Grup de Robòtica de l'IRI, Extraction de Caractéristiques et Identification (imagine), Laboratoire d'InfoRmatique en Image et Systèmes d'information (LIRIS), Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Institut National des Sciences Appliquées (INSA)-Université de Lyon-Institut National des Sciences Appliquées (INSA)-Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-École Centrale de Lyon (ECL), Université de Lyon-Université Lumière - Lyon 2 (UL2)-Institut National des Sciences Appliquées de Lyon (INSA Lyon), Université de Lyon-Université Lumière - Lyon 2 (UL2), Institut de Robòtica i Informàtica Industrial (IRI), Consejo Superior de Investigaciones Científicas [Madrid] (CSIC)-Universitat Politècnica de Catalunya [Barcelona] (UPC), Agence Nationale de la Recherche (France), China Scholarship Council, Ministerio de Economía y Competitividad (España), and Dellandrea, Emmanuel
Subjects: [INFO.INFO-MM] Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing, [INFO.INFO-NE] Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], 0211 other engineering and technologies, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], 02 engineering and technology, [INFO.INFO-NE]Computer Science [cs]/Neural and Evolutionary Computing [cs.NE], computer vision, [INFO.INFO-CV] Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], [INFO.INFO-TI] Computer Science [cs]/Image Processing [eess.IV], [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, 020204 information systems, [INFO.INFO-TI]Computer Science [cs]/Image Processing [eess.IV], 0202 electrical engineering, electronic engineering, information engineering, natural language processing, Informàtica::Robòtica [Àrees temàtiques de la UPC], Pattern recognition::Computer vision [Classificació INSPEC], 021101 geological & geomatics engineering
Abstract: Trabajo presentado a la Conference on Empirical Methods in Natural Language Processing celebrada en Lisboa (Portugal) del 17 al 21 de septiembre de 2015., We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the labels of such entities are known and unknown. In all situations we found clear evidence that all three features contribute to the prediction task., This work was funded by the ERA-Net CHISTERA D2K VisualSense project (Spanish MINECO PCIN-2013-047, UK EPSRC EP/K019082/1 and French ANR Grant ANR-12-CHRI-0002-04) and the Spanish MINECO RobInstruct project TIN2014-58178-R. Ying Lu was also supported by the China Scholarship Council.
Published: 2016

20. A Graph-Based Approach to Topic Clustering for Online Comments to News

Author: Emina Kurtic, A. R. Balamurali, Mark Hepple, Ahmet Aker, Robert Gaizauskas, Monica Lestari Paramita, and Emma Barker
Subjects: Fuzzy clustering, Training set, Information retrieval, Computer science, Graph based, Correlation clustering, 02 engineering and technology, Latent Dirichlet allocation, symbols.namesake, ComputingMethodologies_PATTERNRECOGNITION, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, symbols, Cluster (physics), Graph (abstract data type), 020201 artificial intelligence & image processing, Cluster analysis
Abstract: This paper investigates graph-based approaches to labeled topic clustering of reader comments in online news. For graph-based clustering we propose a linear regression model of similarity between the graph nodes (comments) based on similarity features and weights trained using automatically derived training data. To label the clusters our graph-based approach makes use of DBPedia to abstract topics extracted from the clusters. We evaluate the clustering approach against gold standard data created by human annotators and compare its results against LDA – currently reported as the best method for the news comment clustering task. Evaluation of cluster labelling is set up as a retrieval task, where human annotators are asked to identify the best cluster given a cluster label. Our clustering approach significantly outperforms the LDA baseline and our evaluation of abstract cluster labels shows that graph-based approaches are a promising method of creating labeled clusters of news comments, although we still find cases where the automatically generated abstractive labels are insufficient to allow humans to correctly associate a label with its cluster.
Published: 2016

21. The SENSEI Project: Making Sense of Human Conversations

Author: Udo Kruschwitz, Frédéric Béchet, Massimo Poesio, Morena Danieli, Robert Gaizauskas, Benoit Favre, and Giuseppe Riccardi
Subjects: Multimedia, business.industry, Computer science, Context (language use), 02 engineering and technology, computer.file_format, computer.software_genre, Automatic summarization, Style (sociolinguistics), Metadata, World Wide Web, Analytics, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Social media, Executable, business, computer, Sentence
Abstract: Conversational interaction is the most natural and persistent paradigm for personal and business relations. In contact centres customer spoken conversations are handled daily. On social media platforms conversations are delivered in different forms, lengths and for different purposes. In both cases, conversations have little impact on the intended target listeners, due to the volume, velocity and diversity media, style, social context of the document streams spoken conversations and blog posts. Most language analytics technology is limited in that it performs keyword search, which does not provide automatic descriptions of what happened, who said what, which opinions are held on what subject, in a coherent, readable and executable form. In the SENSEI project we plan to go beyond keyword search and sentence-based analysis of conversations. We adapt lightweight and large coverage linguistic models of semantic and discourse resources to learn a layered model of conversations. SENSEI addresses the issue of multidimensional textual, spoken and metadata descriptors in terms of semantic, para-semantic and discourse structures. Automated generation of readable analytics documents summaries will support end-users in the context of large data analysis tasks. Summarization technology developed in SENSEI has been evaluated with respect to users' task requirements and performances in the context of contact centre and social media conversations.
Published: 2016

22. The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News

Author: Emina Kurtic, Ahmet Aker, Monica Lestari Paramita, Mark Hepple, Robert Gaizauskas, and Emma Barker
Subjects: 060201 languages & linguistics, Argumentative, Computer science, business.industry, media_common.quotation_subject, 06 humanities and the arts, 02 engineering and technology, computer.software_genre, Linguistics, 0602 languages and literature, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Social media, Conversation, Artificial intelligence, Line (text file), business, computer, Natural language processing, media_common
Abstract: Researchers are beginning to explore how to generate summaries of extended argumentative conversations in social media, such as those found in reader comments in on-line news. To date, however, there has been little discussion of what these summaries should be like and a lack of humanauthored exemplars, quite likely because writing summaries of this kind of interchange is so difficult. In this paper we propose one type of reader comment summary – the conversation overview summary – that aims to capture the key argumentative content of a reader comment conversation. We describe a method we have developed to support humans in authoring conversation overview summaries and present a publicly available corpus – the first of its kind – of news articles plus comment sets, each multiply annotated, according to our method, with conversation overview summaries.
Published: 2016

23. The TempEval challenge: identifying temporal relations in text

Author: Mark Hepple, Robert Gaizauskas, Jessica Moszkowicz, James Pustejovsky, Frank Schilder, and Marc Verhagen
Subjects: Linguistics and Language, Information retrieval, business.industry, Event (computing), Computer science, Context (language use), Library and Information Sciences, Temporal annotation, computer.software_genre, Language and Linguistics, SemEval, Education, Task (project management), Annotation, TimeML, Information extraction, Artificial intelligence, Computational linguistics, business, computer, Natural language processing, Sentence
Abstract: TempEval is a framework for evaluating systems that automatically annotate texts with temporal relations. It was created in the context of the SemEval 2007 workshop and uses the TimeML annotation language. The evaluation consists of three subtasks of temporal annotation: anchoring an event to a time expression in the same sentence, anchoring an event to the document creation time, and ordering main events in consecutive sentences. In this paper we describe the TempEval task and the systems that participated in the evaluation. In addition, we describe how further task decomposition can bring even more structure to the evaluation of temporal relations.
Published: 2009

24. Exploring relation types for literature-based discovery

Author: Mark Stevenson, Judita Preiss, and Robert Gaizauskas
Subjects: Relation (database), Computer science, literature based discovery, knowledge discovery, Information Storage and Retrieval, Health Informatics, 02 engineering and technology, text mining, USable, Machine learning, computer.software_genre, Literature-based discovery, 03 medical and health sciences, Knowledge extraction, Simple (abstract algebra), 0202 electrical engineering, electronic engineering, information engineering, Focus on Natural Language Processing, natural language processing, 030304 developmental biology, 0303 health sciences, Shallow parsing, business.industry, Linguistics, Replication (computing), Range (mathematics), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing
Abstract: Objective Literature-based discovery (LBD) aims to identify “hidden knowledge” in the medical literature by: (1) analyzing documents to identify pairs of explicitly related concepts (terms), then (2) hypothesizing novel relations between pairs of unrelated concepts that are implicitly related via a shared concept to which both are explicitly related. Many LBD approaches use simple techniques to identify semantically weak relations between concepts, for example, document co-occurrence. These generate huge numbers of hypotheses, difficult for humans to assess. More complex techniques rely on linguistic analysis, for example, shallow parsing, to identify semantically stronger relations. Such approaches generate fewer hypotheses, but may miss hidden knowledge. The authors investigate this trade-off in detail, comparing techniques for identifying related concepts to discover which are most suitable for LBD.Materials and methods A generic LBD system that can utilize a range of relation types was developed. Experiments were carried out comparing a number of techniques for identifying relations. Two approaches were used for evaluation: replication of existing discoveries and the “time slicing” approach.1Results Previous LBD discoveries could be replicated using relations based either on document co-occurrence or linguistic analysis. Using relations based on linguistic analysis generated many fewer hypotheses, but a significantly greater proportion of them were candidates for hidden knowledge.Discussion and Conclusion The use of linguistic analysis-based relations improves accuracy of LBD without overly damaging coverage. LBD systems often generate huge numbers of hypotheses, which are infeasible to manually review. Improving their accuracy has the potential to make these systems significantly more usable.
Published: 2015

25. Extracting bilingual terms from the Web

Author: Emma Barker, Marcis Pinnis, Marta Pahisa Solé, Monica Lestari Paramita, Ahmet Aker, and Robert Gaizauskas
Subjects: Data collection, Machine translation, Terminology extraction, business.industry, Computer science, Communication, Library and Information Sciences, computer.software_genre, Language and Linguistics, Term (time), Terminology, Domain (software engineering), World Wide Web, Constructed language, Set (abstract data type), Artificial intelligence, business, computer, Natural language processing
Abstract: In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems.
Published: 2015

26. Defining Visually Descriptive Language

Author: Robert Gaizauskas, Josiah Wang, and Arnau Ramisa
Subjects: Interpretation (logic), Computer science, business.industry, media_common.quotation_subject, Contrast (statistics), Descriptive language, computer.software_genre, Agreement, Annotation, Range (mathematics), Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: In this paper, we introduce the notion of visually descriptive language (VDL) ‐ intuitively a text segment whose truth can be confirmed by visual sense alone. VDL can be exploited in many vision-based tasks, e.g. image interpretation and story illustration. In contrast to previous work requiring pre-aligned texts and images, we propose a broader definition of VDL that extends to a much larger range of texts without associated images. We also discuss possible VDL annotation tasks and make recommendations for difficult cases. Lastly, we demonstrate the viability of our definition via an annotation exercise across several text genres and analyse inter-annotator agreement. Results show reasonably high levels of agreement between annotators can be reached.
Published: 2015

27. Generating Image Descriptions with Gold Standard Visual Inputs: Motivation, Evaluation and Baselines

Author: Josiah Wang and Robert Gaizauskas
Subjects: Computer science, Minimum bounding box, Metric (mathematics), Process (computing), Selection (linguistics), Natural language generation, Noise (video), Data mining, Focus (optics), computer.software_genre, computer, Task (project management)
Abstract: In this paper, we present the task of generating image descriptions with gold standard visual detections as input, rather than directly from an image. This allows the Natural Language Generation community to focus on the text generation process, rather than dealing with the noise and complications arising from the visual detection process. We propose a fine-grained evaluation metric specifically for evaluating the content selection capabilities of image description generation systems. To demonstrate the evaluation metric on the task, several baselines are presented using bounding box information and textual information as priors for content selection. The baselines are evaluated using the proposed metric, showing that the fine-grained metric is useful for evaluating the content selection phase of an image description generation system.
Published: 2015

28. Web Service Architectures for Text Mining

Author: George Demetriou, Robert Gaizauskas, Ian Roberts, Yikun Guo, and Neil Davis
Subjects: medicine.medical_specialty, Computer Networks and Communications, Computer science, Unstructured data, Scientific literature, computer.software_genre, Data science, World Wide Web, Text processing, Web mining, medicine, Web service, WS-Policy, Web intelligence, Web modeling, computer, Software, Information Systems
Abstract: Text mining technology can be used to assist in finding relevant or novel information in large volumes of unstructured data, such as that which is increasingly available in the electronic scientific literature. However, publishers are not text mining specialists, nor typically are the end user scientists who consume their products. This situation suggests a web services based solution, where text mining specialists process the literature obtained from publishers and make their results available to remote consumers (research scientists). In this paper we discuss the integration of web services and text mining within the domain of scientific publishing and explore the strengths and weaknesses of three generic architectural designs for delivering text mining web services. We argue for the superiority of one of these and demonstrate its viability by reference to an application designed to provide access to the results of text mining over the PubMed database of scientific abstracts.
Published: 2006

29. The Role of Inference in the Temporal Annotation and Analysis of Text

Author: Robert Gaizauskas, Andrea Setzer, and Mark Hepple
Subjects: Linguistics and Language, Information retrieval, Relation (database), business.industry, Process (engineering), Computer science, Closure (topology), General Social Sciences, Inference, Library and Information Sciences, Temporal annotation, computer.software_genre, Language and Linguistics, Education, Annotation, TimeML, Artificial intelligence, Computational linguistics, business, GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries), computer, Natural language processing
Abstract: In this paper we argue for the importance of doing inference over the information expressed by the annotations of temporally annotated corpora. We describe the process of inferential closure which can be applied to determine the full temporal content that follows from an annotation. We illustrate the importance of temporal inference and temporal closure in relation to three tasks, which are: (a) the comparison of different temporal annotations, (b) facilitating the manual annotation process needed to create temporally annotated corpora and (c) empirical investigations done over temporally annotated data.
Published: 2005

30. Information retrieval for question answering a SIGIR 2004 workshop

Author: Mark A. Greenwood, Mark Hepple, and Robert Gaizauskas
Subjects: World Wide Web, Information retrieval, Hardware and Architecture, Computer science, Question answering, Natural language, Management Information Systems
Abstract: Open domain question answering has become a very active research area over the past few years, due in large measure to the stimulus of the TREC Question Answering track. This track addresses the task of finding answers to natural language (NL) questions (e.g. How tall is the Eiffel Tower? Who is Aaron Copland? ) from large text collections. This task stands in contrast to the more conventional IR task of retrieving documents relevant to a query, where the query may be simply a collection of keywords (e.g. Eiffel Tower, American composer, born Brooklyn NY 1900 , ...).
Published: 2004

31. Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development

Author: Andrew Hardie, Diana Maynard, Paul Baker, B. D. Jayaram, Valentin Tablan, Cristian Ursu, Oana Hamza, Kalina Bontcheva, Tony McEnery, Mark Leisher, Richard Xiao, Robert Gaizauskas, and Hamish Cunningham
Subjects: Text corpus, Demonstrative, Hindi, Linguistics and Language, Computer science, business.industry, computer.software_genre, Unicode, Language and Linguistics, language.human_language, Linguistics, Resource (project management), Corpus linguistics, language, Urdu, Artificial intelligence, business, Minority language, computer, Natural language processing, Information Systems
Abstract: This paper describes the work carried out on the EMILLE Project (Enabling Minority Language Engineering), which was undertaken by the Universities of Lancaster and Sheffield. The primary resource developed by the project is the EMILLE Corpus, which consists of a series of monolingual corpora for fourteen South Asian languages, totalling more than 96 million words, and a parallel corpus of English and five of these languages. The EMILLE Corpus also includes an annotated component, namely, part-of-speech tagged Urdu data, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use in Hindi. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. The development of tools for EMILLE has contributed to the ongoing development of the LE architecture GATE, which has been extended to make use of Unicode. GATE thus plugs some of the gaps for language processing R&D necessary for the exploitation of the EMILLE corpora.
Published: 2004

32. Book Review

Author: Robert Gaizauskas
Subjects: Cognitive science, Linguistics and Language, Artificial Intelligence, Philosophy, Language and Linguistics, Classics, Computer Science Applications, Terminology
Published: 2003

33. [Untitled]

Author: Robert Gaizauskas and H. M. Harmain
Subjects: Computer science, business.industry, Software development, computer.software_genre, Object-oriented analysis and design, Software framework, Software development process, Unified Modeling Language, Software sizing, Problem domain, Goal-Driven Software Development Process, Software construction, Software design, Package development process, Software verification and validation, Software requirements, business, Computer-aided software engineering, Software engineering, computer, Software, computer.programming_language
Abstract: Graphical CASE (Computer Aided Software Engineering) tools provide considerable help in documenting the output of the Analysis and Design stages of software development and can assist in detecting incompleteness and inconsistency in an analysis. However, these tools do not contribute to the initial, difficult stage of the analysis process, that of identifying the object classes, attributes and relationships used to model the problem domain. This paper describes an NL-Based CASE tool called Class Model Builder (CM-Builder) which aims at supporting this aspect of the Analysis stage of software development in an Object-Oriented framework. CM-Builder uses robust Natural Language Processing techniques to analyse software requirements texts written in English and constructs, either automatically or interactively with an analyst, an initial UML Class Model representing the object classes mentioned in the text and the relationships among them. The initial model can be directly input to a graphical CASE tool for further refinement by a human analyst. CM-Builder has been quantitatively evaluated in blind trials against a collection of unseen software requirements texts and we present the results of this evaluation, together with the evaluation method. The results are very encouraging and demonstrate that tools such as CM-Builder have the potential to play an important role in the software development process.
Published: 2003

34. A Hybrid Approach to Multi-document Summarization of Opinions in Reviews

Author: Robert Gaizauskas, Giuseppe Di Fabbrizio, and Amanda Stent
Subjects: Information retrieval, Computer science, business.industry, Natural language generation, computer.software_genre, Hybrid approach, Salient, Multi-document summarization, Selection (linguistics), Product (category theory), Artificial intelligence, business, computer, Sentence, Natural language processing, Natural language
Abstract: We present a hybrid method to generate summaries of product and services reviews by combining natural language generation and salient sentence selection techniques. Our system, STARLET-H, receives as input textual reviews with associated rated topics, and produces as output a natural language document summarizing the opinions expressed in the reviews. STARLET-H operates as a hybrid
Published: 2014

35. Assigning Terms to Domains by Document Classification

Author: Emma Barker, Robert Gaizauskas, Ahmet Aker, and Monica Lestari Paramita
Subjects: Thesaurus (information retrieval), Information retrieval, Exploit, Computer science, business.industry, Document classification, computer.software_genre, Term (time), Domain (software engineering), Identification (information), ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Artificial intelligence, business, computer, Natural language processing
Abstract: In this paper we investigate a number of questions relating to the identification of the domain of a term by domain classification of the document in which the term occurs. We propose and evaluate a straightforward method for domain classification of documents in 24 languages that exploits a multilingual thesaurus and Wikipedia. We investigate and provide quantitative results about the extent to which humans agree about the domain classification of documents and terms also the extent to which terms are likely to “inherit” the domain of their parent document.
Published: 2014

36. Graph Ranking for Collective Named Entity Disambiguation

Author: Robert Gaizauskas and Ayman Alhelbawy
Subjects: Named entity, Entity linking, Information retrieval, Ranking, business.industry, Computer science, Graph (abstract data type), Artificial intelligence, business, computer.software_genre, computer, Natural language processing
Abstract: Named Entity Disambiguation (NED) refers to the task of mapping different named entity mentions in running text to their correct interpretations in a specific knowledge base (KB). This paper presents a collective disambiguation approach using a graph model. All possible NE candidates are represented as nodes in the graph and associations between different candidates are represented by edges between the nodes. Each node has an initial confidence score, e.g. entity popularity. Page-Rank is used to rank nodes and the final rank is combined with the initial confidence for candidate selection. Experiments on 27,819 NE textual mentions show the effectiveness of using Page-Rank in conjunction with initial confidence: 87% accuracy is achieved, outperforming both baseline and state-of-the-art approaches.
Published: 2014

37. A Poodle or a Dog? Evaluating Automatic Image Annotation Using Human Descriptions at Different Levels of Granularity

Author: Robert Gaizauskas, Ahmet Aker, Josiah Wang, and Fei Yan
Subjects: Automatic image annotation, Information retrieval, Computer science, Granularity, Variation (game tree), Object (computer science), Image (mathematics), Task (project management)
Abstract: Different people may describe the same object in different ways, and at varied levels of granularity (“poodle”, “dog”, “pet” or “animal”?) In this paper, we propose the idea of ‘granularityaware’ groupings where semantically related concepts are grouped across different levels of granularity to capture the variation in how different people describe the same image content. The idea is demonstrated in the task of automatic image annotation, where these semantic groupings are used to alter the results of image annotation in a manner that affords different insights from its initial, category-independent rankings. The semantic groupings are also incorporated during evaluation against image descriptions written by humans. Our experiments show that semantic groupings result in image annotations that are more informative and flexible than without groupings, although being too flexible may result in image annotations that are less informative.
Published: 2014

38. Natural language question answering: the view from here

Author: Robert Gaizauskas and Lynette Hirschman
Subjects: World Wide Web, Linguistics and Language, Information retrieval, Artificial Intelligence, Computer science, Ask price, Natural language question answering, Everyday language, Question answering, Context (language use), Text Retrieval Conference, Language and Linguistics, Software
Abstract: As users struggle to navigate the wealth of on-line information now available, the need for automated question answering systems becomes more urgent. We need systems that allow a user to ask a question in everyday language and receive an answer quickly and succinctly, with sufficient context to validate the answer. Current search engines can return ranked lists of documents, but they do not deliver answers to the user.Question answering systems address this problem. Recent successes have been reported in a series of question-answering evaluations that started in 1999 as part of the Text Retrieval Conference (TREC). The best systems are now able to answer more than two thirds of factual questions in this evaluation.
Published: 2001

39. Visual Tools for Natural Language Processing

Author: Peter Rodgers, Kevin Humphreys, and Robert Gaizauskas
Subjects: business.industry, Programming language, Computer science, Computer programming, computer.file_format, Modular design, computer.software_genre, Language and Linguistics, QA76, Computer Science Applications, Visualization, Human-Computer Interaction, Data flow diagram, Data dependency, Graph (abstract data type), Artificial intelligence, Executable, business, computer, Natural language processing, Visual programming language
Abstract: We describe GATE, the General Architecture for Text Engineering, an integrated visual development environment to support the visual assembly, execution and analysis of modular natural language processing systems. The visual model is an executable data flow program graph, automatically synthesised from data dependency declarations of language processing modules. The graph is then directly executable: modules are run interactively in the graph, and results are accessible via generic text visualisation tools linked to the modules. These tools lighten the `cognitive load? of viewing and comparing module results by relating data produced by modules back to the underlying text, by reducing the amount of search in examining results, and by displaying results in context. Overall, the GATE integrated visual development environment leads to rapid understanding of system behaviour and hence to rapid system refinement, therefore demonstrating the utility of visual programming and visualisation techniques for the development of natural language processing systems.
Published: 2001

40. Named Entity Disambiguation Using HMMs

Author: Ayman Alhelbawy and Robert Gaizauskas
Subjects: Sequence, Markov chain, Computer science, business.industry, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Viterbi algorithm, computer.software_genre, symbols.namesake, Entity linking, Hidden variable theory, symbols, State space, Forward algorithm, Artificial intelligence, Hidden Markov model, business, computer, Natural language processing
Abstract: In this paper we present a novel approach to disambiguate textual mentions of named entities against the Wikipedia knowledge base. The conditional dependencies between different named entities across Wikipedia are represented as a Markov network. In our approach, named entities are treated as hidden variables and textual mentions as observations. The number of states and observations is huge and naively using the Viterbi algorithm to find the hidden state sequence that emits the query observation sequence is computationally infeasible, given a state space of this size. Based on an observation that is specific to the disambiguation problem, we propose an approach that uses a tailored approximation to reduce the size of the state space, making the Viterbi algorithm feasible. Results show good improvement in disambiguation accuracy relative to the baseline approach and to some state-of-the-art approaches. Also, our approach shows how, with suitable approximations, HMMs can be used in such large-scale state space problems.
Published: 2013

41. Bioinformatics applications of information extraction from scientific journal articles

Author: George Demetriou, Robert Gaizauskas, and Kevin Humphreys
Subjects: Computer science, 05 social sciences, 02 engineering and technology, Library and Information Sciences, computer.software_genre, Bioinformatics, Data science, World Wide Web, Information extraction, Agency (sociology), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0509 other social sciences, 050904 information & library sciences, computer, Information Systems
Abstract: Information extraction technology, as defined and developed through the US Defense Advanced Research Projects Agency (DARPA) Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this paper, the application of this technology to the extraction of information from scientific journal papers in the area of molecular biology is considered. In particular, it is described how an information extraction designed to participate in the MUC exercises has been modified for two bioinformatics applications: one concerned with enzyme and metabolic pathways; the other with protein structure. Progress to date provides convincing grounds for believing that information extraction techniques will deliver novel and effective ways for scientists to make use of the core literature which defines their disciplines.
Published: 2000

42. Evaluating two methods for Treebank grammar compaction

Author: Mark Hepple, Alexander Krotov, Yorick Wilks, and Robert Gaizauskas
Subjects: ID/LP grammar, Linguistics and Language, Parsing, Computer science, business.industry, Link grammar, Mildly context-sensitive grammar formalism, computer.software_genre, Language and Linguistics, Tree-adjoining grammar, TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES, Artificial Intelligence, Stochastic grammar, Synchronous context-free grammar, Artificial intelligence, L-attributed grammar, business, computer, Software, Natural language processing
Abstract: Treebanks, such as the Penn Treebank, provide a basis for the automatic creation of broad coverage grammars. In the simplest case, rules can simply be ‘read off’ the parse-annotations of the corpus, producing either a simple or probabilistic context-free grammar. Such grammars, however, can be very large, presenting problems for the subsequent computational costs of parsing under the grammar. In this paper, we explore ways by which a treebank grammar can be reduced in size or ‘compacted’, which involve the use of two kinds of technique: (i) thresholding of rules by their number of occurrences; and (ii) a method of rule-parsing, which has both probabilistic and non-probabilistic variants. Our results show that by a combined use of these two techniques, a probabilistic context-free grammar can be reduced in size by 62% without any loss in parsing performance, and by 71% to give a gain in recall, but some loss in precision.
Published: 1999

43. Using a language independent domain model for multilingual information extraction

Author: Kevin Humphreys, Robert Gaizauskas, Saliha Azzam, and Yorick Wilks
Subjects: Information retrieval, Machine translation, business.industry, Computer science, Volume (computing), Domain model, Construct (python library), computer.software_genre, Domain (software engineering), Information extraction, Artificial Intelligence, Artificial intelligence, Representation (mathematics), business, computer, Natural language processing
Abstract: The volume of electronic text in different languages, particularly on the World Wide Web, is growing significantly, and the problem of users who are restricted in the number of languages they read obtaining information from this text is becoming more widespread. This article investigates some of the issues involved in achieving multilingual information extraction (IE), describes the approach adopted in the M-LaSIE-II IE system, which addresses these problems, and presents the results of evaluating the approach against a small parallel corpus of English/French newswire texts. The approach is based on the assumption that it is possible to construct a language independent representation of concepts relevant to the domain, at least for the small well-defined domains typical of IE tasks, allowing multilingual IE to be successfully carried out without requiring full machine translation.
Published: 1999

44. Evaluation in language and speech technology

Author: Robert Gaizauskas
Subjects: Human-Computer Interaction, Computer science, Speech technology, Measure (physics), Software, Linguistics, Theoretical Computer Science
Abstract: I often say that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind: it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science, whatever the matter may be. Lord Kelvin, Popular Lectures and Addresses , (1889), vol. 1, p. 73.
Published: 1998

45. Karen Sparck Jones and Julia Galliers, Evaluating Natural Language Processing Systems: An Analysis and Review. Berlin: Springer-Verlag, 1996. ISBN 3 540 61309 9, Price DM54.00 (paperback). 228 pages

Author: Robert Gaizauskas
Subjects: Linguistics and Language, Artificial Intelligence, Computer science, Art history, Language and Linguistics, Software
Published: 1998

46. Information extraction: beyond document retrieval

Author: Robert Gaizauskas and Yorick Wilks
Subjects: Information retrieval, Machine translation, Computer science, Information processing, Library and Information Sciences, computer.software_genre, Information science, Information extraction, Text processing, Information system, Document retrieval, computer, Natural language, Information Systems
Abstract: In this paper we give a synoptic view of the growth of the text processing technology of information extraction (IE) whose function is to extract information about a pre‐specified set of entities, relations or events from natural language texts and to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960s and 70s till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.
Published: 1998

47. Using a semantic network for information extraction

Author: Robert Gaizauskas and Kevin Humphreys
Subjects: Message Understanding Conference, Linguistics and Language, Generality, Knowledge representation and reasoning, Computer science, business.industry, WordNet, computer.software_genre, Rotation formalisms in three dimensions, Language and Linguistics, Semantic network, Information extraction, Artificial Intelligence, Artificial intelligence, Heuristics, business, computer, Software, Natural language processing
Abstract: This paper describes the approach to knowledge representation taken in the LaSIE Information Extraction (IE) system. Unlike many IE systems that skim texts and use large collections of shallow, domain-specific patterns and heuristics to fill in templates, LaSIE attempts a fuller text analysis, first translating individual sentences to a quasi-logical form, and then constructing a weak discourse model of the entire text from which template fills are finally derived. Underpinning the system is a general ‘world model’, represented as a semantic net, which is extended during the processing of a text by adding the classes and instances described in that text. In the paper we describe the system's knowledge representation formalisms, their use in the IE task, and how the knowledge represented in them is acquired, including experiments to extend the system's coverage using the WordNet general purpose semantic network. Preliminary evaluations of our approach, through the Sixth DARPA Message Understanding Conference, indicate comparable performance to shallower approaches. However, we believe its generality and extensibility offer a route towards the higher precision that is required of IE systems if they are to become genuinely usable technologies.
Published: 1997

48. Information Retrieval for Temporal Bounding

Author: Leon Derczynski and Robert Gaizauskas
Subjects: Reference Document, Information retrieval, Bounding overwatch, Computer science, Interval temporal logic, Assertion, Event mining, Reference dataset, Task (project management)
Abstract: The temporal bounding problem is that of finding the beginning and ending times of a temporal interval during which an assertion holds. Existing approaches to temporal bounding have assumed the provision of a reference document from which to extract temporal bounds. We argue that a real-world setting does not include a reference document and that an information retrieval step is often required in order to locate documents containing candidate beginning and end times. We call this task "Information Retrieval for Temporal Bounding". This paper defines the task and discusses suitable evaluation metrics, as well as demonstrating the task's difficulty using a reference dataset.
Published: 2013

49. Methods for Collection and Evaluation of Comparable Documents

Author: Paul Clough, Mark Sanderson, Evangelos Kanoulas, Monica Lestari Paramita, Robert Gaizauskas, and David Guthrie
Subjects: Information retrieval, Machine translation, business.industry, Computer science, media_common.quotation_subject, Evaluation methods, Quality (business), Artificial intelligence, business, computer.software_genre, computer, Natural language processing, media_common
Abstract: Considerable attention is being paid to methods for gathering and evaluating comparable corpora, not only to improve Statistical Machine Translation (SMT) but for other applications as well, e.g. the extraction of paraphrases. The potential value of such corpora requires efficient and effective methods for gathering and evaluating them. Most of these methods have been tested in retrieving document pairs for well resourced languages, however there is a lack of work in areas of less popular (under resourced) languages, or domains. This chapter describes the work in developing methods for automatically gathering comparable corpora from the Web, specifically for under resourced languages. Different online sources are investigated and an evaluation method is developed to assess the quality of the retrieved documents.
Published: 2013

50. Summarizing Opinion-Related Information for Mobile Devices

Author: Giuseppe Di Fabbrizio, Robert Gaizauskas, and Amanda Stent
Subjects: Information retrieval, Computer science, Reading (process), media_common.quotation_subject, Sentiment analysis, Hybrid approach, Mobile device, Ordinal regression, Automatic summarization, media_common
Abstract: Reviews about products and services are abundantly available online. However, gathering information relevant to shoppers involves a significant amount of time reading reviews and weeding out extraneous information. While recent work in multi-document summarization has attempted to some degree to address this challenge, many questions about extracting and aggregating opinions remain unanswered. This chapter demonstrates a novel approach to review summarization, using three techniques: (1) graphical summarization; (2) review summarization; and (3) a hybrid approach, which combines abstractive and extractive summarization methods, to extract relevant opinions and relative ratings from text documents. All three methods allow a consistent approach to preserve the overall opinion distribution that is expressed in the original reviews.
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Database

Publisher

114 results on '"Robert Gaizauskas"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources