15 results on '"Jatowt, Adam"'
Search Results
2. The Corpora They Are a-Changing: a Case Study in Italian Newspapers
- Author
-
Basile, Pierpaolo, Caputo, Annalina, Caselli, Tommaso, Cassotti, Pierluigi, Varvara, Rossella, Tahmasebi, Nina, Jatowt, Adam, Xu, Yang, Hengchen, Simon, Montariol, Syrielle, Dubossarsky, Haim, and Computational Linguistics (CL)
- Subjects
business.industry ,Computer science ,media_common.quotation_subject ,Computational linguistics ,computer.software_genre ,Newspaper ,Semantic change ,Robustness (computer science) ,Benchmark (surveying) ,Quality (business) ,Hofstede's cultural dimensions theory ,Artificial intelligence ,business ,Set (psychology) ,computer ,Reliability (statistics) ,Natural language processing ,media_common - Abstract
The use of automatic methods for the study of lexical semantic change (LSC) has led to the creation of evaluation benchmarks. Benchmark datasets, however, are intimately tied to the corpus used for their creation questioning their reliability as well as the robustness of automatic methods. This contribution investigates these aspects showing the impact of unforeseen social and cultural dimensions. We also identify a set of additional issues (OCR quality, named entities) that impact the performance of the automatic methods, especially when used to discover LSC.
- Published
- 2021
3. Non-Parametric Subject Prediction
- Author
-
Wang, Shenghui, Koopman, Rob, Englebienne, Gwenn, Doucet, Antoine, Isaac, Antoine, Golub, Koraljka, Aalberg, Trond, and Jatowt, Adam
- Subjects
Similarity (geometry) ,business.industry ,Computer science ,Random projection ,05 social sciences ,Search engine indexing ,02 engineering and technology ,Digital library ,Machine learning ,computer.software_genre ,Non-parametric method ,Set (abstract data type) ,Scalability ,0202 electrical engineering, electronic engineering, information engineering ,Feature (machine learning) ,Embedding ,Semantic embedding ,020201 artificial intelligence & image processing ,Artificial intelligence ,0509 other social sciences ,050904 information & library sciences ,business ,computer ,Subject prediction - Abstract
Automatic subject prediction is a desirable feature for modern digital library systems, as manual indexing can no longer cope with the rapid growth of digital collections. This is an “extreme multi-label classification” problem, where the objective is to assign a small subset of the most relevant subjects from an extremely large label set. Data sparsity and model scalability are the major challenges we need to address to solve it automatically. In this paper, we describe an efficient and effective embedding method that embeds terms, subjects and documents into the same semantic space, where similarity can be computed easily. We then propose a novel Non-Parametric Subject Prediction (NPSP) method and show how effectively it predicts even very specialised subjects, which are associated with few documents in the training set and are not predicted by state-of-the-art classifiers.
- Published
- 2019
4. A Framework for Citing Nanopublications
- Author
-
Fabris, Erika, Kuhn, Tobias, Silvello, Gianmaria, Doucet, Antoine, Isaac, Antoine, Golub, Koraljka, Aalberg, Trond, Jatowt, Adam, Doucet, Antoine, Isaac, Antoine, Golub, Koraljka, Aalberg, Trond, Jatowt, Adam, Business Web and Media, Network Institute, and Intelligent Information Systems
- Subjects
0303 health sciences ,Focus (computing) ,Computer science ,05 social sciences ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Nanopublication ,Data science ,Data citation ,03 medical and health sciences ,0509 other social sciences ,050904 information & library sciences ,Citation ,GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries) ,DisGeNET ,030304 developmental biology - Abstract
In this paper we discuss the role of the Nanopublication (nanopub) model for scholarly publications with particular focus on the citation of nanopubs. To this end, we contribute to the state-of-the-art in data citation by proposing: the nanocitation framework that defines the main steps to create a text snippet and a machine-readable citation given a single nanopub; an ad-hoc metadata schema for encoding nanopub citations; and, an open-source and publicly available citation system.
- Published
- 2019
- Full Text
- View/download PDF
5. Coner: A Collaborative Approach for Long-Tail Named Entity Recognition in Scientific Publications
- Author
-
Vliegenthart, Daniel, Mesbah, S., Lofi, C., Aizawa, Akiko, Bozzon, A., Doucet, Antoine, Isaac, Antoine, Golub, Koraljka, Aalberg, Trond, and Jatowt, Adam
- Subjects
Training set ,business.industry ,Computer science ,020207 software engineering ,02 engineering and technology ,computer.software_genre ,Machine learning ,Small set ,Task (project management) ,Domain (software engineering) ,Named-entity recognition ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Relevance (information retrieval) ,Artificial intelligence ,Heuristics ,business ,computer ,Test data - Abstract
Named Entity Recognition (NER) for rare long-tail entities as e.g., often found in domain-specific scientific publications is a challenging task, as typically the extensive training data and test data for fine-tuning NER algorithms is lacking. Recent approaches presented promising solutions relying on training NER algorithms in an iterative weakly-supervised fashion, thus limiting human interaction to only providing a small set of seed terms. Such approaches heavily rely on heuristics in order to cope with the limited training data size. As these heuristics are prone to failure, the overall achievable performance is limited. In this paper, we therefore introduce a collaborative approach which incrementally incorporates human feedback on the relevance of extracted entities into the training cycle of such iterative NER algorithms. This approach, called Coner, allows to still train new domain specific rare long-tail NER extractors with low costs, but with ever increasing performance while the algorithm is actively used in an application.
- Published
- 2019
- Full Text
- View/download PDF
6. Using Web Archive for Improving Search Engine Results.
- Author
-
Xiaofang Zhou, Jianzhong Li, Heng Tao Shen, Kitsuregawa, Masaru, Yanchun Zhang, Jatowt, Adam, Kawai, Yukiko, and Tanaka, Katsumi
- Abstract
Search engines affect page popularity by making it difficult for currently unpopular pages to reach the top ranks in the search results. This is because people tend to visit and create links to the top-ranked pages. We have addressed this problem by analyzing the previous content of web pages. Our approach is based on the observation that the quality of this content greatly affects link accumulation and hence the final rank of the page. We propose detecting the content that has the greatest impact on the link accumulation process of top-ranked pages and using it for detecting high quality but unpopular web pages. Such pages would have higher ranks assigned. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
7. Personalized Detection of Fresh Content and Temporal Annotation for Improved Page Revisiting.
- Author
-
Bressan, Stephane, Küng, Josef, Wagner, Roland, Jatowt, Adam, Kawai, Yukiko, and Tanaka, Katsumi
- Abstract
Page revisiting is a popular browsing activity in the Web. In this paper we describe a method for improving page revisiting by detecting and highlighting the information on browsed Web pages that is fresh for a user. Content freshness is determined based on comparison with the previously viewed versions of pages. Any new content for the user is marked, enabling the user to quickly spot it. We also describe a mechanism for visually informing users about the degree of freshness of linked pages. By indicating the freshness level of content on linked pages, the system enables users to navigate the Web more effectively. Finally, we propose and demonstrate the concept of determining user-dependent, subjective age of page contents. Using this method, elements of Web pages are annotated with dates indicating the first time the elements were accessed by the user. Keywords: page revisiting, fresh information retrieval, change detection. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
8. Temporal Ranking of Search Engine Results.
- Author
-
Ngu, Anne H. H., Kitsuregawa, Masaru, Neuhold, Erich J., Chung, Jen-Yao, Sheng, Quan Z., Jatowt, Adam, Kawai, Yukiko, and Tanaka, Katsumi
- Abstract
Existing search engines contain the picture of the Web from the past and their ranking algorithms are based on data crawled some time ago. However, a user requires not only relevant but also fresh information. We have developed a method for adjusting the ranking of search engine results from the point of view of page freshness and relevance. It uses an algorithm that post-processes search engine results based on the changed contents of the pages. By analyzing archived versions of web pages we estimate temporal qualities of pages, that is, general freshness and relevance of the page to the query topic over certain time frames. For the top quality web pages, their content differences between past snapshots of the pages indexed by a search engine and their present versions are analyzed. Basing on these differences the algorithm assigns new ranks to the web pages without the need to maintain a constantly updated index of web documents. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
9. Web Content Transformed into Humorous Dialogue-Based TV-Program-Like Content.
- Author
-
Maybury, Mark, Stock, Oliviero, Wahlster, Wolfgang, Nadamoto, Akiyo, Jatowt, Adam, Hayashi, Masaki, and Tanaka, Katsumi
- Abstract
A browsing system is described for transforming declarative web content into humorous-dialogue TV-program-like content that is presented through character agent animation and synthesized speech. We call this system Web2Talkshow which enables users to obtain web content in a manner similar to watching TV. Web content is transformed into humorous dialogues based on the keywords-set of the web page. By using Web2Talkshow, users will be able to watch and listen to desired web content in an easy, pleasant, and userfriendly way, like watching a comedy show. [ABSTRACT FROM AUTHOR]
- Published
- 2005
- Full Text
- View/download PDF
10. Report on the WebQuality 2015 Workshop.
- Author
-
Nielek, Radoslaw, Wierzbicki, Adam, Jatowt, Adam, and Tanaka, Katsumi
- Subjects
WORLD Wide Web ,INTERNET content ,INFORMATION retrieval ,COMPUTER science ,CONFERENCES & conventions - Abstract
The 5
th International Workshop on Web Quality (WebQuality 2015) was held in conjunction with the 24rd International World Wide Web Conference in Florence, Italy on the 18th May 2015. This report briefly summarizes the workshop. [ABSTRACT FROM AUTHOR]- Published
- 2016
- Full Text
- View/download PDF
11. Multilingual Epidemic Event Extraction
- Author
-
Mutuvi, Stephen, Boros, Emanuela, Doucet, Antoine, Lejeune, Ga��l, Jatowt, Adam, Odeo, Moses, Multimedia University (MMU), Laboratoire Informatique, Image et Interaction - EA 2118 (L3I), Université de La Rochelle (ULR), Sens, Texte, Informatique, Histoire (STIH), Sorbonne Université (SU), Universität Innsbruck [Innsbruck], Multimedia university (MMU), Hao-Ren Ke, Chei Sian Lee, and Kazunari Sugiyama
- Subjects
business.industry ,Computer science ,Event (relativity) ,Epidemiological surveillance, Multilingualism, Semi-supervised learning ,Epidemiological surveillance ,Multilingualism ,02 engineering and technology ,computer.software_genre ,[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,[INFO.INFO-IR]Computer Science [cs]/Information Retrieval [cs.IR] ,020204 information systems ,Semi-supervised learning ,0202 electrical engineering, electronic engineering, information engineering ,[INFO.INFO-DL]Computer Science [cs]/Digital Libraries [cs.DL] ,020201 artificial intelligence & image processing ,Extraction (military) ,[INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC] ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
International audience; In this paper, we focus on epidemic event extraction in multilingual and low-resource settings. The task of extracting epidemic events is defined as the detection of disease names and locations in a document. We experiment with a multilingual dataset comprising news articles from the medical domain with diverse morphological structures (Chinese, English, French, Greek, Polish, and Russian). We investigate various Transformer-based models, also adopting a two-stage strategy, first finding the documents that contain events and then performing event extraction. Our results show that error propagation to the downstream task was higher than expected. We also perform an in-depth analysis of the results, concluding that different entity characteristics can influence the performance. Moreover, we perform several preliminary experiments for the low-resourced languages present in the dataset using the mean teacher semi-supervised technique. Our findings show the potential of pre-trained language models benefiting from the incorporation of unannotated data in the training process.
- Full Text
- View/download PDF
12. The Rise and Rise of Interdisciplinary Research: Understanding the Interaction Dynamics of Three Major Fields – Physics, Mathematics and Computer Science
- Author
-
Hazra, Rima, Singh, Mayank, Goyal, Pawan, Adhikari, Bibhas, Mukherjee, Animesh, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Jatowt, Adam, editor, Maeda, Akira, editor, and Syn, Sue Yeon, editor
- Published
- 2019
- Full Text
- View/download PDF
13. Clipping the Page – Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Historical Journalistic Collection
- Author
-
Kimmo Kettunen, Erno Liukkonen, Tuula Pääkkönen, Doucet, Antoine, Isaac, Antoine, Golub, Koraljka, Aalberg, Trond, Jatowt, Adam, and The National Library of Finland, Research Library
- Subjects
business.industry ,National library ,Computer science ,518 Media and communications ,05 social sciences ,Usability ,113 Computer and information sciences ,050905 science studies ,Newspaper ,Set (abstract data type) ,World Wide Web ,Software ,Clipping (photography) ,0509 other social sciences ,050904 information & library sciences ,business - Abstract
This paper describes utilization of article detection and extraction on the Finnish Digi (https://digi.kansalliskirjasto.fi/etusivu?set_language=en) newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869–1918. We use PIVAJ software [1] for detection and marking of articles in our collection. Out of the separated articles we can produce automatic clippings for the user. The user can collect clippings for own use both as images and as OCRed text. Together these functionalities improve usability of the digitized journalistic collection by providing a structured access to the contents of a page.
- Published
- 2019
- Full Text
- View/download PDF
14. Stable Word-Clouds for Visualising Text-Changes Over Time
- Author
-
Jörg Ritter, Marcus Pöckelmann, Mark M. Hall, Elisa Herold, Christian Berg, Doucet, Antoine, Isaac, Antoine, Golub, Koraljika, Aalberg, Trond, and Jatowt, Adam
- Subjects
Layout algorithm ,Information retrieval ,Orthogonality (programming) ,Series (mathematics) ,Computer science ,020207 software engineering ,Context (language use) ,02 engineering and technology ,Space (commercial competition) ,01 natural sciences ,Visualization ,010104 statistics & probability ,Simple (abstract algebra) ,0202 electrical engineering, electronic engineering, information engineering ,0101 mathematics ,Word (computer architecture) - Abstract
Word-clouds are a useful tool for providing overviews over texts, visualising relevant words. Multiple word-clouds can also be used to visualise changes over time in a text. This requires that the words in the individual word-clouds have stable positions, as otherwise it is very difficult so see what changed between two consecutive word-clouds. Existing approaches have used coordinated positioning algorithms, which do not allow for their use in an online, dynamic context. In this paper we present a fast word-cloud algorithm that uses word orthogonality to determine which words can share the same space in the word-clouds combined with a simple, but fast spiral-based layout algorithm. The evaluation shows that the algorithm achieves its goal of creating series of word-clouds fast enough to enable use in an online, dynamic context.
- Published
- 2019
- Full Text
- View/download PDF
15. Linguistic change and historical periodization of Old Literary Finnish
- Author
-
Mika Hämäläinen, Jack Rueter, Niko Partanen, Khalid Alnajjar, Tahmasebi, Nina, Jatowt, Adam, Xu, Yang, Hengchen, Simon, Montariol, Syrielle, Dubossarsky, Haim, Department of Finnish, Finno-Ugrian and Scandinavian Studies, Department of Digital Humanities, Department of Computer Science, and Language Technology
- Subjects
Periodization ,Computer science ,Lemmatisation ,education ,Word error rate ,6121 Languages ,Linguistic change ,113 Computer and information sciences ,Proxy (statistics) ,Linguistics ,Word (computer architecture) - Abstract
In this study, we have normalized and lemmatized an Old Literary Finnish corpus using a lemmatization model trained on texts from Agricola. We analyse the error types that occur and appear in different decades, and use word error rate (WER) and different error types as a proxy for measuring linguistic innovation and change. We show that the proposed approach works, and the errors are connected to accumulating changes and innovations, which also results in a continuous decrease in the accuracy of the model. The described error types also guide further work in improving these models, and document the currently observed issues. We also have trained word embeddings for four centuries of lemmatized Old Literary Finnish, which are available on Zenodo.
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.