Author: "Andrey Kutuzov" / Topic: business - Searchworks@Jio Institute Digital Library Search Results

1. Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Author: Tatyana Iazykova, Olga Bystrova, Andrey Kutuzov, and Denis Kapelyushnik
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Exploit, business.industry, Computer science, Natural language understanding, Rule-based system, computer.software_genre, Machine learning, Modern language, Benchmark (computing), Artificial intelligence, Language model, Heuristics, business, Set (psychology), Computation and Language (cs.CL), computer
Abstract: Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings. In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leader-board for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leader-board even more representative of the real progress in Russian NLU., Comment: Accepted to Dialogue'2021
Published: 2021

2. Semantic Recommendation System for Bilingual Corpus of Academic Papers

Author: Irina Nikishina, Anna Safaryan, Petr Filchenkov, Andrey Kutuzov, and Weijia Yan
Subjects: Word embedding, Computer science, business.industry, Bilingual dictionary, Cosine similarity, Semantic search, Recommender system, computer.software_genre, Task (project management), Semantic similarity, Relevance (information retrieval), Artificial intelligence, business, computer, Natural language processing
Abstract: We tested four methods of making document representations cross-lingual for the task of semantic search for the similar papers based on the corpus of papers from three Russian conferences on NLP: Dialogue, AIST and AINL. The pipeline consisted of three stages: preprocessing, word-by-word vectorisation using models obtained with various methods to map vectors from two independent vector spaces to a common one, and search for the most similar papers based on the cosine similarity of text vectors. The four methods used can be grouped into two approaches: 1) aligning two pretrained monolingual word embedding models with a bilingual dictionary on our own (for example, with the VecMap algorithm) and 2) using pre-aligned cross-lingual word embedding models (MUSE). To find out, which approach brings more benefit to the task, we conducted a manual evaluation of the results and calculated the average precision of recommendations for all the methods mentioned above. MUSE turned out to have the highest search relevance, but the other methods produced more recommendations in a language other than the one of the target paper.
Published: 2021

3. Representing ELMo embeddings as two-dimensional text online

Author: Elizaveta Kuzmenko and Andrey Kutuzov
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Word embedding, Computer science, business.industry, Context (language use), Hyperlink, computer.software_genre, Part of speech, Embedding, Artificial intelligence, Web service, business, computer, Computation and Language (cs.CL), Natural language processing, Word (computer architecture), Sentence
Abstract: We describe a new addition to the WebVectors toolkit which is used to serve word embedding models over the Web. The new ELMoViz module adds support for contextualized embedding architectures, in particular for ELMo models. The provided visualizations follow the metaphor of `two-dimensional text' by showing lexical substitutes: words which are most semantically similar in context to the words of the input sentence. The system allows the user to change the ELMo layers from which token embeddings are inferred. It also conveys corpus information about the query words and their lexical substitutes (namely their frequency tiers and parts of speech). The module is well integrated into the rest of the WebVectors toolkit, providing lexical hyperlinks to word representations in static embedding models. Two web services have already implemented the new functionality with pre-trained ELMo models for Russian, Norwegian and English., Comment: EACL'2021 demo paper
Published: 2021
Full Text: View/download PDF

4. UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection

Author: Andrey Kutuzov and Mario Giulianelli
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, business.industry, Computer science, Cosine similarity, computer.software_genre, SemEval, Task (project management), Ranking (information retrieval), Semantic change, Ranking, Margin (machine learning), Test set, Artificial intelligence, business, Computation and Language (cs.CL), computer, Natural language processing
Abstract: We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by a large margin (in the post-evaluation phase, we have the best Subtask 2 submission for SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set., To appear in Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020)
Published: 2020

5. RuSemShift: a dataset of historical lexical semantic change in Russian

Author: Julia Rodina and Andrey Kutuzov
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer science, business.industry, Process (engineering), ComputingMilieux_LEGALASPECTSOFCOMPUTING, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Task (project management), Annotation, Semantic change, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Computation and Language (cs.CL), computer, Period (music), Natural language processing, Sentence, 0105 earth and related environmental sciences
Abstract: We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sentence contexts extracted from the Russian National Corpus. Additionally, we report the performance of several distributional approaches on RuSemShift, achieving promising results, which at the same time leave room for other researchers to improve., Accepted to COLING 2020
Published: 2020

6. TAXONOMY ENRICHMENT FOR RUSSIAN: SYNSET CLASSIFICATION OUTPERFORMS LINEAR HYPONYM-HYPERNYM PROJECTIONS

Author: A. Plum, Maria Kunilovskaya, and Andrey Kutuzov
Subjects: business.industry, Computer science, Taxonomy (general), Artificial intelligence, computer.software_genre, business, computer, Natural language processing
Abstract: We present the description of our system that was ranked third in the noun sub-track of the Taxonomy Enrichment for the Russian Language shared task offered by Dialogue Evaluation 2020. Our best-performing system appears against the backdrop of other methods and their combinations attempted, and its results argue in favour of Occam’s razor for this task. A simple supervised classifier was trained on static distributional embeddings of hyponym words as features and their numeric hypernym synset identifiers from the taxonomy as class labels. It outperformed more complicated approaches based on learning linear projections from hyponym embeddings to hypernym embeddings and returning synset identifiers for the nearest neighbours of the predicted vectors. Training specially tailored word embeddings for ruWordNet multi-word expressions proved to be one of the key factors for both approaches.
Published: 2020

7. ÚFAL-Oslo at MRP 2019: Garage Sale Semantic Parsing

Author: Andrey Kutuzov, Kira Droganova, Nikita Mediankin, and Daniel Zeman
Subjects: Parsing, Computer science, business.industry, Artificial intelligence, Representation (arts), Meaning (existential), business, computer.software_genre, computer, ComputingMilieux_MISCELLANEOUS, Natural language processing, Task (project management)
Abstract: This paper describes the UFAL--Oslo system submission to the shared task on Cross-Framework Meaning Representation Parsing (MRP, Oepen et al. 2019). The submission is based on several third-party parsers. Within the official shared task results, the submission ranked 11th out of 13 participating systems.
Published: 2019

8. Double-Blind Peer-Reviewing and Inclusiveness in Russian NLP Conferences

Author: Irina Nikishina and Andrey Kutuzov
Subjects: Double blind, business.industry, Gender distribution, Selection (linguistics), Sociology, Artificial intelligence, computer.software_genre, business, nobody, computer, Natural language processing
Abstract: Double-blind peer reviewing has been proved to be pretty effective and fair way of academic work selection. However, to the best of our knowledge, nobody has yet analysed the effects caused by its introduction at the Russian NLP conferences. We investigate how the double-blind peer reviewing influences gender and location (according to authors’ affiliations) biases and whether it makes two of the conferences under analysis more inclusive. The results show that gender distribution has become more equal for the Dialogue conference, but did not change for the AIST conference. The authors’ location distribution (roughly divided into ‘central’ and ‘not central’) has become more equal for AIST, but, interestingly, less equal for Dialogue.
Published: 2019

9. One-to-X analogical reasoning on word embeddings: a case for diachronic armed conflict prediction from news texts

Author: Erik Velldal, Andrey Kutuzov, and Lilja Øvrelid
Subjects: FOS: Computer and information sciences, Word embedding, Computer Science - Computation and Language, Computer science, business.industry, media_common.quotation_subject, Analogy, computer.software_genre, Task (project management), Test set, False positive paradox, Artificial intelligence, Function (engineering), business, computer, Publication, Computation and Language (cs.CL), Natural language processing, Word (computer architecture), media_common
Abstract: We extend the well-known word analogy task to a one-to-X formulation, including one-to-none cases, when no correct answer exists. The task is cast as a relation discovery problem and applied to historical armed conflicts datasets, attempting to predict new relations of type `location:armed-group' based on data about past events. As the source of semantic information, we use diachronic word embedding models trained on English news texts. A simple technique to improve diachronic performance in such task is demonstrated, using a threshold based on a function of cosine distance to decrease the number of false positives; this approach is shown to be beneficial on two different corpora. Finally, we publish a ready-to-use test set for one-to-X analogy evaluation on historical armed conflicts data., Comment: 1st International Workshop on Computational Approaches to Historical Language Change (ACL 2019)
Published: 2019
Full Text: View/download PDF

10. RusNLP: Semantic Search Engine for Russian NLP Conference Papers

Author: Irina Nikishina, Andrey Kutuzov, and Amir Bakarov
Subjects: Service (systems architecture), Source code, Computer science, business.industry, media_common.quotation_subject, 05 social sciences, Semantic search, 02 engineering and technology, Recommender system, computer.software_genre, Metadata, Search engine, Semantic similarity, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, 0509 other social sciences, Web service, 050904 information & library sciences, business, computer, Natural language processing, media_common
Abstract: We present RusNLP, a web service implementing semantic search engine and recommendation system over proceedings of three major Russian NLP conferences (Dialogue, AIST and AINL). The collected corpus spans across 12 years and contains about 400 academic papers in English. The presented web service allows searching for publications semantically similar to arbitrary user queries or to any given paper. Search results can be filtered by authors and their affiliations, conferences or years. They are also interlinked with the NLPub.ru service, making it easier to quickly capture the general focus of each paper. The search engine source code and the publications metadata are freely available for all interested researchers.
Published: 2018

11. Learning Graph Embeddings from WordNet-based Similarity Measures

Author: Andrey Kutuzov, Chris Biemann, Mohammad Dorgham, Alexander Panchenko, and Oleksiy Oliynyk
Subjects: FOS: Computer and information sciences, Theoretical computer science, Computer Science - Computation and Language, Computer science, Graph embedding, business.industry, WordNet, 02 engineering and technology, Distance measures, Semantic similarity, 020204 information systems, Shortest path problem, 0202 electrical engineering, electronic engineering, information engineering, Graph (abstract data type), 020201 artificial intelligence & image processing, Pairwise comparison, Artificial intelligence, business, Computation and Language (cs.CL), Distance
Abstract: We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the proposed model on semantic similarity and word sense disambiguation tasks, using various WordNet-based similarity measures, show that our approach yields competitive results, outperforming strong graph embedding baselines. The model is computationally efficient, being orders of magnitude faster than the direct computation of graph-based distances., Comment: Accepted to StarSem 2019
Published: 2018
Full Text: View/download PDF

12. Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

Author: Maria Kunilovskaya and Andrey Kutuzov
Subjects: Structure (mathematical logic), Word embedding, business.industry, Computer science, Process (engineering), 02 engineering and technology, computer.software_genre, Task (project management), Set (abstract data type), Semantic similarity, Learning curve, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Publication, computer, Natural language processing
Abstract: In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the trained models against the Russian part of the Multilingual SimLex999 semantic similarity dataset. We detect and describe numerous issues in this dataset and publish a new corrected version. Aside from the already known fact that the RNC is generally a better training corpus than web corpora, we enumerate and explain fine differences in how the models process semantic similarity task, what parts of the evaluation set are difficult for particular models and why. Additionally, the learning curves for both models are described, showing that the RNC is generally more robust as training material for this task.
Published: 2017

13. Two centuries in two thousand words

Author: Elizaveta Kuzmenko and Andrey Kutuzov
Subjects: Computer science, business.industry, Embedding, Artificial intelligence, computer.software_genre, business, computer, Natural language processing
Published: 2017

14. WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models

Author: Elizaveta Kuzmenko and Andrey Kutuzov
Subjects: Computer science, business.industry, Semantics, computer.software_genre, Social Semantic Web, World Wide Web, Semantic analytics, Semantic Web Stack, User interface, Web service, business, Semantic Web, computer, Data Web
Abstract: The paper presents a free and open source toolkit which aim is to quickly deploy web services handling distributed vector models of semantics. It fills in the gap between training such models (many tools are already available for this) and dissemination of the results to general public. Our toolkit, WebVectors, provides all the necessary routines for organizing online access to querying trained models via modern web interface. We also describe two demo installations of the toolkit, featuring several efficient models for English, Russian and Norwegian.
Published: 2017

15. Tracing armed conflicts with diachronic word embedding models

Author: Lilja Øvrelid, Erik Velldal, and Andrey Kutuzov
Subjects: 060201 languages & linguistics, Word embedding, Basis (linear algebra), business.industry, Computer science, 06 humanities and the arts, Tracing, computer.software_genre, Field (computer science), Task (project management), Set (abstract data type), Dynamics (music), 0602 languages and literature, Artificial intelligence, business, computer, Natural language processing, TRACE (psycholinguistics)
Abstract: Recent studies have shown that word embedding models can be used to trace time-related (diachronic) semantic shifts for particular words. In this paper, we evaluate some of these approaches on the new task of predicting the dynamics of global armed conflicts on a year-to-year basis, using a dataset from the field of conflict re- search as the gold standard and the Gigaword news corpus as the training data. The results show that much work still remains in extracting ‘cultural’ semantic shifts from diachronic word embedding models. At the same time, we present a new task complete with an evaluation set and introduce the ‘anchor words’ method which outperforms previous approaches on this data.
Published: 2017

16. Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants

Author: Erik Velldal, Lilja Øvrelid, and Andrey Kutuzov
Subjects: FOS: Computer and information sciences, 021110 strategic, defence & security studies, Vocabulary, Computer Science - Computation and Language, Word embedding, Relation (database), Computer science, business.industry, media_common.quotation_subject, 05 social sciences, 0211 other engineering and technologies, 02 engineering and technology, computer.software_genre, 0506 political science, Task (project management), 050602 political science & public administration, Artificial intelligence, business, computer, Computation and Language (cs.CL), Natural language processing, Word (computer architecture), TRACE (psycholinguistics), media_common
Abstract: This paper deals with using word embedding models to trace the temporal dynamics of semantic relations between pairs of words. The set-up is similar to the well-known analogies task, but expanded with a time dimension. To this end, we apply incremental updating of the models with new training texts, including incremental vocabulary expansion, coupled with learned transformation matrices that let us map between members of the relation. The proposed approach is evaluated on the task of predicting insurgent armed groups based on geographical locations. The gold standard data for the time span 1994--2010 is extracted from the UCDP Armed Conflicts dataset. The results show that the method is feasible and outperforms the baselines, but also that important work still remains to be done., Comment: to appear in EMNLP 2017 proceedings
Published: 2017

17. Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit

Author: Elizaveta Kuzmenko and Andrey Kutuzov
Subjects: World Wide Web, Web browser, Software, Computer science, business.industry, Semantic computing, 0202 electrical engineering, electronic engineering, information engineering, 020207 software engineering, 020201 artificial intelligence & image processing, 02 engineering and technology, business, Present moment
Abstract: We present WebVectors, a toolkit that facilitates using distributional semantic models in everyday research. Our toolkit has two main features: it allows to build web interfaces to query models using a web browser, and it provides the API to query models automatically. Our system is easy to use and can be tuned according to individual demands. This software can be of use to those who need to work with vector semantic models but do not want to develop their own interfaces, or to those who need to deliver their trained models to a large audience. WebVectors features vi- sualizations for various kinds of semantic queries. For the present moment, the web services with Russian, English and Norwegian models are available, built using WebVectors.
Published: 2017

18. Redefining part-of-speech classes with distributional semantic models

Author: Erik Velldal, Andrey Kutuzov, and Lilja Øvrelid
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer science, business.industry, Process (engineering), 02 engineering and technology, 010501 environmental sciences, computer.software_genre, Part of speech, 01 natural sciences, Set (abstract data type), Range (mathematics), Annotation, British National Corpus, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Computation and Language (cs.CL), computer, Natural language processing, Word (computer architecture), 0105 earth and related environmental sciences
Abstract: This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The results show that the information about PoS affiliation contained in the distributional vectors allows us to discover groups of words with distributional patterns that differ from other words of the same part of speech. This data often reveals hidden inconsistencies of the annotation process or guidelines. At the same time, it supports the notion of ‘soft’ or ‘graded’ part of speech affiliations. Finally, we show that information about PoS is distributed among dozens of vector components, not limited to only one or two features. © 2016 Association for Computational Linguistics
Published: 2016

19. Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian

Author: Elizaveta Kuzmenko and Andrey Kutuzov
Subjects: Text corpus, Multimedia, Semantic similarity, Computer science, Corpus linguistics, business.industry, Deep learning, Word2vec, Artificial intelligence, computer.software_genre, business, computer, Linguistics
Abstract: In this paper we compare the Russian National Corpus to a larger Russian web corpus composed in 2014; the assumption behind our work is that the National corpus, being limited by the texts it contains and their proportions, presents lexical contexts (and thus meanings) which are different from those found ‘in the wild’ or in a language in use.
Published: 2015

20. Semantic Clustering of Russian Web Search Results: Possibilities and Problems

Author: Andrey Kutuzov
Subjects: Information retrieval, Computer science, business.industry, Word-sense induction, Semantic clustering, Distributional semantics, Construct (python library), Artificial intelligence, computer.software_genre, Cluster analysis, business, computer, Natural language processing
Abstract: The present paper deals with word sense induction from lexical co-occurrence graphs. We construct such graphs on large Russian corpora and then apply the data to cluster the results of Mail.ru search according to meanings in the query. We compare different methods of performing such clustering and different source corpora. Models of applying distributional semantics to big linguistic data are described.
Published: 2015

21. Untangling the Semantic Web: Microdata Use in Russian Video Content Delivery Sites

Author: Andrey Kutuzov and Maxim Ionov
Subjects: Information retrieval, Semantic HTML, Computer science, business.industry, Content delivery, computer.file_format, World Wide Web, Software deployment, Microdata (HTML), Web page, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, The Internet, RDF, business, Semantic Web, computer
Abstract: Nowadays, more and more sites incorporate semantic markup into their pages. This allows search engines to better understand the content of the webpage. This paper investigates the deployment of semantic markup in the form of Microdata on Russian video content delivery sites. We point out commonalities and common problems and link our data set to DBpedia. General description of quantitative and qualitative features of semantic markup usage is given, based on large dataset crawled from Russian Internet segment.
Published: 2014

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

21 results on '"Andrey Kutuzov"'

1. Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

2. Semantic Recommendation System for Bilingual Corpus of Academic Papers

3. Representing ELMo embeddings as two-dimensional text online

4. UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection

5. RuSemShift: a dataset of historical lexical semantic change in Russian

6. TAXONOMY ENRICHMENT FOR RUSSIAN: SYNSET CLASSIFICATION OUTPERFORMS LINEAR HYPONYM-HYPERNYM PROJECTIONS

7. ÚFAL-Oslo at MRP 2019: Garage Sale Semantic Parsing

8. Double-Blind Peer-Reviewing and Inclusiveness in Russian NLP Conferences

9. One-to-X analogical reasoning on word embeddings: a case for diachronic armed conflict prediction from news texts

10. RusNLP: Semantic Search Engine for Russian NLP Conference Papers

11. Learning Graph Embeddings from WordNet-based Similarity Measures

12. Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

13. Two centuries in two thousand words

14. WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models

15. Tracing armed conflicts with diachronic word embedding models

16. Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants

17. Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit

18. Redefining part-of-speech classes with distributional semantic models

19. Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian

20. Semantic Clustering of Russian Web Search Results: Possibilities and Problems

21. Untangling the Semantic Web: Microdata Use in Russian Video Content Delivery Sites

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

21 results on '"Andrey Kutuzov"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources