21 results on '"multilingual language models"'
Search Results
2. What Is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models.
- Author
-
Yu, Jeongrok, Kim, Seong Ug, Choi, Jacob, and Choi, Jinho D.
- Subjects
- *
LANGUAGE models , *SEX discrimination , *CHINESE language , *TRANSFORMER models , *ENGLISH language - Abstract
Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based masked language models (MLMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is now needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few have explored gender bias in other languages. This paper proposes a multilingual approach to estimating gender bias in MLMs from five languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias. For each language, lexicon-based and model-based methods are applied to create two datasets, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and three new scoring metrics. Our results show that the previous approach is data-sensitive and unstable, suggesting that gender bias should be assessed on a large dataset using multiple evaluation metrics for best practice. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Applied Hedge Algebra Approach with Multilingual Large Language Models to Extract Hidden Rules in Datasets for Improvement of Generative AI Applications.
- Author
-
Pham, Hai Van and Moore, Philip
- Subjects
- *
GENERATIVE artificial intelligence , *WINDOWS (Graphical user interfaces) , *LANGUAGE models , *CHATGPT , *CHATBOTS , *BIG data - Abstract
Generative AI applications have played an increasingly significant role in real-time tracking applications in many domains including, for example, healthcare, consultancy, dialog boxes (common types of window in a graphical user interface of operating systems), monitoring systems, and emergency response. This paper considers generative AI and presents an approach which combines hedge algebra and a multilingual large language model to find hidden rules in big data for ChatGPT. We present a novel method for extracting natural language knowledge from large datasets by leveraging fuzzy sets and hedge algebra to extract these rules, presented in meta data for ChatGPT and generative AI applications. The proposed model has been developed to minimize the computational and staff costs for medium-sized enterprises which are typically resource and time limited. The proposed model has been designed to automate question–response interactions for rules extracted from large data in a multiplicity of domains. The experimental results show that the proposed model performs well using datasets associated with specific domains in healthcare to validate the effectiveness of the proposed model. The ChatGPT application in case studies of healthcare is tested using datasets for English and Vietnamese languages. In comparative experimental testing, the proposed model outperformed the state of the art, achieving in the range of 96.70–97.50% performance using a heart dataset. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
4. Evaluating the Effectiveness of Pre-trained Language Models in Predicting the Helpfulness of Online Product Reviews
- Author
-
Boluki, Ali, Pourmostafa Roshan Sharami, Javad, Shterionov, Dimitar, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, and Arai, Kohei, editor
- Published
- 2024
- Full Text
- View/download PDF
5. A Generative Artificial Intelligence Using Multilingual Large Language Models for ChatGPT Applications.
- Author
-
Tuan, Nguyen Trung, Moore, Philip, Thanh, Dat Ha Vu, and Pham, Hai Van
- Subjects
GENERATIVE artificial intelligence ,LANGUAGE models ,CHATGPT ,SMALL business ,VIETNAMESE language ,CHATBOTS - Abstract
ChatGPT plays significant roles in the third decade of the 21st Century. Smart cities applications can be integrated with ChatGPT in various fields. This research proposes an approach for developing large language models using generative artificial intelligence models suitable for small- and medium-sized enterprises with limited hardware resources. There are many generative AI systems in operation and in development. However, the technological, human, and financial resources required to develop generative AI systems are impractical for small- and medium-sized enterprises. In this study, we present a proposed approach to reduce training time and computational cost that is designed to automate question–response interactions for specific domains in smart cities. The proposed model utilises the BLOOM approach as its backbone for using generative AI to maximum the effectiveness of small- and medium-sized enterprises. We have conducted a set of experiments on several datasets associated with specific domains to validate the effectiveness of the proposed model. Experiments using datasets for the English and Vietnamese languages have been combined with model training using low-rank adaptation to reduce training time and computational cost. In comparative experimental testing, the proposed model outperformed the 'Phoenix' multilingual chatbot model by achieving a 92% performance compared to 'ChatGPT' for the English benchmark. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
6. DSG-KD: Knowledge Distillation From Domain-Specific to General Language Models
- Author
-
Sangyeon Cho, Jangyeong Jeon, Dongjoon Lee, Changhee Lee, and Junyeong Kim
- Subjects
Bilingual medical data analysis ,emergency room electronic health records ,code switching ,knowledge distillation ,multilingual language models ,natural language processing ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
The use of pre-trained language models fine-tuned to address specific downstream tasks is a common approach in natural language processing (NLP). However, acquiring domain-specific knowledge via fine-tuning is challenging. Traditional methods involve pretraining language models using vast amounts of domain-specific data before fine-tuning for particular tasks. This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea. Our findings reveal that existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics of non-English-speaking regions. To address these limitations, we propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning. This study demonstrates the effective transfer of specialized knowledge between models by defining a general language model as the student model and a domain-specific pre-trained model as the teacher model. In particular, we address the complexities of EMR data obtained from PEDs in non-English-speaking regions, such as Korea, and demonstrate that the proposed method enhances classification performance in such contexts. The proposed methodology not only outperforms baseline models on Korean PED EMR data, but also promises broader applicability in various professional and technical domains. In future works, we intend to extend this methodology to include diverse non-English-speaking regions and address additional downstream tasks, with the aim of developing advanced model architectures using state-of-the-art KD techniques. The code is available in https://github.com/JoSangYeon/DSG-KD.
- Published
- 2024
- Full Text
- View/download PDF
7. What Is Your Favorite Gender, MLM? Gender Bias Evaluation in Multilingual Masked Language Models
- Author
-
Jeongrok Yu, Seong Ug Kim, Jacob Choi, and Jinho D. Choi
- Subjects
bias evaluation ,multilingual bias benchmark ,multilingual language models ,Information technology ,T58.5-58.64 - Abstract
Bias is a disproportionate prejudice in favor of one side against another. Due to the success of transformer-based masked language models (MLMs) and their impact on many NLP tasks, a systematic evaluation of bias in these models is now needed more than ever. While many studies have evaluated gender bias in English MLMs, only a few have explored gender bias in other languages. This paper proposes a multilingual approach to estimating gender bias in MLMs from five languages: Chinese, English, German, Portuguese, and Spanish. Unlike previous work, our approach does not depend on parallel corpora coupled with English to detect gender bias in other languages using multilingual lexicons. Moreover, a novel model-based method is presented to generate sentence pairs for a more robust analysis of gender bias. For each language, lexicon-based and model-based methods are applied to create two datasets, which are used to evaluate gender bias in an MLM specifically trained for that language using one existing and three new scoring metrics. Our results show that the previous approach is data-sensitive and unstable, suggesting that gender bias should be assessed on a large dataset using multiple evaluation metrics for best practice.
- Published
- 2024
- Full Text
- View/download PDF
8. adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds.
- Author
-
Lankford, Séamus, Afli, Haithem, and Way, Andy
- Subjects
- *
LANGUAGE models , *NATURAL language processing , *MACHINE translating , *PLAYGROUNDS , *WORKFLOW - Abstract
The advent of Multilingual Language Models (MLLMs) and Large Language Models (LLMs) has spawned innovation in many areas of natural language processing. Despite the exciting potential of this technology, its impact on developing high-quality Machine Translation (MT) outputs for low-resource languages remains relatively under-explored. Furthermore, an open-source application, dedicated to both fine-tuning MLLMs and managing the complete MT workflow for low-resources languages, remains unavailable. We aim to address these imbalances through the development of adaptMLLM, which streamlines all processes involved in the fine-tuning of MLLMs for MT. This open-source application is tailored for developers, translators, and users who are engaged in MT. It is particularly useful for newcomers to the field, as it significantly streamlines the configuration of the development environment. An intuitive interface allows for easy customisation of hyperparameters, and the application offers a range of metrics for model evaluation and the capability to deploy models as a translation service directly within the application. As a multilingual tool, we used adaptMLLM to fine-tune models for two low-resource language pairs: English to Irish (EN ↔ GA) and English to Marathi (EN ↔ MR). Compared with baselines from the LoResMT2021 Shared Task, the adaptMLLM system demonstrated significant improvements. In the EN → GA direction, an improvement of 5.2 BLEU points was observed and an increase of 40.5 BLEU points was recorded in the GA → EN direction representing relative improvements of 14% and 117%, respectively. Significant improvements in the translation performance of the EN ↔ MR pair were also observed notably in the MR → EN direction with an increase of 21.3 BLEU points which corresponds to a relative improvement of 68%. Finally, a fine-grained human evaluation of the MLLM output on the EN → GA pair was conducted using the Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies. The application and models are freely available. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
9. Applied Hedge Algebra Approach with Multilingual Large Language Models to Extract Hidden Rules in Datasets for Improvement of Generative AI Applications
- Author
-
Hai Van Pham and Philip Moore
- Subjects
generative AI ,language comprehension ,multilingual language models ,large language models ,support systems ,technological determinism ,Information technology ,T58.5-58.64 - Abstract
Generative AI applications have played an increasingly significant role in real-time tracking applications in many domains including, for example, healthcare, consultancy, dialog boxes (common types of window in a graphical user interface of operating systems), monitoring systems, and emergency response. This paper considers generative AI and presents an approach which combines hedge algebra and a multilingual large language model to find hidden rules in big data for ChatGPT. We present a novel method for extracting natural language knowledge from large datasets by leveraging fuzzy sets and hedge algebra to extract these rules, presented in meta data for ChatGPT and generative AI applications. The proposed model has been developed to minimize the computational and staff costs for medium-sized enterprises which are typically resource and time limited. The proposed model has been designed to automate question–response interactions for rules extracted from large data in a multiplicity of domains. The experimental results show that the proposed model performs well using datasets associated with specific domains in healthcare to validate the effectiveness of the proposed model. The ChatGPT application in case studies of healthcare is tested using datasets for English and Vietnamese languages. In comparative experimental testing, the proposed model outperformed the state of the art, achieving in the range of 96.70–97.50% performance using a heart dataset.
- Published
- 2024
- Full Text
- View/download PDF
10. A Generative Artificial Intelligence Using Multilingual Large Language Models for ChatGPT Applications
- Author
-
Nguyen Trung Tuan, Philip Moore, Dat Ha Vu Thanh, and Hai Van Pham
- Subjects
generative AI ,language comprehension ,multilingual language models ,large language models ,support systems ,technological determinism ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
ChatGPT plays significant roles in the third decade of the 21st Century. Smart cities applications can be integrated with ChatGPT in various fields. This research proposes an approach for developing large language models using generative artificial intelligence models suitable for small- and medium-sized enterprises with limited hardware resources. There are many generative AI systems in operation and in development. However, the technological, human, and financial resources required to develop generative AI systems are impractical for small- and medium-sized enterprises. In this study, we present a proposed approach to reduce training time and computational cost that is designed to automate question–response interactions for specific domains in smart cities. The proposed model utilises the BLOOM approach as its backbone for using generative AI to maximum the effectiveness of small- and medium-sized enterprises. We have conducted a set of experiments on several datasets associated with specific domains to validate the effectiveness of the proposed model. Experiments using datasets for the English and Vietnamese languages have been combined with model training using low-rank adaptation to reduce training time and computational cost. In comparative experimental testing, the proposed model outperformed the ‘Phoenix’ multilingual chatbot model by achieving a 92% performance compared to ‘ChatGPT’ for the English benchmark.
- Published
- 2024
- Full Text
- View/download PDF
11. ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis.
- Author
-
de la Rosa, Javier, Pérez Pozo, Álvaro, Ros, Salvador, and González Blanco, Elena
- Subjects
LANGUAGE models ,NATURAL language processing ,POETRY (Literary form) ,GERMAN language ,ELECTRIC transformers - Abstract
Copyright of Procesamiento del Lenguaje Natural is the property of Sociedad Espanola para el Procesamiento del Lenguaje Natural and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2023
- Full Text
- View/download PDF
12. It’s All in the Name: Entity Typing Using Multilingual Language Models
- Author
-
Biswas, Russa, Chen, Yiyi, Paulheim, Heiko, Sack, Harald, Alam, Mehwish, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Groth, Paul, editor, Rula, Anisa, editor, Schneider, Jodi, editor, Tiddi, Ilaria, editor, Simperl, Elena, editor, Alexopoulos, Panos, editor, Hoekstra, Rinke, editor, Alam, Mehwish, editor, Dimou, Anastasia, editor, and Tamper, Minna, editor
- Published
- 2022
- Full Text
- View/download PDF
13. adaptMLLM: Fine-Tuning Multilingual Language Models on Low-Resource Languages with Integrated LLM Playgrounds
- Author
-
Séamus Lankford, Haithem Afli, and Andy Way
- Subjects
MLLMs ,LLMs ,multilingual language models ,large language models ,low-resource languages ,neural machine translation ,Information technology ,T58.5-58.64 - Abstract
The advent of Multilingual Language Models (MLLMs) and Large Language Models (LLMs) has spawned innovation in many areas of natural language processing. Despite the exciting potential of this technology, its impact on developing high-quality Machine Translation (MT) outputs for low-resource languages remains relatively under-explored. Furthermore, an open-source application, dedicated to both fine-tuning MLLMs and managing the complete MT workflow for low-resources languages, remains unavailable. We aim to address these imbalances through the development of adaptMLLM, which streamlines all processes involved in the fine-tuning of MLLMs for MT. This open-source application is tailored for developers, translators, and users who are engaged in MT. It is particularly useful for newcomers to the field, as it significantly streamlines the configuration of the development environment. An intuitive interface allows for easy customisation of hyperparameters, and the application offers a range of metrics for model evaluation and the capability to deploy models as a translation service directly within the application. As a multilingual tool, we used adaptMLLM to fine-tune models for two low-resource language pairs: English to Irish (EN↔ GA) and English to Marathi (EN↔MR). Compared with baselines from the LoResMT2021 Shared Task, the adaptMLLM system demonstrated significant improvements. In the EN→GA direction, an improvement of 5.2 BLEU points was observed and an increase of 40.5 BLEU points was recorded in the GA→EN direction representing relative improvements of 14% and 117%, respectively. Significant improvements in the translation performance of the EN↔MR pair were also observed notably in the MR→EN direction with an increase of 21.3 BLEU points which corresponds to a relative improvement of 68%. Finally, a fine-grained human evaluation of the MLLM output on the EN→GA pair was conducted using the Multidimensional Quality Metrics and Scalar Quality Metrics error taxonomies. The application and models are freely available.
- Published
- 2023
- Full Text
- View/download PDF
14. Automated coding of student chats, a trans-topic and language approach
- Author
-
Adelson de Araujo, Pantelis M. Papadopoulos, Susan McKenney, and Ton de Jong
- Subjects
Automatic Content analysis ,Collaborative learning ,Machine learning ,Multilingual language models ,Transferability ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Computer-Supported Collaborative Learning (CSCL) is known to be productive if well structured. In CSCL, students construct knowledge by performing learning tasks while communicating about their work. This communication is most often done through online written chats. Understanding what is happening in chats is important from both research and practical perspectives. From a research perspective, insight into chat content offers a window into student interaction and learning. From a more practical standpoint, insight into chat content can (potentially) be used to trigger supportive elements in CSCL environments (e.g., context-sensitive tips or conversational agents). The latter requires real-time, and therefore automated, analysis of the chats. Such an automated analysis is also helpful from the research perspective, since hand-coding of chats is a very time and labour-consuming activity. In this article, we propose a new machine learning-based system for automated coding of student chats, which we labelled ConSent. The core of ConSent is an algorithm that uses contextual information and sentence encoding to produce a reliable estimation of chat message content (i.e. code). To optimize usability, ConSent was designed in such a way that it can cover various topics and various languages. To evaluate our approach, we used two sets of chats coming from different topics (within the domain of physics) and different languages (Dutch and Portuguese). We tested different algorithm configurations, including two multilingual sentence encoders, to find the model that yields the best reliability. As a result, analysis revealed that ConSent models can perform with substantial reliability levels and are able to transfer reliable coding of chats in a similar topic and different language. Finally, we discuss how ConSent can form the basis for a conversational agent, we explain the limitations of our approach, and we indicate possible paths for future work to contribute towards reliable and transferable models.
- Published
- 2023
- Full Text
- View/download PDF
15. Exploring source languages for Faroese in single-source and multi-source transfer learning using language-specific and multilingual language models
- Author
-
Fischer, Kristóf and Fischer, Kristóf
- Abstract
Cross-lingual transfer learning has been the driving force of low-resource natural language processing in recent years, relying on massively multilingual language models with hopes of solving the data scarcity issue for languages with a limited digital presence. However, this "one-size-fits-all" approach is not equally applicable to all low-resource languages, suggesting limitations of such models in cross-lingual transfer. Besides, known similarities and phylogenetic relationships between source and target languages are often overlooked. In this work, the emphasis is placed on Faroese, a low-resource North Germanic language with several closely related resource-rich sibling languages. The cross-lingual transfer potential from these strong Scandinavian source candidates, as well as from additional genetically related, geographically proximate, and syntactically similar source languages is studied in single-source and multi-source experiments, in terms of Faroese syntactic parsing and part-of-speech tagging. In addition, the effect of task-specific fine-tuning on monolingual, linguistically informed smaller multilingual, and massively multilingual pre-trained language models is explored. The results suggest Icelandic as a strong source candidate, however, only when fine-tuning a monolingual model. With multilingual models, task-specific fine-tuning in Norwegian and Swedish seems even more beneficial. Although they do not surpass fully Scandinavian fine-tuning, models trained on genetically related and syntactically similar languages produce good results. Additionally, the findings indicate that multilingual models outperform models pre-trained on a single language, and that even better results can be achieved using a smaller, linguistically informed model, compared to a massively multilingual one.
- Published
- 2024
16. PREDICTING COLLECTIVE VIOLENCE FROM COORDINATED HOSTILE INFORMATION CAMPAIGNS IN SOCIAL MEDIA
- Author
-
Warren, Timothy C., Yoshida, Ruriko, Defense Analysis (DA), Operations Research (OR), Mendieta, Milton V., Warren, Timothy C., Yoshida, Ruriko, Defense Analysis (DA), Operations Research (OR), and Mendieta, Milton V.
- Abstract
The ability to predict conflicts prior to their occurrence can help deter the outbreak of collective violence and avoid human suffering. Existing approaches use statistical and machine learning models, and even social network analysis techniques; however, they are generally confined to long-range predictions in specific regions and are based on only a few languages. Understanding collective violence from signals in multiple or mixed languages in social media remains understudied. In this work, we construct a multilingual language model (MLLM) that can accept input from any language in social media, a model that is language-agnostic in nature. The purpose of this study is twofold. First, it aims to collect a multilingual violence corpus from archived Twitter data using a proposed set of heuristics that account for spatial-temporal features around past and future violent events. And second, it attempts to compare the performance of traditional machine learning classifiers against deep learning MLLMs for predicting message classes linked to past and future occurrences of violent events. Our findings suggest that MLLMs substantially outperform traditional ML models in predictive accuracy. One major contribution of our work is that military commands now have a tool to evaluate and learn the language of violence across all human languages. Finally, we made the data, code, and models publicly available., Outstanding Thesis, Commander, Ecuadorian Navy, Approved for public release. Distribution is unlimited.
- Published
- 2023
17. Effective and Efficient Search Across Languages
- Author
-
Nair, Suraj and Nair, Suraj
- Abstract
In the digital era, the abundance of text content in multiple languages has created a need to develop search systems to meet the diverse information needs of users. Cross-Language Information Retrieval (CLIR) plays an essential role in overcoming language barriers, allowing users to retrieve content in a language that differs from their query language. However, a challenge in designing retrieval systems lies in balancing their effectiveness, which reflects the quality of the ranked outputs, with their efficiency, which encompasses document processing latency at indexing time (indexing latency) and content retrieval latency at query time (query latency). This dissertation focuses on designing neural CLIR systems that offer a Pareto-optimal balance between the competing objectives of effectiveness and efficiency. While neural ranking models that rely on query-document term interactions, such as cross-encoder models, are highly effective, they are computationally prohibitive for processing large document collections in response to every query. One solution is to build a cascaded pipeline of multiple ranking stages, where a first-stage retrieval system generates a set of documents, which is then reranked by the cross-encoder. Ensuring that the first-stage retrieval system produces an accurate and rapid triage of large document collections is crucial for the success of the cascaded pipeline. This dissertation introduces BLADE, a first-stage system that strikes a better balance between retrieval effectiveness and indexing/query latency on the Pareto frontier by leveraging traditional inverted indexes. Once a smaller set of documents is generated, less efficient techniques can be applied to the output from the first stage. In addition, this dissertation introduces ColBERT-X, the best-known second-stage technique in terms of the balance between retrieval effectiveness and indexing latency on the Pareto frontier. To further tackle the efficiency challenges of cross-encoders
- Published
- 2023
18. Alberti, a Multilingual Domain Specific Language Model for Poetry Analysis
- Author
-
Rosa, Javier de la, Pérez Pozo, Álvaro, Ros, Salvador, González-Blanco García, Elena, Rosa, Javier de la, Pérez Pozo, Álvaro, Ros, Salvador, and González-Blanco García, Elena
- Abstract
The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present Alberti, the first multilingual pre-trained large language model for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, Alberti outperforms multilingual BERT and other transformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain., El análisis computacional de la poesía está limitado por la escasez de herramientas para analizar y escandir automáticamente los poemas. En entornos multilingües, el problema se agrava ya que los sistemas de escansión y rima solo existen para idiomas individuales, lo que hace que los estudios comparativos sean muy difíciles de llevar a cabo y consuman mucho tiempo. En este trabajo, presentamos Alberti, el primer modelo de lenguaje multilingüe pre-entrenado para poesía. Usando la técnica de pre-entrenamiento de dominio específico (DSP, de sus siglas en inglés), aumentamos las capacidades del modelo BERT multilingüe empleando un corpus de más de 12 millones de versos en 12 idiomas. Evaluamos su rendimiento en dos tareas estructurales de poesía: clasificación de tipos de estrofas en español y predicción de patrones métricos para español, inglés y alemán. En ambos casos, Alberti supera a BERT multilingüe y a otros modelos basados en transformers de tamaños similares, e incluso logra resultados de estado del arte para el alemán en comparación con los sistemas basados en reglas, lo que demuestra la viabilidad y eficacia del DSP en el dominio de la poesía.
- Published
- 2023
19. ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis
- Author
-
Javier de la Rosa, Álvaro Pérez Pozo, Salvador Ros, and Elena González-Blanco
- Subjects
FOS: Computer and information sciences ,Stanzas ,Computer Science - Computation and Language ,Multilingual Language Models ,Poetry ,Computation and Language (cs.CL) ,Scansion ,Natural Language Processing - Abstract
The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the first multilingual pre-trained large language model for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, \textsc{Alberti} outperforms multilingual BERT and other transformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain., Comment: Accepted for publication at SEPLN 2023: 39th International Conference of the Spanish Society for Natural Language Processing
- Published
- 2023
20. PREDICTING COLLECTIVE VIOLENCE FROM COORDINATED HOSTILE INFORMATION CAMPAIGNS IN SOCIAL MEDIA
- Author
-
Mendieta, Milton V., Warren, Timothy C., Yoshida, Ruriko, and Defense Analysis (DA), Operations Research (OR)
- Subjects
social media ,violence prediction ,multilingual language models ,deep learning ,Twitter ,hostile information campaigns ,NLP - Abstract
The ability to predict conflicts prior to their occurrence can help deter the outbreak of collective violence and avoid human suffering. Existing approaches use statistical and machine learning models, and even social network analysis techniques; however, they are generally confined to long-range predictions in specific regions and are based on only a few languages. Understanding collective violence from signals in multiple or mixed languages in social media remains understudied. In this work, we construct a multilingual language model (MLLM) that can accept input from any language in social media, a model that is language-agnostic in nature. The purpose of this study is twofold. First, it aims to collect a multilingual violence corpus from archived Twitter data using a proposed set of heuristics that account for spatial-temporal features around past and future violent events. And second, it attempts to compare the performance of traditional machine learning classifiers against deep learning MLLMs for predicting message classes linked to past and future occurrences of violent events. Our findings suggest that MLLMs substantially outperform traditional ML models in predictive accuracy. One major contribution of our work is that military commands now have a tool to evaluate and learn the language of violence across all human languages. Finally, we made the data, code, and models publicly available. Outstanding Thesis Commander, Ecuadorian Navy Approved for public release. Distribution is unlimited.
- Published
- 2022
21. Multilingual information retrieval in the language modeling framework.
- Author
-
Rahimi, Razieh, Shakery, Azadeh, and King, Irwin
- Subjects
- *
CROSS-language information retrieval , *PROGRAMMING languages , *COMPUTATIONAL complexity , *MEMORY , *COMPUTER performance - Abstract
Multilingual information retrieval (MLIR) provides results that are more comprehensive than those of mono- and cross-lingual retrieval. Methods for MLIR are categorized as: (1) Fusion-based methods that merge results from multiple retrieval runs, and (2) Direct methods that build a unique index for the entire collection. Merging results of individual runs reduces the overall effectiveness, while more effective direct methods suffer from either time complexity and memory overhead, or over-weighting of index terms. In this paper, we propose a direct MLIR approach by using the language modeling framework that includes a novel multilingual language model estimation for documents, and a new way to globally estimate word statistics. These contributions enable ranking documents in multiple languages in one retrieval phase without having the problems of the previous direct methods. Moreover, our approach has the advantage of accommodating multilingual feedback information which helps to prevent query drift, and consequently to improve the performance. Finally, we effectively address the common case of incomplete coverage of translation resources in our proposed estimation methods. Experimental results show that the proposed approach outperforms the previous MLIR approaches. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.