2,234 results on '"Language Modeling"'
Search Results
2. Unlock Dramatically Improved Search Relevance with Cutting-Edge Language Modeling Techniques
- Subjects
Internet/Web search services -- Models -- Methods -- Reports ,Natural language interfaces -- Methods -- Reports -- Models ,Electronic marketing -- Methods -- Models -- Reports ,Computational linguistics -- Methods -- Reports -- Models ,Language processing -- Methods -- Reports -- Models ,Text search and retrieval software ,Internet search software ,Internet/Web search service ,Banking, finance and accounting industries ,Business - Abstract
Fort Worth, Oct. 13, 2023 (GLOBE NEWSWIRE) -- Fort Worth, Texas - Jordan P. Fowler, SEO and founder of full-service digital marketing agency Moon and Owl, has published a new [...]
- Published
- 2023
3. New Experimental Biology and Medicine Findings Has Been Reported by Investigators at Food and Drug Administration (Rxbert: Enhancing Drug Labeling Text Mining and Analysis With Ai Language Modeling)
- Subjects
United States. Food and Drug Administration -- Analysis ,Biological research -- Analysis -- Models ,Biology, Experimental -- Analysis -- Models ,Natural language interfaces -- Models -- Analysis ,Computational linguistics -- Models -- Analysis ,Language processing -- Models -- Analysis ,Data mining -- Models -- Analysis ,Data warehousing/data mining ,Business ,Government ,Political science - Abstract
2024 FEB 15 (VerticalNews) -- By a News Reporter-Staff News Editor at Politics & Government Business -- Current study results on Life Sciences - Experimental Biology and Medicine have been [...]
- Published
- 2024
4. A language modeling-like approach to sketching
- Author
-
Lisa Graziani, Stefano Melacci, and Marco Gori
- Subjects
Neural Networks ,Computer science ,Cognitive Neuroscience ,computer.software_genre ,Computer ,Artificial Intelligence ,Humans ,Set (psychology) ,Recurrent Neural Networks ,Probability ,Language ,Artificial neural network ,business.industry ,Statistical model ,Language Modeling ,Sketch generation ,Neural Networks, Computer ,Sketch ,Recurrent neural network ,Beam search ,Artificial intelligence ,Language model ,business ,computer ,Natural language processing ,Natural language - Abstract
Sketching is a universal communication tool that, despite its simplicity, is able to efficiently express a large variety of concepts and, in some limited contexts, it can be even more immediate and effective than natural language. In this paper we explore the feasibility of using neural networks to approach sketching in the same way they are commonly used in Language Modeling. We propose a novel approach to what we refer to as “Sketch Modeling”, in which a neural network is exploited to learn a probabilistic model that estimates the probability of sketches. We focus on simple sketches and, in particular, on the case in which sketches are represented as sequences of segments. Segments and sequences can be either given – when the sketches are originally drawn in this format – or automatically generated from the input drawing by means of a procedure that we designed to create short sequences, loosely inspired by the human behavior. A Recurrent Neural Network is used to learn the sketch model and, afterward, the network is seeded with an incomplete sketch that it is asked to complete, generating one segment at each time step. We propose a set of measures to evaluate the outcome of a Beam Search-based generation procedure, showing how they can be used to identify the most promising generations. Our experimental analysis assesses the feasibility of this way of modeling sketches, also in the case in which several different categories of sketches are considered.
- Published
- 2021
5. Improving Trigram Language Modeling with the World Wide Web
- Author
-
Roni Rosenfeld and Xiaojin Zhu
- Subjects
FOS: Computer and information sciences ,Information retrieval ,Phrase ,Computer science ,business.industry ,Word error rate ,computer.software_genre ,World Wide Web ,Trigram tagger ,Test set ,Web page ,89999 Information and Computing Sciences not elsewhere classified ,Trigram ,Language model ,Artificial intelligence ,business ,computer ,80107 Natural Language Processing ,Natural language ,Natural language processing - Abstract
We propose a novel method for using the World Wide Web to acquire trigram estimates for statistical lan- guage modeling. We submit an N-gram as a phrase query to web search engines. The search engines return the number of web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set.
- Published
- 2023
- Full Text
- View/download PDF
6. FedMed: A Federated Learning Framework for Language Modeling
- Author
-
Jianjia Wang, Xing Wu, and Zhaowang Liang
- Subjects
Scheme (programming language) ,Information privacy ,language modeling ,Perplexity ,Phrase ,Computer science ,Treebank ,02 engineering and technology ,Machine learning ,computer.software_genre ,lcsh:Chemical technology ,Biochemistry ,Article ,Analytical Chemistry ,law.invention ,law ,0202 electrical engineering, electronic engineering, information engineering ,lcsh:TP1-1185 ,Electrical and Electronic Engineering ,Instrumentation ,Virtual keyboard ,computer.programming_language ,federated learning ,business.industry ,020206 networking & telecommunications ,Atomic and Molecular Physics, and Optics ,topK ranking ,communication efficiency ,020201 artificial intelligence & image processing ,Applications of artificial intelligence ,Artificial intelligence ,Language model ,business ,Mobile device ,computer - Abstract
Federated learning (FL) is a privacy-preserving technique for training a vast amount of decentralized data and making inferences on mobile devices. As a typical language modeling problem, mobile keyboard prediction aims at suggesting a probable next word or phrase and facilitating the human-machine interaction in a virtual keyboard of the smartphone or laptop. Mobile keyboard prediction with FL hopes to satisfy the growing demand that high-level data privacy be preserved in artificial intelligence applications even with the distributed models training. However, there are two major problems in the federated optimization for the prediction: (1) aggregating model parameters on the server-side and (2) reducing communication costs caused by model weights collection. To address the above issues, traditional FL methods simply use averaging aggregation or ignore communication costs. We propose a novel Federated Mediation (FedMed) framework with the adaptive aggregation, mediation incentive scheme, and topK strategy to address the model aggregation and communication costs. The performance is evaluated in terms of perplexity and communication rounds. Experiments are conducted on three datasets (i.e., Penn Treebank, WikiText-2, and Yelp) and the results demonstrate that our FedMed framework achieves robust performance and outperforms baseline approaches.
- Published
- 2020
7. Effects of a Coaching Intervention on Paraeducator Use of Aided Language Modeling in Classroom Settings: A Pilot Investigation
- Author
-
Prema Polit, Elena Dukhovny, and Shubha Kashinath
- Subjects
Speech and Hearing ,Linguistics and Language ,Medical education ,Augmentative and alternative communication ,business.industry ,Intervention (counseling) ,Language model ,Psychology ,business ,Coaching - Abstract
Paraeducators are the most frequent communication partners during the school day for students who use augmentative and alternative communication (AAC), yet they often lack training in AAC best practices. This intervention study examined the effect of an in-class coaching intervention on the aided language modeling (ALM) skills of paraeducators who work with students who use AAC. An intervention protocol using evidence-based coaching strategies was used to support paraeducator implementation of ALM in typical classroom activities. The multiple-baseline single-subject design measured the use of ALM by four paraeducators. Data were analyzed visually and by calculating Tau- U and gain scores. Results suggest a strong effect from the coaching intervention on ALM skills for each of the paraeducators. Challenges and benefits of paraeducator-focused interventions in classroom settings are presented.
- Published
- 2021
- Full Text
- View/download PDF
8. Psychological Human Traits Detection based on Universal Language Modeling
- Author
-
Mahmoud A. Ismail Shoman, Reda A. El-Khoribi, Sherif M. Abdou, and Kamal El-Demerdash
- Subjects
Feature engineering ,Word embedding ,Process (engineering) ,Computer science ,02 engineering and technology ,Management Science and Operations Research ,computer.software_genre ,NLP ,Deep Learning ,Inductive transfer ,0202 electrical engineering, electronic engineering, information engineering ,Big Five personality traits ,Interpretability ,Personality Traits ,business.industry ,Deep learning ,020206 networking & telecommunications ,QA75.5-76.95 ,Computer Science Applications ,Big Five Personality Model ,Electronic computers. Computer science ,020201 artificial intelligence & image processing ,Artificial intelligence ,Text Analytics ,business ,Transfer of learning ,LSTM ,computer ,Natural language processing ,Information Systems - Abstract
Personality Traits Detection is one of the important problems as a text analytics task in Natural Language Processing (NLP). Text analytics is the process of finding out insight knowledge over written text. Although most deep learning models give high performance, they often lack interpretability. Computer Vision (CV) has been affected significantly with inductive transfer learning, however training from scratch and task-specific modifications are still wanted in many NLP techniques. This paper addresses the problem of personality traits classification. We adopted the use of the Universal Language Model Fine-Tuning (ULMFiT) in personality traits detection. The model makes use of transfer learning rather than the classical shallow methods of word embedding and proved to be the most powerful model in many NLP problems. The basic advantage of using this model is that there is no need to do feature engineering before classification. When applied to benchmark dataset, the proposed method shows a statistical accuracy improvement of about 1% compared to the state-of-the-art results for the big five personality traits.
- Published
- 2021
9. Enhancing E-Business Communication with a Hybrid Rule-Based and Extractive-Based Chatbot
- Author
-
Onur Dogan and Omer Faruk Gurcan
- Subjects
AI in e-business ,chatbot ,large language modeling ,customer satisfaction ,service quality ,Business ,HF5001-6182 - Abstract
E-businesses often face challenges related to customer service and communication, leading to increased dissatisfaction among customers and potential damage to the brand. To address these challenges, data-driven and AI-based approaches have emerged, including predictive analytics for optimizing customer interactions and chatbots powered by AI and NLP technologies. This study focuses on developing a hybrid rule-based and extractive-based chatbot for e-business, which can handle both routine and complex inquiries, ensuring quick and accurate responses to improve communication problems. The rule-based QA method used in the chatbot demonstrated high precision and accuracy in providing answers to user queries. The rule-based approach achieved impressive 98% accuracy and 97% precision rates among 1684 queries. The extractive-based approach received positive feedback, with 91% of users rating it as “good” or “excellent” and an average user satisfaction score of 4.38. General user satisfaction was notably high, with an average Likert score of 4.29, and 54% of participants gave the highest score of 5. Communication time was significantly improved, as the chatbot reduced average response times to 41 s, compared to the previous 20-min average for inquiries.
- Published
- 2024
- Full Text
- View/download PDF
10. Deep dynamic neural networks for temporal language modeling in author communities
- Author
-
Edouard Delasalles, Sylvain Lamprier, and Ludovic Denoyer
- Subjects
Artificial neural network ,Computer science ,business.industry ,Statistical model ,Latent variable ,Space (commercial competition) ,computer.software_genre ,Variety (linguistics) ,Human-Computer Interaction ,Artificial Intelligence ,Hardware and Architecture ,Leverage (statistics) ,Language model ,Artificial intelligence ,business ,computer ,Software ,Natural language processing ,Word (computer architecture) ,Information Systems - Abstract
Language models are at the heart of numerous works, notably in the text mining and information retrieval communities. These statistical models aim at extracting word distributions, from simple unigram models to recurrent approaches with latent variables that capture subtle dependencies in texts. However, those models are learned from word sequences only, and authors’ identities, as well as publication dates, are seldom considered. We propose a neural model, based on recurrent language modeling (e.g., LSTM), which aims at capturing language diffusion tendencies in author communities through time. By conditioning language models with author and dynamic temporal vector states, we are able to leverage the latent dependencies between the text contexts. The model captures language evolution of authors via a shared temporal prediction function in a latent space, which allows to handle a variety of modeling tasks, including completion and prediction of language models through time. Experiments show the performances of the approach, compared to several temporal and non-temporal language baselines on two real-world corpora.
- Published
- 2021
- Full Text
- View/download PDF
11. Unsupervised stemmed text corpus for language modeling and transcription of Telugu broadcast news
- Author
-
Laxminarayana Parayitam, Venkataramana Appala, and Mythilisharan Pala
- Subjects
Hindi ,Text corpus ,Linguistics and Language ,Root (linguistics) ,business.industry ,Computer science ,Supervised learning ,computer.software_genre ,Language and Linguistics ,language.human_language ,Telugu ,Human-Computer Interaction ,ComputingMethodologies_PATTERNRECOGNITION ,Transcription (linguistics) ,language ,Computer Vision and Pattern Recognition ,Language model ,Artificial intelligence ,business ,computer ,Software ,Natural language processing ,Smoothing - Abstract
In Indian Languages, root words will be either combined or modified to match the context with reference to tense, number and/or gender. So the number of unique words will increase when compared to many European languages. Whatever be the size of the text corpus used for language modeling cannot contain all the possible inflected words. A word which occurred during testing but not in training data is called Out of Vocabulary (OOV) word. Similarly, the text corpus cannot have all possible sequence of words. So Due to this data sparsity, Automatic Speech Recognition system (ASR) may not accommodate all the words in the language model/irrespective of the size of the text corpus. It also becomes computationally challenging if the volume of the data increases exponentially due to morphological changes to the root word. To reduce the OOVs in the language model, a new unsupervised stemming method is proposed in this paper for one Indian language, Telugu, based on the method proposed for Hindi. Other issues in the language modeling for Telugu using techniques like smoothing and interpolation, with supervised and unsupervised stemming data is also analyzed. It is observed that the smoothing techniques Witten–Bell and Kneser–Ney performing well when compared to other techniques, on pre-processed data with supervised learning. The ASRs accuracy is improved by 0.76% and 0.94% with supervised and unsupervised stemming respectively.
- Published
- 2020
- Full Text
- View/download PDF
12. Context, Language Modeling, and Multimodal Data in Finance
- Author
-
Shuai Zheng, Sheng Zha, Rob van Dusen, Nagpurnanand Prabhala, George Karypis, Dylan Slack, John He, Krishnamurthy Sandeep, Shenghua Yue, Connor Goggins, Mitali Mahajan, and Sanjiv Ranjan Das
- Subjects
Finance ,Class (computer programming) ,Information Systems and Management ,business.industry ,Computer science ,Strategy and Management ,Big data ,Context (language use) ,Commission ,Readability ,Credit rating ,Computational Theory and Mathematics ,Artificial Intelligence ,Business, Management and Accounting (miscellaneous) ,Language model ,Business and International Management ,business ,Encoder ,Information Systems - Abstract
The authors enhance pretrained language models with Securities and Exchange Commission filings data to create better language representations for features used in a predictive model. Specifically, they train RoBERTa class models with additional financial regulatory text, which they denote as a class of RoBERTa-Fin models. Using different datasets, the authors assess whether there is material improvement over models that use only text-based numerical features (e.g., sentiment, readability, polarity), which is the traditional approach adopted in academia and practice. The RoBERTa-Fin models also outperform generic bidirectional encoder representations from transformers (BERT) class models that are not trained with financial text. The improvement in classification accuracy is material, suggesting that full text and context are important in classifying financial documents and that the benefits from the use of mixed data, (i.e., enhancing numerical tabular data with text) are feasible and fruitful in machine learning models in finance. TOPICS:Quantitative methods, big data/machine learning, legal/regulatory/public policy, information providers/credit ratings Key Findings ▪ Machine learning based on multimodal data provides meaningful improvement over models based on numerical data alone. ▪ Context-rich models perform better than context-free models. ▪ Pretrained language models that mix common text and financial text do better than those pretrained on financial text alone.
- Published
- 2021
- Full Text
- View/download PDF
13. Bidirectional Language Modeling: A Systematic Literature Review
- Author
-
Anam Amjad, Habib Ullah Khan, Shahzad Akbar, Sarah Gul, Muhammad Umar Farooq, and Muhammad Shah Jahan
- Subjects
Dependency (UML) ,Computer science ,Computational linguistics ,Context (language use) ,02 engineering and technology ,Machine learning ,computer.software_genre ,QA76.75-76.765 ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,Computer software ,Representation (mathematics) ,030304 developmental biology ,Transformer (machine learning model) ,0303 health sciences ,Feedforward neural networks ,business.industry ,Computer Science Applications ,Recurrent neural network ,020201 artificial intelligence & image processing ,Artificial intelligence ,Language model ,Recurrent neural network (RNN) ,business ,Transfer of learning ,Encoder ,computer ,Software - Abstract
In transfer learning, two major activities, i.e., pretraining and fine-tuning, are carried out to perform downstream tasks. The advent of transformer architecture and bidirectional language models, e.g., bidirectional encoder representation from transformer (BERT), enables the functionality of transfer learning. Besides, BERT bridges the limitations of unidirectional language models by removing the dependency on the recurrent neural network (RNN). BERT also supports the attention mechanism to read input from any side and understand sentence context better. It is analyzed that the performance of downstream tasks in transfer learning depends upon the various factors such as dataset size, step size, and the number of selected parameters. In state-of-the-art, various research studies produced efficient results by contributing to the pretraining phase. However, a comprehensive investigation and analysis of these research studies is not available yet. Therefore, in this article, a systematic literature review (SLR) is presented investigating thirty-one (31) influential research studies published during 2018–2020. Following contributions are made in this paper: (1) thirty-one (31) models inspired by BERT are extracted. (2) Every model in this paper is compared with RoBERTa (replicated BERT model) having large dataset and batch size but with a small step size. It is concluded that seven (7) out of thirty-one (31) models in this SLR outperforms RoBERTa in which three were trained on a larger dataset while the other four models are trained on a smaller dataset. Besides, among these seven models, six models shared both feedforward network (FFN) and attention across the layers. Rest of the twenty-four (24) models are also studied in this SLR with different parameter settings. Furthermore, it has been concluded that a pretrained model with a large dataset, hidden layers, attention heads, and small step size with parameter sharing produces better results. This SLR will help researchers to pick a suitable model based on their requirements.
- Published
- 2021
- Full Text
- View/download PDF
14. SpaceTransformers: Language Modeling for Space Systems
- Author
-
Paul Darm, Annalisa Riccardi, and Audrey Berquand
- Subjects
concept recognition ,Word embedding ,General Computer Science ,TL ,Computer science ,Space (commercial competition) ,computer.software_genre ,Field (computer science) ,Ranking (information retrieval) ,Data modeling ,General Materials Science ,space systems ,requirements ,business.industry ,General Engineering ,transformers ,Language model ,TK1-9971 ,Task analysis ,TJ ,Artificial intelligence ,Electrical engineering. Electronics. Nuclear engineering ,Transfer of learning ,business ,computer ,Natural language processing - Abstract
The transformers architecture and transfer learning have radically modified the Natural Language Processing (NLP) landscape, enabling new applications in fields where open source labelled datasets are scarce. Space systems engineering is a field with limited access to large labelled corpora and a need for enhanced knowledge reuse of accumulated design data. Transformers models such as the Bidirectional Encoder Representations from Transformers (BERT) and the Robustly Optimised BERT Pretraining Approach (RoBERTa) are however trained on general corpora. To answer the need for domain specific contextualised word embedding in the space field, we propose Space Transformers, a novel family of three models, SpaceBERT, SpaceRoBERTa and SpaceSciBERT, respectively further pre-trained from BERT, RoBERTa and SciBERT on our domain-specific corpus. We collect and label a new dataset of space systems concepts based on space standards. We fine-tune and compare our domain-specific models to their general counterparts on a domain-specific Concept Recognition (CR) task. Our study rightly demonstrates that the models further pre-trained on a space corpus outperform their respective baseline models in the Concept Recognition task, with SpaceRoBERTa achieving significant higher ranking overall.
- Published
- 2021
15. Language modeling and bidirectional coders representations: an overview of key technologies
- Author
-
D. I. Kachkou
- Subjects
Computer science ,computer.software_genre ,transformer architecture ,03 medical and health sciences ,model bert ,Text processing ,information technology ,language models ,informatics ,natural language processing ,030304 developmental biology ,Transformer (machine learning model) ,0303 health sciences ,Class (computer programming) ,business.industry ,030302 biochemistry & molecular biology ,Information technology ,QA75.5-76.95 ,Electronic computers. Computer science ,Artificial intelligence ,Language model ,business ,attention mechanism ,Knowledge transfer ,computer ,Strengths and weaknesses ,Natural language ,Natural language processing - Abstract
The article is an essay on the development of technologies for natural language processing, which formed the basis of BERT (Bidirectional Encoder Representations from Transformers), a language model from Google, showing high results on the whole class of problems associated with the understanding of natural language. Two key ideas implemented in BERT are knowledge transfer and attention mechanism. The model is designed to solve two problems on a large unlabeled data set and can reuse the identified language patterns for effective learning for a specific text processing problem. Architecture Transformer is based on the attention mechanism, i.e. it involves evaluation of relationships between input data tokens. In addition, the article notes strengths and weaknesses of BERT and the directions for further model improvement.
- Published
- 2021
16. Natural language modeling with syntactic structure dependency
- Author
-
Yue Sun, Yijia Tan, Zhun Cai, and Kai Shuang
- Subjects
Information Systems and Management ,Perplexity ,Dependency (UML) ,Machine translation ,Computer science ,Treebank ,02 engineering and technology ,computer.software_genre ,Theoretical Computer Science ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,business.industry ,05 social sciences ,050301 education ,Syntax ,Computer Science Applications ,Recurrent neural network ,Control and Systems Engineering ,020201 artificial intelligence & image processing ,Syntactic structure ,Artificial intelligence ,Language model ,business ,0503 education ,computer ,Software ,Sentence ,Natural language processing ,Natural language - Abstract
In natural language, the relationship among the constituents of a sentence is usually tree-like: words, phrases, and clauses constitute a sentence hierarchically, and the dependency between different constituents induces the syntactic structure. Such a complex tree-like structure is vital for understanding natural languages. However, recurrent neural networks (RNNs) model languages sequentially and fail to encode a hierarchical syntactic dependency comprehensively, therefore causing the networks to underperform on comprehension-based tasks. In this paper, we propose a novel neural language model, called relative syntactic distance LSTM (RSD-LSTM), to capture the syntactic structure dependency dynamically. RSD-LSTM employs a convolutional neural network to compute the relative syntactic distance between sentences to represent the degree of dependency between words and modifies the gating mechanism of LSTM through the relative syntactic distance. Furthermore, we add a direct connection between hidden states to fuse high-level and low-level syntactic features. We conducted extensive experiments on language modeling. The results suggest that RSD-LSTM achieves improvements of 1.82 and 2.03 in perplexity compared with current top methods on the Penn Treebank and WikiText-2 datasets, respectively. Moreover, we conducted experiments on a machine translation application task. Experimental results of this task also show significant improvements of RSD-LSTM compared with baseline models.
- Published
- 2020
- Full Text
- View/download PDF
17. Alternating Language Modeling for Cross-Lingual Pre-Training
- Author
-
Jian Yang, Ming Zhou, Dongdong Zhang, Shuming Ma, Shuangzhi Wu, and Zhoujun Li
- Subjects
Cross lingual ,Machine translation ,business.industry ,Computer science ,Translation language ,Concatenation ,Context (language use) ,General Medicine ,computer.software_genre ,Simple (abstract algebra) ,Artificial intelligence ,Language model ,business ,computer ,Natural language processing ,Sentence - Abstract
Language model pre-training has achieved success in many natural language processing tasks. Existing methods for cross-lingual pre-training adopt Translation Language Model to predict masked words with the concatenation of the source sentence and its target equivalent. In this work, we introduce a novel cross-lingual pre-training method, called Alternating Language Modeling (ALM). It code-switches sentences of different languages rather than simple concatenation, hoping to capture the rich cross-lingual context of words and phrases. More specifically, we randomly substitute source phrases with target translations to create code-switched sentences. Then, we use these code-switched data to train ALM model to learn to predict words of different languages. We evaluate our pre-training ALM on the downstream tasks of machine translation and cross-lingual classification. Experiments show that ALM can outperform the previous pre-training methods on three benchmarks.1
- Published
- 2020
- Full Text
- View/download PDF
18. Enhanced Language Modeling with Proximity and Sentence Relatedness Information for Extractive Broadcast News Summarization
- Author
-
Kuan-Yu Chen, Shih-Hung Liu, and Berlin Chen
- Subjects
General Computer Science ,Computer science ,business.industry ,02 engineering and technology ,Construct (python library) ,computer.software_genre ,Automatic summarization ,Ensemble learning ,Task (project management) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Selection (linguistics) ,020201 artificial intelligence & image processing ,Language model ,Artificial intelligence ,Set (psychology) ,business ,computer ,Sentence ,Natural language processing - Abstract
The primary task of extractive summarization is to automatically select a set of representative sentences from a text or spoken document that can concisely express the most important theme of the original document. Recently, language modeling (LM) has been proven to be a promising modeling framework for performing this task in an unsupervised manner. However, there still remain three fundamental challenges facing the existing LM-based methods, which we set out to tackle in this article. The first one is how to construct a more accurate sentence model in this framework without resorting to external sources of information. The second is how to take into account sentence-level structural relationships, in addition to word-level information within a document, for important sentence selection. The last one is how to exploit the proximity cues inherent in sentences to obtain a more accurate estimation of respective sentence models. Specifically, for the first and second challenges, we explore a novel, principled approach that generates overlapped clusters to extract sentence relatedness information from the document to be summarized, which can be used not only to enhance the estimation of various sentence models but also to render sentence-level structural relationships within the document, leading to better summarization effectiveness. For the third challenge, we investigate several formulations of proximity cues for use in sentence modeling involved in the LM-based summarization framework, free of the strict bag-of-words assumption. Furthermore, we also present various ensemble methods that seamlessly integrate proximity and sentence relatedness information into sentence modeling. Extensive experiments conducted on a Mandarin broadcast news summarization task show that such integration of proximity and sentence relatedness information is indeed beneficial for speech summarization. Our proposed summarization methods can significantly boost the performance of an LM-based strong baseline (e.g., with a maximum ROUGE-2 improvement of 26.7% relative) and also outperform several state-of-the-art unsupervised methods compared in the article.
- Published
- 2020
- Full Text
- View/download PDF
19. A Sequential and Intensive Weighted Language Modeling Scheme for Multi-Task Learning-Based Natural Language Understanding
- Author
-
Seonjeong Hwang, Sohyeun Bae, Suhyune Son, Jang Hwan Choi, and Soo Jun Park
- Subjects
Scheme (programming language) ,language modeling ,Computer science ,Natural language understanding ,multi-task learning ,Multi-task learning ,02 engineering and technology ,Machine learning ,computer.software_genre ,lcsh:Technology ,supervised learning ,Task (project management) ,lcsh:Chemistry ,0202 electrical engineering, electronic engineering, information engineering ,General Materials Science ,lcsh:QH301-705.5 ,Instrumentation ,computer.programming_language ,Fluid Flow and Transfer Processes ,Artificial neural network ,lcsh:T ,business.industry ,Process Chemistry and Technology ,Supervised learning ,General Engineering ,neural networks ,021001 nanoscience & nanotechnology ,lcsh:QC1-999 ,Computer Science Applications ,Weighting ,lcsh:Biology (General) ,lcsh:QD1-999 ,lcsh:TA1-2040 ,natural language understanding ,020201 artificial intelligence & image processing ,Artificial intelligence ,Language model ,lcsh:Engineering (General). Civil engineering (General) ,0210 nano-technology ,business ,computer ,lcsh:Physics - Abstract
Multi-task learning (MTL) approaches are actively used for various natural language processing (NLP) tasks. The Multi-Task Deep Neural Network (MT-DNN) has contributed significantly to improving the performance of natural language understanding (NLU) tasks. However, one drawback is that confusion about the language representation of various tasks arises during the training of the MT-DNN model. Inspired by the internal-transfer weighting of MTL in medical imaging, we introduce a Sequential and Intensive Weighted Language Modeling (SIWLM) scheme. The SIWLM consists of two stages: (1) Sequential weighted learning (SWL), which trains a model to learn entire tasks sequentially and concentrically, and (2) Intensive weighted learning (IWL), which enables the model to focus on the central task. We apply this scheme to the MT-DNN model and call this model the MTDNN-SIWLM. Our model achieves higher performance than the existing reference algorithms on six out of the eight GLUE benchmark tasks. Moreover, our model outperforms MT-DNN by 0.77 on average on the overall task. Finally, we conducted a thorough empirical investigation to determine the optimal weight for each GLUE task.
- Published
- 2021
- Full Text
- View/download PDF
20. Split Attention Pointer Network for Source Code Language Modeling
- Author
-
Zhongwen Chen and Zhimin Zhou
- Subjects
Source code ,Computer Networks and Communications ,Computer science ,business.industry ,Programming language ,media_common.quotation_subject ,Deep learning ,computer.software_genre ,Computer Graphics and Computer-Aided Design ,Recurrent neural network ,Artificial Intelligence ,Pointer (computer programming) ,Program completion ,Leverage (statistics) ,Language model ,Artificial intelligence ,business ,computer ,Software ,media_common - Abstract
There is a growing interest in leveraging Deep Learning (DL) for automating Software Engineering tasks such as program completion. In this paper, we leverage Recurrent Neural Networks (RNNs) for Abstract Syntax Tree (AST)-based code completion. Our approach converts source code into AST nodes and a language model predicts the type and value attributes of next tokens. Our work demonstrates that the attention augmented RNN-based language models are able to understand local context and copy from recent past tokens which have never appeared in the training data set. We observed a drop of performances of both type and value predictions when using a traditional pointer network architecture for out-of-vocabulary (OoV) copying and context understanding, which we call multi-task conflict. To address this challenge, we have devised a new structure of self-attention called Split Attention, where two separate dot-product layers are applied to different parts of the history cache. Based on this structure, we propose a new network called Split Attention Pointer Network (SAPN), which is efficient and flexible in both learning local context and copying OoV tokens from history. The empirical results suggest that our model is superior in syntax-aware generation and OoV token prediction by demonstrating attention behavior similar to human programmers. The results also indicate that our model out performs previous state-of-the-art approaches by more than 6% on widely recognized program completion benchmarks.
- Published
- 2020
- Full Text
- View/download PDF
21. Analysis of Neural Network Based Language Modeling
- Author
-
P. Karuppusamy
- Subjects
Artificial neural network ,Computer science ,business.industry ,0202 electrical engineering, electronic engineering, information engineering ,020206 networking & telecommunications ,020201 artificial intelligence & image processing ,02 engineering and technology ,Artificial intelligence ,Language model ,business - Abstract
The fundamental and core process of the natural language processing is the language modelling usually referred as the statistical language modelling. The language modelling is also considered to be vital in the processing the natural languages as the other chores such as the completion of sentences, recognition of speech automatically, translations of the statistical machines, and generation of text and so on. The success of the viable natural language processing totally relies on the quality of the modelling of the language. In the previous spans the research field such as the linguistics, psychology, speech recognition, data compression, neuroscience, machine translation etc. As the neural network are the very good choices for having a quality language modelling the paper presents the analysis of neural networks in the modelling of the language. Utilizing some of the dataset such as the Penn Tree bank, Billion Word Benchmark and the Wiki Test the neural network models are evaluated on the basis of the word error rate, perplexity and the bilingual evaluation under study scores to identify the optimal model.
- Published
- 2020
- Full Text
- View/download PDF
22. Variational Sentence Augmentation for Masked Language Modeling
- Author
-
Mehmet Fatih Amasyali and M. Safak Bilici
- Subjects
Space (punctuation) ,business.industry ,Computer science ,Turkish ,computer.software_genre ,Semantics ,Autoencoder ,language.human_language ,Data modeling ,language ,Artificial intelligence ,Language model ,Representation (mathematics) ,business ,computer ,Sentence ,Natural language processing - Abstract
We introduce a variational sentence augmentation method that consists of Variational Autoencoder [1] and Gated Recurrent Unit [2]. The proposed method for data augmentation benefits from its latent space representation, which encodes semantic and syntactic properties of the language. After learning the representation of the language, the model generates sentences from its latent space with the sequential structure of Gated Recurrent Unit. By augmenting existing unstructured corpus, the model improves Masked Language Modeling on pre-training. As a result, it improves fine-tuning as well. In pre-training, our method increases the prediction rate of masked tokens. In fine-tuning, we show that variational sentence augmentation can help semantic tasks and syntactic tasks. We make our experiments and evaluations on a limited dataset containing Turkish sentences, which also stands for a contribution to low resource languages.
- Published
- 2021
- Full Text
- View/download PDF
23. Deep-Learning Language-Modeling Approach for Automated, Personalized, and Iterative Radiology-Pathology Correlation
- Author
-
Ross W. Filice
- Subjects
Pathology ,medicine.medical_specialty ,Databases, Factual ,Computer science ,media_common.quotation_subject ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,030218 nuclear medicine & medical imaging ,Automation ,03 medical and health sciences ,Deep Learning ,0302 clinical medicine ,Artificial Intelligence ,Multidisciplinary approach ,medicine ,Humans ,Radiology, Nuclear Medicine and imaging ,Quality (business) ,Precision Medicine ,Peer learning ,Adaptation (computer science) ,Natural Language Processing ,media_common ,Pathology, Clinical ,business.industry ,Deep learning ,Quality Improvement ,Radiology Information Systems ,Binary classification ,030220 oncology & carcinogenesis ,Language model ,Artificial intelligence ,Radiology ,business ,Quality assurance ,Forecasting - Abstract
Purpose Radiology-pathology correlation has long been foundational to continuing education, peer learning, quality assurance, and multidisciplinary patient care. The objective of this study was to determine whether modern deep-learning language-modeling techniques could reliably match pathology reports to pertinent radiology reports. Methods The recently proposed Universal Language Model Fine-Tuning for Text Classification methodology was used. Two hundred thousand radiology and pathology reports were used for adaptation to the radiology-pathology space. One hundred thousand candidate radiology-pathology pairs, evenly split into match and no-match categories, were used for training the final binary classification model. Matches were defined by a previous-generation artificial intelligence anatomic concept radiology-pathology correlation system. Results The language model rapidly adapted very closely to the prior anatomic concept-matching approach, with 100% specificity, 65.1% sensitivity, and 73.7% accuracy. For comparison, the previous methodology, which was intentionally designed to be specific at the expense of sensitivity, had 98.0% specificity, 65.1% sensitivity, and 73.2% accuracy. Conclusions Modern deep-learning language-modeling approaches are promising for radiology-pathology correlation. Because of their rapid adaptation to underlying training labels, these models advance previous artificial intelligence work in that they can be continuously improved and tuned to improve performance and adjust to user and site-level preference.
- Published
- 2019
- Full Text
- View/download PDF
24. Neural Language Modeling for Molecule Generation
- Author
-
Adilov S
- Subjects
Matching (statistics) ,Artificial neural network ,business.industry ,Computer science ,Deep learning ,Benchmarking ,Machine learning ,computer.software_genre ,chEMBL ,Recurrent neural network ,Artificial intelligence ,Language model ,business ,computer ,Generative grammar - Abstract
Generative neural networks have shown promising results in de novo drug design. Recent studies suggest that one of the efficient ways to produce novel molecules matching target properties is to model SMILES sequences using deep learning in a way similar to language modeling in natural language processing. In this paper, we present a survey of various machine learning methods for SMILES-based language modeling and propose our benchmarking results on a standardized subset of ChEMBL database.
- Published
- 2021
- Full Text
- View/download PDF
25. Transformer-Based Deep Neural Language Modeling for Construct-Specific Automatic Item Generation
- Author
-
Hommel B, Kotova, Schmukle Sc, Wollang F, and Hannes Zacher
- Subjects
business.industry ,Computer science ,Automatic item generation ,Artificial intelligence ,Language model ,Construct (python library) ,business ,computer.software_genre ,computer ,Natural language processing ,Transformer (machine learning model) - Abstract
Algorithmic automatic item generation can be used to obtain large quantities of cognitive items in the domains of knowledge and aptitude testing. However, conventional item models used by template-based automatic item generation techniques are not ideal for the creation of items for non-cognitive constructs. Progress in this area has been made recently by employing long short-term memory recurrent neural networks to produce word sequences that syntactically resemble items typically found in personality questionnaires. To date, such items have been produced unconditionally, without the possibility of selectively targeting personality domains. In this article, we offer a brief synopsis on past developments in natural language processing and explain why the automatic generation of construct-specific items has become attainable only due to recent technological progress. We propose that pre-trained causal transformer models can be fine-tuned to achieve this task using implicit parameterization in conjunction with conditional generation. We demonstrate this method in a tutorial-like fashion and finally compare aspects of validity in human- and machine-authored items using empirical data. Our study finds that approximately two-thirds of the automatically generated items show good psychometric properties (factor loadings above .40) and that one-third even have properties equivalent to established and highly curated human-authored items. Our work thus demonstrates the practical use of deep neural networks for non-cognitive automatic item generation.
- Published
- 2021
- Full Text
- View/download PDF
26. Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition
- Author
-
Darko Pekar, Branislav Popović, and Edvin Pakoci
- Subjects
Vocabulary ,Article Subject ,General Computer Science ,Computer science ,General Mathematics ,media_common.quotation_subject ,02 engineering and technology ,lcsh:Computer applications to medicine. Medical informatics ,computer.software_genre ,Semantics ,lcsh:RC321-571 ,03 medical and health sciences ,0302 clinical medicine ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Speech ,lcsh:Neurosciences. Biological psychiatry. Neuropsychiatry ,Lemma (morphology) ,media_common ,Dictation ,business.industry ,General Neuroscience ,Recognition, Psychology ,General Medicine ,language.human_language ,Grammatical number ,Speech Perception ,language ,lcsh:R858-859.7 ,020201 artificial intelligence & image processing ,Artificial intelligence ,Language model ,Serbian ,business ,computer ,030217 neurology & neurosurgery ,Word (computer architecture) ,Natural language processing ,Research Article - Abstract
Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics.
- Published
- 2019
- Full Text
- View/download PDF
27. Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
- Author
-
Shancheng Fang, Yongdong Zhang, Hongtao Xie, Zhendong Mao, and Yuxin Wang
- Subjects
FOS: Computer and information sciences ,Computer science ,business.industry ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Text recognition ,Feature (machine learning) ,Code (cryptography) ,Limited capacity ,Noise (video) ,Language model ,Artificial intelligence ,Representation (mathematics) ,business ,Block (data storage) - Abstract
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition. Code is available at https://github.com/FangShancheng/ABINet., Accepted by CVPR 2021
- Published
- 2021
28. From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken Language Understanding
- Author
-
Ibrahim Sharaf, Alan Ramponi, Siti Oryza Khairunnisa, Barbara Plank, Marija Stepanović, Ahmet Üstün, Aizhan Imankulova, Rob van der Goot, and Mamoru Komachi
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Syntax (programming languages) ,Machine translation ,business.industry ,Computer science ,02 engineering and technology ,010501 environmental sciences ,Reuse ,computer.software_genre ,01 natural sciences ,Zero (linguistics) ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Key (cryptography) ,020201 artificial intelligence & image processing ,Language model ,Artificial intelligence ,business ,Computation and Language (cs.CL) ,computer ,Natural language processing ,0105 earth and related environmental sciences ,Spoken language - Abstract
The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification., Comment: To appear in the proceedings of NAACL 2021
- Published
- 2021
- Full Text
- View/download PDF
29. A Cognitive Regularizer for Language Modeling
- Author
-
Clara Meister, Jason Wei, Ryan Cotterell, Zong, Chengqing, Xia, Fei, Li, Wenjie, and Navigli, Roberto
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Perplexity ,Operationalization ,Inductive bias ,Computer science ,business.industry ,SIGNAL (programming language) ,Cognition ,computer.software_genre ,Psycholinguistics ,Artificial intelligence ,Language model ,business ,Computation and Language (cs.CL) ,computer ,Regularization (linguistics) ,Natural language processing - Abstract
The uniform information density (UID) hypothesis, which posits that speakers behaving optimally tend to distribute information uniformly across a linguistic signal, has gained traction in psycholinguistics as an explanation for certain syntactic, morphological, and prosodic choices. In this work, we explore whether the UID hypothesis can be operationalized as an inductive bias for statistical language modeling. Specifically, we augment the canonical MLE objective for training language models with a regularizer that encodes UID. In experiments on ten languages spanning five language families, we find that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited. Moreover, via an analysis of generated sequences, we find that UID-regularized language models have other desirable properties, e.g., they generate text that is more lexically diverse. Our results not only suggest that UID is a reasonable inductive bias for language modeling, but also provide an alternative validation of the UID hypothesis using modern-day NLP tools., Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ISBN:978-1-954085-52-7
- Published
- 2021
- Full Text
- View/download PDF
30. Cultural Understanding Using In-context Learning and Masked Language Modeling
- Author
-
Charles Newton, Davis Qian, and Ming Qian
- Subjects
Descriptive knowledge ,Computer science ,business.industry ,Deep learning ,Variety (linguistics) ,computer.software_genre ,Schema (psychology) ,Language technology ,Language model ,Artificial intelligence ,business ,computer ,Natural language processing ,Generative grammar ,Gesture - Abstract
With the rapid advancement of natural language processing (NLP) as a sub-field of artificial intelligence (AI), a number of unsupervised pre-trained language models trained on large corpus have become available (e.g. BERT and GPT-3). While these models have tremendous linguistic knowledge, a lot of other types of knowledge are embedded in them as well. We perform cross-culture analysis experiments using AI-based Masked Language Modeling (MLM) and GPT-based Generative Language Modeling (In-context learning modeling). The designed approach is to set up a cultural context in sentences with masked words (for MLM) or in a human-prompted text segment (for GPT-based NLG). Consequently, the predicted masked words or the machine generated stories will reflect measurable intercultural differences because language models are trained on different corpus in different languages, and on English corpus containing a significant amount of knowledge on foreign cultures. We show a variety of examples: geopolitical knowledge, holidays, gestures, customs, social norms, emotion schema, role schema, procedure schema, and emotion change detection based on a diplomatic speech. The deep learning neural network model encodes its knowledge in the weights of a neural network instead of as organized semantic concepts. The model can reflect biases brought in by the training data and can give us inaccurate or faulty answers. Overall, with the rapid advancement of language technology, pre-trained language models have grown more powerful, and have great potential to serve as a culturalization tool.
- Published
- 2021
- Full Text
- View/download PDF
31. Assisting Text Localization and Transcreation Tasks Using AI-Based Masked Language Modeling
- Author
-
Ming Qian and Jessie Liu
- Subjects
Parsing ,Conceptualization ,Computer science ,business.industry ,Natural language understanding ,Context (language use) ,Query language ,computer.software_genre ,Schema (psychology) ,Language model ,Artificial intelligence ,Transcreation ,business ,computer ,Natural language processing - Abstract
Localization refers to the adaptation of a document’s content to meet the linguistic, cultural, and other requirements of a specific target market―a locale. Transcreation describes the process of adapting a message from one language to another, while maintaining its intent, style, tone, and context. In recent years, pre-trained language models have pushed the limits of natural language understanding and generation and dominated the NLP progress. We foresee that the AI-based pre-trained language models (e.g. masked language modeling) and other existing and upcoming language modeling techniques will be integrated as effective tools to support localization/transcreation efforts in the coming years. To support localization/transcreation tasks, we use AI-based Masked Language Modeling (MLM) to provide a powerful human-machine teaming tool to query language models for the most proper words/phrases to reflect the proper linguistical and cultural characteristics of the target language. For linguistic applications, we list examples on logical connectives, pronouns and antecedents, and unnecessary redundant nouns and verbs. For intercultural conceptualization applications, we list examples of cultural event schema, role schema, emotional schema, and propositional schema. There are two possible approaches to determine where to put masks: a human-based approach or an algorithm-based approach. For the algorithm-based approach, constituency parsing can be used to break a text into sub-phrases, or constituents, after which typical linguistic patterns can be detected and then finally masking tasks can be attempted on the related texts.
- Published
- 2021
- Full Text
- View/download PDF
32. Researchers Submit Patent Application, "Systems And Methods For Language Modeling With Textual Clincal Data", for Approval (USPTO 20240095445).
- Abstract
A patent application has been submitted for a system and methods for language modeling with textual clinical data. The invention aims to improve the analysis of text-based attributes in electronic health records (EHRs) by leveraging machine learning and natural language processing (NLP) techniques. The system includes a trainer module that fine-tunes a pre-trained language model using clinician feedback and multiple task-specific NLP models. The system can convert EHRs into a second format, redacting protected health information, and then feed the data into the language model for analysis. The system can perform tasks such as classification, search and ranking, autocomplete, and topic modeling. [Extracted from the article]
- Published
- 2024
33. On Sampling-Based Training Criteria for Neural Language Modeling
- Author
-
Hermann Ney, Yingbo Gao, Khoa Viet Tran, Alexander Gerstenberger, Ralf Schlüter, and David Thulke
- Subjects
FOS: Computer and information sciences ,Vocabulary ,Sound (cs.SD) ,Mean squared error ,Computer science ,media_common.quotation_subject ,Monte Carlo method ,Machine Learning (stat.ML) ,02 engineering and technology ,Machine learning ,computer.software_genre ,Computer Science - Sound ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Audio and Speech Processing (eess.AS) ,Statistics - Machine Learning ,0202 electrical engineering, electronic engineering, information engineering ,FOS: Electrical engineering, electronic engineering, information engineering ,media_common ,Computer Science - Computation and Language ,business.industry ,Sampling (statistics) ,020206 networking & telecommunications ,Tree traversal ,Artificial intelligence ,Language model ,0305 other medical science ,business ,computer ,Computation and Language (cs.CL) ,Word (computer architecture) ,Importance sampling ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and cross-entropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups., Accepted at INTERSPEECH 2021
- Published
- 2021
34. Deep neural language modeling enables functional protein generation across families
- Author
-
Nikhil Naik, Caiming Xiong, Ali Madani, Zachary Z. Sun, Subu Subramanian, Richard Socher, Jose L. Olmos, Ben Krause, James S. Fraser, Eric R. Greene, Benjamin P. Mohr, and James M. Holton
- Subjects
chemistry.chemical_classification ,Protein family ,business.industry ,Deep learning ,Computational biology ,Biology ,Amino acid ,White (mutation) ,chemistry ,Chorismate mutase ,Identity (object-oriented programming) ,Artificial intelligence ,Language model ,business ,Natural language - Abstract
Bypassing nature’s evolutionary trajectory,de novoprotein generation—defined as creating artificial protein sequences from scratch—could enable breakthrough solutions for biomedical and environmental challenges. Viewing amino acid sequences as a language, we demonstrate that a deep learning-based language model can generate functional artificial protein sequences across families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. Our protein language model is trained by simply learning to predict the next amino acid for over 280 million protein sequences from thousands of protein families, without biophysical or coevolutionary modeling. We experimentally evaluate model-generated artificial proteins on five distinct antibacterial lysozyme families. Artificial proteins show similar activities and catalytic efficiencies as representative natural lysozymes, including hen egg white lysozyme, while reaching as low as 44% identity to any known naturally-evolved protein. The X-ray crystal structure of an enzymatically active artificial protein recapitulates the conserved fold and positioning of active site residues found in natural proteins. We demonstrate our language model’s ability to be adapted to different protein families by accurately predicting the functionality of artificial chorismate mutase and malate dehydrogenase proteins. These results indicate that neural language models successfully performde novoprotein generation across protein families and may prove to be a tool to shortcut evolution.
- Published
- 2021
- Full Text
- View/download PDF
35. Processing Large Text Corpus Using N-Gram Language Modeling and Smoothing
- Author
-
Sandhya Avasthi, D. P. Acharjya, and Ritu Chauhan
- Subjects
Text corpus ,Perplexity ,Phrase ,business.industry ,Computer science ,computer.software_genre ,n-gram ,User experience design ,Language model ,Artificial intelligence ,business ,computer ,Turing ,Natural language processing ,Word (computer architecture) ,computer.programming_language - Abstract
The prediction of next word, letter or phrase for the user, while she is typing, is a really valuable tool for improving user experience. The users are communicating, writing reviews and expressing their opinion on such platforms frequently and many times while moving. It has become necessary to provide the user with an application that can reduce typing effort and spelling errors when they have limited time. The text data is getting larger in size due to the extensive use of all kinds of social media platforms and so implementation of text prediction application is difficult considering the size of text data to be processed for language modeling. This research paper’s primary objective is processing large text corpus and implementing a probabilistic model like N-grams to predict the next word when the user provides input. In this exploratory research, n-gram models are discussed and evaluated using Good Turing Estimation, perplexity measure and type-to-token ratio.
- Published
- 2021
- Full Text
- View/download PDF
36. PolyLM: Learning about Polysemy through Language Modeling
- Author
-
Felipe Bravo-Marquez, Alan Ansell, and Bernhard Pfahringer
- Subjects
FOS: Computer and information sciences ,Contextualization ,Computer Science - Computation and Language ,Computer science ,business.industry ,Context (language use) ,Meaning (non-linguistic) ,Conflation ,computer.software_genre ,Code (cryptography) ,Artificial intelligence ,Language model ,Polysemy ,business ,Computation and Language (cs.CL) ,computer ,Word (computer architecture) ,Natural language processing - Abstract
To avoid the "meaning conflation deficiency" of word embeddings, a number of models have aimed to embed individual word senses. These methods at one time performed well on tasks such as word sense induction (WSI), but they have since been overtaken by task-specific techniques which exploit contextualized embeddings. However, sense embeddings and contextualization need not be mutually exclusive. We introduce PolyLM, a method which formulates the task of learning sense embeddings as a language modeling problem, allowing contextualization techniques to be applied. PolyLM is based on two underlying assumptions about word senses: firstly, that the probability of a word occurring in a given context is equal to the sum of the probabilities of its individual senses occurring; and secondly, that for a given occurrence of a word, one of its senses tends to be much more plausible in the context than the others. We evaluate PolyLM on WSI, showing that it performs considerably better than previous sense embedding techniques, and matches the current state-of-the-art specialized WSI method despite having six times fewer parameters. Code and pre-trained models are available at https://github.com/AlanAnsell/PolyLM., EACL 2021
- Published
- 2021
- Full Text
- View/download PDF
37. Language Modeling and Text Generation Using Hybrid Recurrent Neural Network
- Author
-
Rizwan Hasan Khan, Iftikhar Ahmad, Suleman Khan, Samreen, and Muhammad Iqbal
- Subjects
Perplexity ,Computer science ,business.industry ,Character (computing) ,Treebank ,computer.software_genre ,Recurrent neural network ,Artificial intelligence ,Language model ,business ,computer ,Natural language ,Natural language processing ,Word (computer architecture) ,Sentence - Abstract
The increase in development of machines that have capability to understand the complicated behavior to solve the human brain involvement problems, the auto text generation application also gets the wide attention. The Language modeling or text generation is a task of next character or word prediction in a sequence with analysis of input data. The ATG enable the machines to write and provide the help to reduce the human brain effort. The ATG is also useful for understanding and analysis of languages and provide the techniques that enable the machines to exchange information in natural languages. At the large scale the text data are created everywhere (whatsApp, facebook, and tweets etc.) and freely online available therefore an effective system is needed for automation of text generation process and analysis of the text data for extracting meaningful information from it so in this work, a case study is presented on how develop a text generation model using hybrid recurrent neural network for English language. The explore model find the dependencies between characters and the conditional probabilities of character in sequences from the available input text data and generate the wholly new sequences of characters like human beings writing (correct in meaning, spelling and sentence structure). A comprehensive comparison between these models, namely, LSTM, deep LSTM, GRU and HRNN is also presented. Previously the RNN models are used for text predictions or auto text generation but these models created the problem of vanishing gradient (short memory) when process long text, therefore the GRU and LSTM models were created for solving this problem. The text generated by GRU and LSTM have many spellings error, incorrect sentence structure, therefore, filling this gap the HRNN model is explore. The HRNN model is the combination of LSTM, GRU and a dense layer. The experiments performed on Penn Treebank, Shakespeare, and Nietzsche datasets. The perplexity of HRNN model is 3.27, the bit per character is 1.18 and average word prediction accuracy is 0.63. As compare with baseline work and previous models (LSTM, deep LSTM and GRU), our model (HRNN) perplexity and bit per character are less. The texts generated by HRNN have fewer spelling errors and sentence structure mistakes. A closer analysis of explored models’ performance and efficiency is described with the help of graph plots and generated texts by taking some input strings. These graphs explain the performance for each model.
- Published
- 2021
- Full Text
- View/download PDF
38. MG-BERT: Multi-Graph Augmented BERT for Masked Language Modeling
- Author
-
Mahdieh Soleymani Baghshah, Hossein Zakerinia, and Parishad BehnamGhader
- Subjects
Text corpus ,business.industry ,Computer science ,Context (language use) ,02 engineering and technology ,computer.software_genre ,Task (project management) ,03 medical and health sciences ,0302 clinical medicine ,030221 ophthalmology & optometry ,0202 electrical engineering, electronic engineering, information engineering ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Artificial intelligence ,Language model ,business ,Encoder ,computer ,Sentence ,Word (computer architecture) ,Natural language processing - Abstract
Pre-trained models like Bidirectional Encoder Representations from Transformers (BERT), have recently made a big leap forward in Natural Language Processing (NLP) tasks. However, there are still some shortcomings in the Masked Language Modeling (MLM) task performed by these models. In this paper, we first introduce a multi-graph including different types of relations between words. Then, we propose Multi-Graph augmented BERT (MG-BERT) model that is based on BERT. MG-BERT embeds tokens while taking advantage of a static multi-graph containing global word co-occurrences in the text corpus beside global real-world facts about words in knowledge graphs. The proposed model also employs a dynamic sentence graph to capture local context effectively. Experimental results demonstrate that our model can considerably enhance the performance in the MLM task.
- Published
- 2021
- Full Text
- View/download PDF
39. Noobs at Semeval-2021 Task 4: Masked Language Modeling for abstract answer prediction
- Author
-
Sarthak Sarthak, Karm Veer Arya, and Shikhar Shukla
- Subjects
Computer science ,business.industry ,computer.software_genre ,SemEval ,Task (project management) ,Reading comprehension ,Binary classification ,Benchmark (computing) ,Language model ,Artificial intelligence ,business ,computer ,Word (computer architecture) ,Natural language processing ,Meaning (linguistics) - Abstract
This paper presents the system developed by our team for Semeval 2021 Task 4: Reading Comprehension of Abstract Meaning. The aim of the task was to benchmark the NLP techniques in understanding the abstract concepts present in a passage, and then predict the missing word in a human written summary of the passage. We trained a Roberta-Large model trained with a masked language modeling objective. In cases where this model failed to predict one of the available options, another Roberta-Large model trained as a binary classifier was used to predict correct and incorrect options. We used passage summary generated by Pegasus model and question as inputs. Our best solution was an ensemble of these 2 systems. We achieved an accuracy of 86.22% on subtask 1 and 87.10% on subtask 2.
- Published
- 2021
- Full Text
- View/download PDF
40. FICOBU: Filipino WordNet Construction Using Decision Tree and Language Modeling
- Author
-
Ria A. Sagum, Aldrin D. Ramos, and Monique T. Llanes
- Subjects
Information Systems and Management ,Artificial Intelligence ,business.industry ,Computer science ,Decision tree ,WordNet ,Artificial intelligence ,Language model ,business ,computer.software_genre ,computer ,Natural language processing ,Computer Science Applications - Published
- 2019
- Full Text
- View/download PDF
41. Morphology Matters: A Multilingual Language Modeling Analysis
- Author
-
Han Liu, Lane Schwartz, Kenneth Steimel, Katherine J. Zhang, Hyunji Hayley Park, and Coleman Haley
- Subjects
FOS: Computer and information sciences ,050101 languages & linguistics ,Linguistics and Language ,Bible translations ,Computer science ,Morphology (biology) ,02 engineering and technology ,computer.software_genre ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,0501 psychology and cognitive sciences ,Segmentation ,Computer Science - Computation and Language ,business.industry ,Communication ,05 social sciences ,Computer Science Applications ,Human-Computer Interaction ,020201 artificial intelligence & image processing ,Language model ,Artificial intelligence ,Compiler ,business ,Computation and Language (cs.CL) ,On Language ,computer ,Natural language processing - Abstract
Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically-motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language's morphology on language modeling., To appear in TACL, a pre-MIT Press publication version; 15 pages, 3 figures; for the datasets, see https://github.com/hayleypark/MorphologyMatters
- Published
- 2020
42. The Go Transformer: Natural Language Modeling for Game Play
- Author
-
David Noever, Josh Kalin, and Matthew Ciolino
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,0209 industrial biotechnology ,Computer Science - Computation and Language ,business.industry ,Computer science ,Deep learning ,ComputingMilieux_PERSONALCOMPUTING ,02 engineering and technology ,Machine Learning (cs.LG) ,Visualization ,020901 industrial engineering & automation ,Human–computer interaction ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Language model ,Artificial intelligence ,Championship ,business ,Computation and Language (cs.CL) ,Natural language ,Transformer (machine learning model) - Abstract
This work applies natural language modeling to generate plausible strategic moves in the ancient game of Go. We train the Generative Pretrained Transformer (GPT-2) to mimic the style of Go champions as archived in Smart Game Format (SGF), which offers a text description of move sequences. The trained model further generates valid but previously unseen strategies for Go. Because GPT-2 preserves punctuation and spacing, the raw output of the text generator provides inputs to game visualization and creative patterns, such as the Sabaki project's game engine using auto-replays. Results demonstrate that language modeling can capture both the sequencing format of championship Go games and their strategic formations. Compared to random game boards, the GPT-2 fine-tuning shows efficient opening move sequences favoring corner play over less advantageous center and side play. Game generation as a language modeling task offers novel approaches to more than 40 other board games where historical text annotation provides training data (e.g., Amazons & Connect 4/6)., 8 Pages, 5 Figures, 1 Table, IEEE Format, Ai4i 2020
- Published
- 2020
- Full Text
- View/download PDF
43. Generating Reasonable Legal Text through the Combination of Language Modeling and Question Answering
- Author
-
Shaojun Wang, Xianfeng Liao, Bojin Zhuang, Zhiqiang Xie, Xiao Jing, Jiang Qian, and Weijing Huang
- Subjects
Computer science ,business.industry ,Question answering ,Language model ,Artificial intelligence ,computer.software_genre ,business ,computer ,Natural language processing - Abstract
Due to the improvement of Language Modeling, the emerging NLP assistant tools aiming for text generation greatly reduce the human workload on writing documents. However, the generation of legal text faces greater challenges than ordinary texts because of its high requirement for keeping logic reasonable, which can not be guaranteed by Language Modeling right now. To generate reasonable legal documents, we propose a novel method CoLMQA, which (1) combines Language Modeling and Question Answering, (2) generates text with slots by Language Modeling, and (3) fills the slots by our proposed Question Answering method named Transformer-based Key-Value Memory Networks. In CoLMQA, the slots represent the text part that needs to be highly constrained by logic, such as the name of the law and the number of the law article. And the Question Answering fills the slots in context with the help of Legal Knowledge Base to keep logic reasonable. The experiment verifies the quality of legal documents generated by CoLMQA, surpassing the documents generated by pure Language Modeling.
- Published
- 2020
- Full Text
- View/download PDF
44. Cold-start Active Learning through Self-supervised Language Modeling
- Author
-
Jordan Boyd-Graber, Hsuan-Tien Lin, and Michelle Yuan
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Active learning (machine learning) ,media_common.quotation_subject ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Machine Learning (cs.LG) ,Cold start ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Proxy (statistics) ,0105 earth and related environmental sciences ,media_common ,Computer Science - Computation and Language ,business.industry ,Surprise ,Active learning ,Artificial intelligence ,Language model ,business ,computer ,Computation and Language (cs.CL) - Abstract
Active learning strives to reduce annotation costs by choosing the most critical examples to label. Typically, the active learning strategy is contingent on the classification model. For instance, uncertainty sampling depends on poorly calibrated model confidence scores. In the cold-start setting, active learning is impractical because of model instability and data scarcity. Fortunately, modern NLP provides an additional source of information: pre-trained language models. The pre-training loss can find examples that surprise the model and should be labeled for efficient fine-tuning. Therefore, we treat the language modeling loss as a proxy for classification uncertainty. With BERT, we develop a simple strategy based on the masked language modeling loss that minimizes labeling costs for text classification. Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and computation time., Comment: Published in EMNLP 2020
- Published
- 2020
- Full Text
- View/download PDF
45. Natural Language Processing Methods for Language Modeling
- Author
-
Nemeskey, D��vid M��rk
- Subjects
Computer science ,business.industry ,Hungarian Language ,computer.software_genre ,Lexicographical order ,Pipeline (software) ,Set (abstract data type) ,Software ,Language model ,Artificial intelligence ,F1 score ,business ,computer ,Word (computer architecture) ,Natural language processing - Abstract
In this thesis we concentrated on two issues: how modern NLP (or traditional linguistic) techniques can improve language modeling, and how to improve the state of the art for Hungarian. Chapter 2 highlighted the problems of word-based language modeling for Hungarian and introduced the “gluten-free” format, a morphological segmentation algorithm that alleviates the adverse effects of the overabundance of word forms in the language. Chapter 3 proposed a novel method for evaluating multi-sense embeddings based on lexicographical resources. Chapter 5 gave an example for the opposite direction, when language models are used to improve the performance of an NLP system. We improved on the state of the art in Hungarian language modeling in several ways. First, we presented a set of language modeling benchmarks on three Hungarian corpora in Chapter 2. A preprocessed version of the Hungarian Webcorpus has been released to serve as a standard dataset for language model assessment. A new Hungarian corpus has been created in Chapter 4. Webcorpus 2.0 was compiled from Hungarian pages in the Common Crawl and the Hungarian Wikipedia. At 9 billion tokens, it is 3.5 times the size of the previous largest (commercial) corpus, and can serve as training data for large-scale language models. Finally, the emBERT module developed in Chapter 5 enables the integration of modern contextualized embedding-based classifiers into the e-magyar pipeline. Based on our preliminary Hungarian BERT model, the NP chunker outperforms the previous best system by 2.8% in F1 score. All resources (corpora, models and software alike) presented in the thesis are freely downloadable under permissive licenses.
- Published
- 2020
- Full Text
- View/download PDF
46. On the Inductive Bias of Masked Language Modeling: From Statistical to Syntactic Dependencies
- Author
-
Tianyi Zhang and Tatsunori Hashimoto
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Computer Science - Artificial Intelligence ,02 engineering and technology ,Minimum spanning tree ,Lexicon ,computer.software_genre ,01 natural sciences ,Machine Learning (cs.LG) ,010104 statistics & probability ,0202 electrical engineering, electronic engineering, information engineering ,Graphical model ,0101 mathematics ,Structure (mathematical logic) ,Parsing ,Computer Science - Computation and Language ,business.industry ,Inductive bias ,Construct (python library) ,Artificial Intelligence (cs.AI) ,020201 artificial intelligence & image processing ,Artificial intelligence ,Language model ,business ,computer ,Computation and Language (cs.CL) ,Natural language processing - Abstract
We study how masking and predicting tokens in an unsupervised fashion can give rise to linguistic structures and downstream performance gains. Recent theories have suggested that pretrained language models acquire useful inductive biases through masks that implicitly act as cloze reductions for downstream tasks. While appealing, we show that the success of the random masking strategy used in practice cannot be explained by such cloze-like masks alone. We construct cloze-like masks using task-specific lexicons for three different classification datasets and show that the majority of pretrained performance gains come from generic masks that are not associated with the lexicon. To explain the empirical success of these generic masks, we demonstrate a correspondence between the Masked Language Model (MLM) objective and existing methods for learning statistical dependencies in graphical models. Using this, we derive a method for extracting these learned statistical dependencies in MLMs and show that these dependencies encode useful inductive biases in the form of syntactic structures. In an unsupervised parsing evaluation, simply forming a minimum spanning tree on the implied statistical dependence structure outperforms a classic method for unsupervised parsing (58.74 vs. 55.91 UUAS)., NAACL-HLT 2021
- Published
- 2021
47. Segmented Regression via the Shape Language Modeling for Multi-Slope Path-Loss Modeling
- Author
-
Dayan A. Guimarães
- Subjects
business.industry ,Computer science ,Path loss ,Pattern recognition ,Language model ,Artificial intelligence ,Segmented regression ,business - Published
- 2021
- Full Text
- View/download PDF
48. UDALM: Unsupervised Domain Adaptation through Language Modeling
- Author
-
Georgios Paraskevopoulos, Alexandros Potamianos, and Constantinos Karouzos
- Subjects
FOS: Computer and information sciences ,Domain adaptation ,Computer Science - Computation and Language ,Computer science ,business.industry ,Sample (statistics) ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Domain (software engineering) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Language model ,Artificial intelligence ,business ,Computation and Language (cs.CL) ,computer ,0105 earth and related environmental sciences - Abstract
In this work we explore Unsupervised Domain Adaptation (UDA) of pretrained language models for downstream tasks. We introduce UDALM, a fine-tuning procedure, using a mixed classification and Masked Language Model loss, that can adapt to the target domain distribution in a robust and sample efficient manner. Our experiments show that performance of models trained with the mixed loss scales with the amount of available target data and the mixed loss can be effectively used as a stopping criterion during UDA training. Furthermore, we discuss the relationship between A-distance and the target error and explore some limitations of the Domain Adversarial Training approach. Our method is evaluated on twelve domain pairs of the Amazon Reviews Sentiment dataset, yielding $91.74\%$ accuracy, which is an $1.11\%$ absolute improvement over the state-of-the-art., Accepted for publication in 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)
- Published
- 2021
- Full Text
- View/download PDF
49. Towards Zero-shot Language Modeling
- Author
-
Roi Reichart, Ivan Vulić, Edoardo Maria Ponti, Anna Korhonen, and Ryan Cotterell
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,business.industry ,Computer science ,Sample (statistics) ,02 engineering and technology ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,Task (project management) ,Zero (linguistics) ,Language technology ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Language modelling ,Language model ,Artificial intelligence ,business ,Construct (philosophy) ,computer ,Computation and Language (cs.CL) ,Natural language processing ,Scope (computer science) ,0105 earth and related environmental sciences - Abstract
Can we construct a neural language model which is inductively biased towards learning human language? Motivated by this question, we aim at constructing an informative prior for held-out languages on the task of character-level, open-vocabulary language modelling. We obtain this prior as the posterior over network weights conditioned on the data from a sample of training languages, which is approximated through Laplace’s method. Based on a large and diverse sample of languages, the use of our prior outperforms baseline models with an uninformative prior in both zero-shot and few-shot settings, showing that the prior is imbued with universal linguistic knowledge. Moreover, we harness broad language-specific information available for most languages of the world, i.e., features from typological databases, as distant supervision for held-out languages. We explore several language modelling conditioning techniques, including concatenation and meta-networks for parameter generation. They appear beneficial in the few-shot setting, but ineffective in the zero-shot setting. Since the paucity of even plain digital text affects the majority of the world’s languages, we hope that these insights will broaden the scope of applications for language technology.
- Published
- 2021
- Full Text
- View/download PDF
50. 6VecLM: Language Modeling in Vector Space for IPv6 Target Generation
- Author
-
Gaopeng Gou, Gang Xiong, Tianyu Cui, Wei Xia, and Junzheng Shi
- Subjects
Sequence ,IPv6 address ,Address space ,Computer science ,business.industry ,Deep learning ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Field (computer science) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,Language model ,Artificial intelligence ,business ,computer ,Word (computer architecture) ,Transformer (machine learning model) - Abstract
Fast IPv6 scanning is challenging in the field of network measurement as it requires exploring the whole IPv6 address space but limited by current computational power. Researchers propose to obtain possible active target candidate sets to probe by algorithmically analyzing the active seed sets. However, IPv6 addresses lack semantic information and contain numerous addressing schemes, leading to the difficulty of designing effective algorithms. In this paper, we introduce our approach 6VecLM to explore achieving such target generation algorithms. The architecture can map addresses into a vector space to interpret semantic relationships and uses a Transformer network to build IPv6 language models for predicting address sequence. Experiments indicate that our approach can perform semantic classification on address space. By adding a new generation approach, our model possesses a controllable word innovation capability compared to conventional language models. The work outperformed the state-of-the-art target generation algorithms on two active address datasets by reaching more quality candidate sets.
- Published
- 2021
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.