Back to Search
Start Over
Generation of Cross-Lingual Word Vectors for Low-Resourced Languages Using Deep Learning and Topological Metrics in a Data-Efficient Way
- Source :
- Electronics, Vol 10, Iss 1372, p 1372 (2021), Electronics, Volume 10, Issue 12
- Publication Year :
- 2021
- Publisher :
- MDPI AG, 2021.
-
Abstract
- Linguists have been focused on a qualitative comparison of the semantics from different languages. Evaluation of the semantic interpretation among disparate language pairs like English and Tamil is an even more formidable task than for Slavic languages. The concept of word embedding in Natural Language Processing (NLP) has enabled a felicitous opportunity to quantify linguistic semantics. Multi-lingual tasks can be performed by projecting the word embeddings of one language onto the semantic space of the other. This research presents a suite of data-efficient deep learning approaches to deduce the transfer function from the embedding space of English to that of Tamil, deploying three popular embedding algorithms: Word2Vec, GloVe and FastText. A novel evaluation paradigm was devised for the generation of embeddings to assess their effectiveness, using the original embeddings as ground truths. Transferability across other target languages of the proposed model was assessed via pre-trained Word2Vec embeddings from Hindi and Chinese languages. We empirically prove that with a bilingual dictionary of a thousand words and a corresponding small monolingual target (Tamil) corpus, useful embeddings can be generated by transfer learning from a well-trained source (English) embedding. Furthermore, we demonstrate the usability of generated target embeddings in a few NLP use-case tasks, such as text summarization, part-of-speech (POS) tagging, and bilingual dictionary induction (BDI), bearing in mind that those are not the only possible applications.
- Subjects :
- Word embedding
TK7800-8360
Computer Networks and Communications
Computer science
Semantic interpretation
bilingual embedding
02 engineering and technology
semantic interpretation
transfer learning
computer.software_genre
Semantics
ontology engineering
topological measures
cross-lingual embedding
0202 electrical engineering, electronic engineering, information engineering
Word2vec
Electrical and Electronic Engineering
business.industry
Bilingual dictionary
Deep learning
English–Tamil
020206 networking & telecommunications
Automatic summarization
language.human_language
low resourced languages
Hardware and Architecture
Control and Systems Engineering
Tamil
Signal Processing
language
020201 artificial intelligence & image processing
Artificial intelligence
Electronics
business
computer
Natural language processing
Subjects
Details
- Language :
- English
- ISSN :
- 20799292
- Volume :
- 10
- Issue :
- 1372
- Database :
- OpenAIRE
- Journal :
- Electronics
- Accession number :
- edsair.doi.dedup.....c74f7b8261e6d9b3dbd5f97e98792a7e