Back to Search
Start Over
EVALUACIÓN MORFOLÓGICA DE LOS VOCABULARIOS DE SUBPALABRAS UTILIZADOS POR LOS GRANDES MODELOS DE LENGUAJE.
- Source :
-
Revista Espanola de Linguistica . ene-jun2024, Vol. 54 Issue 1, p103-129. 27p. - Publication Year :
- 2024
-
Abstract
- Traditional tokenization methods using linguistic rules have been replaced by statistical segmentation algorithms. Although these algorithms show a higher efficiency and are capable of building subword vocabularies from large corpora without human supervision, these subwords do not consistently correspond to morphemes. This paper addresses this issue by proposing an evaluation methodology and applying it to the morphological quality of Spanish vocabularies produced by three prominent subword tokenization algorithms –BPE, WordPiece, and Unigram– commonly used in Large Language Models (LLMs). Three gold standards were created to measure relevance, coherence, and morphological accuracy of vocabularies of six tokenizers trained on Spanish corpus, exploring different vocabulary sizes. Evaluation results indicate that none of the three algorithms is suitable for accurately representing Spanish morphology. [ABSTRACT FROM AUTHOR]
- Subjects :
- *LANGUAGE models
*SPANISH language
*EVALUATION methodology
*VOCABULARY
*MORPHEMICS
Subjects
Details
- Language :
- Spanish
- ISSN :
- 02101874
- Volume :
- 54
- Issue :
- 1
- Database :
- Academic Search Index
- Journal :
- Revista Espanola de Linguistica
- Publication Type :
- Academic Journal
- Accession number :
- 178704907
- Full Text :
- https://doi.org/10.31810/rsel.54.1.4