Back to Search Start Over

EVALUACIÓN MORFOLÓGICA DE LOS VOCABULARIOS DE SUBPALABRAS UTILIZADOS POR LOS GRANDES MODELOS DE LENGUAJE.

Authors :
GARCÍA-SIERRA, Óscar
FERNÁNDEZ-PAMPILLÓN CESTEROS, Ana
ORTEGA-MARTÍN, Miguel
Source :
Revista Espanola de Linguistica. ene-jun2024, Vol. 54 Issue 1, p103-129. 27p.
Publication Year :
2024

Abstract

Traditional tokenization methods using linguistic rules have been replaced by statistical segmentation algorithms. Although these algorithms show a higher efficiency and are capable of building subword vocabularies from large corpora without human supervision, these subwords do not consistently correspond to morphemes. This paper addresses this issue by proposing an evaluation methodology and applying it to the morphological quality of Spanish vocabularies produced by three prominent subword tokenization algorithms –BPE, WordPiece, and Unigram– commonly used in Large Language Models (LLMs). Three gold standards were created to measure relevance, coherence, and morphological accuracy of vocabularies of six tokenizers trained on Spanish corpus, exploring different vocabulary sizes. Evaluation results indicate that none of the three algorithms is suitable for accurately representing Spanish morphology. [ABSTRACT FROM AUTHOR]

Details

Language :
Spanish
ISSN :
02101874
Volume :
54
Issue :
1
Database :
Academic Search Index
Journal :
Revista Espanola de Linguistica
Publication Type :
Academic Journal
Accession number :
178704907
Full Text :
https://doi.org/10.31810/rsel.54.1.4