Back to Search Start Over

HAMLET: Hybrid Adaptable Machine Learning approach to Extract Terminology.

Authors :
Rigouts Terryn, Ayla
Hoste, Véronique
Lefever, Els
Source :
Terminology; 2021, Vol. 27 Issue 2, p254-293, 40p
Publication Year :
2021

Abstract

Automatic term extraction (ATE) is an important task within natural language processing, both separately, and as a preprocessing step for other tasks. In recent years, research has moved far beyond the traditional hybrid approach where candidate terms are extracted based on part-of-speech patterns and filtered and sorted with statistical termhood and unithood measures. While there has been an explosion of different types of features and algorithms, including machine learning methodologies, some of the fundamental problems remain unsolved, such as the ambiguous nature of the concept "term". This has been a hurdle in the creation of data for ATE, meaning that datasets for both training and testing are scarce, and system evaluations are often limited and rarely cover multiple languages and domains. The ACTER Annotated Corpora for Term Extraction Research contain manual term annotations in four domains and three languages and have been used to investigate a supervised machine learning approach for ATE, using a binary random forest classifier with multiple types of features. The resulting system (HAMLET Hybrid Adaptable Machine Learning approach to Extract Terminology) provides detailed insights into its strengths and weaknesses. It highlights a certain unpredictability as an important drawback of machine learning methodologies, but also shows how the system appears to have learnt a robust definition of terms, producing results that are state-of-the-art, and contain few errors that are not (part of) terms in any way. Both the amount and the relevance of the training data have a substantial effect on results, and by varying the training data, it appears to be possible to adapt the system to various desired outputs, e.g., different types of terms. While certain issues remain difficult – such as the extraction of rare terms and multiword terms – this study shows how supervised machine learning is a promising methodology for ATE. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09299971
Volume :
27
Issue :
2
Database :
Complementary Index
Journal :
Terminology
Publication Type :
Academic Journal
Accession number :
152889752
Full Text :
https://doi.org/10.1075/term.20017.rig