Back to Search
Start Over
Semi-automatic extraction of multiword terms from domain-specific corpora
- Source :
- The Electronic Library, The Electronic Library, Emerald Publishing, 2018, 36 (3), pp.550-567. ⟨10.1108/EL-06-2017-0128⟩, Electronic Library
- Publication Year :
- 2018
- Publisher :
- Emerald, 2018.
-
Abstract
- Purpose A hybrid approach is presented, which combines linguistic and statistical information to semi-automatically extract multiword term candidates from texts. Design/methodology/approach The method is designed to be domain and language independent, focusing on languages with rich morphology. Here, it is used for extracting multiword terms from texts in Serbian, belonging to the agricultural engineering domain, as a use case. Predefined syntactic structures were used for multiword terms. For each structure, a finite state transducer was developed, which recognizes text sequences having that structure and outputs the sequence in a normalized form, so that different inflectional forms of the same multiword term can be counted properly. Term candidates were further filtered by their frequencies and evaluated by two domain experts. Findings By using language resources, such as electronic dictionaries and grammars, 928 multiword terms were extracted out of 1,523 multiword terms that were recognized as candidates from a corpus having 42,260 different simple word forms; 870 of these were new, not already contained in the existing electronic dictionary of compounds for Serbian, and they were used to enrich the dictionary. Originality/value The paper presents methodology that can significantly contribute to the development of terminology lexicons in different areas. In this particular use case, some important agricultural engineering concepts were extracted from the text, but this approach could be used for other domains and languages as well.
- Subjects :
- Unitex
Computer science
Multiword expression
Foreign languages
Data analysis
02 engineering and technology
Library and Information Sciences
computer.software_genre
Hybrid language processing
[INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL]
Terminology
Domain (software engineering)
Digital documents
Data retrieval
Rule-based machine translation
Electronic dictionary
Digital document
0202 electrical engineering, electronic engineering, information engineering
Information retrieval
Evaluation
Term extraction
Document handling
060201 languages & linguistics
Structure (mathematical logic)
Hybrid NLP
business.industry
NLP dictionary
06 humanities and the arts
E-dictionary
Term (logic)
Computer Science Applications
Data processing
[INFO.INFO-TT]Computer Science [cs]/Document and Text Processing
0602 languages and literature
020201 artificial intelligence & image processing
Artificial intelligence
business
computer
Natural language processing
Word (computer architecture)
Subjects
Details
- ISSN :
- 02640473
- Volume :
- 36
- Database :
- OpenAIRE
- Journal :
- The Electronic Library
- Accession number :
- edsair.doi.dedup.....4d96240d7a595594ac1d4e6eef9a8513
- Full Text :
- https://doi.org/10.1108/el-06-2017-0128