Start Over

Towards Automatic Definition Extraction for Serbian

Authors :: Stanković, R.
Cvetana Krstev
Stijović, R.
Gočanin, M.
Škorić, M.
Gavriilidou, Z, Mitits L., Kiosses S.
Source :: Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Scopus-Elsevier
Publication Year :: 2021
Publisher :: Democritus University of Thrace, 2021.
Abstract: 704 695 М30 М33 U radu su prikazani preliminarni rezultati automatske ekstrakcije kandidata za definicije rečnika iz nestrukturiranih tekstova na srpskom jeziku u cilju ubrzanja razvoja rečnika. Definicije u rečniku Srpske akademije nauka i umetnosti (SANU) korišćene su za modelovanje različitih tipova definicija (opisnih, gramatičkih, referentnih i sinonimskih) koje imaju različite sintaksičke i leksičke karakteristike. Korpus istraživanja sastoji se od 61.213 definicija imenica, koje su analizirane korišćenjem morfoloških e-rečnika i lokalnih gramatika implementiranih kao pretvarači konačnih stanja u paketu za obradu korpusa otvorenog koda Unitek. 21 model razvijen do sadašnjeg trenutka pokriva 57% definicija rečnika, od kojih je 83% u potpunosti prepoznato. Analiza je pokazala da mnoge definicije imaju strukturu koja se može modelirati, o čemu svedoči statistika definicija grupisanih po tipu. Ovi modeli su korišćeni za preuzimanje definicija imenica iz korpusa od 1,4 miliona reči koji sadrži 25 udžbenika za osnovne i srednje škole koji pokrivaju različite domene. Dobijeni rezultati su detaljno analizirani i date smernice za njihovo unapređenje. The paper presents preliminary results of the automatic extraction of candidates for dictionary definitions from unstructured texts in the Serbian language with the aim of accelerating dictionary development. Definitions in the Serbian Academy of Sciences and Arts (SASA) dictionary were used to model different definition types (descriptive, grammatical, reference-based and synonym-based) having different syntactic and lexical features. The research corpus consists of 61,213 definitions of nouns, which were analysed using Serbian morphological e-dictionaries and local grammars implemented as finite state transducers in an open-source corpus processing suite Unitex. The 21 models developed up to the present moment cover 57% of dictionary definitions, 83% of which were fully recognized. The analysis has shown that many definitions have a structure that can be modelled, as evidenced by the statistics of definitions grouped by type. These models were used to retrieve noun definitions from a 1.4-million-word corpus containing 25 primary and secondary school textbooks covering various domains. The obtained results were thoroughly analysed, and guidelines were offered for their improvement.

Details

Database :: OpenAIRE
Journal :: Proceedings of the XIX EURALEX Congress of the European Assocition for Lexicography: Lexicography for Inclusion (Volume 2). 7-9 September (virtual), Scopus-Elsevier
Accession number :: edsair.dedup.wf.001..d78e821a6aece20b5bf259fd800b8bf5