Back to Search
Start Over
A random walk on an ontology: Using thesaurus structure for automatic subject indexing
- Source :
- Journal of the American Society for Information Science and Technology. 64:1330-1344
- Publication Year :
- 2013
- Publisher :
- Wiley, 2013.
-
Abstract
- Relationships between terms and features are an essential component of thesauri, ontologies, and a range of controlled vocabularies. In this article, we describe ways to identify important concepts in documents using the relationships in a thesaurus or other vocabulary structures. We introduce a methodology for the analysis and modeling of the indexing process based on a weighted random walk algorithm. The primary goal of this research is the analysis of the contribution of thesaurus structure to the indexing process. The resulting models are evaluated in the context of automatic subject indexing using four collections of documents pre-indexed with 4 different thesauri (AGROVOC [UN Food and Agriculture Organization], high-energy physics taxonomy [HEP], National Agricultural Library Thesaurus [NALT], and medical subject headings [MeSH]). We also introduce a thesaurus-centric matching algorithm intended to improve the quality of candidate concepts. In all cases, the weighted random walk improves automatic indexing performance over matching alone with an increase in average precision (AP) of 9% for HEP, 11% for MeSH, 35% for NALT, and 37% for AGROVOC. The results of the analysis support our hypothesis that subject indexing is in part a browsing process, and that using the vocabulary and its structure in a thesaurus contributes to the indexing process. The amount that the vocabulary structure contributes was found to differ among the 4 thesauri, possibly due to the vocabulary used in the corresponding thesauri and the structural relationships between the terms. Each of the thesauri and the manual indexing associated with it is characterized using the methods developed here.
- Subjects :
- Vocabulary
Computer Networks and Communications
Computer science
media_common.quotation_subject
InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL
Context (language use)
Ontology (information science)
computer.software_genre
Artificial Intelligence
Controlled vocabulary
Index term
media_common
Thesaurus (information retrieval)
Information retrieval
business.industry
Subject indexing
Search engine indexing
Human-Computer Interaction
Automatic indexing
Ontology
Artificial intelligence
business
computer
Software
Natural language processing
Information Systems
Subjects
Details
- ISSN :
- 15322882
- Volume :
- 64
- Database :
- OpenAIRE
- Journal :
- Journal of the American Society for Information Science and Technology
- Accession number :
- edsair.doi...........1be2f5fe03b4d803d839db9c8087d938
- Full Text :
- https://doi.org/10.1002/asi.22853