Start Over

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Authors :: Seifollahi, S
Piccardi, M
Publication Year :: 2023
Publisher :: Springer Nature, 2023.
Abstract: Extracting meaningful features from documents can prove critical for a variety of tasks such as classification, clustering and semantic analysis. However, traditional approaches to document feature extraction mainly rely on first-order word statistics that are very high dimensional and do not capture well the semantic of the documents. For this reason, in this paper we present a novel approach that extracts document features based on a combination of a constructed word taxonomy and a word embedding in vector space. The feature extraction consists of three main steps: first, a word embedding technique is used to map all the words in the vocabulary onto a vector space. Second, the words in the vocabulary are organised into a hierarchy of clusters (word clusters) by using k-means hierarchically. Lastly, the individual documents are projected onto the word clusters based on a predefined set of keywords, leading to a compact representation as a mixture of keywords. The extracted features can be used for a number of tasks including document classification and clustering as well as semantic analysis of the documents generated by specific individuals over time. For the experiments, we have employed a dataset of transcripts of phone calls between claim managers and clients collected by the Transport Accident Commission of the Victorian Government. The experimental results show that the proposed approach has been capable of achieving comparable or higher accuracy than conventional feature extraction approaches and with a much more compact representation.

Subjects :: Artificial Intelligence & Image Processing

Details

Language :: English
Database :: OpenAIRE
Accession number :: edsair.od.......363..6871f1d5acdb5f33c7a4429c11a97a3c

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Taxonomy-Based Feature Extraction for Document Classification, Clustering and Semantic Analysis

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources