1. Training Data Size Requirements for Topic Classification in a Speech-Oriented Guidance System
- Author
-
Torres, Rafael, Kawanami, Hiromichi, Matsui, Tomoko, Saruwatari, Hiroshi, Shikano, Kiyohiro, Torres, Rafael, Kawanami, Hiromichi, Matsui, Tomoko, Saruwatari, Hiroshi, and Shikano, Kiyohiro
- Abstract
In this work, we address the classification in topics of utterances in Japanese received by a speech-oriented guidance system operating in a real environment. The implementation of this kind of systems requires the collection and manual labeling of actual user's utterances, which is a costly process. Because of this, we are interested in evaluating the influence of the amount of data for training in the context of topic classification. For this, we compared the performance of a Support Vector Machine and a Maximum Entropy classifier using training data of different sizes. We used actual data collected by the speech-oriented guidance system Takemaru-kun, from adults and children, and also evaluated the effect of automatic speech recognition (ASR) errors in the classification performance. To deal with the shortness of the utterances we proposed to use characters as features, which is possible with the Japanese language due to the presence of kanji; ideograms from Chinese characters that represent not only sound but meaning. Experimental results show an average performance decrease of 4.6% for ASR results of utterances from adults, and 2.8% for children, when reducing the amount of data for training to its 25%; and a classification performance improvement from 92.2% to 94.1% for adults and 87.2% to 88.3% for children, when using character as features instead of words., APSIPA Annual Summit and Conference 2010, December 14-17, 2010, Biopolis, Singapore.
- Published
- 2023