Back to Search Start Over

Discriminative Keyword Spotting for limited-data applications

Authors :
Hadas Benisty
David Malah
Itamar Katz
Koby Crammer
Source :
Speech Communication. 99:1-11
Publication Year :
2018
Publisher :
Elsevier BV, 2018.

Abstract

Mobile devices are widely used around the world, frequently by people speaking local languages or dialects that are not well documented. For these languages, it might not be beneficial for commercial companies to develop Automatic Speech Recognition (ASR) systems, so users of these languages cannot utilize voice activation features (often using Keyword Spotting, KWS) of their devices. Standard KWS methods aim to statistically model the generation process of the speech signal, requiring hours of recorded and transcribed speech for training, and therefore are not adequate for limited-data scenarios. In this paper we propose a new KWS method, suitable for limited-data scenarios, which can be easily applied by developers. The proposed method uses a new histogram representation for words, obtained with respect to a pre-trained Gaussian Mixture Model (GMM). Sentences are represented by fixed-length global feature vectors, extracted from the response curves obtained by a word classifier. Word and sentence classifiers are trained using a discriminative approach, which is typically robust to training-set size. The dataset for training the GMM is easy to obtain, since no annotation is required. We compared the proposed system to a Hidden Markov Model (HMM) based system, trained using the same low data-resources conditions as ours, and to a state-of-the-art ASR system, trained using either the limited data scenario, or using many hours of recorded speech. In the limited data situation, our system performs better then both benchmarks in all experiments except for clean speech of children (CSLU dataset), where it performs as good as the HMM. Since the ASR benchmark performs poorly without enough training data, we also trained it without limiting the available data. In this case the ASR benchmark performs better when tested on speech of adults (TED-LIUM dataset of TED lectures) for all noise conditions, and our system performs better when tested on speech of children with low to moderate SNR values. The results demonstrate the advantages of the proposed system, and the conditions under which it performs better.

Details

ISSN :
01676393
Volume :
99
Database :
OpenAIRE
Journal :
Speech Communication
Accession number :
edsair.doi...........32612050d9d4b81a9c5243d1fc2ba94a
Full Text :
https://doi.org/10.1016/j.specom.2018.02.003