Author: "Georges Linarès" / Topic: speech recognition - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Georges Linarès"' showing total 55 results

Start Over Author "Georges Linarès" Topic speech recognition

55 results on '"Georges Linarès"'

1. Denoised Bottleneck Features From Deep Autoencoders for Telephone Conversation Analysis

Author: Richard Dufour, Mohamed Morchid, Killian Janod, Georges Linarès, Renato De Mori, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Acoustics and Ultrasonics, Computer science, media_common.quotation_subject, Speech recognition, Feature extraction, 02 engineering and technology, computer.software_genre, Bottleneck, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 030507 speech-language pathology & audiology, 03 medical and health sciences, Transcription (linguistics), Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Conversation, Electrical and Electronic Engineering, ComputingMilieux_MISCELLANEOUS, media_common, business.industry, Speech processing, Autoencoder, [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing, Computational Mathematics, Conversation analysis, 020201 artificial intelligence & image processing, Artificial intelligence, 0305 other medical science, business, computer, Natural language processing
Abstract: Automatic transcription of spoken documents is affected by automatic transcription errors that are especially frequent when speech is acquired in severe noisy conditions. Automatic speech recognition errors induce errors in the linguistic features used for a variety of natural language processing tasks. Recently, denoisng autoencoders (DAE) and stacked autoencoders (SAE) have been proposed with interesting results for acoustic feature denoising tasks. This paper deals with the recovery of corrupted linguistic features in spoken documents. Solutions based on DAEs and SAEs are considered and evaluated in a spoken conversation analysis task. In order to improve conversation theme classification accuracy, the possibility of combining abstractions obtained from manual and automatic transcription features is considered. As a result, two original representations of highly imperfect spoken documents are introduced. They are based on bottleneck features of a supervised autoencoder that takes advantage of both noisy and clean transcriptions to improve the robustness of error prone representations. Experimental results on a spoken conversation theme identification task show substantial accuracy improvements obtained with the proposed recovery of corrupted features.
Published: 2017
Full Text: View/download PDF

2. E2E-SINCNET: TOWARD FULLY END-TO-END SPEECH RECOGNITION

Author: Mohamed Morchid, Titouan Parcollet, Georges Linarès, Parcollet, Titouan, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, and PNRIA
Subjects: Lossless compression, [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], Computer science, Speech recognition, 05 social sciences, Word error rate, 010501 environmental sciences, [INFO] Computer Science [cs], 01 natural sciences, Signal, Field (computer science), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], End-to-end principle, 0502 economics and business, Waveform, [INFO]Computer Science [cs], Mel-frequency cepstrum, 050207 economics, Joint (audio engineering), 0105 earth and related environmental sciences
Abstract: International audience; Modern end-to-end (E2E) Automatic Speech Recognition (ASR) systems rely on Deep Neural Networks (DNN) that are mostly trained on handcrafted and pre-computed acoustic features such as Mel-filter-banks or Mel-frequency cepstral coefficients. Nonetheless , and despite worse performances, E2E ASR models processing raw waveforms are an active research field due to the lossless nature of the input signal. In this paper, we propose the E2E-SincNet, a novel fully E2E ASR model that goes from the raw waveform to the text transcripts by merging two recent and powerful paradigms: SincNet and the joint CTC-attention training scheme. The conducted experiments on two different speech recognition tasks show that our approach outperforms previously investigated E2E systems relying either on the raw waveform or pre-computed acoustic features, with a reported top-of-the-line Word Error Rate (WER) of 4.7% on the Wall Street Journal (WSJ) dataset.
Published: 2020

3. M2H-GAN: A GAN-Based Mapping from Machine to Human Transcripts for Speech Understanding

Author: Xavier Bost, Mohamed Morchid, Georges Linarès, Titouan Parcollet, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, and Parcollet, Titouan
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, business.industry, Computer science, Speech recognition, Deep learning, Machine Learning (stat.ML), Context (language use), [INFO] Computer Science [cs], [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Machine Learning (cs.LG), Term (time), Identification (information), Statistics - Machine Learning, [INFO]Computer Science [cs], Artificial intelligence, Representation (mathematics), business, Computation and Language (cs.CL), Generative grammar, Spoken language
Abstract: Deep learning is at the core of recent spoken language understanding (SLU) related tasks. More precisely, deep neural networks (DNNs) drastically increased the performances of SLU systems, and numerous architectures have been proposed. In the real-life context of theme identification of telephone conversations, it is common to hold both a human, manual (TRS) and an automatically transcribed (ASR) versions of the conversations. Nonetheless, and due to production constraints, only the ASR transcripts are considered to build automatic classifiers. TRS transcripts are only used to measure the performances of ASR systems. Moreover, the recent performances in term of classification accuracy, obtained by DNN related systems are close to the performances reached by humans, and it becomes difficult to further increase the performances by only considering the ASR transcripts. This paper proposes to distillates the TRS knowledge available during the training phase within the ASR representation, by using a new generative adversarial network called M2H-GAN to generate a TRS-like version of an ASR document, to improve the theme identification performances., Comment: Submitted at INTERSPEECH 2019
Published: 2019
Full Text: View/download PDF

4. Real to H-space Encoder for Speech Recognition

Author: Georges Linarès, Mohamed Morchid, Renato De Mori, Titouan Parcollet, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, McGill University = Université McGill [Montréal, Canada], and Parcollet, Titouan
Subjects: [INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI], FOS: Computer and information sciences, Sound (cs.SD), Relation (database), Computer science, Speech recognition, Computer Science::Neural and Evolutionary Computation, [INFO] Computer Science [cs], Computer Science - Sound, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], Quaternion, Representation (mathematics), Index Terms: quaternion neural networks, Computer Science - Computation and Language, Artificial neural network, Frame (networking), Process (computing), speech recognition, recurrent neural net- works, Recurrent neural network, Encoder, Computation and Language (cs.CL), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep neural networks (DNNs) and more precisely recurrent neural networks (RNNs) are at the core of modern automatic speech recognition systems, due to their efficiency to process input sequences. Recently, it has been shown that different input representations, based on multidimensional algebras, such as complex and quaternion numbers, are able to bring to neural networks a more natural, compressive and powerful representation of the input signal by outperforming common real-valued NNs. Indeed, quaternion-valued neural networks (QNNs) better learn both internal dependencies, such as the relation between the Mel-filter-bank value of a specific time frame and its time derivatives, and global dependencies, describing the relations that exist between time frames. Nonetheless, QNNs are limited to quaternion-valued input signals, and it is difficult to benefit from this powerful representation with real-valued input data. This paper proposes to tackle this weakness by introducing a real-to-quaternion encoder that allows QNNs to process any one dimensional input features, such as traditional Mel-filter-banks for automatic speech recognition., Comment: Accepted at INTERSPEECH 2019
Published: 2019

5. Bidirectional Quaternion Long Short-term Memory Recurrent Neural Networks for Speech Recognition

Author: Mohamed Morchid, Georges Linarès, Titouan Parcollet, and Renato De Mori
Subjects: Signal Processing (eess.SP), FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Sequence, Computer science, Speech recognition, Machine Learning (stat.ML), 020206 networking & telecommunications, 02 engineering and technology, Computer Science - Sound, Machine Learning (cs.LG), Term (time), Long short term memory, Recurrent neural network, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Electrical Engineering and Systems Science - Signal Processing, Element (category theory), Quaternion, Representation (mathematics), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recurrent neural networks (RNN) are at the core of modern automatic speech recognition (ASR) systems. In particular, long-short term memory (LSTM) recurrent neural networks have achieved state-of-the-art results in many speech recognition tasks, due to their efficient representation of long and short term dependencies in sequences of inter-dependent features. Nonetheless, internal dependencies within the element composing multidimensional features are weakly considered by traditional real-valued representations. We propose a novel quaternion long-short term memory (QLSTM) recurrent neural network that takes into account both the external relations between the features composing a sequence, and these internal latent structural dependencies with the quaternion algebra. QLSTMs are compared to LSTMs during a memory copy-task and a realistic application of speech recognition on the Wall Street Journal (WSJ) dataset. QLSTM reaches better performances during the two experiments with up to $2.8$ times less learning parameters, leading to a more expressive representation of the information., Submitted at ICASSP 2019. arXiv admin note: text overlap with arXiv:1806.04418
Published: 2019
Full Text: View/download PDF

6. Impact of Word Error Rate on theme identification task of highly imperfect human–human conversations

Author: Georges Linarès, Mohamed Morchid, and Richard Dufour
Subjects: 0209 industrial biotechnology, Computer science, media_common.quotation_subject, Speech recognition, Word error rate, 02 engineering and technology, computer.software_genre, Latent Dirichlet allocation, Theoretical Computer Science, symbols.namesake, 020901 industrial engineering & automation, Discriminative model, Transcription (linguistics), 0202 electrical engineering, electronic engineering, information engineering, Speech analytics, Conversation, Gaussian process, media_common, business.industry, Human-Computer Interaction, Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, symbols, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Software, Natural language processing
Abstract: HighlightsReview of the impact of dialogue representations and classification methods.We discuss the impact of discriminative words in terms of transcription accuracy.Original study evaluating the impact of the WER in the LDA topic space. A review is proposed of the impact of word representations and classification methods in the task of theme identification of telephone conversation services having highly imperfect automatic transcriptions. We firstly compare two word-based representations using the classical Term Frequency-Inverse Document Frequency with Gini purity criteria (TF-IDF-Gini) method and the latent Dirichlet allocation (LDA) approach. We then introduce a classification method that takes advantage of the LDA topic space representation, highlighted as the best word representation. To do so, two assumptions about topic representation led us to choose a Gaussian Process (GP) based method. Its performance is compared with a classical Support Vector Machine (SVM) classification method. Experiments showed that the GP approach is a better solution to deal with the multiple theme complexity of a dialogue, no matter the conditions studied (manual or automatic transcriptions) (Morchid et al., 2014). In order to better understand results obtained using different word representation methods and classification approaches, we then discuss the impact of discriminative and non-discriminative words extracted by both word representations methods in terms of transcription accuracy (Morchid et al., 2014). Finally, we propose a novel study that evaluates the impact of the Word Error Rate (WER) in the LDA topic space learning process as well as during the theme identification task. This original qualitative study points out that selecting a small subset of words having the lowest WER (instead of using all the words) allows the system to better classify automatic transcriptions with an absolute gain of 0.9 point, in comparison to the best performance achieved on this dialogue classification task (precision of 83.3%).
Published: 2016
Full Text: View/download PDF

7. Audiovisual speaker diarization of TV series

Author: Georges Linarès, Xavier Bost, and Serigne Gueye
Subjects: Speaker diarisation, FOS: Computer and information sciences, Modality (human–computer interaction), Computer Science - Computation and Language, Series (mathematics), Computer science, Speech recognition, ComputerApplications_MISCELLANEOUS, Intonation (linguistics), Set (psychology), Computation and Language (cs.CL), Computer Science - Multimedia, Multimedia (cs.MM)
Abstract: Speaker diarization may be difficult to achieve when applied to narrative films, where speakers usually talk in adverse acoustic conditions: background music, sound effects, wide variations in intonation may hide the inter-speaker variability and make audio-based speaker diarization approaches error prone. On the other hand, such fictional movies exhibit strong regularities at the image level, particularly within dialogue scenes. In this paper, we propose to perform speaker diarization within dialogue scenes of TV series by combining the audio and video modalities: speaker diarization is first performed by using each modality; the two resulting partitions of the instance set are then optimally matched, before the remaining instances, corresponding to cases of disagreement between both modalities, are finally processed. The results obtained by applying such a multi-modal approach to fictional films turn out to outperform those obtained by relying on a single modality.
Published: 2018

8. Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition

Author: Chiheb Trabelsi, Titouan Parcollet, Yoshua Bengio, Ying Zhang, Renato De Mori, Mohamed Morchid, Georges Linarès, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Montreal Institute for Learning Algorithms [Montréal] (MILA), Centre de Recherches Mathématiques [Montréal] (CRM), and Université de Montréal (UdeM)-Université de Montréal (UdeM)
Subjects: FOS: Computer and information sciences, Sound (cs.SD), 0209 industrial biotechnology, Computer science, Speech recognition, Word error rate, Machine Learning (stat.ML), TIMIT, 02 engineering and technology, Convolutional neural network, Computer Science - Sound, Machine Learning (cs.LG), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], 020901 industrial engineering & automation, Audio and Speech Processing (eess.AS), Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), [INFO]Computer Science [cs], Quaternion, Index Terms: quaternion convolutional neural networks, Artificial neural network, Quaternion algebra, business.industry, Deep learning, deep learning, auto- matic speech recognition, Computer Science - Learning, 020201 artificial intelligence & image processing, Artificial intelligence, business, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, the connectionist temporal classification (CTC) model coupled with recurrent (RNN) or convolutional neural networks (CNN), made it easier to train speech recognition systems in an end-to-end fashion. However in real-valued models, time frame components such as mel-filter-bank energies and the cepstral coefficients obtained from them, together with their first and second order derivatives, are processed as individual elements, while a natural alternative is to process such components as composed entities. We propose to group such elements in the form of quaternions and to process these quaternions using the established quaternion algebra. Quaternion numbers and quaternion neural networks have shown their efficiency to process multidimensional inputs as entities, to encode internal dependencies, and to solve many tasks with less learning parameters than real-valued models. This paper proposes to integrate multiple feature views in quaternion-valued convolutional neural network (QCNN), to be used for sequence-to-sequence mapping with the CTC model. Promising results are reported using simple QCNNs in phoneme recognition experiments with the TIMIT corpus. More precisely, QCNNs obtain a lower phoneme error rate (PER) with less learning parameters than a competing model based on real-valued CNNs., Accepted at INTERSPEECH 2018
Published: 2018
Full Text: View/download PDF

9. Audio-Based Video Genre Identification

Author: Georges Linarès, Stanislas Oger, Bernard Merialdo, Mickael Rouvier, Driss Matrouf, Yingbo Li, Laboratoire d'Informatique Fondamental, Université Pierre et Marie Curie - Paris 6 (UPMC), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, and Eurecom [Sophia Antipolis]
Subjects: Volunteered geographic information, Acoustics and Ultrasonics, Computer science, video genre classification, Speech recognition, Feature extraction, Index Terms—Automatic classification, Word error rate, Pragmatics, linguistic feature extrac-tion, Support vector machine, Computational Mathematics, Identification (information), Complementarity (molecular biology), Cepstrum, Computer Science (miscellaneous), [INFO]Computer Science [cs], Electrical and Electronic Engineering
Abstract: International audience; —This paper presents investigations about the automatic identification of video genre by audio channel analysis. Genre refers to editorial styles such commercials, movies, sports… We propose and evaluate some methods based on both low and high level descriptors, in cepstral or time domains, but also by analyzing the global structure of the document and the linguistic contents. Then, the proposed features are combined and their complementarity is evaluated. On a database composed of single-stories web-videos, the best audio-only based system performs 9% of Classification Error Rate (CER). Finally, we evaluate the complementarity of the proposed audio features and video features that are classically used for Video Genre Identification (VGI). Results demonstrate the complementarity of the modalities for genre recognition, the final audio-video system reaching 6% CER.
Published: 2015
Full Text: View/download PDF

10. Quaternion Denoising Encoder-Decoder for Theme Identification of Telephone Conversations

Author: Mohamed Morchid, Georges Linarès, Titouan Parcollet, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Computer science, Speech recognition, Index Terms: Spoken language understanding, 020206 networking & telecommunications, 02 engineering and technology, Overfitting, Perceptron, Autoencoder, Small set, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Denoising encoder-decoder neural networks, Identification (information), symbols.namesake, Gaussian noise, Quaternion algebra, 0202 electrical engineering, electronic engineering, information engineering, symbols, [INFO]Computer Science [cs], 020201 artificial intelligence & image processing, Quaternion, Neural networks, Subspace topology
Abstract: International audience; In the last decades, encoder-decoders or autoencoders (AE) have received a great interest from researchers due to their capability to construct robust representations of documents in a low dimensional subspace. Nonetheless, autoencoders reveal little in way of spoken document internal structure by only considering words or topics contained in the document as an isolate basic element, and tend to overfit with small corpus of documents. Therefore, Quaternion Multi-layer Perceptrons (QMLP) have been introduced to capture such internal latent dependencies , whereas denoising autoencoders (DAE) are composed with different stochastic noises to better process small set of documents. This paper presents a novel autoencoder based on both hitherto-proposed DAE (to manage small corpus) and the QMLP (to consider internal latent structures) called "Quater-nion denoising encoder-decoder" (QDAE). Moreover, the paper defines an original angular Gaussian noise adapted to the speci-ficity of hyper-complex algebra. The experiments, conduced on a theme identification task of spoken dialogues from the DE-CODA framework, show that the QDAE obtains the promising gains of 3% and 1.5% compared to the standard real valued de-noising autoencoder and the QMLP respectively.
Published: 2017
Full Text: View/download PDF

11. Web-based possibilistic language models for automatic speech recognition

Author: Stanislas Oger and Georges Linarès
Subjects: Training set, Computer science, business.industry, Speech recognition, Probabilistic logic, Word error rate, computer.software_genre, Machine learning, Speech processing, Theoretical Computer Science, Human-Computer Interaction, Transcription (linguistics), Web application, Artificial intelligence, Language model, business, computer, Software, Natural language processing, Possibility theory
Abstract: This paper describes a new kind of language models based on the possibility theory. The purpose of these new models is to better use the data available on the Web for language modeling. These models aim to integrate information relative to impossible word sequences. We address the two main problems of using this kind of model: how to estimate the measures for word sequences and how to integrate this kind of model into the ASR system. We propose a word-sequence possibilistic measure and a practical estimation method based on word-sequence statistics, which is particularly suited for estimating from Web data. We develop several strategies and formulations for using these models in a classical automatic speech recognition engine, which relies on a probabilistic modeling of the speech recognition process. This work is evaluated on two typical usage scenarios: broadcast news transcription with very large training sets and transcription of medical videos, in a specialized domain, with only very limited training data. The results show that the possibilistic models provide significantly lower word error rate on the specialized domain task, where classical n-gram models fail due to the lack of training materials. For the broadcast news, the probabilistic models remain better than the possibilistic ones. However, a log-linear combination of the two kinds of models outperforms all the models used individually, which indicates that possibilistic models bring information that is not modeled by probabilistic ones.
Published: 2014
Full Text: View/download PDF

12. Modelling Semantic Context of OOV Words in Large Vocabulary Continuous Speech Recognition

Author: Georges Linarès, Dominique Fohr, Irina Illina, Imran Sheikh, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique d'Avignon (LIA), Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU), Grid'5000, ANR-12-BS02-0009,ContNomina,Exploitation du contexte pour la reconnaissance de noms propres dans les documents diachroniques audio(2012), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: Topic model, Vocabulary, Acoustics and Ultrasonics, Computer science, media_common.quotation_subject, Speech recognition, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Context (language use), 02 engineering and technology, 010501 environmental sciences, Semantics, computer.software_genre, 01 natural sciences, Latent Dirichlet allocation, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], symbols.namesake, 0202 electrical engineering, electronic engineering, information engineering, Computer Science (miscellaneous), Electrical and Electronic Engineering, out-of-vocabulary, proper names, 0105 earth and related environmental sciences, media_common, Context model, business.industry, Computational Mathematics, large vocabulary continuous speech recognition, Automatic indexing, semantic context, symbols, 020201 artificial intelligence & image processing, Artificial intelligence, Language model, business, computer, Natural language processing
Abstract: International audience; The diachronic nature of broadcast news data leads to the problem of Out-Of-Vocabulary (OOV) words in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. Analysis of OOV words reveals that a majority of them are Proper Names (PNs). However PNs are important for automatic indexing of audio-video content and for obtaining reliable automatic transcriptions. In this paper, we focus on the problem of OOV PNs in diachronic audio documents. To enable recovery of the PNs missed by the LVCSR system, relevant OOV PNs are retrieved by exploiting the semantic context of the LVCSR transcriptions. For retrieval of OOV PNs, we explore topic and semantic context derived from Latent Dirichlet Allocation (LDA) topic models, continuous word vector representations and the Neural Bag-of-Words (NBOW) model which is capable of learning task specific word and context representations. We propose a Neural Bag-of-Weighted Words (NBOW2) model which learns to assign higher weights to words that are important for retrieval of an OOV PN. With experiments on French broadcast news videos we show that the NBOW and NBOW2 models outperform the methods based on raw embeddings from LDA and Skip-gram models. Combining the NBOW and NBOW2 models gives a faster convergence during training. Second pass speech recognition experiments, in which the LVCSR vocabulary and language model are updated with the retrieved OOV PNs, demonstrate the effectiveness of the proposed context models.
Published: 2017
Full Text: View/download PDF

13. Parallel Long Short-Term Memory for multi-stream classification

Author: Richard Dufour, Mohamed Morchid, Georges Linarès, Renato De Mori, Mohamed Bouaziz, Département de Recherche en Ingéniérie des Véhicules pour l'Environnement (DRIVE), Université de Bourgogne (UB), Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: FOS: Computer and information sciences, Sequence, Computer Science - Computation and Language, business.industry, Computer science, Speech recognition, Process (computing), Pattern recognition, Context (language use), 02 engineering and technology, Multi stream, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Machine Learning (cs.LG), Task (computing), Computer Science - Learning, Recurrent neural network, 020204 information systems, Logic gate, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, business, Hidden Markov model, Computation and Language (cs.CL), ComputingMilieux_MISCELLANEOUS
Abstract: Recently, machine learning methods have provided a broad spectrum of original and efficient algorithms based on Deep Neural Networks (DNN) to automatically predict an outcome with respect to a sequence of inputs. Recurrent hidden cells allow these DNN-based models to manage long-term dependencies such as Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM). Nevertheless, these RNNs process a single input stream in one (LSTM) or two (Bidirectional LSTM) directions. But most of the information available nowadays is from multistreams or multimedia documents, and require RNNs to process these information synchronously during the training. This paper presents an original LSTM-based architecture, named Parallel LSTM (PLSTM), that carries out multiple parallel synchronized input sequences in order to predict a common output. The proposed PLSTM method could be used for parallel sequence classification purposes. The PLSTM approach is evaluated on an automatic telecast genre sequences classification task and compared with different state-of-the-art architectures. Results show that the proposed PLSTM method outperforms the baseline n-gram models as well as the state-of-the-art LSTM approach., Comment: 2016 IEEE Workshop on Spoken Language Technology
Published: 2016
Full Text: View/download PDF

14. Improved Neural Bag-of-Words Model to Retrieve Out-of-Vocabulary Words in Speech Recognition

Author: Irina Illina, Imran Sheikh, Georges Linarès, Dominique Fohr, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique d'Avignon (LIA), Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU), ANR-12-BS02-0009,ContNomina,Exploitation du contexte pour la reconnaissance de noms propres dans les documents diachroniques audio(2012), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), and Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Subjects: Computer science, Speech recognition, Process (computing), oov, Context (language use), 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Task (project management), Bag-of-words model, 0202 electrical engineering, electronic engineering, information engineering, lvcsr, Embedding, Proper noun, 020201 artificial intelligence & image processing, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], Layer (object-oriented design), proper names, 0105 earth and related environmental sciences
Abstract: International audience; Many Proper Names (PNs) are Out-Of-Vocabulary (OOV) words for speech recognition systems used to process di-achronic audio data. To enable recovery of the PNs missed by the system, relevant OOV PNs can be retrieved by exploiting the semantic context of the spoken content. In this paper, we explore the Neural Bag-of-Words (NBOW) model, proposed previously for text classification, to retrieve relevant OOV PNs. We propose a Neural Bag-of-Weighted-Words (NBOW2) model in which the input embedding layer is augmented with a context anchor layer. This layer learns to assign importance to input words and has the ability to capture (task specific) keywords in a NBOW model. With experiments on French broadcast news videos we show that the NBOW and NBOW2 models outper-form earlier methods based on raw embeddings from LDA and Skip-gram. Combining NBOW with NBOW2 gives faster convergence during training.
Published: 2016
Full Text: View/download PDF

15. Learning Word Importance with the Neural Bag-of-Words Model

Author: Imran Sheikh, Georges Linarès, Dominique Fohr, Irina Illina, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, ANR-12-BS02-0009,ContNomina,Exploitation du contexte pour la reconnaissance de noms propres dans les documents diachroniques audio(2012), Fohr, Dominique, and BLANC - Exploitation du contexte pour la reconnaissance de noms propres dans les documents diachroniques audio - - ContNomina2012 - ANR-12-BS02-0009 - BLANC - VALID
Subjects: business.industry, Computer science, Speech recognition, 02 engineering and technology, 010501 environmental sciences, computer.software_genre, 01 natural sciences, Task (project management), Bag-of-words model, Classifier (linguistics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Artificial intelligence, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], [INFO.INFO-HC] Computer Science [cs]/Human-Computer Interaction [cs.HC], business, computer, Word (computer architecture), Natural language processing, 0105 earth and related environmental sciences
Abstract: The Neural Bag-of-Words (NBOW) model performs classification with an average of the input word vectors and achieves an impressive performance. While the NBOW model learns word vectors targeted for the classification task it does not explicitly model which words are important for given task. In this paper we propose an improved NBOW model with this ability to learn task specific word importance weights. The word importance weights are learned by introducing a new weighted sum composition of the word vectors. With experiments on standard topic and sentiment classification tasks, we show that (a) our proposed model learns meaningful word importance for a given task (b) our model gives best accuracies among the BOW approaches. We also show that the learned word importance weights are comparable to tf-idf based word weights when used as features in a BOWSVM classifier.
Published: 2016

16. Temporal and Lexical Context of Diachronic Text Documents for Automatic Out-Of-Vocabulary Proper Name Retrieval

Author: Irina Illina, Georges Linarès, Dominique Fohr, Imane Nkairi, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Zygmunt Vetulani, Hans Uszkoreit, Marek Kubis, and ANR-12-BS02-0009,ContNomina,Exploitation du contexte pour la reconnaissance de noms propres dans les documents diachroniques audio(2012)
Subjects: Vocabulary, Out-of-vocabulary words, Computer science, media_common.quotation_subject, Word error rate, Context (language use), 02 engineering and technology, Proper names, Speech recognition, computer.software_genre, Out of vocabulary, Task (project management), 0202 electrical engineering, electronic engineering, information engineering, Selection (linguistics), Proper noun, [INFO]Computer Science [cs], media_common, business.industry, 020206 networking & telecommunications, Key (cryptography), 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing, Vocabulary augmentation
Abstract: International audience; Proper name recognition is a challenging task in information retrieval from large audio/video databases. Proper names are semantically rich and are usually key to understanding the information contained in a document. Our work focuses on increasing the vocabulary coverage of a speech transcription system by automatically retrieving proper names from contemporary diachronic text documents. We proposed methods that dynamically augment the automatic speech recognition system vocabulary using lexical and temporal features in diachronic documents. We also studied different metrics for proper name selection in order to limit the vocabulary augmentation and therefore the impact on the ASR performances. Recognition results show a significant reduction of the proper name error rate using an augmented vocabulary.
Published: 2016
Full Text: View/download PDF

17. Integrating imperfect transcripts into speech recognition systems for building high-quality corpora

Author: Georges Linarès, Benjamin Lecouteux, Stanislas Oger, Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF), Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Computer science, Speech recognition, 02 engineering and technology, acoustic model training, computer.software_genre, text-to-speech alignment, [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Theoretical Computer Science, 030507 speech-language pathology & audiology, 03 medical and health sciences, Search algorithm, 0202 electrical engineering, electronic engineering, information engineering, Voice activity detection, business.industry, Acoustic model, Speech corpus, Speech processing, Human-Computer Interaction, Scripting language, 020201 artificial intelligence & image processing, Artificial intelligence, Transcription (software), 0305 other medical science, business, Error detection and correction, computer, Software, Natural language processing
Abstract: (Impact-F 1.46 estim. in 2012); International audience; The training of state-of-the-art automatic speech recognition (ASR) systems requires huge relevant training corpora. The cost of such databases is high and remains a major limitation for the development of speech-enabled applications in particular contexts (e.g. low-density languages, or specialized domains). On the other hand, a large amount of data can be found in news prompts, movie subtitles or scripts, etc. The use of such data as training corpus could provide a low-cost solution to the acoustic model estimation problem. Unfortunately, prior transcripts are seldom exact with respect to the content of the speech signal, and suffer from a lack of temporal information. This paper tackles the issue of prompt-based speech corpora improvement, by addressing the problems mentioned above. We propose a method allowing to locate accurate transcript segments in speech signals and automatically correct errors or lack of transcript surrounding these segments. This method relies on a new decoding strategy where the search algorithm is driven by the imperfect transcription of the input utterances. The experiments are conducted on the French language, by using the ESTER database and a set of records (and associated prompts) from RTBF (Radio Télévision Belge Francophone). The results demonstrate the effectiveness of the proposed approach, in terms of both error correction and text-to-speech alignment.
Published: 2012
Full Text: View/download PDF

18. Modeling nuisance variabilities with factor analysis for GMM-based audio pattern classification

Author: Georges Linarès, Florian Verdet, Mickael Rouvier, Driss Matrouf, and Jean-François Bonastre
Subjects: business.industry, Computer science, Speech recognition, Context (language use), Pattern recognition, computer.software_genre, Speaker recognition, Theoretical Computer Science, Human-Computer Interaction, Support vector machine, Statistical classification, ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Feature (machine learning), Artificial intelligence, Computational linguistics, Audio signal processing, business, computer, Software
Abstract: Audio pattern classification represents a particular statistical classification task and includes, for example, speaker recognition, language recognition, emotion recognition, speech recognition and, recently, video genre classification. The feature being used in all these tasks is generally based on a short-term cepstral representation. The cepstral vectors contain at the same time useful information and nuisance variability, which are difficult to separate in this domain. Recently, in the context of GMM-based recognizers, a novel approach using a Factor Analysis (FA) paradigm has been proposed for decomposing the target model into a useful information component and a session variability component. This approach is called Joint Factor Analysis (JFA), since it models jointly the nuisance variability and the useful information, using the FA statistical method. The JFA approach has even been combined with Support Vector Machines, known for their discriminative power. In this article, we successfully apply this paradigm to three automatic audio processing applications: speaker verification, language recognition and video genre classification. This is done by applying the same process and using the same free software toolkit. We will show that this approach allows for a relative error reduction of over 50% in all the aforementioned audio processing tasks.
Published: 2011
Full Text: View/download PDF

19. Query-Driven Strategy for On-the-Fly Term Spotting in Spontaneous Speech

Author: Georges Linarès, Mickael Rouvier, Benjamin Lecouteux, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Audio mining, Voice activity detection, Acoustics and Ultrasonics, Computer science, business.industry, Speech recognition, Search engine indexing, lcsh:QC221-246, Speech corpus, Spotting, Speech processing, computer.software_genre, lcsh:QA75.5-76.95, lcsh:Acoustics. Sound, [INFO]Computer Science [cs], Speech analytics, lcsh:Electronic computers. Computer science, Artificial intelligence, Electrical and Electronic Engineering, business, computer, Utterance, Natural language processing
Abstract: International audience; Spoken utterance retrieval was largely studied in the last decades, with the purpose of indexing large audio databases or of detecting keywords in continuous speech streams. While the indexing of closed corpora can be performed via a batch process, on-line spotting systems have to synchronously detect the targeted spoken utterances. We propose a two-level architecture for on-the-fly term spotting. The first level performs a fast detection of the speech segments that probably contain the targeted utterance. The second level refines the detection on the selected segments, by using a speech recognizer based on a query-driven decoding algorithm. Experiments are conducted on both broadcast and spontaneous speech corpora. We investigate the impact of the spontaneity level on system performance. Results show that our method remains effective even if the recognition rates are significantly degraded by disfluencies.
Published: 2010
Full Text: View/download PDF

20. Topic-space based setup of a neural network for theme identification of highly imperfect transcriptions

Author: Mohamed Morchid, Georges Linarès, Richard Dufour, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Artificial neural network, Computer science, Time delay neural network, business.industry, Hidden layer, Speech recognition, Index Terms— Artificial neural network, Initialization, Latent Dirichlet allocation, Machine learning, computer.software_genre, symbols.namesake, Categorization, Prior probability, symbols, Weights initialization, Speech analytics, [INFO]Computer Science [cs], Artificial intelligence, business, computer, Classifier (UML)
Abstract: International audience; This paper presents a method for speech analytics that integrates topic-space based representation into a feed-forward artificial neural network (FFANN), working as a document classifier. The proposed method consists in configuring the FFANN's topology and in initializing the weights according to a previously estimated topic-space. Setup based on thematic priors is expected to improve the efficiency of the FFANN's weight optimization process, while speeding-up the training process and improving the classification accuracy. This method is evaluated on a spoken dialogue categorization task which is composed of customer-agent dialogues from the call-centre of Paris Public Transportation Company. Results show the interest of the proposed setup method, with a gain of more than 4 points in terms of classification accuracy, compared to the baseline. Moreover, experiments highlight that performance is weakly dependent to FFANN's topology with the LDA-based configuration, in comparison to classical empirical setup.
Published: 2015
Full Text: View/download PDF

21. OOV Proper Name retrieval using topic and lexical context models

Author: Imran Sheikh, Georges Linarès, Dominique Fohr, and Irina Illina
Subjects: Topic model, Vocabulary, Context model, Computer science, business.industry, media_common.quotation_subject, Speech recognition, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Search engine indexing, Context (language use), computer.software_genre, Keyword spotting, Selection (linguistics), Proper noun, Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: Retrieving Proper Names (PNs) specific to an audio document can be useful for vocabulary selection and OOV recovery in speech recognition, as well as in keyword spotting and audio indexing tasks. We propose methods to infer and retrieve OOV PNs relevant to an audio news document by using probabilistic topic models trained over diachronic text news. LVCSR hypothesis on the audio news document is analysed for latent topics, which is then used to retrieve relevant OOV PNs. Using an LDA topic model we obtain Recall up to 0.87 and Mean Average Precision (MAP) of 0.26 with only top 10% of the retrieved OOV PNs. We further propose methods to re-score and retrieve rare OOV PNs, and a lexical context model to improve the target OOV PN rankings assigned by the topic model, which may be biased due to prominence of certain news events. Re-scoring rare OOV PNs improves Recall whereas the lexical context model improves MAP.
Published: 2015
Full Text: View/download PDF

22. Author-topic based representation of call-center conversations

Author: Richard Dufour, Mohamed Morchid, Mohamed Bouallegue, Georges Linarès, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Process (engineering), Computer science, Human/human con-versation, Context (language use), 02 engineering and technology, Space (commercial competition), Speech recognition, computer.software_genre, 01 natural sciences, Latent Dirichlet allocation, Task (project management), 010104 statistics & probability, symbols.namesake, Transcription (linguistics), 0202 electrical engineering, electronic engineering, information engineering, Speech analytics, [INFO]Computer Science [cs], Latent Dirichlet Allocation, 0101 mathematics, business.industry, Representation (systemics), Index Terms— Author-Topic model, Classification, symbols, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, Natural language processing
Abstract: International audience; Performance of Automatic Speech Recognition (ASR) systems drops dramatically when transcribing conversations recorded in noisy conditions. Speech analytics suffer from this poor automatic transcription quality. To tackle this difficulty , a solution consists in mapping transcriptions into a space of hidden topics. This abstract representation allows to substantiate the drawbacks of the ASR process. The well-known and commonly used one is the topic-based representation from a Latent Dirichlet Allocation (LDA). Several studies demonstrate the effectiveness and reliability of this high-level representation. During the LDA learning process, distribution of words into each topic is estimated automatically. Nonetheless, in the context of a classification task, no consideration is made for the targeted classes. Thus, if the targeted application is to find out the main theme related to a dialogue, this information should be taken into consideration. In this paper, we propose to compare a classical topic-based representation of a dialogue, with a new one based not only on the dialogue content itself (words), but also on the theme related to the dialogue. This original representation is based on the author-topic (AT) model. The effectiveness of the proposed representation is evaluated on a classification task from automatic dialogue transcriptions between an agent and a customer of the Paris Transportation Company. Experiments confirmed that this author-topic model approach outperforms by far the classical topic representation, with a substantial gain of more than 7% in terms of correctly labeled conversations.
Published: 2014
Full Text: View/download PDF

23. Constrained speaker diarization of TV series based on visual patterns

Author: Xavier Bost and Georges Linarès
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Series (mathematics), Computer science, Speech recognition, Process (computing), Multimedia (cs.MM), Task (project management), Hierarchical clustering, Constraint (information theory), Speaker diarisation, Visual patterns, Computation and Language (cs.CL), Computer Science - Multimedia
Abstract: Speaker diarization, usually denoted as the ''who spoke when'' task, turns out to be particularly challenging when applied to fictional films, where many characters talk in various acoustic conditions (background music, sound effects...). Despite this acoustic variability , such movies exhibit specific visual patterns in the dialogue scenes. In this paper, we introduce a two-step method to achieve speaker diarization in TV series: a speaker diarization is first performed locally in the scenes detected as dialogues; then, the hypothesized local speakers are merged in a second agglomerative clustering process, with the constraint that speakers locally hypothesized to be distinct must not be assigned to the same cluster. The performances of our approach are compared to those obtained by standard speaker diarization tools applied to the same data.
Published: 2014
Full Text: View/download PDF

24. Factor Analysis based Semantic Variability Compensation for Automatic Conversation Representation

Author: Mohamed Bouallegue, Driss Matrouf, Georges Linarès, Mohamed Morchid, Richard Dufour, Renato De Mori, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, McGill University = Université McGill [Montréal, Canada], and Déposants HAL-Avignon, bibliothèque Universitaire
Subjects: Computer science, Speech recognition, media_common.quotation_subject, 02 engineering and technology, [INFO] Computer Science [cs], computer.software_genre, Latent Dirichlet allocation, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Dimension (vector space), Se-mantic variability, 0202 electrical engineering, electronic engineering, information engineering, Conversation, [INFO]Computer Science [cs], Latent Dirichlet Allocation, Variability compensation, Representation (mathematics), Index Terms: Human/Human conversation representation, media_common, business.industry, 020206 networking & telecommunications, Speech processing, Identification (information), symbols, Automatic classification, Noise (video), Artificial intelligence, Factor analysis, 0305 other medical science, business, computer, Subspace topology, Natural language processing
Abstract: The main objective of this paper is to identify themes from dialogues of telephone conversations in a real-life customer care service. In this task, the word semantic variability contained in these conversations may impact the classification performance by retaining the noise in their vectorial representation. In this article, we propose an original method to compensate this semantic variability using the Factor Analysis (FA) paradigm, initially designed for speech processing tasks to compensate the acoustic variability, mainly in Speaker Verification (SV) and Automatic Speech Recognition (ASR). In our proposal, we used the FA paradigm to estimate the semantic variability as an additive component located in a subspace of low dimension (with respect to the super-vector space). This additive semantic variability is estimated in Factor Analysis model space. From this estimation, a specific vector transformation is obtained and is applied to vectors of dialogue representation. Experiments are reported using a corpus collected in the call center of the Paris Transportation Service. Results show the effectiveness of the proposed representation paradigm with a theme identification accuracy of 80.0%, showing a significant improvement with respect to previous results on the same corpus. Index Terms: Human/Human conversation representation, Semantic variability, Factor analysis, Variability compensation, Automatic classification, Latent Dirichlet Allocation.
Published: 2014

25. Multimodal understanding for person recognition in video broadcasts

Author: Frédéric Béchet, Pierre Tirilly, Corinne Fredouille, Benjamin Bigot, Mickael Rouvier, Gregory Senay, Meriem Bendris, Benoit Favre, Rémi Auguste, Georges Linarès, Géraldine Damnati, Richard Dufour, Delphine Charlet, Jean Martinet, Laboratoire d'informatique Fondamentale de Marseille - UMR 6166 (LIF), Université de la Méditerranée - Aix-Marseille 2-Université de Provence - Aix-Marseille 1-Centre National de la Recherche Scientifique (CNRS), Laboratoire d'informatique Fondamentale de Marseille (LIF), Aix Marseille Université (AMU)-École Centrale de Marseille (ECM)-Centre National de la Recherche Scientifique (CNRS), France Télécom Recherche & Développement (FT R&D), France Télécom, France Télécom Recherche et Développement [Lannion] (FTR&D), Traitement Automatique du Langage Ecrit et Parlé (TALEP), Laboratoire d'Informatique et Systèmes (LIS), Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-Université de Toulon (UTLN)-Centre National de la Recherche Scientifique (CNRS), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, FOX MIIRE (LIFL), Laboratoire d'Informatique Fondamentale de Lille (LIFL), Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS), Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS), Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Centre National de la Recherche Scientifique (CNRS)-École Centrale de Marseille (ECM)-Aix Marseille Université (AMU), and Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU)
Subjects: Modality (human–computer interaction), Multimedia, Computer science, Speech recognition, 020207 software engineering, 02 engineering and technology, computer.software_genre, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Speaker diarisation, Information extraction, Identification (information), Transcription (linguistics), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Person recognition, Set (psychology), computer, ComputingMilieux_MISCELLANEOUS
Abstract: This paper describes a multi-modal person recognition system for video broadcast developed for participating in the DefiRepere challenge. The main track of this challenge targets the identification of all persons occurring in a video either in the audio modality (speakers) or the image modality (faces). This system is developed by the PERCOL team involving 4 research labs in France and was ranked first at the 2014 Defi-Repere challenge. The main scientific issue addressed by this challenge is the combination of audio and video information extraction processes for improving the extraction performance in both modalities. In this paper, we present the strategy followed by the PERCOL team for speaker identification based on enriching the speaker diarization with features related to the ”understanding” of the video scenes: text overlay transcription and analysis, automatic situation identification (TV set, report), the amount of people visible, TV set disposition and even the camera when available. Experiments on the REPERE corpus show interesting results on the speaker identification system enriched by the scene understanding features and the usefulness of the speaker to identify faces.
Published: 2014

26. I-vector based Representation of Highly Imperfect Automatic Transcriptions

Author: Driss Matrouf, Renato De Mori, Mohamed Morchid, Georges Linarès, Richard Dufour, Mohamed Bouallegue, Déposants HAL-Avignon, bibliothèque Universitaire, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, and McGill University = Université McGill [Montréal, Canada]
Subjects: Topic model, Computer science, Speech recognition, media_common.quotation_subject, Context (language use), 02 engineering and technology, [INFO] Computer Science [cs], Latent Dirichlet allocation, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, Transcription (linguistics), 0202 electrical engineering, electronic engineering, information engineering, Speech analytics, Conversation, [INFO]Computer Science [cs], Latent Dirichlet Allocation, Baseline (configuration management), Representation (mathematics), media_common, speech recognition, joint factor analysis, 020206 networking & telecommunications, Speaker recognition, symbols, Index Terms: human/human conversation, 0305 other medical science, i-vectors
Abstract: International audience; The performance of Automatic Speech Recognition (ASR) systems drops dramatically when used in noisy environments. Speech analytics suffer from this poor quality of automatic transcriptions. In this paper, we seek to identify themes from dialogues of telephone conversation services using multiple topic-spaces estimated with a Latent Dirichlet Allocation (LDA) approach. This technique consists in estimating several topic models that offer different views of the document. Unfortunately, such a multi-model approach also introduces additional vari-abilities due to the model diversity. We propose to extract the useful information from the full model-set by using an i-vector based approach, previously developed in the context of speaker recognition. Experiments are conducted on the DECODA corpus , that contains records from the call center of the Paris Transportation Company. Results show the effectiveness of the proposed representation paradigm, our identification system reaching an accuracy of 84.7%, with a gain of 3.3 points compared to the baseline.
Published: 2014

27. Combining acoustic name spotting and continuous context models to improve spoken person name recognition in speech

Author: Richard Dufour, Corinne Fredouille, Georges Linarès, Benjamin Bigot, Gregory Senay, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, and Fredouille, Corinne
Subjects: spoken name spotting, business.industry, Computer science, Speech recognition, Search engine indexing, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, Spotting, Spoken name recognition, computer.software_genre, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], 030507 speech-language pathology & audiology, 03 medical and health sciences, linguistic context modelling, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], 0202 electrical engineering, electronic engineering, information engineering, Artificial intelligence, Transcription (software), 0305 other medical science, business, Set (psychology), phoneme confusion network, computer, Natural language processing
Abstract: International audience; Retrieving pronounced person names in spoken documents is a critical problematic in the context of audiovisual content indexing. In this paper, we present a cascading strategy for two methods dedicated to spoken name recognition in speech. The first method is an acoustic name spotting in phoneme confusion networks. It is based on a phonetic edition distance criterion based on phoneme probabilities held in confusion networks. The second method is a continuous context modelling approach applied on the 1-best transcription output. It relies on a probabilistic modelling of name-to-context dependencies. We assume that the combination of these methods, based on different types of information, may improve spoken name recognition performance. This assumption is studied through experiments done on a set of audiovisual documents from the development set of the REPERE challenge. Results report that combining acoustic and linguistic methods produces an absolute gain of 3% in terms of F-measure compared to the best system taken alone.
Published: 2013
Full Text: View/download PDF

28. Person name spotting by combining acoustic matching and LDA topic models

Author: Richard Dufour, Corinne Fredouille, Gregory Senay, Georges Linarès, Benjamin Bigot, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, and Déposants HAL-Avignon, bibliothèque Universitaire
Subjects: Topic model, business.industry, Computer science, Speech recognition, 020206 networking & telecommunications, Pattern recognition, 02 engineering and technology, Spotting, [INFO] Computer Science [cs], Latent Dirichlet allocation, 030507 speech-language pathology & audiology, 03 medical and health sciences, symbols.namesake, 0202 electrical engineering, electronic engineering, information engineering, symbols, [INFO]Computer Science [cs], Artificial intelligence, 0305 other medical science, business, ComputingMilieux_MISCELLANEOUS
Abstract: In this article, we are interested in spoken term detection task, with a particular focus on Person Name (PN) spotting in automatic speech recognition (ASR) system outputs. We propose a two-step method that combines an acoustic matching based on a Phoneme Confusion Network (PCN) with a semantic rescoring based on the Latent Dirichlet Allocation (LDA) models. The first module allows to find, in the PCN, potential PN candidates in speech segments, while the second is in charge of ranking the competing PN, according to a LDA topic model. The proposed LDA-based approach outperforms significantly the baseline system based on a search in the ASR phoneme lattice, obtaining a F-measure score of 77.04% on PN detection.
Published: 2013

29. Theme Identification in Telephone Service Conversations using Quaternions of Speech Features

Author: Mohamed Morchid, Georges Linarès, Marc El-Bèze, Renato De Mori, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Déposants HAL-Avignon, bibliothèque Universitaire, and Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU)
Subjects: Service (systems architecture), human/human conversation analysis, Computer science, Speech recognition, media_common.quotation_subject, Index Terms: Speech analytics, 020206 networking & telecommunications, 02 engineering and technology, [INFO] Computer Science [cs], topic identification, 030507 speech-language pathology & audiology, 03 medical and health sciences, Identification (information), Word lists by frequency, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), quaternion algebra, Conversation, [INFO]Computer Science [cs], 0305 other medical science, Quaternion, Theme (computing), media_common
Abstract: International audience; The paper introduces new features for describing possible focus variation in a human/human conversation. The application considered is a real-life telephone customer care service. The purpose is to hypothesize the dominant theme of conversations between a casual customer calling. Conversations are processed by an automatic speech recognition system that provides hypotheses used for extracting word frequency. Features are extracted in different, broadly defined and partially overlapped, time segments. Combinations of each feature in different segments are represented in a quaternion algebra framework. The advantage of the proposed approach is made evident by the statistically significant improvements in theme classification accuracy .
Published: 2013

30. A LDA-based method for automatic tagging of Youtube videos

Author: Mohamed Morchid, Georges Linarès, Laboratoire Informatique d'Avignon (LIA), and Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU)
Subjects: Topic model, Computer science, business.industry, Keyword extraction, speech recognition, Pattern recognition, 02 engineering and technology, Latent Dirichlet allocation, Small set, symbols.namesake, Robustness (computer science), keyword extraction, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, symbols, 020201 artificial intelligence & image processing, [INFO]Computer Science [cs], Artificial intelligence, Granularity, Transcription (software), business, Tag system, Index Terms— audio categorization, structuring multimedia collection
Abstract: International audience; This article presents a method for automatic tagging of Youtube videos. The proposed method combines an automatic speech recognition (ASR) system, that extracts the spoken contents, and a keyword extraction component that aims at finding a small set of tags representing a video. In order to improve the robustness of the tagging system to the recognition errors, a video transcription is represented in a topic space obtained by a Latent Dirichlet Allocation (LDA), in which each dimension is automatically characterized by a list of weighted terms. Tags are extracted by combining the weighted word list of the best LDA classes. We evaluate this method by employing the user-provided tags of Youtube videos as reference and we investigate the impact of the topic model granularity. The obtained results demonstrate the interest of such model to improve the robustness of the tagging system.
Published: 2013
Full Text: View/download PDF

31. Person name recognition in ASR outputs using continuous context models

Author: Corinne Fredouille, Richard Dufour, Gregory Senay, Benjamin Bigot, Georges Linarès, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: business.industry, Computer science, InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g.,HCI), Speech recognition, Index Terms— spoken document retrieval, lexical context representation, 020206 networking & telecommunications, Context (language use), 02 engineering and technology, Latent variable, computer.software_genre, spoken name detection, 030507 speech-language pathology & audiology, 03 medical and health sciences, Content analysis, Order (business), 0202 electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], Artificial intelligence, 0305 other medical science, business, computer, Natural language processing, ComputingMilieux_MISCELLANEOUS
Abstract: International audience; The detection and characterization, in audiovisual documents, of speech utterances where person names are pronounced, is an important cue for spoken content analysis. This paper tackles the problematic of retrieving spoken person names in the 1-Best ASR outputs of broadcast TV shows. Our assumption is that a person name is a latent variable produced by the lexical context it appears in. Thereby, a spoken name could be derived from ASR outputs even if it has not been proposed by the speech recognition system. A new context modelling is proposed in order to capture lexical and structural information surrounding a spoken name. The fundamental hypothesis of this study has been validated on broadcast TV documents available in the context of the REPERE challenge.
Published: 2013
Full Text: View/download PDF

32. Dynamic Combination of Automatic Speech Recognition Systems by Driven Decoding

Author: Yannick Estève, Guillaume Gravier, Georges Linarès, Benjamin Lecouteux, Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF), Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Laboratoire d'Informatique de l'Université du Maine (LIUM), Le Mans Université (UM)-Centre National de la Recherche Scientifique (CNRS), Multimedia content-based indexing (TEXMEX), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique
Subjects: Voice activity detection, Acoustics and Ultrasonics, Computer science, business.industry, Speech recognition, Speech coding, [INFO.INFO-MM]Computer Science [cs]/Multimedia [cs.MM], Word error rate, 02 engineering and technology, Integrated approach, 030507 speech-language pathology & audiology, 03 medical and health sciences, Search algorithm, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Electrical and Electronic Engineering, 0305 other medical science, business, Radio broadcasting, Decoding methods
Abstract: International audience; Combining automatic speech recognition (ASR) systems generally relies on the posterior merging of the outputs or on acoustic cross-adaptation. In this paper, we propose an integrated approach where outputs of secondary systems are integrated in the search algorithm of a primary one. In this driven decoding algorithm (DDA), the secondary systems are viewed as observation sources that should be evaluated and combined to others by a primary search algorithm. DDA is evaluated on a subset of the ESTER I corpus consisting of 4 hours of French radio broadcast news. Results demonstrate DDA significantly outperforms vote-based approaches: we obtain an improvement of 14.5% relative word error rate over the best single-systems, as opposed to the the 6.7% with a ROVER combination. An in-depth analysis of the DDA shows its ability to improve robustness (gains are greater in adverse conditions) and a relatively low dependency on the search algorithm. The application of DDA to both A* and beam-search-based decoder yields similar performances.
Published: 2013

33. Acoustic modeling for under-resourced languages based on vectorial HMM-states representation using Subspace Gaussian Mixture Models

Author: Emmanuel Ferreira, Maria Goudi, Pascal Nocera, Driss Matrouf, Mohamed Bouallegue, and Georges Linarès
Subjects: Computer science, business.industry, Speech recognition, Decision tree, Pattern recognition, Context (language use), Mixture model, symbols.namesake, Simple (abstract algebra), symbols, Artificial intelligence, Representation (mathematics), Hidden Markov model, business, Gaussian process, Subspace topology
Abstract: This paper explores a novel method for context-dependent models in automatic speech recognition (ASR), in the context of under-resourced languages. We present a simple way to realize a tying states approach, based on a new vectorial representation of the HMM states. This vectorial representation is considered as a vector of a low number of parameters obtained by the Subspace Gaussian Mixture Models paradigm (SGMM). The proposed method does not require phonetic knowledge or a large amount of data, which represent the major problems of acoustic modeling for under-resourced languages. This paper shows how this representation can be obtained and used for tying states. Our experiments, applied on Vietnamese, show that this approach achieves a stable gain compared to the classical approach which is based on decision trees. Furthermore, this method appears to be portable to other languages, as shown in the preliminary study conducted on Berber.
Published: 2012
Full Text: View/download PDF

34. Factor analysis based session variability compensation for Automatic Speech Recognition

Author: Mohamed Bouallegue, Mickael Rouvier, Driss Matrouf, and Georges Linarès
Subjects: Normalization (statistics), Voice activity detection, Computer science, business.industry, Speech recognition, Speech coding, Acoustic model, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Pattern recognition, Speaker recognition, Speech processing, Computer Science::Sound, Artificial intelligence, Hidden Markov model, business, Test data
Abstract: In this paper we propose a new feature normalization based on Factor Analysis (FA) for the problem of acoustic variability in Automatic Speech Recognition (ASR). The FA paradigm was previously used in the field of ASR, in order to model the usefull information: the HMM state dependent acoustic information. In this paper, we propose to use the FA paradigm to model the useless information (speaker- or channel-variability) in order to remove it from acoustic data frames. The transformed training data frames are then used to train new HMM models using the standard training algorithm. The transformation is also applied to the test data before the decoding process. With this approach we obtain, on french broadcast news, an absolute WER reduction of 1.3%.
Published: 2011
Full Text: View/download PDF

35. Bag of n-gram driven decoding for LVCSR system harnessing

Author: Georges Linarès, Yannick Estève, Fethi Bougares, Paul Deléglise, Déposants HAL-Avignon, bibliothèque Universitaire, and AMOKRANE, HAKIM
Subjects: Voice activity detection, Matching (graph theory), System combination, Computer science, business.industry, Speech recognition, Speech coding, system combination, [INFO] Computer Science [cs], Machine learning, computer.software_genre, n-gram, [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], Search algorithm, Auxiliary system, Artificial intelligence, Index Terms—speech recognition, business, computer, ComputingMilieux_MISCELLANEOUS, Decoding methods, bag of n-gram driven decoding
Abstract: —This paper focuses on automatic speech recognition systems combination based on driven decoding paradigms. The driven decoding algorithm (DDA) involves the use of a 1-best hypothesis provided by an auxiliary system as another knowledge source in the search algorithm of a primary system. In previous studies, it was shown that DDA outperforms ROVER when the primary system is guided by a more accurate system. In this paper we propose a new method to manage auxiliary transcriptions which are presented as a bag-of-n-grams (BONG) without temporal matching. These modifications allow to make easier the combination of several hypotheses given by different auxiliary systems. Using BONG combination with hypotheses provided by two auxiliary systems, each of which obtained more than 23% of WER on the same data, our experiments show that a CMU Sphinx based ASR system can reduce its WER from 19.85% to 18.66% which is better than the results reached with DDA or classical ROVER combination.
Published: 2011

36. A SEGMENT-LEVEL CONFIDENCE MEASURE FOR SPOKEN DOCUMENT RETRIEVAL

Author: Georges Linarès, Gregory Senay, Benjamin Lecouteux, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), and Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)
Subjects: Computer science, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, 02 engineering and technology, confidence measures, Speech recognition, Semantics, computer.software_genre, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Task (project management), Search engine, Text mining, 0202 electrical engineering, electronic engineering, information engineering, 0501 psychology and cognitive sciences, Relevance (information retrieval), Document retrieval, 050107 human factors, Measure (data warehouse), Information retrieval, business.industry, 05 social sciences, Search engine indexing, 020206 networking & telecommunications, spoken document retrieval, Index (publishing), Artificial intelligence, Transcription (software), business, computer, Natural language processing
Abstract: International audience; This paper presents a semantic confidence measure that aims to predict the relevance of automatic transcripts for a task of Spoken Document Retrieval (SDR). The proposed predicting method relies on the combination of Automatic Speech Recognition (ASR) confidence measure and a Semantic Com-pacity Index (SCI), that estimates the relevance of the words considering the semantic context in which they occurred. Experiments are conducted on the French Broadcast news corpus ESTER, by simulating a classical SDR usage scenario : users submit text-queries to a search engine that is expected to return the most relevant documents regarding the query. Results demonstrate the interest of using semantic level information to predict the transcription indexability.
Published: 2011

37. Semantic cache model driven speech recognition

Author: Benjamin Lecouteux, Georges Linarès, Pascal Nocera, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Latent semantic analysis, Computer science, Speech recognition, Speech coding, speech recognition, Word error rate, Semantics, cache model, Latent Semantic Analysis, Search algorithm, driven decoding, [INFO]Computer Science [cs], Language model, Cache, Decoding methods
Abstract: International audience; This paper proposes an improved semantic based cache model: our method boils down to using the first pass of the ASR system, associated to confidence scores and semantic fields, for driving the second pass. In previous papers, we had introduced a Driven Decoding Algorithm (DDA), which allows us to combine speech recognition systems, by guiding the search algorithm of a primary ASR system by the one-best hypothesis of an auxiliary system. We propose a strategy using DDA to drive a semantic cache, according to the confidence measures. The combination between semantic-cache and DDA optimizes the new decoding process, like an unsupervised language model adaptation. Experiments evaluate the proposed method on 8 hours of speech. Results show that semantic-DDA yields significant improvements to the baseline: we obtain a 4% word error rate relative improvement without acoustic adaptation, and 1.9% after adaptation with a 3xRT ASR system.
Published: 2010
Full Text: View/download PDF

38. Transcription-based video genre classification

Author: Mickael Rouvier, Stanislas Oger, and Georges Linarès
Subjects: Artificial neural network, Transcription (linguistics), business.industry, Computer science, Speech recognition, Feature extraction, Artificial intelligence, Pragmatics, business, computer.software_genre, computer, Natural language processing
Abstract: In this paper, we present a new method for video genre identification based on the linguistic content analysis. This approach relies on the analysis of the most frequent words in the video transcriptions provided by an automatic speech recognition system. Experiments are conducted on a corpus composed of cartoons, movies, news, commercials, documentary, sport and music. On this 7-genre identification task, the proposed transcription-based method obtains up to 80% of correct identification. Finally, this rate is increased to 95% by combining the proposed linguistic-level features with low-level acoustic features.
Published: 2010
Full Text: View/download PDF

39. On-the-fly video genre classification by combination of audio features

Author: Driss Matrouf, Mickael Rouvier, and Georges Linarès
Subjects: Motion analysis, Computer science, business.industry, Speech recognition, Feature extraction, Frame (networking), Pattern recognition, Speech processing, computer.software_genre, Identification (information), Feature (computer vision), Artificial intelligence, Mel-frequency cepstrum, Audio signal processing, business, computer
Abstract: Video genre identification methods are frequently based on image or motion analysis, which are relatively timeconsuming processes. Since such approaches are tractable by batch processing, as-soon-as-possible identification requires faster methods. In this paper, we investigate the use of audio-only methods for on-the-fly video classification. We propose to use several acoustic feature streams and we evaluate various combination schemes at the frame or at the score level. Results are compared to those obtained by humans, according to the listening duration. Although the system based on model combination slightly outperforms the humans on very soon detection. The latter remain significantly more accurate on long sessions.
Published: 2010
Full Text: View/download PDF

40. Factor Analysis for Audio-based Video Genre Classification

Author: Driss Matrouf, Mickael Rouvier, Georges Linarès, and Déposants HAL-Avignon, bibliothèque Universitaire
Subjects: Channel (digital image), Computer science, business.industry, Speech recognition, Pattern recognition, automatic classification, [INFO] Computer Science [cs], Mixture model, Domain (software engineering), Support vector machine, ComputingMethodologies_PATTERNRECOGNITION, Factor (programming language), Index Terms: video genre identification, Feature (machine learning), Artificial intelligence, business, Factor Analy-sis, computer, computer.programming_language
Abstract: Statistical classifiers operate on features that generally include both useful and useless information. These two types of information are difficult to separate in the feature domain. Recently, a new paradigm based on a Latent Factor Analysis (LFA) proposed a model decomposition into usefull and useless components. This method was successfully applied to speaker and language recognition tasks. In this paper, we study the use of LFA for video genre classification by using only the audio channel. We propose a classification method based on short-term cep-stral features and Gaussian Mixture Models (GMM) or Support Vector Machine (SVM) classifiers, that are combined with Factor Analysis (FA). Experiments are conducted on a corpus composed of 5 types of video (musics, commercials, cartoons, movies and news). The relative classification error reduction obtained by using the best factor analysis configuration with respect to the baseline system, Gaussian Mixture Model Universal Background Model (GMM-UBM), is about 56%, corresponding to a correct identification rate of about 90%.
Published: 2009

41. Compact Acoustic Models for Embedded Speech Recognition

Author: Jean-François Bonastre, Christophe Lévy, and Georges Linarès
Subjects: Acoustics and Ultrasonics, Computer science, Speech recognition, Resource constraints, lcsh:QC221-246, Acoustic model, Probability density function, Speaker recognition, Speech processing, lcsh:QA75.5-76.95, Transformation (function), Computer Science::Sound, lcsh:Acoustics. Sound, lcsh:Electronic computers. Computer science, Electrical and Electronic Engineering, Adaptation (computer science), Hidden Markov model
Abstract: Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques) with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.
Published: 2009

42. Combined low level and high level features for Out-Of-Vocabulary Word detection

Author: Benjamin Lecouteux, Georges Linarès, Benoit Favre, Favre, Benoit, Laboratoire Informatique d'Avignon (LIA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, International Computer Science Institute [Berkeley] (ICSI), International Computer Science Institute, and Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU)
Subjects: [INFO.INFO-CL] Computer Science [cs]/Computation and Language [cs.CL], speech recognition, confidence measures, Index Terms: OOV word detection, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], ComputingMilieux_MISCELLANEOUS
Abstract: International audience; no abstract
Published: 2009

43. Using prompts to produce quality corpus for training automatic speech recognition systems

Author: Georges Linarès, Benjamin Lecouteux, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Closed captioning, Matching (statistics), Computer science, business.industry, closed captioning, Speech recognition, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Feature extraction, Speech coding, speech recognition, Spotting, computer.software_genre, automatic segmentation, corpus building, Unsupervised learning, [INFO]Computer Science [cs], Artificial intelligence, Timestamp, business, computer, Decoding methods, Natural language processing
Abstract: International audience; — In this paper we present an integrated unsupervised method to produce a quality corpus for training automatic speech recognition system (ASR) using prompts or closed captions. Closed captions and prompts do not always have timestamps and do not necessarily correspond to the exact speech. We propose a method allowing to extract quality corpus from imperfect transcript. The proposed approach works in two steps. During the search, the ASR system finds matching segments in a large prompt database. Matching segments are then used inside a Driven Decoding Algorithm (DDA) to produce a high quality corpus. Results show a F-measure of 96% in term of spotting while the DDA corrects the output according to the prompts: a high quality corpus is easily extracted. 1 Index Terms— speech recognition, closed captioning, corpus building, automatic segmentation I. INTRODUCTION The training of an automatic speech recognition system (ASR) requires large amounts of exact annotated speech. The transcription task is expensive and takes a lot of time. In some situations imperfect transcripts like journalist prompts, closed captions or abstracts are available. This material is available in large quantities. However, these transcripts present two issues: the distance compared to the audio stream and the lack of timestamps. Various approaches propose to use imperfect transcripts for unsupervised ASR training (section II-D). But existing methods are not integrated and have shortcomings: processes are iterative and take a lot of computing time; the lack of times-tamp is forgotten. Moreover existing methods do not use all the potential of imperfect transcripts. The first part of this paper is dedicated to the related work on these issues: the prompts quality, methods to perform ASR alignment with imperfect transcripts, the automatic imperfect transcript segmentation, and finally how to use them for training an ASR system. In a second part, we describe an integrated approach allowing us to solve the two main approximated transcription issues
Published: 2008
Full Text: View/download PDF

44. Frame-Based Acoustic Feature Integration for Speech Understanding

Author: Loïc Barrault, Driss Matrouf, R. De Mori, Christophe Servan, Georges Linarès, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: speech understanding, 0209 industrial biotechnology, Computer science, Speech recognition, Word error rate, Topology (electrical circuits), 02 engineering and technology, Set (abstract data type), Reduction (complexity), 030507 speech-language pathology & audiology, 03 medical and health sciences, 020901 industrial engineering & automation, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, frame based combination, business.industry, speech recognition, posterior probabilities combination, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), Pattern recognition, Computer Science::Sound, Feature (computer vision), Artificial intelligence, 0305 other medical science, business, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Spoken language, Interpolation
Abstract: International audience; With the purpose of improving Spoken Language Un- derstanding (SLU) performance, a combination of different acoustic speech recognition (ASR) systems is proposed. State a posteriori probabilities obtained with systems using different acoustic feature sets are combined with log-linear inter- polation. In order to perform a coherent combination of these probabilities, acoustic models must have the same topology (i.e. same set of states). For this purpose, a fast and efficient twin model training protocol is proposed. By a wise choice of acoustic feature sets and log-linear interpolation of their like- lihood ratios, a substantial Concept Error Rate (CER) reduction has been observed on the test part of the French MEDIA corpus.
Published: 2008
Full Text: View/download PDF

45. On-demand new word learning using world wide web

Author: Pascal Nocera, Georges Linarès, Stanislas Oger, Frédéric Béchet, Laboratoire Informatique d'Avignon (LIA), and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Computer science, Word processing, Context (language use), 02 engineering and technology, Natural languages, Speech recognition, Lexicon, computer.software_genre, World Wide Web, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, Information retrieval, Relevance (information retrieval), [INFO]Computer Science [cs], Index Terms— Lexical modeling, Semantic Web, business.industry, 020201 artificial intelligence & image processing, Artificial intelligence, Transcription (software), 0305 other medical science, business, computer, Word (computer architecture), Natural language, Natural language processing
Abstract: International audience; Most of the Web-based methods for lexicon augmenting consist in capturing global semantic features of the targeted domain in order to collect relevant documents from the Web. We suggest that the local context of the out-of-vocabulary (OOV) words contains relevant information on the OOV words. With this information, we propose to use the Web to build locally-augmented lexicons which are used in a final local decoding pass. Our experiments confirm the relevance of the Web for the OOV word retrieval. Different methods are proposed to retrieve the hypothesis words. Finally we present the integration of new words in the transcription process based on part-of-speech models. This technique allows to recover 7.6% of the significant OOV words and the accuracy of the system is improved.
Published: 2008
Full Text: View/download PDF

46. GENERALIZED DRIVEN DECODING FOR SPEECH RECOGNITION SYSTEM COMBINATION

Author: Benjamin Lecouteux, Georges Linarès, Yannick Estève, Guillaume Gravier, Laboratoire d'Informatique de Grenoble (LIG), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS), Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Laboratoire Informatique d'Avignon (LIA), Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU), Laboratoire d'Informatique de l'Université du Mans (LIUM), Le Mans Université (UM), Creating and exploiting explicit links between multimedia fragments (LinkMedia), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-MEDIA ET INTERACTIONS (IRISA-D6), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Télécom Bretagne-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-École normale supérieure - Rennes (ENS Rennes)-Université de Bretagne Sud (UBS)-Centre National de la Recherche Scientifique (CNRS)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA), Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI, Speech and sound data modeling and processing (METISS), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria), ANR-06-MDCA-0006,EPAC,Exploration de masse de documents audio pour l'extraction et le traitement de la parole conversationnelle(2006), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes 1 (UR1), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Rennes – Bretagne Atlantique, Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), and Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Télécom Bretagne-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)
Subjects: System combination, Computer science, business.industry, Speech recognition, Speech coding, Word error rate, system combination, Pattern recognition, 02 engineering and technology, Integrated approach, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], 030507 speech-language pathology & audiology, 03 medical and health sciences, Search algorithm, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, [INFO]Computer Science [cs], Artificial intelligence, 0305 other medical science, business, Decoding methods, ComputingMilieux_MISCELLANEOUS
Abstract: International audience; Driven Decoding Algorithm (DDA) is initially an integrated approach for the combination of 2 speech recognition (ASR) systems. It consists in guiding the search algorithm of a primary ASR system by the one-best hypothesis of an auxiliary system. In this paper , we generalize DDA to confusion-network driven decoding and we propose new combination schemes for multiple system combination. Since previous experiments involved 2 ASR systems on broadcast news data, the proposed extended DDA is evaluated using 3 ASR systems from different labs. Results show that generalized-DDA outperforms significantly ROVER method: we obtain a 15.7% relative word error rate improvement with respect to the best single system, as opposed to 8.5% with the ROVER combination.
Published: 2008

47. Fast adaptation of GMM-based compact models

Author: Jean-François Bonastre, Christophe Lévy, Georges Linarès, and Déposants HAL-Avignon, bibliothèque Universitaire
Subjects: Computer science, business.industry, Speech recognition, Pattern recognition, adaptation, [INFO] Computer Science [cs], Speaker recognition, Domain (software engineering), Set (abstract data type), compact acoustic models, Task (computing), ComputingMethodologies_PATTERNRECOGNITION, Computer Science::Sound, Artificial intelligence, Hidden Markov model, business, Adaptation (computer science), Index Terms: speech recognition
Abstract: In this paper, a new strategy for a fast adaptation of acoustic models is proposed for embedded speech recognition. It relies on a general GMM, which represents the whole acoustic space, associated with a set of HMM state-dependent probability functions modeled as transformations of this GMM. The work presented here takes advantage of this architecture to propose a fast and efficient way to adapt the acoustic models. The adaptation is performed only on the general GMM model, using techniques gathered from the speaker recognition domain. It does not require state-dependent adaptation data and it is very efficient in terms of computational cost. Weevaluate our approach in the voice-command task, using a car-based corpus. This adaptation method achieved a relative error-rate decrease of about 10% even if few adaptation data are available. The complete system allows a total relative gain of more than 20% compared to a basic HMM-based system. Index Terms: speech recognition, compact acoustic models, adaptation
Published: 2007
Full Text: View/download PDF

48. The LIA Speech Recognition System: From 10xRT to 1xRT

Author: Pascal Nocera, D Massonié, Georges Linarès, Driss Matrouf, and Déposants HAL-Avignon, bibliothèque Universitaire
Subjects: Search algorithm, Computer science, Speech recognition, Word error rate, Acoustic model, Language model, [INFO] Computer Science [cs], Transcription (software), Speaker recognition, Lexicon, Real-time operating system
Abstract: The LIA developed a speech recognition toolkit providing most of the components required by speech-to-text systems. This toolbox allowed to build a Broadcast News (BN) transcription system was involved in the ESTER evaluation campaign ([1]), on unconstrained transcription and real-time transcription tasks. In this paper, we describe the techniques we used to reach the real-time, starting from our baseline 10xRT system. We focus on some aspects of the A* search algorithm which are critical for both efficiency and accuracy. Then, we evaluate the impact of the different system components (lexicon, language models and acoustic models) to the trade-off between efficiency and accuracy. Experiments are carried out in framework of the ESTER evaluation campaign. Our results show that the real time system reaches performance on about 5.6% absolute WER whorses than the standard 10xRT system, with an absolute WER (Word Error Rate) of about 26.8%.
Published: 2007
Full Text: View/download PDF

49. System Combination by Driven Decoding

Author: Georges Linarès, Benjamin Lecouteux, Yannick Estève, Julie Mauclair, Laboratoire Informatique d'Avignon (LIA), Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU), Laboratoire d'Informatique de l'Université du Mans (LIUM), and Le Mans Université (UM)
Subjects: Vocabulary, System combination, Computer science, Speech recognition, media_common.quotation_subject, Speech coding, broadcast news transcription, system combination, 020206 networking & telecommunications, confidence measures, 02 engineering and technology, decoding algorithms, 030507 speech-language pathology & audiology, 03 medical and health sciences, 0202 electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], 0305 other medical science, Decoding methods, media_common
Abstract: International audience; The combination of Automatic Speech Recognition (ASR) systems generally relies on a posteriori merge of system outputs or on a cross-adaptation. In this paper, we propose an integrated approach where the search of a primary system is driven by the outputs of a secondary one. This method allows to drive the primary system search by using the one-best hypotheses and the word posteriors gathered from the secondary system. Experiments are carried out within the experimental framework of the ESTER evaluation campaign [1]. Results show that the driven decoding algorithm significantly outper-forms the two single ASR systems (-8% of relative WER,-1.7% absolute). Finally, we investigate the interactions between driven decoding and cross-adaptations. The best cross-adaptation strategy in combination with the driven decoding process brings to a final absolute gain of about 1.9% WER.
Published: 2007
Full Text: View/download PDF

50. Text island spotting in large speech databases

Author: Benjamin Lecouteux, Frédéric Beaugendre, Pascal Nocera, Georges Linarès, Laboratoire d'Informatique de Grenoble (LIG), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF), Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Institut National Polytechnique de Grenoble (INPG)-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF), Université Grenoble Alpes [2016-2019] (UGA [2016-2019]), Laboratoire Informatique d'Avignon (LIA), Centre d'Enseignement et de Recherche en Informatique - CERI-Avignon Université (AU), Déposants HAL-Avignon, bibliothèque Universitaire, and Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
Subjects: Closed captioning, Noisy text analytics, Computer science, Speech recognition, closed captioning, Context (language use), 02 engineering and technology, [INFO] Computer Science [cs], Lexicon, computer.software_genre, [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL], Set (abstract data type), 030507 speech-language pathology & audiology, 03 medical and health sciences, Dimension (vector space), 0202 electrical engineering, electronic engineering, information engineering, [INFO]Computer Science [cs], Database, business.industry, speech recognition, 020206 networking & telecommunications, Speech corpus, Spotting, corpus building, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Artificial intelligence, 0305 other medical science, business, computer, Natural language processing, Word (computer architecture)
Abstract: International audience; Automatic transcript aligned and corrected This competition is arbitrated by a matching score Wi. Experimental context :-First experiments assessed on 3 hours of radio ESTER (with exact transcript and a 10% WER transcripts)-Second experiments assessed on 11 hours of RTBF on wich time stamps where manually added.-All words available in database are added to the language model-Language model : about 67000 words trained on « lemonde »-Speech recognition system : SPEERAL, an asynchronous decoder based on the A* algorithm. Results : Conclusions :-On ESTER tests approximative transcripts bring a WER gain of about 14% relative, while exact ones allows a WER gain close to 24% relative.-Spotting performance is good; more than 95.3% of segments have been found, with a precision of about 96.7%.-On RTBF tests, spotting performance is good; more than 95.3% of segments have been found, with a precision of about 96.7%. Fast-match to transcript island-The principle of the proposed method is close to approaches used in the field of information retrieval.-In our case, the hypothesis is a query which may be answered by one of the transcript island.-The lexicon is represented by a lexical space Ls where each dimension is associated to a word. The coefficients of these vectors represent the frequencies of words in the document.-As the current hypothesis is developed, a set of word clusters Ci is built and updated.-These clusters result from the intersection of hc and the transcript island Ii.-For each new word added to the hypothesis hc, transcript islands are considered as candidates for guiding the search.
Published: 2007

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

55 results on '"Georges Linarès"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources