Author: "Luboš Šmídl" - Searchworks@Jio Institute Digital Library Search Results

1. Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer

Author: Jan Švec, Jan Lehečka, and Luboš Šmídl
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Spoken Term Detection, Wav2Vec, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years, the standard hybrid DNN-HMM speech recognizers are outperformed by the end-to-end speech recognition systems. One of the very promising approaches is the grapheme Wav2Vec 2.0 model, which uses the self-supervised pretraining approach combined with transfer learning of the fine-tuned speech recognizer. Since it lacks the pronunciation vocabulary and language model, the approach is suitable for tasks where obtaining such models is not easy or almost impossible. In this paper, we use the Wav2Vec speech recognizer in the task of spoken term detection over a large set of spoken documents. The method employs a deep LSTM network which maps the recognized hypothesis and the searched term into a shared pronunciation embedding space in which the term occurrences and the assigned scores are easily computed. The paper describes a bootstrapping approach that allows the transfer of the knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid ASR into the context of grapheme-based Wav2Vec. The proposed method outperforms the previously published system based on the combination of the DNN-HMM hybrid ASR and phoneme recognizer by a large margin on the MALACH data in both English and Czech languages.
Published: 2022

2. Spoken Term Detection and Relevance Score Estimation using Dot-Product of Pronunciation Embeddings

Author: Aleš Pražák, Jan Švec, Josef Psutka, and Luboš Šmídl
Subjects: Estimation, FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, spoken term detection, business.industry, Computer science, Dot product, Pronunciation, computer.software_genre, Computer Science - Sound, Term (time), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Relevance (information retrieval), relevance-score estimation, Artificial intelligence, speech embeddings, business, computer, Computation and Language (cs.CL), Natural language processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.
Published: 2022
Full Text: View/download PDF

3. Air traffic control communication (ATCC) speech corpora and their use for ASR and TTS development

Author: Luboš Šmídl, Jan Švec, Jan Romportl, Daniel Tihelka, Pavel Ircing, and Jindřich Matoušek
Subjects: rozpoznávání řeči, 050101 languages & linguistics, Linguistics and Language, komunikace při řízení letového provozu, Computer science, Speech recognition, 05 social sciences, speech recognition, Process (computing), 02 engineering and technology, Library and Information Sciences, Air traffic control, Language and Linguistics, Education, ComputingMethodologies_PATTERNRECOGNITION, řečový korpus, syntéza řeči, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, text-to-speech, Computational linguistics, speech corpus, air traffic control communication, Communication channel
Abstract: Článek představuje motivaci k vytváření specializovaných řečových korpusů o řízení letového provozu, podrobně popisuje proces přípravy korpusů pro automatické rozpoznávání řeči a převod textu na řeč (řečovou syntézu), Dále ukazuje ilustrativní příklad využití popisovaných korpusů pro vývoj systému rozpoznávání řeči a popisuje technickéaspekty dat a distribučního kanálu. The paper introduces the motivation for creating dedicated speech corpora of air traffic control communication, describes in detail the process of preparation of corpora for both automatic speech recognition and text-to-speech synthesis, presents an illustrative example of speech recognition system developed using the automatic speech recognition corpora and finally describes the technical aspects of the data and the distribution channel.
Published: 2019
Full Text: View/download PDF

4. Transformer-Based Automatic Punctuation Prediction and Word Casing Reconstruction of the ASR Output

Author: Jan Lehečka, Pavel Ircing, Luboš Šmídl, and Jan Švec
Subjects: Punctuation predictor, Czech, Word casing reconstruction, Computer science, media_common.quotation_subject, Speech recognition, Punctuation, Readability, language.human_language, T5, ASR, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, language, Casing, Word (computer architecture), BERT, media_common, Transformer (machine learning model)
Abstract: The paper proposes a module for automatic punctuation prediction and casing reconstruction based on transformers architectures (BERT/T5) that constitutes the current state-of-the-art in many similar NLP tasks. The main motivation for our work was to increase the readability of the ASR output. The ASR output is usually in the form of a continuous stream of text, without punctuation marks and with all words in lowercase. The resulting punctuation and casing reconstruction module is evaluated on both the written text and the actual ASR output in three languages (English, Czech and Slovak).
Published: 2021
Full Text: View/download PDF

5. Initial Experiments on Question Answering from the Intrinsic Structure of Oral History Archives

Author: Adam Chýlek, Jan Švec, and Luboš Šmídl
Subjects: Structure (mathematical logic), Questions and answers, Language representation, odpovídání na otázky, Information retrieval, archiv MALACH, Computer science, Natural language question answering, The MALACH archive, Oral history, Transformers, Question answering, Datasets, Natural (music), Active listening, datové sady, transfromery
Abstract: Velké zvukové archivy s mluveným obsahem jsou přirozenými kandidáty pro systémy odpovídající na otázky. Archivy orální historie obecně obsahují mnoho faktů a příběhů, které by bylo jinak těžké získat bez poslechu mnoha hodin nahrávek. Snažíme se učinit archiv přístupnějším tím, že umožňujeme hledat odpovědi na otázky položené v přirozeném jazyce. V tomto článku popisujeme výzvy, které naše datová sada představuje. Navrhujeme náš počáteční přístup, který využívá otázky a odpovědi získané ze samotného archivu a hodnotíme výkon v experimentech s modely s předem natrénovanou jazykovou reprezentací a s předtrénovanými modely odpovědí na otázky. Large audio archives with spoken content are natural candidates for question answering systems. Oral history archives generally contain many facts and stories that would be otherwise hard to obtain without listening to hours of recordings. We strive for making the archive more accessible by allowing natural language question answering. In this paper, we present challenges our dataset poses. We propose our initial approach that uses questions and answers mined from the archive itself and evaluate the performance in experiments with pretrained language representation and question answering models.
Published: 2021
Full Text: View/download PDF

6. Automatic Correction of i/y Spelling in Czech ASR Output

Author: Pavel Ircing, Luboš Šmídl, Jan Lehečka, and Jan Švec
Subjects: Czech, Grammar rules, 050101 languages & linguistics, Correctness, Computer science, Speech recognition, 05 social sciences, 02 engineering and technology, language.human_language, Spelling, Perceived quality, 0202 electrical engineering, electronic engineering, information engineering, language, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Encoder, Grammatical error correction, ASR , BERT
Abstract: This paper concentrates on the design and evaluation of the method that would be able to automatically correct the spelling of i/y in the Czech words at the output of the ASR decoder. After analysis of both the Czech grammar rules and the data, we have decided to deal only with the endings consisting of consonants b/f/l/m/p/s/v/z followed by i/y in both short and long forms. The correction is framed as the classification task where the word could belong to the “i” class, the “y” class or the “empty” class. Using the state-of-the-art Bidirectional Encoder Representations from Transformers (BERT) architecture, we were able to substantially improve the correctness of the i/y spelling both on the simulated and the real ASR output. Since the misspelling of i/y in the Czech texts is seen by the majority of native Czech speakers as a blatant error, the corrected output greatly improves the perceived quality of the ASR system.
Published: 2020

7. BERT-Based Sentiment Analysis Using Distillation

Author: Luboš Šmídl, Pavel Ircing, Jan Švec, and Jan Lehečka
Subjects: business.industry, Computer science, Pooling, Sentiment analysis, Knowledge distillation, Machine learning, computer.software_genre, Production model, Artificial intelligence, Layer (object-oriented design), business, computer, Encoder, BERT, Transformer (machine learning model), Movie reviews
Abstract: In this paper, we present our experiments with BERT (Bidirectional Encoder Representations from Transformers) models in the task of sentiment analysis, which aims to predict the sentiment polarity for the given text. We trained an ensemble of BERT models from a large self-collected movie reviews dataset and distilled the knowledge into a single production model. Moreover, we proposed an improved BERT’s pooling layer architecture, which outperforms standard classification layer while enables per-token sentiment predictions. We demonstrate our improvements on a publicly available dataset with Czech movie reviews.
Published: 2020
Full Text: View/download PDF

8. Adjusting BERT’s Pooling Layer for Large-Scale Multi-Label Text Classification

Author: Jan Švec, Pavel Ircing, Jan Lehečka, and Luboš Šmídl
Subjects: 050101 languages & linguistics, Sequence, business.industry, Computer science, 05 social sciences, Pooling, Text document, 02 engineering and technology, computer.software_genre, Security token, Class (biology), Task (project management), Text classification, BERT model, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0501 psychology and cognitive sciences, Artificial intelligence, Layer (object-oriented design), business, Scale (map), computer, Natural language processing
Abstract: In this paper, we present our experiments with BERT models in the task of Large-scale Multi-label Text Classification (LMTC). In the LMTC task, each text document can have multiple class labels, while the total number of classes is in the order of thousands. We propose a pooling layer architecture on top of BERT models, which improves the quality of classification by using information from the standard [CLS] token in combination with pooled sequence output. We demonstrate the improvements on Wikipedia datasets in three different languages using public pre-trained BERT models.
Published: 2020

9. Použití stavových LSTM sítí pro detekci klíčových frází

Author: Martin Bulín, Luboš Šmídl, and Jan Švec
Subjects: 0209 industrial biotechnology, Context model, Phrase, LSTM, stavové modelování kontextu, detekce klíčových frází, ASR, business.industry, Computer science, LSTM, Stateful Context modeling, Key-phrase detection, ASR, Process (computing), Context (language use), 02 engineering and technology, Machine learning, computer.software_genre, 01 natural sciences, 020901 industrial engineering & automation, Recurrent neural network, Stateful firewall, 0103 physical sciences, Key (cryptography), State (computer science), Artificial intelligence, business, 010301 acoustics, computer
Abstract: V tomto článku se zaměřujeme na sítě LSTM (Long Short-Term Memory) a jejich implementaci v populárním rámci zvaném Keras. Cílem je ukázat, jak využít jejich schopnosti projít kontextem při zachování stavu a objasnit, co vlastně znamená stavová vlastnost LSTM rekurentní neuronové sítě implementované v Kerasu. Hlavním výsledkem práce je pak obecný algoritmus pro balení libovolných dat závislých na kontextu, který je schopen 1 / zabalit data tak, aby odpovídala stavovým modelům; 2 / zefektivnění tréninkového procesu dodáním více rámců dohromady; 3 / on-the-fly (frame-by-fly) predikce trénovaným modelem. Jsou prezentovány dvě metody trénování, přístup založený na okně je porovnán s plně stavovým přístupem. Analýza se provádí na datovém souboru příkazů řeči. Nakonec poskytujeme návod, jak používat stavové LSTM k vytvoření systému detekce klíčových frází. In this paper, we focus on LSTM (Long Short-Term Memory) networks and their implementation in a popular framework called Keras. The goal is to show how to take advantage of their ability to pass the context by holding the state and to clear up what the stateful property of LSTM Recurrent Neural Network implemented in Keras actually means. The main outcome of the work is then a general algorithm for packing arbitrary context-dependent data, capable of 1/ packing the data to fit the stateful models; 2/ making the training process efficient by supplying multiple frames together; 3/ on-the-fly (frame-by-frame) prediction by the trained model. Two training methods are presented, a window-based approach is compared with a fully-stateful approach. The analysis is performed on the Speech commands dataset. Finally, we give guidance on how to use stateful LSTMs to create a key-phrase detection system.
Published: 2019

10. Dialogový systém pro vyhledávání znalostí v rozsáhlých audiovizuálních archivech

Author: Adam Chýlek, Jan Švec, and Luboš Šmídl
Subjects: Generování znalostní báze, zpracování přirozeného jazyka, zodpovězení otázek, dialogový systém, Information retrieval, business.industry, Interface (Java), Computer science, Knowledge base generation, Natural language processing,Question answering, Dialog system, 02 engineering and technology, computer.software_genre, 030507 speech-language pathology & audiology, 03 medical and health sciences, Knowledge base, Named-entity recognition, 0202 electrical engineering, electronic engineering, information engineering, Question answering, 020201 artificial intelligence & image processing, Spoken dialog, Dialog system, Dialog box, 0305 other medical science, business, computer, Natural language
Abstract: V tomto článku představujeme náš hlasový dialogový systém, který slouží jako rozhraní pro vyhledávání v archivu MALACH. Hlasové rozhraní a přirozený jazyk na vstupu umožňují uživatelům pohodlnější načítání informací obsažených ve velkých audiovizuálních archivech. Zejména hledání odpovědí na strukturovanější otázku by mělo být snazší ve srovnání s typickými možnostmi vstupu do vyhledávání. Dialog je postaven na systému, který automaticky anotuje a indexuje archiv pomocí automatického rozpoznávání řeči. Tyto indexy byly zatím prohledávatelné pouze při fulltextovém vyhledávání libovolného textového dotazu. Náš navržený přístup vylepšuje tento systém a využívá rozpoznávání pojmenovaných entit k vytváření znalostní báze sémantických informací obsažených v uznávaných výrokách. Popisujeme návrh dialogového systému, automatické generování znalostní báze a přístup k vytváření dotazů pomocí mluveného přirozeného jazyka jako vstupu. In this paper, we present our spoken dialog system that serves as a search interface of the MALACH archive. The voice interface and natural language input allow the users to retrieve information contained in large audiovisual archives more comfortably. Especially, finding answers to a more structured question should be easier in comparison with typical search input options. The dialog is build on top of a system that automatically annotates and indexes the archive using automatic speech recognition. These indexes were searchable so far only in a full-text search for any arbitrary text query. Our proposed approach improves this system and leverages named entity recognition to create a knowledge base of semantic information contained in the recognized utterances. We describe the design of the dialog system, as well as the automatic knowledge base generation and the approach to creating queries using a spoken natural language as an input.
Published: 2019

11. Towards Network Simplification for Low-Cost Devices by Removing Synapses

Author: Luboš Šmídl, Martin Bulín, and Jan Švec
Subjects: rozpoznávání řeči, Artificial neural network, nízkonákladová zařízení, Computer science, 020209 energy, speech recognition, 02 engineering and technology, zjednodušení sítě, prořezávání synpsí, Power (physics), pruning synapses, Reduction (complexity), Task (computing), low-cost devices, Computer engineering, Software deployment, minimal network structure, 0202 electrical engineering, electronic engineering, information engineering, Memory footprint, 020201 artificial intelligence & image processing, Pruning (decision trees), Simple speech, struktura minimální sítě, network simplification
Abstract: The deployment of robust neural network based models on low-cost devices touches the problem with hardware constraints like limited memory footprint and computing power. This work presents a general method for a rapid reduction of parameters (80–90%) in a trained (DNN or LSTM) network by removing its redundant synapses, while the classification accuracy is not significantly hurt. The massive reduction of parameters leads to a notable decrease of the model’s size and the actual prediction time of on-board classifiers. We show the pruning results on a simple speech recognition task, however, the method is applicable to any classification data.
Published: 2018
Full Text: View/download PDF

12. Choosing a Dialogue System’s Modality in Order to Minimize User’s Workload

Author: Jakub Nedvěd, Adam Chýlek, and Luboš Šmídl
Subjects: 050210 logistics & transportation, Focus (computing), Modality (human–computer interaction), Computer science, 05 social sciences, Word error rate, Workload, Task (project management), Order (business), Human–computer interaction, 0502 economics and business, 0501 psychology and cognitive sciences, Duration (project management), 050107 human factors, Cognitive load
Abstract: The communication during human-machine interaction often happens only as a secondary task that distract the user’s main focus on a primary task. In our study, the primary task was driving a vehicle and the secondary task was an interaction with a dialogue system on a tablet device using touch and speech. In this paper we present the design and the analysis of a study that can be used to create an optimal strategy for a dialogue manager that takes into consideration several metrics. These include the type of the information we require from the user, the expected cognitive load on the user, the expected duration of a user’s response and the expected error rate.
Published: 2018
Full Text: View/download PDF

13. Learning to Interrupt the User at the Right Time in Incremental Dialogue Systems

Author: Adam Chýlek, Luboš Šmídl, and Jan Švec
Subjects: Annotation, Process (engineering), Computer science, Human–computer interaction, Interrupt, Utterance
Abstract: Continuous processing of input in incremental dialogue systems might result in the need of interrupting a user’s utterance when clarification or rapport is needed. Being able to predict the right time when to interrupt the utterance can be another step to a more human-like dialogue. On the other hand, annotation of corpora with different types of possible interruptions requires additional human resources. In this paper, we discuss how to process a corpus that does not have interruptions specifically annotated. We also present initial experiments on two corpora and show that it is possible to model the desired behaviour from these corpora.
Published: 2018
Full Text: View/download PDF

14. Semi-Supervised Training of DNN-Based Acoustic Model for ATC Speech Recognition

Author: Aleš Pražák, Luboš Šmídl, Jan Trmal, and Jan Švec
Subjects: Czech, 050210 logistics & transportation, Unlabelled data, Computer science, Speech recognition, 05 social sciences, Acoustic model, 02 engineering and technology, Air traffic control, Training methods, language.human_language, ComputingMethodologies_PATTERNRECOGNITION, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, language, 020201 artificial intelligence & image processing, Baseline (configuration management), Supervised training, Data selection
Abstract: In this paper, we describe a semi-supervised training method used to generalize the Air Traffic Control (ATC) speech recognizer. The paper introduces the problems and challenges in ATC English recognition, describes available datasets and ongoing research projects. The baseline recognition model is then used to recognize the unlabelled data from a publicly available source. We used the LiveATC community portal which records and archives the recordings of ATC communication near the airports. The recognized unlabelled data are filtered using the data selection procedure based on confidence scores and the recognition acoustic model is retrained to obtain a more general model. The results on accented Czech and French data are reported.
Published: 2018
Full Text: View/download PDF

15. A Relevance Score Estimation for Spoken Term Detection Based on RNN-Generated Pronunciation Embeddings

Author: Luboš Šmídl, Jan Švec, Jan Trmal, and Josef Psutka
Subjects: Estimation, Computer science, business.industry, Speech recognition, 02 engineering and technology, Pronunciation, computer.software_genre, Term (time), 03 medical and health sciences, 0302 clinical medicine, 030221 ophthalmology & optometry, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Relevance (information retrieval), Artificial intelligence, business, computer, Natural language processing
Published: 2017
Full Text: View/download PDF

16. An Analysis of the RNN-Based Spoken Term Detection Training

Author: Luboš Šmídl, Jan Švec, and Josef Psutka
Subjects: Computer science, Speech recognition, Grapheme, Realization (linguistics), 020206 networking & telecommunications, 02 engineering and technology, Pronunciation, 01 natural sciences, Term (time), Task (project management), Recurrent neural network, 0103 physical sciences, 0202 electrical engineering, electronic engineering, information engineering, Embedding, Representation (mathematics), 010301 acoustics
Abstract: This paper studies the training process of the recurrent neural networks used in the spoken term detection (STD) task. The method used in the paper employ two jointly trained Siamese networks using unsupervised data. The grapheme representation of a searched term and the phoneme realization of a putative hit are projected into the pronunciation embedding space using such networks. The score is estimated as relative distance of these embeddings. The paper studies the influence of different loss functions, amount of unsupervised data and the meta-parameters on the performance of the STD system.
Published: 2017
Full Text: View/download PDF

17. A study of different weighting schemes for spoken language understanding based on convolutional neural networks

Author: Pavel Ircing, Jan Švec, Adam Chylek, and Luboš Šmídl
Subjects: Artificial neural network, business.industry, Computer science, Estimation theory, Speech recognition, computer.software_genre, 01 natural sciences, Convolutional neural network, Weighting, 030507 speech-language pathology & audiology, 03 medical and health sciences, Robustness (computer science), 0103 physical sciences, Artificial intelligence, 0305 other medical science, business, 010301 acoustics, computer, Natural language processing, Spoken language
Abstract: This paper describes the development of a stateless spoken spoken language understanding (SLU) module based on artificial neural networks that is able to deal with the uncertainty of the automatic speech recognition (ASR) output. The work builds upon the concept of weighted neurons introduced by the authors previously and presents a generalized weighting term for such a neuron. The effect of different forms and parameter estimation methods of the weighting term is experimentally evaluated on the multi-task training corpus, created by merging two different semantically annotated corpora. The robustness of the best performing weighting schemes is then demonstrated by experiments involving hybrid word-semantic (WSE) lattices and also limited data scenario.
Published: 2016
Full Text: View/download PDF

18. An Intelligent Telephony Interface of Multiagent Decision Support Systems

Author: Petr Becvar, Michal Pechoucek, Luboš Šmídl, and Josef Psutka
Subjects: Decision support system, Voice over IP, business.industry, Computer science, Interface (Java), Information technology, VoiceXML, Application software, computer.software_genre, Computer Science Applications, Human-Computer Interaction, Production planning, Control and Systems Engineering, Human–computer interaction, Embedded system, The Internet, Telephony, Electrical and Electronic Engineering, User interface, business, computer, Software, Information Systems
Abstract: ExtraPlanT is a multiagent production planning system designed for small and medium-sized enterprises with project-oriented production. In order to make the results of the system available even to users who are located away from the enterprise, it has been equipped with the possibility of remote access-a Web and telephony interface. The multiagent design of the ExtraPlanT makes the integration of these interfaces robust and simple. The telephony interface uses VoiceXML technology so that it can be built without extensive knowledge of speech processing. The interface also uses innovative techniques to overcome the common disadvantages of speech as a medium for machine output.
Published: 2007
Full Text: View/download PDF

19. Word-semantic lattices for spoken language understanding

Author: Pavel Ircing, Tomáš Valenta, Luboš Šmídl, Adam Chylek, and Jan Švec
Subjects: Vocabulary, Computer science, business.industry, Speech recognition, media_common.quotation_subject, Word error rate, Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing), computer.software_genre, Semantics, Semantic computing, Artificial intelligence, Dialog box, business, Hidden Markov model, computer, Natural language processing, Word (computer architecture), Spoken language, media_common
Abstract: The paper presents a method for converting word-based automatic speech recognition (ASR) lattices into word-semantic (W-SE) lattices that contain original words together with a partial semantic information - so-called semantic entities. Semantic entity detection algorithm generates semantic entities based on the expert-defined knowledge. The generated W-SE lattices have smaller vocabulary and consequently reduce the sparsity of the training data. The format of the W-SE lattices also naturally preserves the inherent uncertainty of the ASR output that can be exploited in subsequent dialog modules. The presented technique employs the framework of weighted finite state transducers which allows for efficient optimization of word-semantic lattices. We have evaluated the method in two different spoken language understanding tasks and obtained more than 10% reduction of concept error rate in comparison with using 1-best word hypothesis in both of those tasks.
Published: 2015
Full Text: View/download PDF

20. WebTransc — A WWW Interface for Speech Corpora Production and Processing

Author: Luboš Šmídl and Tomáš Valenta
Subjects: Web browser, business.product_category, Multimedia, Computer science, business.industry, Speech synthesis, computer.software_genre, Annotation, ComputingMethodologies_PATTERNRECOGNITION, Internet access, Web application, Artificial intelligence, Transcription (software), business, computer, Natural language processing
Abstract: This paper describes a web application that was designed to prepare and process speech corpora, key data sources for automatic speech recognition (ASR), natural language processing (NLP), speech synthesis (TTS) and many other tasks. The application allows users to process the corpora with no other equipment than a web browser with internet connection. The application has been used, upgraded and improved for several years and its history is also described here. During that time, many valuable experiences with speech corpora processing have been gained and they are also mentioned as some good practices.
Published: 2015
Full Text: View/download PDF

21. On the impact of sentence length on recognition accuracy

Author: Tomáš Valenta and Luboš Šmídl
Subjects: Audio mining, Czech, Vocabulary, Sentence length, Computer science, business.industry, Speech recognition, media_common.quotation_subject, Speech corpus, computer.software_genre, VoxForge, language.human_language, language, Automatic speech, Artificial intelligence, Hidden Markov model, business, computer, Natural language processing, media_common
Abstract: The goal of this article is to analyse how the length of utterances affects performance of an automatic speech recognizer (ASR). Benchmarks of an ASR system were performed for utterances of various lengths on English and Czech corpora. Then the observed phenomena are tried to be explained theoretically. Eventually, results are summarized and some conclusions drawn.
Published: 2014
Full Text: View/download PDF

22. Semantic Entity Detection in the Spoken Air Traffic Control Data

Author: Luboš Šmídl and Jan Švec
Subjects: Rule-based machine translation, Weighted finite state transducer, Computer science, business.industry, Process (computing), Word error rate, Artificial intelligence, Dialog box, Air traffic control, computer.software_genre, business, computer, Natural language processing
Abstract: The paper deals with the semantic entity detection (SED) in the ASR lattices obtained by recognizing the air traffic control dialogs. The presented method is intended for the use in an automatic training tool for air traffic controllers. The semantic entities are modeled using the expert-defined context-free grammars. We use a novel approach which allows processing of uncertain input in the form of weighted finite state transducer. The method was experimentally evaluated on the real data. We also compare two methods for utilization of the knowledge about the dialog environment in the SED process. The results show that the SED with the knowledge about target semantic entities improves the equal error rate from 24.7% to 17.1% in comparison to generic SED.
Published: 2014
Full Text: View/download PDF

23. Semantic entity detection from multiple ASR hypotheses within the WFST framework

Author: Luboš Šmídl, Pavel Ircing, and Jan Švec
Subjects: Computer science, business.industry, Speech recognition, Semantic interpretation, Context-free grammar, computer.software_genre, Automaton, medicine, Automata theory, Finite state, Artificial intelligence, Named entity detection, medicine.symptom, business, computer, Natural language processing, Confusion
Abstract: The paper presents a novel approach to named entity detection from ASR lattices. Since the described method not only detects the named entities but also assigns a detailed semantic interpretation to them, we call our approach the semantic entity detection. All the algorithms are designed to use automata operations defined within the framework of weighted finite state transducers (WFST) - the ASR lattices are nowadays frequently represented as weighted acceptors. The expert knowledge about the semantics of the task at hand can be first expressed in the form of a context free grammar and then converted to the FST form. We use a WFST optimization to obtain compact representation of the ASR lattice. The WFST framework also allows to use the word confusion networks as another representation of multiple ASR hypotheses. That way we can use the full power of composition and optimization operations implemented in the OpenFST toolkit for our semantic entity detection algorithm. The devised method also employs the concept of a factor automaton; this approach allows us to overcome the need for a filler model and consequently makes the method more general. The paper includes experimental evaluation of the proposed algorithm and compares the performance obtained by using the one-best word hypothesis, optimized lattices and word confusion networks.
Published: 2013
Full Text: View/download PDF

24. Hierarchical discriminative model for spoken language understanding

Author: Pavel Ircing, Jan Švec, and Luboš Šmídl
Subjects: Parsing, Grammar, Computer science, business.industry, media_common.quotation_subject, Speech recognition, computer.software_genre, Discriminative model, Rule-based machine translation, Artificial intelligence, Tuple, business, computer, Natural language processing, Sentence, media_common, Spoken language, Spoken dialog systems
Abstract: The paper presents a new discriminative model for statistical spoken language understanding designed for use in spoken dialog systems. The parsing algorithm uses lexicalized grammar derived from unaligned training data with probability estimates generated by multiclass classifiers. The generated semantic trees are partially aligned with the input sentence to provide lexical realisation of semantic concepts. The model was evaluated on two semantically annotated corpora and in both tasks it outperforms the baseline Hidden Vector State parser and Semantic Tuple Classifiers model. The experiments were performed using both transcribed data and recognized lattices. The innovative aspect of using phoneme lattices in the understanding process instead of word lattices is examined and described.
Published: 2013
Full Text: View/download PDF

25. On the Use of Phoneme Lattices in Spoken Language Understanding

Author: Luboš Šmídl and Jan Švec
Subjects: ComputingMethodologies_PATTERNRECOGNITION, Discriminative model, Phoneme recognition, Computer science, business.industry, Speech recognition, Artificial intelligence, computer.software_genre, business, computer, Word (computer architecture), Natural language processing, Spoken language
Abstract: This paper presents a novel approach to spoken language understanding in dialogue systems. Unlike prevalent methods that use only the word lattices, the presented approach works with phoneme lattices generated by a phoneme recognizer. The hierarchical discriminative model for speech understanding was used together with modifications proposed in this paper. The method was experimentally evaluated using two semantic corpora and the results are presented.
Published: 2013
Full Text: View/download PDF

26. Spoken Dialogue System Design in 3 Weeks

Author: Tomáš Valenta, Luboš Šmídl, and Jan Švec
Subjects: Service (systems architecture), Vocabulary, Parsing, business.industry, Computer science, media_common.quotation_subject, VoiceXML, Context-free grammar, computer.software_genre, Task (project management), Rule-based machine translation, Systems design, Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: This article describes knowledge-based spoken dialogue system design from scratch. It covers all stages which were performed during the period of three weeks: definition of semantic goals and entities, data collection and recording of sample dialogues, data annotation, parser and grammars design, dialogue manager design and testing. The work was focused mainly on rapid development of such a dialogue system. The final implementation was written in dynamically generated VoiceXML. The large vocabulary continuous speech recognition system was used and the language understanding module was implemented using non-recursive probabilistic context free grammars which were converted to finite states transducers. The design and implementation has been verified on a railway information service task with a real large-scale database. The paper describes an innovative combination of data, expert knowledge and state-of-the-art methods which allow fast spoken dialogue system design.
Published: 2012
Full Text: View/download PDF

27. On the Impact of Annotation Errors on Unit-Selection Speech Synthesis

Author: Daniel Tihelka, Luboš Šmídl, and Jindřich Matoušek
Subjects: Process (engineering), business.industry, Computer science, media_common.quotation_subject, Speech recognition, Speech synthesis, Speech corpus, computer.software_genre, Unit (housing), Annotation, ComputingMethodologies_PATTERNRECOGNITION, Manual annotation, Selection (linguistics), Quality (business), Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: Unit selection is a very popular approach to speech synthesis. It is known for its ability to produce nearly natural-sounding synthetic speech, but, at the same time, also for its need for very large speech corpora. In addition, unit selection is also known to be very sensitive to the quality of the source speech corpus the speech is synthesised from and its textual, phonetic and prosodic annotations and indexation. Given the enormous size of current speech corpora, manual annotation of the corpora is a lengthy process. Despite this fact, human annotators do make errors. In this paper, the impact of annotation errors on the quality of unit-selection-based synthetic speech is analysed. Firstly, an analysis and categorisation of annotation errors is presented. Then, a speech synthesis experiment, in which the same utterances were synthesised by unit-selection systems with and without annotation errors, is described. Results of the experiment and the options for fixing the annotation errors are discussed as well.
Published: 2012
Full Text: View/download PDF

28. Unsupervised Synchronization of Hidden Subtitles with Audio Track Using Keyword Spotting Algorithm

Author: Luboš Šmídl, Jan Švec, and Petr Stanislav
Subjects: Similarity (geometry), business.industry, Computer science, Speech recognition, computer.software_genre, Set (abstract data type), Software framework, Longest common subsequence problem, Consistency (database systems), Keyword spotting, Synchronization (computer science), Subtitle, Artificial intelligence, business, computer, Algorithm, Natural language processing
Abstract: This paper deals with a processing of hidden subtitles and with an assignment of subtitles without time alignment to the corresponding parts of audio records. The first part of this paper describes processing of hidden subtitles using a software framework designed for handling large volumes of language modelling data. It evaluates characteristics of a corpus built from publicly available subtitles and compares them with the corpora created from other sources of data such as news articles. The corpus consistency and similarity to other data sources is evaluated using a standard Spearman rank correlation coefficients. The second part presents a novel algorithm for unsupervised alignment of hidden subtitles to the corresponding audio. The algorithm uses no prior time alignment information. The method is based on a keyword spotting algorithm. This algorithm is used for approximate alignment, because large amount of redundant information is included in obtained results. The longest common subsequence algorithm then determines the best alignment of an audio and a subtitle. The method was verified on a set of real data (set of TV shows with hidden subtitles).
Published: 2012
Full Text: View/download PDF

29. Real-time large vocabulary spontaneous speech recognition for spoken dialog systems

Author: Luboš Šmídl and Jan Švec
Subjects: Vocabulary, Computer science, business.industry, media_common.quotation_subject, Speech recognition, Word error rate, Speech corpus, computer.software_genre, Formal language, Text normalization, Written language, Artificial intelligence, Dialog system, business, computer, Natural language processing, Spoken dialog systems, media_common
Abstract: This paper describes the method for modifying the baseline speech recognition system to be suitable for a use in spoken dialog system with mixed initiative and natural user's input. We present three approaches for extending the recognition vocabulary to ensure the spoken dialog system is able to recognize all entities in the given domain. The colloquial text normalization method is proposed. The experiments performed on spontaneous speech corpus suggested that the proposed method is very important for languages where the formal written language and a common colloquial speech are very different. The overall word error rate was reduced by 16.7%.
Published: 2011
Full Text: View/download PDF

30. Automatic Switchboard Operator

Author: Tomáš Valenta and Luboš Šmídl
Subjects: Voice activity detection, Grammar, Computer science, business.industry, Speech recognition, media_common.quotation_subject, Operator (linguistics), Speech synthesis, VoiceXML, computer.software_genre, Phone, Filter (video), Key (cryptography), Artificial intelligence, business, computer, Natural language processing, media_common
Abstract: This paper describes the Automatic Switchboard Operator system and experiences and improvements based on data collected while full operation of the application. Automatic Switchboard Operator is a voice dialogue application whose purpose is to answer phone calls and transfer the calls to the requested person. Called person is recognized according to a speech grammar which has key effect on successfulness of the system as a whole. After several months of full operation of the application, the speech grammar was made more robust in order to accept more utterances and filter out substantial information, i.e. a filler model was introduced.
Published: 2011
Full Text: View/download PDF

31. System for fast lexical and phonetic spoken term detection in a czech cultural heritage archive

Author: Pavel Ircing, Josef Psutka, Luboš Šmídl, Aleš Pražák, Jan Vaněk, and Jan Švec
Subjects: Czech, Acoustics and Ultrasonics, Computer science, media_common.quotation_subject, 02 engineering and technology, Lexicon, computer.software_genre, video, 030507 speech-language pathology & audiology, 03 medical and health sciences, Phonetic search technology, Malach, 0202 electrical engineering, electronic engineering, information engineering, automatické rozpoznávání řeči, Electrical and Electronic Engineering, media_common, business.industry, automatic speech recognition, Acoustic model, 16. Peace & justice, language.human_language, Term (time), Cultural heritage, Slang, language, 020201 artificial intelligence & image processing, Language model, Artificial intelligence, 0305 other medical science, business, computer, Natural language processing
Abstract: The main objective of the work presented in this paper was to develop a complete system that would accomplish the original visions of the MALACH project. Those goals were to employ automatic speech recognition and information retrieval techniques to provide improved access to the large video archive containing recorded testimonies of the Holocaust survivors. The system has been so far developed for the Czech part of the archive only. It takes advantage of the state-of-the-art speech recognition system tailored to the challenging properties of the recordings in the archive (elderly speakers, spontaneous speech and emotionally loaded content) and its close coupling with the actual search engine. The design of the algorithm adopting the spoken term detection approach is focused on the speed of the retrieval. The resulting system is able to search through the 1,000 h of video constituting the Czech portion of the archive and find query word occurrences in the matter of seconds. The phonetic search implemented alongside the search based on the lexicon words allows to find even the words outside the ASR system lexicon such as names, geographic locations or Jewish slang.
Published: 2011

32. Prototype of Czech Spoken Dialog System with Mixed Initiative for Railway Information Service

Author: Jan Švec and Luboš Šmídl
Subjects: Service (systems architecture), Computer science, business.industry, Natural language understanding, Speech synthesis, computer.software_genre, Domain (software engineering), Human–computer interaction, The Internet, Artificial intelligence, Dialog box, Dialog system, business, computer, Utterance, Natural language processing
Abstract: This paper describes a prototype of a Czech dialog system with a mixed dialog initiative and a natural language understanding module. The described dialog system is designed for providing railway information such as arrivals, departures, prices and train types. The dialog can be driven by both an user of the system and a dialog manager to accomplish the dialog goal. In addition the user can use an almost arbitrary Czech utterance consistent with the dialog domain to interact with the system. The system accesses the train database on-line via the Internet. The version described in this paper works as a desktop computer application and communicates with the user using the headset. The paper describes the modules of the dialog system including automatic speech recognition, natural language understanding, dialog manager, speech generation and speech synthesis.
Published: 2010
Full Text: View/download PDF

33. Voice-supported electronic health record for temporomandibular joint disorders

Author: Tatjana Dostalova, Hippmann R, M. Seydlova, P. Kriz, Jan Trmal, Luboš Šmídl, Jana Zvárová, M. Nagy, and Petr Hanzlicek
Subjects: medicine.medical_specialty, Interoperability, MEDLINE, Dentistry, Health Informatics, Data entry, Ontology (information science), Field (computer science), Terminology, Open Biomedical Ontologies, Interoperation, Health Information Management, Electronic health record, Component (UML), OBO Foundry, Temporomandibular Joint Disorder, medicine, Electronic Health Records, Humans, Medical physics, Advanced and Specialized Nursing, SNOMED CT, business.industry, Usability, Temporomandibular Joint Disorders, Data science, Temporomandibular joint, medicine.anatomical_structure, business, Speech Recognition Software
Abstract: Objective: Biomedical ontologies exist to serve integration of clinical and experimental data, and it is critical to their success that they be put to widespread use in the annotation of data. How, then, can ontologies achieve the sort of user- friendliness, reliability, cost-effectiveness, and breadth of coverage that is necessary to ensure extensive usage? Methods: Our focus here is on two different sets of answers to these questions that have been proposed, on the one hand in medicine, by the SNOMED CT community, and on the other hand in biology, by the OBO Foundry. We address more specifically the issue as to how adherence to certain development principles can advance the usability and effectiveness of an ontology or terminology resource, for example by allowing more accurate maintenance, more reliable application, and more efficient interoperation with other ontologies and information resources. Results: SNOMED CT and the OBO Foundry differ considerably in their general approach. Nevertheless, a general trend towards more formal rigor and cross-domain interoperability can be seen in both and we argue that this trend should be accepted by all similar initiatives in the future. Conclusions: Future efforts in ontology development have t o address t he need f or harmonization and integration of ontologies across disciplinary borders, and for this, coherent formalization of ontologies is a pre-requisite.
Published: 2009

34. Rejection and key-phrase spottin techniques using a mumble model in a czech telephone dialog system

Author: Müller, L., Jurčiček, F., and Luboš Šmídl
Published: 2000
Full Text: View/download PDF

35. Design of Speech Recognition Engine

Author: Luboš Šmídl, Josef Psutka, and Ludek Müller
Subjects: symbols.namesake, ComputingMethodologies_PATTERNRECOGNITION, Perplexity, Rule-based machine translation, Viterbi decoder, Computer science, Speech recognition, symbols, Speech corpus, Markov model, Speaker recognition, Gaussian process
Abstract: This paper concerns a speaker independent recognition engine of Czech continuous speech designed for Czech telephone applications and describes the recognition module as an important component of a telephone dialogue system being designed and constructed at the Department of Cybernetics, the University of West Bohemia. The recognition is based on a statistical approach. The left-to-right three-state HMMs with an output probability density function expressed as multivariate Gaussian mixture are used to model triphones as basic units in acoustic modelling and stochastic regular grammars are implemented to reduce a task perplexity. A real time recognition process is supported by a very computation cost reduction approach estimating log-likelihood scores of Gaussian mixtures and also by a beam pruning used during Viterbi decoding. The present paper concerns the main part of the engine - a speaker independent recognition engine for continuous Czech speech.
Published: 2000
Full Text: View/download PDF

37. An automatic training tool for air traffic control training

Author: Stanislav, P., Luboš Šmídl, and Švec, J.

38. Decision support framework ExtraPlanT with remote access and telephony interface

Author: Luboš Šmídl, Michal Pechoucek, and Petr Becvar
Subjects: Decision support system, business.industry, Computer science, Interface (Java), Speech synthesis, VoiceXML, computer.software_genre, User experience design, Models of communication, Transient (computer programming), Telephony, business, computer, Computer network
Abstract: ExtraPlanT system is a multi-agent production planning system designed for small factories, which needs to react quickly on market changes. To deal with this requirement, ExtraPlanT system has been equipped with an extra-enterprise access feature that allows managers to access and use the system whenever and wherever they need. One possibility of the extra-enterprise access is the telephony interface using computer based speech recognition and synthesis. The interface has been built on a VoiceXML technology, and it uses DTMF and speech input and a synthesized speech output. VoiceXML documents are generated by JAVA servlets running on Tomcat server. To overcome the main disadvantages of telephony interfaces: sequential, transient and slow presentation of information, two techniques has been developed for the ExtraPlanT telephony interface. The first technique is the two-level communication model based on analytical module-knowledge-based system that transforms data into a short summary. On a user request, each summary can be followed by a detailed explanation. The second technique is a dynamical selection of prompts wording, which selects a wording of the prompts according to estimated user experience in order to find an optimal dialog length and descriptiveness

39. Design and development of speech corpora for air traffic control training

Author: Luboš Šmídl, Švec, J., Tihelka, D., Matoušek, J., Romportl, J., and Ircing, P.

40. Hierarchical discriminative model for spoken language understanding based on convolutional neural network

Author: Jan Švec, Luboš Šmídl, and Adam Chýlek
Subjects: Discriminative model, Computer science, business.industry, Speech recognition, Artificial intelligence, computer.software_genre, business, computer, Convolutional neural network, Natural language processing, Spoken language

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

40 results on '"Luboš Šmídl"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources