54 results on '"Elmar Nöth"'
Search Results
2. New Cues in Low-Frequency of Speech for Automatic Detection of Parkinson’s Disease
- Author
-
César Germán Castellanos-Domínguez, Juan Rafael Orozco-Arroyave, Elmar Nöth, Elkyn Alexander Belalcázar-Bolaños, Jesús Francisco Vargas-Bonilla, and Julián D. Arias-Londoño
- Subjects
Parkinson's disease ,Computer science ,Speech recognition ,medicine ,medicine.disease ,Group delay and phase delay ,Energy operator - Abstract
In this paper, the analysis of low-frequency zone of the speech signals from the five Spanish vowels, by means of the Teager energy operator (TEO) and the modified group delay functions (MGDF) is proposed for the automatic detection of Parkinson’s disease.
- Published
- 2013
3. Analysis of Speech from People with Parkinson’s Disease through Nonlinear Dynamics
- Author
-
Juan Rafael Orozco-Arroyave, Julián D. Arias-Londoño, Elmar Nöth, and Jesús Francisco Vargas-Bonilla
- Subjects
Nonlinear system ,Parkinson's disease ,Computer science ,Speech recognition ,medicine ,medicine.disease - Abstract
Different characterization approaches, including nonlinear dynamics (NLD), have been addressed for the automatic detection of PD; however, the obtained discrimination capability when only NLD features are considered has not been evaluated yet.
- Published
- 2013
4. Intelligibility Rating with Automatic Speech Recognition, Prosodic, and Cepstral Evaluation
- Author
-
Frank Rosanowski, Cornelia Moers, Elmar Nöth, Bernd Möbius, and Tino Haderlein
- Subjects
Correlation ,Support vector machine ,Offset (computer science) ,Computer science ,Speech recognition ,Vowel ,Cepstrum ,Intelligibility (communication) ,Standard deviation ,Jitter - Abstract
For voice rehabilitation, speech intelligibility is an important criterion. Automatic evaluation of intelligibility has been shown to be successful for automatic speech recognition methods combined with prosodic analysis. In this paper, this method is extended by using measures based on the Cepstral Peak Prominence (CPP). 73 hoarse patients (48.3±16.8 years) uttered the vowel /e/ and read the German version of the text "The North Wind and the Sun". Their intelligibility was evaluated perceptually by 5 speech therapists and physicians according to a 5-point scale. Support Vector Regression (SVR) revealed a feature set with a human-machine correlation of r = 0.85 consisting of the word accuracy, smoothed CPP computed from a speech section, and three prosodic features (normalized energy of word-pause-word intervals, F0 value at voice offset in a word, and standard deviation of jitter). The average human-human correlation was r = 0.82. Hence, the automatic method can be a meaningful objective support for perceptual analysis.
- Published
- 2011
5. Voice Assessment of Speakers with Laryngeal Cancer by Glottal Excitation Modeling Based on a 2-Mass Model
- Author
-
Tobias Bocklet, Georg Stemmer, and Elmar Nöth
- Subjects
Range (music) ,medicine.anatomical_structure ,Glottis ,Computer science ,Vocal folds ,Speech recognition ,Feature extraction ,medicine ,Contrast (statistics) ,Excitation ,Vocal tract ,Parametric statistics - Abstract
The paper investigates the automatic evaluation of voice-related criteria of speakers with laryngeal cancer using a parametric two-mass model of the glottis. In contrast to previous approaches based on automatic speech recognition, the proposed method allows for a distinct evaluation of voice parameters alone since the underlying feature extraction technologies are based on a modeling of the whole vocal tract. This work focuses on the separation of vocal folds and vocal tract by LPC, where the vocal folds are represented by a parametric two-mass model which characterizes the excitation signal. The model parameters are optimized by a data-driven optimization procedure in order to fit the synthetic excitation signal to the LPC residue and the estimated pitch. We found first evidence that the computed parameters are meaningful in form of Pearson correlations between excitation signal parameters and different perceptual voice evaluation criteria in the range of r ≈ |0.7|.
- Published
- 2011
6. A Novel Lecture Browsing System Using Ranked Key Phrases and StreamGraphs
- Author
-
Korbinian Riedhammer, Martin Gropp, and Elmar Nöth
- Subjects
Normalized discounted cumulative gain ,Information retrieval ,Phrase ,Ranking ,Computer science ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,Key (cryptography) ,Learning to rank ,Visualization - Abstract
A growing number of universities offer recordings of lectures, seminars and talks in an online e-learning portal. However, the user is often not interested in the entire recording, but is looking for parts covering a certain topic. Usually, the user has to either watch the whole video or "zap" through the lecture and risk missing important details. We present an integrated web-based platform to help users find relevant sections within recorded lecture videos by providing them with a ranked list of key phrases. For a user-defined subset of these, a StreamGraph visualizes when important key phrases occur and how prominent they are at the given time. To come up with the best key phrase rankings, we evaluate three different key phrase ranking methods using lectures of different topics by comparing automatic with human rankings, and show that human and automatic rankings yield similar scores using Normalized Discounted Cumulative Gain (NDCG).
- Published
- 2011
7. Automatic Detection and Evaluation of Edentulous Speakers with Insufficient Dentures
- Author
-
Elmar Nöth, Florian Stelzle, Tobias Bocklet, Florian Hönig, Christian Knipfer, and Tino Haderlein
- Subjects
Orthodontics ,Computer science ,medicine.medical_treatment ,Assistive technology ,medicine ,Dentures ,Oral cavity - Abstract
Dental rehabilitation by complete dentures is a state-of-the-art approach to improve functional aspects of the oral cavity of edentulous patients. It is important to assure that these dentures have a sufficient fit. We introduce a dataset of 13 edentulous patients that have been recorded with and without complete dentures in situ. These patients have been rated an insufficient fit of their dentures, so that additional (sufficient) dentures and additional speech recordings have been prepared. In this paper we show that sufficient dentures increase the performance of an ASR system by ca. 27 %. Based on these results, we present and discuss three different systems that automatically determine whether the dentures of an edentulous person have a sufficient fit or not. The system with the best performance models the recordings by GMMs and uses the mean vectors of these GMMs as features in an SVM. With this system we were able to achieve a recognition rate of 80%.
- Published
- 2010
8. Towards the Automatic Classification of Reading Disorders in Continuous Text Passages
- Author
-
Stefanie Horndasch, Elmar Nöth, Tobias Bocklet, Andreas Maier, and Florian Hönig
- Subjects
Supervisor ,business.industry ,Computer science ,Speech recognition ,media_common.quotation_subject ,Standardized test ,computer.software_genre ,Test (assessment) ,Identification (information) ,ComputingMethodologies_PATTERNRECOGNITION ,Duration (music) ,Reading (process) ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
In this paper, we present an automatic classification approach to identify reading disorders in children. This identification is based on a standardized test. In the original setup the test is performed by a human supervisor who measures the reading duration and notes down all reading errors of the child at the same time. In this manner we recorded tests of 38 children who were suspected to have reading disorders. The data was confronted to an automatic system which employs speech recognition and prosodic analysis to identify the reading errors. In a subsequent classification experiment -- based on the speech recognizer's output, the duration of the test, and prosodic features -- 94.7 % of the children could be classified correctly.
- Published
- 2009
9. Communication Disorders and Speech Technology
- Author
-
Stefan Steidl, Elmar Nöth, and Maria Schuster
- Subjects
Cued speech ,Speech production ,Computer science ,Speech recognition ,Speech technology ,medicine ,Language translation ,medicine.symptom ,Intelligibility (communication) ,Speech Therapist ,Paraphasia ,Speech error - Abstract
In this talk we will give an overview of the different kinds of communication disorders. We will concentrate on communication disorders related to language and speech (i.e., not look at disorders like blindness or deafness). Speech and language disorders can range from simple sound substitution to the inability to understand or use language. Thus, a disorder may affect one or several linguistic levels: A patient with an articulation disorder cannot correctly produce speech sounds (phonemes) because of imprecise placement, timing, pressure, speed, or flow of movement of the lips, tongue, or throat. His speech may be acoustically unintelligible, yet the syntactic, semantic, and pragmatic level are not affected. With other pathologies, e.g. Wernicke-aphasia, the acoustics of the speech signal might be intelligible, yet the patient is --- due to mixup of words (semantic paraphasia) or sounds (phonematic paraphasia) --- unintelligible. We will look at what linguistic knowledge has to be modeled in order to analyze different pathologies with speech technology, how difficult the task is, and how speech technology is able to support the speech therapist for the tasks diagnosis, therapy control, comparison of therapies, and screening.
- Published
- 2009
10. Objective vs. Subjective Evaluation of Speakers with and without Complete Dentures
- Author
-
Andreas Maier, Christian Knipfer, Elmar Nöth, Tino Haderlein, Tobias Bocklet, and Florian Stelzle
- Subjects
Word accuracy ,Correlation ,Rehabilitation ,Computer science ,medicine.medical_treatment ,Speech recognition ,Word recognition ,medicine ,Objective evaluation ,Dentures ,Intelligibility (communication) - Abstract
For dento-oral rehabilitation of edentulous (toothless) patients, speech intelligibility is an important criterion. 28 persons read a standardized text once with and once without wearing complete dentures. Six experienced raters evaluated the intelligibility subjectively on a 5-point scale and the voice on the 4-point Roughness-Breathiness-Hoarseness (RBH) scales. Objective evaluation was performed by Support Vector Regression (SVR) on the word accuracy (WA) and word recognition rate (WR) of a speech recognition system, and a set of 95 word based prosodic features. The word accuracy combined with selected prosodic features showed a correlation of up to r = 0.65 to the subjective ratings for patients with dentures and r = 0.72 for patients without dentures. For the RBH scales, however, the average correlation of the feature subsets to the subjective ratings for both types of recordings was r < 0.4.
- Published
- 2009
11. 3D Tele-Medical Speech Therapy using Time-of-Flight Technology
- Author
-
Elmar Nöth, Jochen Penne, Andreas Maier, R. Handschu, Christian Schaller, M. Scibor, S. Soutschek, and Michael Stürmer
- Subjects
Time of flight ,Telemedicine ,Computer science ,Real-time computing ,Speech technology ,Speech therapy ,Visualization ,Original data - Abstract
Tele-medical applications allow the treatment of patients over long distances. This is of special importance in speech therapy, since speech therapists often operate in specialized centers for certain disorders. In this paper we propose a novel therapy system which employs Time-of-Flight (ToF) 3D cameras that acquire 3D surface data in real time. The distance data is compressed within a MPEG stream and transmitted to a remote client where a 3D visualization is implemented. The loss in accuracy compared to the original data is within 1.6 mm and 9.7 mm. The system can be used for real-time streaming.
- Published
- 2009
12. Age Determination of Children in Preschool and Primary School Age with GMM-Based Supervectors and Support Vector Machines/Regression
- Author
-
Elmar Nöth, Andreas Maier, and Tobias Bocklet
- Subjects
Support vector machine ,School age child ,Mean squared error ,business.industry ,Computer science ,Independent parameter ,Pattern recognition ,Artificial intelligence ,business ,Training methods ,Mixture model ,System a ,Regression - Abstract
This paper focuses on the automatic determination of the age of children in preschool and primary school age. For each child a Gaussian Mixture Model(GMM) is trained. As training method the Maximum A Posterioriadaptation (MAP) is used. MAP derives the speaker models from a Universal Background Model(UBM) and does not perform an independent parameter estimation. The means of each GMM are extracted and concatenated, which results in a so-called GMM supervector. These supervectors are then used as meta features for classification with Support Vector Machines(SVM) or for Support Vector Regression(SVR). With the classification system a precision of 83 % was achieved and a recall of 66 %. When the regression system was used to determine the age in years, a mean error of 0.8 years and a maximal error of 3 years was obtained. A regression with a monthly accuracy brought similar results.
- Published
- 2008
13. Prosodic Events Recognition in Evaluation of Speech-Synthesis System Performance
- Author
-
France Mihelič, Boštjan Vesnicer, Janez Žibert, and Elmar Nöth
- Subjects
Computer science ,business.industry ,Speech recognition ,System evaluation ,Speech synthesis ,computer.software_genre ,Notation ,ComputingMethodologies_PATTERNRECOGNITION ,Naturalness ,Evaluation methods ,Objective evaluation ,Artificial intelligence ,Hidden Markov model ,business ,Prosody ,computer ,Natural language processing - Abstract
We present an objective-evaluation method of the prosody modeling in an HMM-based Slovene speech-synthesis system. Method is based on the results of the automatic recognition of syntactic-prosodic boundary positions and accented words in the synthetic speech. We have shown that the recognition results represent a close match with the prosodic notations, labeled by the human expert on the natural-speech counterpart that was used to train the speech-synthesis system. The recognition rate of the prosodic events is proposed as an objective evaluation measure for the quality of the prosodic modeling in the speech-synthesis system. The results of the proposed evaluation method are also in accordance with previous subjective-listening assesment evaluations, where high scores for the naturalness for such a type of speech synthesis were observed.
- Published
- 2008
14. Multilingual Weighted Codebooks for Non-native Speech Recognition
- Author
-
Martin Raab, Rainer Gruhn, and Elmar Nöth
- Subjects
Word accuracy ,Voice activity detection ,Native english ,Computer science ,business.industry ,Speech recognition ,Vector quantization ,Codebook ,Artificial intelligence ,computer.software_genre ,business ,computer ,Natural language processing - Abstract
In many embedded systems commands and other words in the user's main language must be recognized with maximum accuracy, but it should be possible to use foreign names as they frequently occur in music titles or city names. Example systems with constrained resources are navigation systems, mobile phones and MP3 players. Speech recognizers on embedded systems are typically semi-continuous speech recognizers based on vector quantization. Recently we introduced Multilingual Weighted Codebooks (MWCs) for such systems. Our previous work shows significant improvements for the recognition of multiple native languages. However, open questions remained regarding the performance on non-native speech. We evaluate on four different non-native accents of English, and our MWCs produce always significantly better results than a native English codebook. Our best result is a 4.4% absolute word accuracy improvement. Further experiments with non-native accented speech give interesting insights in the attributes of non-native speech in general.
- Published
- 2008
15. Analysis of Hypernasal Speech in Children with Cleft Lip and Palate
- Author
-
Christian Hacker, Andreas Maier, Elmar Nöth, Maria Schuster, and Alexander Reuß
- Subjects
Computer science ,Speech recognition ,Frame (networking) ,medicine ,medicine.symptom ,Pronunciation ,Mixture model ,Hypernasal speech ,medicine.disease ,Word (computer architecture) ,Connected speech ,Confusion - Abstract
In children with cleft lip and palate speech disorders appear often. One major disorder amongst them is hypernasality. This is the first study which shows that it is possible to automatically detect hypernasality in connected speech without any invasive means. Therefore, we investigated MFCCs and pronunciation features. The pronunciation features are computed from phoneme confusion probabilities. Furthermore, we examine frame level features based on the Teager Energy operator. The classification of hypernasal speech is performed with up to 66.6 % (CL) and 86.9 % (RR) on word level. On frame level rates of 62.3 % (CL) and 90.3 % (RR) are reached.
- Published
- 2008
16. Influence of Reading Errors on the Text-Based Automatic Evaluation of Pathologic Voices
- Author
-
Elmar Nöth, Frank Rosanowski, Andreas Maier, Maria Schuster, and Tino Haderlein
- Subjects
Rehabilitation ,Computer science ,business.industry ,Speech recognition ,medicine.medical_treatment ,Intelligibility (communication) ,computer.software_genre ,Correlation ,Pathologic ,Vowel ,Word recognition ,medicine ,Preprocessor ,Artificial intelligence ,Prosody ,business ,computer ,Natural language processing - Abstract
In speech therapy and rehabilitation, a patient's voice has to be evaluated by the therapist. Established methods for objective, automatic evaluation analyze only recordings of sustained vowels. However, an isolated vowel does not reflect a real communication situation. In this paper, a speech recognition system and a prosody module are used to analyze a text that was read out by the patients. The correlation between the perceptive evaluation of speech intelligibility by five medical experts and measures like word accuracy (WA), word recognition rate (WR), and prosodic features was examined. The focus was on the influence of reading errors on this correlation. The test speakers were 85 persons suffering from cancer in the larynx. 65 of them had undergone partial laryngectomy, i.e. partial removal of the larynx. The correlation between the human intelligibility ratings on a five-point scale and the machine was r= ---0.61 for WA, r≈ 0.55 for WR, and r≈ 0.60 for prosodic features based on word duration and energy. The reading errors did not have a significant influence on the results. Hence, no special preprocessing of the audio files is necessary.
- Published
- 2008
17. An Extension to the Sammon Mapping for the Robust Visualization of Speaker Dependencies
- Author
-
Julian Exner, Tino Haderlein, Stefan Steidl, Andreas Maier, Elmar Nöth, and Anton Batliner
- Subjects
Computer science ,Microphone ,business.industry ,Dimensionality reduction ,Two step ,Ranging ,Extension (predicate logic) ,Zero (linguistics) ,Visualization ,Sammon mapping ,Computer vision ,Artificial intelligence ,ddc:004 ,business - Abstract
We present a novel method for the visualization of speakers which is microphone independent. To solve the problem of lacking microphone independency we present two methods to reduce the influence of the recording conditions on the visualization. The first one is a registration of maps created from identical speakers recorded under different conditions, i.e., different microphones and distances in two steps: Dimension reduction followed by the linear registration of the maps. The second method is an extension of the Sammon mapping method, which performs a non-linear registration during the dimension reduction procedure. The proposed method surpasses the two step registration approach with a mapping error ranging from 17 % to 24 % and a grouping error which is close to zero.
- Published
- 2008
18. Automatic Evaluation of Pathologic Speech – from Research to Routine Clinical Use
- Author
-
Elmar Nöth, Andreas Maier, Frank Rosanowski, Tino Haderlein, Korbinian Riedhammer, and Maria Schuster
- Subjects
Word accuracy ,Svm regression ,Correlation ,Pathologic ,business.industry ,Computer science ,Speech recognition ,Transliteration ,Artificial intelligence ,business ,computer.software_genre ,computer ,Natural language processing - Abstract
Previously we have shown that ASR technology can be used to objectively evaluate pathologic speech. Here we report on progress for routine clinical use: 1) We introduce an easy-to-use recording and evaluation environment. 2) We confirm our previous results for a larger group of patients. 3) We show that telephone speech can be analyzed with the same methods with only a small loss of agreement with human experts. 4) We show that prosodic information leads to more robust results. 5) We show that text reference instead of transliteration can be used for evaluation. Using word accuracy of a speech recognizer and prosodic features as features for SVM regression, we achieve a correlation of .90 between the automatic analysis and human experts.
- Published
- 2007
19. Intelligibility Is More Than a Single Word: Quantification of Speech Intelligibility by ASR and Prosody
- Author
-
Emeka Nkenke, Andreas Maier, Elmar Nöth, Tino Haderlein, and Maria Schuster
- Subjects
Correlation ,Correlation coefficient ,Computer science ,Speech recognition ,education ,medicine ,Speech disorder ,Intelligibility (communication) ,medicine.symptom ,Prosody - Abstract
In this paper we examine the quality of the prediction of intelligibility scores of human experts. Furthermore, we investigate the differences between subjective expert raters who evaluated speech disorders of laryngectomees and children with cleft lip and palate. We use the recognition rate of a word recognizer and prosodic features to predict the intelligibility score of each individual expert. For each expert and the mean opinion of all experts we present the best features to model their scoring behavior according to the mean rank obtained during a 10-fold cross-validation. In this manner all individual speech experts were modeled with a correlation coefficient of at least r > .75. Themean opinion of all raters is predicted with a correlation of r =.90 for the laryngectomees and r =.86 for the children.
- Published
- 2007
20. An Automatic Version of the Post-Laryngectomy Telephone Test
- Author
-
Andreas Maier, Frank Rosanowski, Elmar Nöth, Hikmet Toy, Korbinian Riedhammer, and Tino Haderlein
- Subjects
business.industry ,Computer science ,medicine.medical_treatment ,Speech recognition ,Intelligibility (communication) ,Rate of speech ,computer.software_genre ,language.human_language ,German ,Laryngectomy ,Word recognition ,language ,medicine ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
Tracheoesophageal (TE) speech is a possibility to restore the ability to speak after total laryngectomy, i.e. the removal of the larynx. The quality of the substitute voice has to be evaluated during therapy. For the intelligibility evaluation of German speakers over telephone, the Post-Laryngectomy Telephone Test (PLTT) was defined. Each patient reads out 20 of 400 different monosyllabic words and 5 out of 100 sentences. A human listener writes down the words and sentences understood and computes an overall score. This paper presents a means of objective and automatic evaluation that can replace the subjective method. The scores of 11 naive raters for a set of 31 test speakers were compared to the word recognition rate of speech recognizers. Correlation values of about 0.9 were reached.
- Published
- 2007
21. Text-Independent Speaker Identification Using Temporal Patterns
- Author
-
Andreas Maier, Tobias Bocklet, and Elmar Nöth
- Subjects
Training set ,business.industry ,Computer science ,Speech recognition ,Text independent ,Pattern recognition ,Speaker recognition ,Mixture model ,Linear discriminant analysis ,Speaker diarisation ,Speaker identification ,Artificial intelligence ,Mel-frequency cepstrum ,business - Abstract
In this work we present an approach for text-independent speaker recognition. As features we used Mel Frequency Cepstrum Coefficients (MFCCs) and Temporal Patterns (TRAPs). For each speaker we trained Gaussian Mixture Models (GMMs) with different numbers of densities. The used database was a 36 speakers database with very noisy close-talking recordings. For the training a Universal Background Model (UBM) is built by the EM-Algorithm and all available training data. This UBM is then used to create speaker-dependent models for each speaker. This can be done in two ways: Taking the UBM as an initial model for EM-Training or Maximum-A-Posteriori (MAP) adaptation. For the 36 speaker database the use of TRAPs instead of MFCCs leads to a frame-wise recognition improvement of 12.0%. The adaptation with MAP enhanced the recognition rate by another 14.2%.
- Published
- 2007
22. Multimodal Emogram, Data Collection and Presentation
- Author
-
Johann Adelhardt, Carmen Frank, Elmar Nöth, Viktor Zeißler, Heinrich Niemann, and Rui Ping Shi
- Subjects
Facial expression ,Data collection ,business.industry ,Computer science ,media_common.quotation_subject ,computer.software_genre ,Presentation ,Use case ,Emotional expression ,Artificial intelligence ,User state ,business ,computer ,Natural language processing ,media_common ,Gesture - Abstract
There are several characteristics not optimally suited for the user state classification with Wizard-of-Oz (WOZ) data like the nonuniform distribution of emotions in the utterances and the distribution of emotional utterances in speech, facial expression, and gesture. In particular, the fact that most of the data collected in the WOZ experiments are without any emotional expression gives rise to the problem of getting enough representative data for training the classifiers. Because of this problem we collected data in our own database. These data are also relevant for several demonstration sessions, where the functionality of the SmartKom system is shown in accordance with the defined use cases.
- Published
- 2006
23. The Gesture Interpretation Module
- Author
-
Rui Ping Shi, Anton Batliner, Elmar Nöth, Heinrich Niemann, Viktor Zeißler, Carmen Frank, and Johann Adelhardt
- Subjects
Communication ,Facial expression ,Unconscious mind ,Interpretation (logic) ,InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g.,HCI) ,Computer science ,business.industry ,Semantics ,Body language ,Gesture recognition ,ddc:004 ,business ,Utterance ,Gesture - Abstract
Humans make often conscious and unconscious gestures, which reflect their mind, thoughts and the way these are formulated. These inherently complex processes can in general not be substituted by a corresponding verbal utterance that has the same semantics (McNeill, 1992). Gesture, which is a kind of body language, contains important information on the intention and the state of the gesture producer. Therefore, it is an important communication channels in human computer interaction.
- Published
- 2006
24. Environmental Adaptation with a Small Data Set of the Target Domain
- Author
-
Andreas Maier, Elmar Nöth, and Tino Haderlein
- Subjects
Microphone array ,Small data ,Computer science ,business.industry ,Microphone ,Speech recognition ,Adaptation (eye) ,Pattern recognition ,Set (abstract data type) ,Linear regression ,Maximum a posteriori estimation ,Artificial intelligence ,business ,Test data - Abstract
In this work we present an approach to adapt speaker-independent recognizers to a new acoustical environment The recognizers were trained with data which were recorded using a close-talking microphone These recognizers are to be evaluated with distant-talking microphone data The adaptation set was recorded with the same type of microphone In order to keep the speaker-independency this set includes 33 speakers The adaptation itself is done using maximum a posteriori (MAP) and maximum likelihood linear regression adaptation (MLLR) in combination with the Baum-Welch algorithm Furthermore the close-talking training data were artificially reverberated to reduce the mismatch between training and test data In this manner the performance could be increased from 9.9 % WA to 40.0 % WA in speaker-open conditions If further speaker-dependent adaptation is applied this rate is increased up to 54.9 % WA.
- Published
- 2006
25. Visualization of Voice Disorders Using the Sammon Transform
- Author
-
Maria Schuster, Stefan Steidl, Elmar Nöth, Tino Haderlein, Dominik Zorn, and Makoto Shozakai
- Subjects
Voice pathology ,Basis (linear algebra) ,Computer science ,Speech recognition ,Projection (set theory) ,Hidden Markov model ,Markov model ,Voice Disorder ,Natural language ,Visualization - Abstract
The Sammon Transform performs data projections in a topology-preserving manner on the basis of an arbitrary distance measure We use the weights of the observation probabilities of semi-continuous HMMs that were adapted to the current speaker as input Experiments on laryngectomized speakers with tracheoesophageal substitute voice, hoarse, and normal speakers show encouraging results Different speaker groups are separated in 2-D space, and the projection of a new speaker into the Sammon map allows prediction of his or her kind of voice pathology The method can thus be used as an objective, automated support for the evaluation of voice disorders, and it visualizes them in a way that is convenient for speech therapists.
- Published
- 2006
26. Using Artificially Reverberated Training Data in Distant-Talking ASR
- Author
-
Elmar Nöth, Tino Haderlein, Walter Kellermann, Wolfgang Herbordt, and Heinrich Niemann
- Subjects
Reverberation ,Microphone array ,Computer science ,Microphone ,Speech recognition ,Test set ,Word error rate ,Room acoustics ,Speech processing ,Test data - Abstract
Automatic Speech Recognition (ASR) in reverberant rooms can be improved by choosing training data from the same acoustical environment as the test data. In a real-world application this is often not possible. A solution for this problem is to use speech signals from a close-talking microphone and reverberate them artificially with multiple room impulse responses. This paper shows results on recognizers whose training data differ in size and percentage of reverberated signals in order to find the best combination for data sets with different degrees of reverberation. The average error rate on a close-talking and a distant-talking test set could thus be reduced by 29% relative.
- Published
- 2005
27. Robust Parallel Speech Recognition in Multiple Energy Bands
- Author
-
Heinrich Niemann, Stefan Steidl, Christian Hacker, Elmar Nöth, and Andreas Maier
- Subjects
Reverberation ,Signal-to-noise ratio ,Phone ,Computer science ,Robustness (computer science) ,Speech recognition ,Cepstrum ,Mel-frequency cepstrum ,Linear discriminant analysis - Abstract
In this paper we will investigate the performance of TRAP-features on clean and noisy data. Multiple feature sets are evaluated on a corpus which was recorded in clean and noisy environment. In addition, the clean version was reverberated artificially. The feature sets are assembled from selected energy bands. In this manner multiple recognizers are trained using different energy bands. The outputs of all recognizers are joined with ROVER in order to achieve a single recognition result. This system is compared to a baseline recognizer that uses Mel frequency cepstrum coefficients (MFCC). In this paper we will point out that the use of artificial reverberation leads to more robustness to noise in general. Furthermore most TRAP-based features excel in phone recognition. While MFCC features prove to be better in a matched training/test situation, TRAP-features clearly outperform them in a mismatched training/test situation: When we train on clean data and evaluate on noisy data the word accuracy (WA) can be raised by 173 % relative (from 12.0 % to 32.8 % WA).
- Published
- 2005
28. Automatic Recognition and Evaluation of Tracheoesophageal Speech
- Author
-
Tino Haderlein, Maria Schuster, Elmar Nöth, Frank Rosanowski, and Stefan Steidl
- Subjects
Laryngectomy ,Computer science ,Speech recognition ,medicine.medical_treatment ,medicine ,Acoustic model ,Tracheoesophageal Speech ,PSQM ,Intelligibility (communication) ,Markov model ,Speech processing ,Hidden Markov model - Abstract
Tracheoesophageal (TE) speech is a possibility to restore the ability to speak after laryngectomy, i.e. the removal of the larynx. TE speech often shows low audibility and intelligibility which also makes it a challenge to automatic speech recognition. We improved the recognition results by adapting a speech recognizer trained on normal, non-pathologic voices to single TE speakers by unsupervised HMM interpolation.
- Published
- 2004
29. Speech Recognition with μ -Law Companded Features on Reverberated Signals
- Author
-
Elmar Nöth, Georg Stemmer, and Tino Haderlein
- Subjects
Reverberation ,Voice activity detection ,Computer science ,Microphone ,Law ,Speech recognition ,Speech corpus ,Mel-frequency cepstrum ,Language model ,Speech processing - Abstract
One of the goals of the EMBASSI project is the creation of a speech interface between a user and a TV set or VCR. The interface should allow spontaneous speech recorded by microphones far away from the speaker. This paper describes experiments evaluating the robustness of a speech recognizer against reverberation. For this purpose a speech corpus was recorded with several different distortion types under real-life conditions. On these data the recognition results for reverberated signals using μ -law companded features were compared to an MFCC baseline system. Trained with clear speech, the word accuracy for the μ -law features on highly reverberated signals was 3 percent points better than the baseline result.
- Published
- 2003
30. Optimizing Eigenfaces by Face Masks for Facial Expression Recognition
- Author
-
Elmar Nöth and Carmen Frank
- Subjects
Facial expression ,Modality (human–computer interaction) ,Face hallucination ,Computer science ,business.industry ,Facial recognition system ,InformationSystems_MODELSANDPRINCIPLES ,Eigenface ,Human–computer interaction ,Pattern recognition (psychology) ,Three-dimensional face recognition ,Computer vision ,Artificial intelligence ,business ,Face detection - Abstract
A new direction in improving modern dialogue systems is to make a human-machine dialogue more similar to a human-human dialogue. This can be done by adding more input modalities. One additional modality for automatic dialogue systems is the facial expression of the human user. A common problem in a human-machine dialogue where the angry face may give a clue is the recurrent misunderstanding of the user by the system. Or an helpless face may indicate a naive user who does not know how to utilize the system and should be led through the dialogue step by step.
- Published
- 2003
31. Automatic Pixel Selection for Optimizing Facial Expression Recognition Using Eigenfaces
- Author
-
Carmen Frank and Elmar Nöth
- Subjects
Facial expression ,Face hallucination ,Pixel ,Computer science ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,Facial recognition system ,Software portability ,ComputingMethodologies_PATTERNRECOGNITION ,Eigenface ,Face (geometry) ,Three-dimensional face recognition ,Artificial intelligence ,business - Abstract
A new direction in improving modern dialogue systems is to make a human-machine dialogue more similar to a human-human dialogue. This can be done by adding more input modalities, e.g. facial expression recognition. A common problem in a human-machine dialogue where the angry face may give a clue is the recurrent misunderstanding of the user by the system. This paper describes recognizing facial expressions in frontal images using eigenspaces. For the classification of facial expressions, rather than using the whole image we classify regions which do not differ between subjects and at the same time are meaningful for facial expressions. Using this face mask for training and classification of joy and anger expressions of the face, we achieved an improvement of up to 11% absolute. The portability to other classification problems is shown by a gender classification.
- Published
- 2003
32. Improving Children’s Speech Recognition by HMM Interpolation with an Adults’ Speech Recognizer
- Author
-
Stefan Steidl, Elmar Nöth, Christian Hacker, Heinrich Niemann, and Georg Stemmer
- Subjects
Training set ,Computer science ,business.industry ,Speech recognition ,Markov process ,Pattern recognition ,Hide markov model ,Set (abstract data type) ,Word accuracy ,symbols.namesake ,Expectation–maximization algorithm ,symbols ,Artificial intelligence ,business ,Hidden Markov model ,Interpolation - Abstract
In this paper we address the problem of building a good speech recognizer if there is only a small amount of training data available. The acoustic models can be improved by interpolation with the well-trained models of a second recognizer from a different application scenario. In our case, we interpolate a children’s speech recognizer with a recognizer for adults’ speech. Each hidden Markov model has its own set of interpolation partners; experiments were conducted with up to 50 partners. The interpolation weights are estimated automatically on a validation set using the EM algorithm. The word accuracy of the children’s speech recognizer could be improved from 74.6 % to 81.5 %. This is a relative improvement of almost 10 %.
- Published
- 2003
33. Prosodic Classification of Offtalk: First Experiments
- Author
-
Elmar Nöth, Heinrich Niemann, Viktor Zeissler, and Anton Batliner
- Subjects
Facial expression ,Computer science ,Aside ,business.industry ,Speech recognition ,Feature vector ,Multimodal communication ,computer.software_genre ,Facial recognition system ,Artificial intelligence ,ddc:004 ,business ,Prosody ,computer ,Natural language processing ,Spontaneous speech ,Gesture - Abstract
SmartKom is a multi-modal dialogue system which combines speech with gesture and facial expression. In this paper, we want to deal with one of those phenomena which can be observed in such elaborated systems that we want to call 'offtalk', i.e., speech that is not directed to the system (speaking to oneself, speaking aside). We report the classification results of first experiments which use a large prosodic feature vector in combination with part-of-speech information.
- Published
- 2002
34. MOBSY: Integration of Vision and Dialogue in Service Robots
- Author
-
Benno Heigl, Matthias Zobel, Elmar Nöth, Georg Stemmer, Joachim Denzler, Jochen Schmidt, and Dietrich Paulus
- Subjects
Service robot ,Service (systems architecture) ,Computer science ,Human–computer interaction ,business.industry ,Video tracking ,Pattern recognition (psychology) ,Obstacle avoidance ,Robot ,Robotics ,Artificial intelligence ,business ,Humanoid robot - Abstract
MOBSY is a fully integrated autonomous mobile service robot system. It acts as an automatic dialogue based receptionist for visitors of our institute. MOBSY incorporates many techniques from different research areas into one working stand-alone system. Especially the computer vision and dialogue aspects are of main interest from the pattern recognition's point of view. To summarize shortly, the involved techniques range from object classification over visual self-localization and recalibration to object tracking with multiple cameras. A dialogue component has to deal with speech recognition, understanding and answer generation. Further techniques needed are navigation, obstacle avoidance, and mechanisms to provide fault tolerant behavior. This contribution introduces our mobile system MOBSY. Among the main aspects vision and speech, we focus also on the integration aspect, both on the methodological and on the technical level. We describe the task and the involved techniques. Finally, we discuss the experiences that we gained with MOBSY during a live performance at the 25th anniversary of our institute.
- Published
- 2001
35. Research Issues for the Next Generation Spoken Dialogue Systems Revisited
- Author
-
Florian Gallwitz, Richard Huber, Julia Fischer, Georg Stemmer, Heinrich Niemann, Jürgen Haas, Elmar Nöth, Manuela Boros, and Volker Warnke
- Subjects
Nonverbal communication ,Computer science ,business.industry ,Human–computer interaction ,media_common.quotation_subject ,Information system ,Conversation ,Artificial intelligence ,computer.software_genre ,business ,computer ,Natural language processing ,media_common - Abstract
In this paper we take a second look at current research issues for conversational dialogue systems addressed in [17]. We look at two systems, a movie information and a stock information system which were built based on the experiences with the train information system Evar, described in [17].
- Published
- 2001
36. Towards a Dynamic Adjustment of the Language Weight
- Author
-
Viktor Zeissler, Heinrich Niemann, Elmar Nöth, and Georg Stemmer
- Subjects
Computer science ,business.industry ,Speech recognition ,Acoustic model ,Value (computer science) ,computer.software_genre ,Test set ,State (computer science) ,Language model ,Information source (mathematics) ,Artificial intelligence ,Constant (mathematics) ,business ,computer ,Natural language processing ,Utterance - Abstract
Most speech recognition systems use a language weight to reduce the mismatch between the language model and the acoustic models. Usually a constant value of the language weight is chosen for the whole test set. In this paper, we evaluate the possibility to adapt the language weight dynamically to the state of the dialogue or to the current utterance. Our experiments show, that the gain in performance, that can be achieved with a dynamic adjustment of the language weight on our data is very limited. This result is independent of the information source that is used for the adaption of the language weight.
- Published
- 2001
37. Demonstration von Bildverarbeitung und Sprachverstehen in der Dienstleistungsrobotik
- Author
-
Joachim Denzler, Jochen Schmidt, Benno Heigl, Dietrich Paulus, Elmar Nöth, Georg Stemmer, and Matthias Zobel
- Abstract
Die typischerweise gewunschten Einsatzgebiete fur Dienstleistungsroboter, z. B. Krankenhauser oder Seniorenheime, stellen sehr hohe Anforderungen an die Mensch-Maschine-Schnittstelle. Diese Erfordernisse gehen im Allgemeinen uber die Moglichkeiten der Standardsensoren, wie Ultraschalloder Infrarotsensoren, hinaus. Es mussen daher erganzende Verfahren zum Einsatz kommen. Aus der Sicht der Mustererkennung sind die Nutzung des Rechnersehens und des naturlichsprachlichen Dialogs von besonderem Interesse. Dieser Beitrag stellt das mobile System MOBS Y vor. MOBS Y ist ein vollkommen integrierter autonomer mobiler Dienstleistungsroboter. Er dient als ein automatischer dialogbasierter Empfangsservice fur Besucher unseres Instituts. MOBSY vereinigt vielfaltige Methoden aus unterschiedlichsten Forschungsgebieten in einem eigenstandigen System. Die zum Einsatz kommenden Methoden aus dem Bereich der Bildverarbeitung reichen dabei von Objektklassifikation uber visuelle Selbstlokalisierung und Rekalibrierung bis hin zu multiokularer Objektverfolgung. Die Dialogkomponente umfasst Methoden der Spracherkennung, des Sprachverstehens und die Generierung von Antworten. Im Beitrag werden die zu erfullende Aufgabe und die einzelnen Verfahren dargestellt.
- Published
- 2001
38. Recognition and Labelling of Prosodic Events in Slovenian Speech
- Author
-
France Mihelič, Jerneja Gros, Elmar Nöth, and Volker Warnke
- Subjects
ComputingMethodologies_PATTERNRECOGNITION ,Computer science ,Duration (music) ,business.industry ,Speech recognition ,Artificial intelligence ,computer.software_genre ,business ,computer ,Pitch period ,Natural language processing ,Pitch contour ,Loudness - Abstract
The paper describes prosodic annotation procedures of the GOPOLIS Slovenian speech data database and methods for automatic classification of different prosodic events. Several statistical parameters concerning duration and loudness of words, syllables and allophones were computed for the Slovenian language, for the first time on such a large amount of speech data. The evaluation of the annotated data showed a close match between automatically determined syntactic-prosodic boundary marker positions and those obtained by a rule-based approach.
- Published
- 2000
39. The Prosody Module
- Author
-
Johann Adelhardt, Heinrich Niemann, Carmen Frank, Rui Ping Shi, Anton Batliner, Elmar Nöth, and Viktor Zeißler
- Subjects
Artificial neural network ,business.industry ,Computer science ,Feature vector ,computer.software_genre ,language.human_language ,German ,Annotation ,language ,Language model ,Artificial intelligence ,ddc:004 ,Prosody ,business ,computer ,Sentence ,Statistic ,Natural language processing - Abstract
We describe the acoustic-prosodic and syntactic-prosodic annotation and classification of boundaries, accents and sentence mood integrated in the Verbmobil system for the three languages German, English, and Japanese. For the acoustic-prosodic classification, a large feature vector with normalized prosodic features is used. For the three languages, a multilingual prosody module was developed that reduces memory requirement considerably, compared to three monolingual modules. For classification, neural networks and statistic language models are used.
- Published
- 2000
40. The Recognition of Emotion
- Author
-
Elmar Nöth, Heinrich Niemann, Anton Batliner, Richard Huber, Jörg Spilker, and Kerstin Fischer
- Subjects
business.industry ,Computer science ,media_common.quotation_subject ,Feature vector ,Anger ,computer.software_genre ,ComputingMethodologies_PATTERNRECOGNITION ,Knowledge sources ,Artificial intelligence ,Dialog system ,Dialog box ,business ,Affective computing ,computer ,Natural language processing ,media_common - Abstract
To detect emotional user behavior, particularly anger, can be very useful for successful automatic dialog processing. We present databases and prosodic classifiers implemented for the recognition of emotion in Verbmobil. Using a prosodic feature vector alone is, however, not sufficient for the modelling of emotional user behavior. Therefore, a module is described that combines several knowledge sources within an integrated classification of trouble in communication.
- Published
- 2000
41. The Utility of Semantic-Pragmatic Information and Dialogue-State for Speech Recognition in Spoken Dialogue Systems
- Author
-
Georg Stemmer, Elmar Nöth, and Heinrich Niemann
- Subjects
Perplexity ,Knowledge representation and reasoning ,Computer science ,Speech recognition ,Word error rate ,Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) ,Language model ,Linear interpolation ,Pragmatics ,Semantics ,Interpolation - Abstract
Information about the dialogue-state can be integrated into language models to improve performance of the speech recogniser in a dialogue system. A dialogue state is defined in this paper as the question, the user is replying to. One of the main problems in dialogue-state dependent language modelling is the limitation of training data. In order to obtain robust models, we use the method of rational interpolation to smooth between a dialogue-state dependent and a general language model. In contrast to linear interpolation methods, rational interpolation weights the different predictors according to their reliability. Semantic-pragmatic knowledge is used to enlarge the training data of the language models. Both methods reduce perplexity and word error rate significantly.
- Published
- 2000
42. Multilingual Speech Recognition
- Author
-
Heinrich Niemann, Stefan Harbeck, and Elmar Nöth
- Subjects
Philosophy of language ,Language identification ,Computer science ,Speech recognition ,Codebook ,Context (language use) ,Language model ,Speech processing ,Word (computer architecture) ,Natural language - Abstract
We present two concepts for systems with language identification in the context of multilingual information retrieval dialogs. The first one has an explicit module for language identification. It is based on training a common codebook for all the languages and integrating over the output probabilities of language specific n-gram models trained over the codebook sequences. The system can decide for one language either after a predefined time interval or if the difference between the probabilities of the languages succeeds a certain threshold. This approach allows to recognize languages that the system can not process and give out a prerecorded message in that language. In the second approach, the trained recognizers of the languages to be recognized, the lexicons, and the language models are combined to one multilingual recognizer. Only allowing transitions between the words from one language, each hypothesized word chain contains words from just one language and language identification is an implicit by-product of the speech recognizer. First results for both language identification approaches are presented.
- Published
- 1999
43. Information Theoretic Based Segments for Language Identification
- Author
-
Uwe Ohler, Elmar Nöth, Heinrich Niemann, and Stefan Harbeck
- Subjects
Phonotactics ,Vocabulary ,Language identification ,Computer science ,business.industry ,Speech recognition ,media_common.quotation_subject ,Codebook ,Speech processing ,computer.software_genre ,Information theory ,language.human_language ,German ,Identification (information) ,language ,Segmentation ,Artificial intelligence ,Language model ,business ,computer ,Natural language processing ,media_common - Abstract
In our paper we present two new approaches for language identification. Both of them are based on the use of so-called multigrams, an information theoretic based observation representation. In the first approach we use multigram models for phonotactic modeling of phoneme or codebook sequences. The multigram model can be used to segment the new observation into larger units (e.g. something like words) and calculates a probability for the best segmentation. In the second approach we build a fenon recognizer using the segments of the best segmentation of the training material as "words" inside the recognition vocabulary. On the OGI test corpus and on the NIST'95 evaluation corpus we got significant improvements with this second approach in comparison to the unsupervised codebook approach when discriminating between English and German utterances.
- Published
- 1999
44. Fast and Robust Features for Prosodic Classification?
- Author
-
Elmar Nöth, Volker Warnke, Richard Huber, Anton Batliner, Jan Buckow, and Heinrich Niemann
- Subjects
Set (abstract data type) ,Normalization (statistics) ,ComputingMethodologies_PATTERNRECOGNITION ,Computer science ,Speech recognition ,Multilayer perceptron ,Segmentation ,ddc:004 ,Prosody ,Speech processing ,Word (computer architecture) ,Pitch contour - Abstract
In our previous research, we have shown that prosody can be used to dramatically improve the performance of the automatic speech translation system Verbmobil [5,7,8]. In Verbmobil, prosodic information is made available to the different modules of the system by annotating the output of a word recognizer with prosodic markers. These markers are determined in a classification process. The computation of the prosodic features used for classification was previously based on a time alignment of the phoneme sequence of the recognized words. The phoneme segmentation was needed for the normalization of duration and energy features. This time alignment was very expensive in terms of computational effort and memory requirement. In our new approach the normalization is done on the word level with precomputed duration and energy statistics, thus the phoneme segmentation can be avoided. With the new set of prosodic features better classification results can be achieved, the features extraction can be sped up by 64 %, and the memory requirements are even reduced by 92%.
- Published
- 1999
45. Research Issues for the Next Generation Spoken Dialogue Systems
- Author
-
Jürgen Haas, Maria Aretoulaki, Elmar Nöth, Florian Gallwitz, Heinrich Niemann, Richard Huber, and Stefan Harbeck
- Subjects
World Wide Web ,German ,Computer science ,business.industry ,language ,Multilingualism ,Artificial intelligence ,business ,Telephone line ,language.human_language ,Natural language - Abstract
In this paper we present extensions to the spoken dialogue system EVAR which are crucial issues for the next generation dialogue systems. EVAR was developed at the University of Erlangen. In 1994, it became accessible over telephone line and could answer inquiries in the German language about German InterCity train connections. It has since been continuously improved and extended, including some unique features, such as the processing of out-of-vocabulary words and a flexible dialogue strategy that adapts to the quality of the recognition of the user input.
- Published
- 1999
46. Prosodische Information: Begriffsbestimmung und Nutzen für das Sprachverstehen
- Author
-
Ralf Kompe, Elmar Nöth, Heinrich Niemann, Andreas Kießling, and Anton Batliner
- Subjects
ddc:004 - Abstract
Prosodische Information spielt in der Mensch—Mensch—Kommunikation eine grose Rolle, in der automatischen Sprachverarbeitung (ASV) wurde diese Informationsquelle bisher jedoch nicht benutzt. Erst seitdem sich die automatische Sprachverarbeitung der Spontansprache und weniger restringierten Aufgabenstellungen zugewandt hat, ist der Einsatz der Prosodie wirklich wesentlich geworden. Wir beschreiben im einzelnen die Grunde dafur und zeigen an der Integration der Prosodie in das automatische Ubersetzungssystem Verbmobil, das dieser Einsatz auch erfolgreich ist. Verbmobil ist weltweit das erste ASV—Gesamtsystem, welches prosodische Information wahrend der linguistischen Analyse einsetzt. Die zur Zeit wirkungsvollste prosodische Information wird von den Wahrscheinlichkeiten fur Satzgrenzen geliefert. Diese werden zu 94% richtig erkannt. Wahrend des syntaktischen Parsens von Worthypothesengraphen fuhrt die Benutzung der Satzgrenzen—Information zu einer Beschleunigung der syntaktischen Analyse um 92% und zu einer Reduktion der syntaktischen Lesarten um 96%.
- Published
- 1997
47. Automatic classification of dialog acts with Semantic Classification Trees and Polygrams
- Author
-
Elmar Nöth, Ernst Günter Schukat-Talamazzini, Marion Mast, and Heinrich Niemann
- Subjects
Computer science ,business.industry ,Component (UML) ,Speech recognition ,One-class classification ,Artificial intelligence ,Language model ,Dialog box ,computer.software_genre ,business ,computer ,Natural language processing - Abstract
This paper presents automatic methods for the classification of dialog acts. In the verbmobil application (speech-to-speech translation of face-to-face dialogs) maximally 50 % of the utterances are analyzed in depth and for the rest, shallow processing takes place. The dialog component keeps track of the dialog with this shallow processing. For the classification of utterances without in depth processing two methods are presented: Semantic Classification Trees and Polygrams. For both methods the classification algorithm is trained automatically from a corpus of labeled data. The novel idea with respect to SCTs is the use of dialog state dependent CTs and with respect to Polygrams it is the use of competing language models for the classification of dialog acts.
- Published
- 1996
48. Statistical Modeling of Segmental and Suprasegmental Information
- Author
-
A. Kiessling, Konrad Ott, Elmar Nöth, Ralf Kompe, S. Rieck, Thomas Kuhn, Ernst Günter Schukat-Talamazzini, and Heinrich Niemann
- Subjects
Phrase ,Computer science ,Speech recognition ,Word recognition ,Statistical model ,Software system ,Dialog system ,Dialog box ,computer.software_genre ,Prosody ,computer ,Utterance - Abstract
A speech understanding and dialog system has to cope with spontaneous speech (including, for example, hesitations and non-words) and common dialog phenomena (including, for example, elliptic references and changes of topic). Modeling of words by segmental units should be supported by suprasegmental units since valuable information is represented in the prosody of an utterance. A software system for flexible and efficient modeling of speech by segmental and suprasegmental units will be described. Results concerning the use of prosodic information for word recognition, phrase boundaries, and dialog acts will be given. Aspects of integrating these results into an operational dialog system will be discussed.
- Published
- 1995
49. EVAR: Ein sprachverstehendes Dialogsystem
- Author
-
Wieland Eckert, Gernot A. Fink, Gerhard Sagerer, S. Rieck, Elmar Nöth, Heinrich Niemann, Franz Kummert, Andreas Kießling, Thomas Kuhn, R. Prechtel, A. Scheuer, Marion Mast, G. Schukat-Talamazzini, Ralf Kompe, and B. Seestaedt
- Abstract
Dieser Artikel befast sich mit dem sprachverstehenden Dialogsystem EVAR, insbesondere mit der linguistischen Verarbeitung des Systems. Aufgabe von EVAR ist die Fuhrung eines informationsabfragenden Dialogs uber das deutsche InterCity-Zugsystem. Das linguistische Wissen ist einheitlich in einem semantischen Netz reprasentiert. Die Wissensbasis ist gemas einem geschichteten linguistischen Modell wohlstrukturiert. Schnittstelle zur Spracherkennung ist die Worthypothesen-Ebene. Der Kontrollalgorithmus ist anwendungsunabhangig formuliert und erlaubt das dynamische Umschalten zwischen den beiden grundlegenden Analysestrategien top-down und bottom-up. Das im System reprasentierte Wissen wird sowohl zur Steuerung der Erkennungsphase als auch in der Verstehensphase benutzt. Das System ist in der Lage, Anfragen trotz fehlerhafter Erkennungsergebnisse zu bearbeiten. Ergebnisse fur eine sprecherabhangige- und eine Mehrsprecher-Version der Erkennung werden vorgestellt.
- Published
- 1992
50. Iterative Optimization of the Data Driven Analysis in Continuous Speech
- Author
-
Thomas Kuhn, S. Rieck, Elmar Nöth, Ernst Günter Schukat-Talamazzini, and S. Kunzmann
- Subjects
ComputingMethodologies_PATTERNRECOGNITION ,Basis (linear algebra) ,Estimation theory ,Computer science ,Iterative method ,Speech recognition ,Word recognition ,Statistical parameter ,Initialization ,Bootstrapping (linguistics) ,Algorithm ,Data-driven - Abstract
We present an iterative method to optimize the word recognition rate for a data driven analysis in continuous speech by using a large set of speech samples. After a short description of our system environment a bootstrapping method for an iterative parameter estimation will be discussed. The initialization of the bootstrapping procedure is done by using a limited amount of hand labeled training data to estimate the statistical parameters roughly. In the second step the statistical parameters are estimated more exactly on the basis of unlabeled training data. Some experimental results for the bootstrapping method performed on unlabeled training data in comparison with results achieved by parameter estimation on labeled training data will be given.
- Published
- 1992
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.