60 results on '"Elmar Nöth"'
Search Results
2. Adult cochlear implant users versus typical hearing persons: an automatic analysis of acoustic–prosodic pparameters
- Author
-
Tomás Arias-Vergara, Anton Batliner, Tobias Rader, Daniel Polterauer, Catalina Högerle, Joachim Müller, Juan-Rafael Orozco-Arroyave, Elmar Nöth, and Maria Schuster
- Subjects
Speech and Hearing ,Linguistics and Language ,ddc:004 ,Language and Linguistics - Abstract
Purpose: The aim of this study was to investigate the speech prosody of postlingually deaf cochlear implant (CI) users compared with control speakers without hearing or speech impairment. Method: Speech recordings of 74 CI users (37 males and 37 females) and 72 age-balanced control speakers (36 males and 36 females) are considered. All participants are German native speakers and read Der Nordwind und die Sonne (The North Wind and the Sun), a standard text in pathological speech analysis and phonetic transcriptions. Automatic acoustic analysis is performed considering pitch, loudness, and duration features, including speech rate and rhythm. Results: In general, duration and rhythm features differ between CI users and control speakers. CI users read slower and have a lower voiced segment ratio compared with control speakers. A lower voiced ratio goes along with a prolongation of the voiced segments' duration in male and with a prolongation of pauses in female CI users. Rhythm features in CI users have higher variability in the duration of vowels and consonants than in control speakers. The use of bilateral CIs showed no advantages concerning speech prosody features in comparison to unilateral use of CI. Conclusions: Even after cochlear implantation and rehabilitation, the speech of postlingually deaf adults deviates from the speech of control speakers, which might be due to changed auditory feedback. We suggest considering changes in temporal aspects of speech in future rehabilitation strategies. Supplemental Material: https://doi.org/10.23641/asha.21579171
- Published
- 2022
3. The Phonetic Footprint of Parkinson's Disease
- Author
-
Anton Batliner, Juan Camilo Vásquez-Correa, Tomas Arias-Vergara, Elmar Nöth, Philipp Klumpp, Juan Rafael Orozco-Arroyave, and Paula Andrea Pérez-Toro
- Subjects
FOS: Computer and information sciences ,Speech production ,Computer Science - Machine Learning ,Parkinson's disease ,Computer science ,Computer Science - Artificial Intelligence ,Feature vector ,Speech recognition ,Realization (linguistics) ,Pronunciation ,Intelligibility (communication) ,medicine.disease ,Theoretical Computer Science ,Machine Learning (cs.LG) ,Human-Computer Interaction ,Artificial Intelligence (cs.AI) ,Audio and Speech Processing (eess.AS) ,Vowel ,Muscle tension ,medicine ,FOS: Electrical engineering, electronic engineering, information engineering ,ddc:004 ,Software ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
As one of the most prevalent neurodegenerative disorders, Parkinson's disease (PD) has a significant impact on the fine motor skills of patients. The complex interplay of different articulators during speech production and realization of required muscle tension become increasingly difficult, thus leading to a dysarthric speech. Characteristic patterns such as vowel instability, slurred pronunciation and slow speech can often be observed in the affected individuals and were analyzed in previous studies to determine the presence and progression of PD. In this work, we used a phonetic recognizer trained exclusively on healthy speech data to investigate how PD affected the phonetic footprint of patients. We rediscovered numerous patterns that had been described in previous contributions although our system had never seen any pathological speech previously. Furthermore, we could show that intermediate activations from the neural network could serve as feature vectors encoding information related to the disease state of individuals. We were also able to directly correlate the expert-rated intelligibility of a speaker with the mean confidence of phonetic predictions. Our results support the assumption that pathological data is not necessarily required to train systems that are capable of analyzing PD speech., Comment: https://www.sciencedirect.com/science/article/abs/pii/S0885230821001169
- Published
- 2021
- Full Text
- View/download PDF
4. Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground
- Author
-
Anton Batliner, Bernd Möbius, Gregor Möhler, Antje Schweitzer, and Elmar Nöth
- Subjects
ddc:004 - Published
- 2020
5. The facial expression module
- Author
-
Carmen Frank, Elmar Nöth, Johann Adelhardt, Heinrich Niemann, Viktor Zeißler, Rui Ping Shi, and Anton Batliner
- Subjects
Facial expression ,Modalities ,Modality (human–computer interaction) ,Point (typography) ,Computer science ,Human–computer interaction ,Gesture recognition ,ComputerApplications_MISCELLANEOUS ,Interface (computing) ,User interface ,ddc:004 ,Gesture - Abstract
In current dialogue systems the use of speech as an input modality is common. But this modality is only one of those human beings use. In human-human interaction people use gestures to point or facial expressions to show their moods as well. To give modern systems a chance to read information from all modalities used by humans, these systems must have multimodal user interfaces. The SmartKom system has such a multimodal interface that analyzes facial expression, speech and gesture simultaneously. Here we present the module that fulfills the task of facial expression analysis in order to identify the internal state of a user.
- Published
- 2020
6. Combining semantic word classes and sub-word unit speech recognition for robust OOV detection
- Author
-
Anton Batliner, Elmar Nöth, Caroline Kaufhold, and Axel Horndasch
- Subjects
Computer science ,business.industry ,Speech recognition ,02 engineering and technology ,computer.software_genre ,01 natural sciences ,Unit (housing) ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,ddc:004 ,business ,010301 acoustics ,computer ,Natural language processing ,Word (computer architecture) - Published
- 2020
7. Phoneme-to-grapheme mapping for spoken inquiries to the semantic web
- Author
-
Horndasch, A., Elmar Nöth, Batliner, A., and Warnke, V.
- Subjects
ddc:004 - Published
- 2020
8. Does it groove or does it stumble: automatic classification of alcoholic intoxication using prosodic features
- Author
-
Hönig, F., Batliner, A., and Elmar Nöth
- Subjects
ddc:004 - Published
- 2020
9. Acoustic-prosodic characteristics of sleepy speech – between performance and interpretation
- Author
-
Elmar Nöth, Sebastian Schnieder, Florian Hönig, Anton Batliner, and Jarek Krajewski
- Subjects
Computer science ,Interpretation (philosophy) ,Speech recognition ,ddc:004 - Published
- 2020
10. Prosody takes over: towards a prosodically guided dialog system
- Author
-
Anton Batliner, Marion Mast, Heinrich Niemann, Ralf Kompe, K. Ott, Andreas Kießling, Thomas Kuhn, and Elmar Nöth
- Subjects
Linguistics and Language ,Computer science ,business.industry ,Communication ,Speech recognition ,Intonation (linguistics) ,Speech synthesis ,computer.software_genre ,Language and Linguistics ,Computer Science Applications ,Domain (software engineering) ,Officer ,Modeling and Simulation ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Dialog box ,Dialog system ,ddc:004 ,Prosody ,business ,computer ,Software ,Utterance ,Natural language processing - Abstract
The domain of the speech recognition and dialog system EVAR is train time table inquiry. We observed that in real human-human dialogs when the officer transmits the information, the customer very often interrupts. Many of these interruptions are just repetitions of the time of day given by the officer. The functional role of these interruptions is often determined by prosodic cues only. An important result of experiments where naive persons used the EVAR system is that it is hard to follow the train connection given via speech synthesis. In this case it is even more important than in human-human dialogs that the user has the opportunity to interact during the answer phase. Therefore we extended the dialog module to allow the user to repeat the time of day and we added a prosody module guiding the continuation of the dialog by analyzing the intonation contour of this utterance.
- Published
- 2020
11. To talk or not to talk with a computer: taking into account the user’s focus of attention
- Author
-
Anton Batliner, Elmar Nöth, and Christian Hacker
- Subjects
Focus (computing) ,Notice ,Multimedia ,Aside ,Computer science ,Process (computing) ,Contrast (statistics) ,computer.software_genre ,Human-Computer Interaction ,Human–computer interaction ,Signal Processing ,ddc:004 ,Face detection ,computer - Abstract
If no specific precautions are taken, people talking to a computer can—the same way as while talking to another human—speak aside, either to themselves or to another person. On the one hand, the computer should notice and process such utterances in a special way; on the other hand, such utterances provide us with unique data to contrast these two registers: talking vs. not talking to a computer. In this paper, we present two different databases, SmartKom and SmartWeb, and classify and analyse On-Talk (addressing the computer) vs. Off-Talk (addressing someone else)—and by that, the user’s focus of attention—found in these two databases employing uni-modal (prosodic and linguistic) features, and employing multimodal information (additional face detection).
- Published
- 2020
12. Quantification of segmentation and F0 errors and their effect on emotion recognition
- Author
-
Joachim Hornegger, Elmar Nöth, Stefan Steidl, and Anton Batliner
- Subjects
Computer science ,business.industry ,Speech recognition ,Text segmentation ,Scale-space segmentation ,Pattern recognition ,Duration (music) ,Segmentation ,AIBO ,Emotion recognition ,Artificial intelligence ,ddc:004 ,business ,Energy (signal processing) ,Word (computer architecture) - Abstract
Prosodic features modelling pitch, energy, and duration play a major role in speech emotion recognition. Our word level features, especially duration and pitch features, rely on correct word segmentation and F0 extraction. For the FAU Aibo Emotion Corpus, the automatic segmentation of a forced alignment of the spoken word sequence and the automatically extracted F0 values have been manually corrected. Frequencies of different types of segmentation and F0errors are given and their influence on emotion recognition using different groups of prosodic features is evaluated. The classification results show that the impact of these errors on emotion recognition is small.
- Published
- 2020
13. Prosody takes over: a prosodically guided dialog system
- Author
-
Ralf Kompe, Andreas Kießling, T. Kuhn, Marion Mast, Heinrich Niemann, Elmar Nöth, K. Ott, and Anton Batliner
- Subjects
dialog ,prosody ,Künstliche Intelligenz ,ddc:620 ,ddc:004 ,ASU - Abstract
In this paper first experiments with naive persons using the speech understanding and dialog system EVAR are discussed. The domain of EVAR is train table inquiry. We observed that in real human-human dialogs when the officer transmits the information the customer very often interrupts. Many of these interruptions are just repetitions of the time of day given by the officer. The functional role of these interruptions is determined by prosodic cues only. An important result of the experiments with EVAR is that it is hard to follow the system giving the train connection via speech synthesis. In this case it is even more important than in human-human dialogs that the user has the opportunity to interact during the answer phase. Therefore we extended the dialog module to allow the user to repeat the time of day and we added a prosody module guiding the continuation of the dialog.
- Published
- 2020
14. Looking at the last two turns, i'd say this dialogue is doomed – measuring dialogue success
- Author
-
Anton Batliner, Christine Ruff, Stefan Steidl, Christian Hacker, Jürgen Haas, and Elmar Nöth
- Subjects
Set (abstract data type) ,Computer science ,Speech recognition ,Single step ,Statistical analysis ,ddc:004 ,Speech processing ,Human being ,Word (computer architecture) ,Natural language ,Domain (software engineering) - Abstract
Two sets of linguistic features are developed: The first one to estimate if a single step in a dialogue between a human being and a machine is successful or not. The second set to classify dialogues as a whole. The features are based on Part-of-Speech-Labels (POS), word statistics and properties of turns and dialogues. Experiments were carried out on the SympaFly corpus, data from a real application in the flight booking domain. A single dialogue step could be classified with an accuracy of 83% (class-wise averaged recognition rate). The recognition rate for whole dialogues was 85%.
- Published
- 2020
15. Can you understand him? Let's look at his word accuracy-automatic evaluation of tracheoesophageal speech
- Author
-
Maria Schuster, Elmar Nöth, Tino Haderlein, Anton Batliner, Stefan Steidl, and Frank Rosanowski
- Subjects
Laryngectomy ,Word accuracy ,Noise ,Computer science ,medicine.medical_treatment ,Speech recognition ,medicine ,Loudspeaker ,Tracheoesophageal Speech ,Intelligibility (communication) ,ddc:004 ,Speech therapy - Abstract
Tracheoesophageal (TE) speech is a possibility to restore the ability to speak after laryngectomy. TE speech often shows low intelligibility. An objective means to determine and quantify the intelligibility does not exist until now and an automation of this procedure is desirable. We used a speech recognizer trained on normal, non-pathologic voices. We compared intelligibility scores for TE speech from five experienced raters with the word accuracy (WA) of our speech recognizer. A correlation coefficient of -0.84 shows that WA can be a good indicator of intelligibility for pathologic voices. An outlook for future work is presented.
- Published
- 2020
16. Are you looking at me, are you talking with me: multimodal classification of the focus of attention
- Author
-
Elmar Nöth, Anton Batliner, and Christian Hacker
- Subjects
Focus (computing) ,Computer science ,Speech recognition ,media_common.quotation_subject ,Mobile computing ,Conversation ,Noise (video) ,ddc:004 ,Semantic Web ,Mobile device ,Natural language ,Utterance ,media_common - Abstract
Automatic dialogue systems get easily confused if speech is recognized which is not directed to the system Besides noise or other people's conversation, even the user's utterance can cause difficulties when he is talking to someone else or to himself (“Off-Talk”) In this paper the automatic classification of the user's focus of attention is investigated In the German SmartWeb project, a mobile device is used to get access to the semantic web In this scenario, two modalities are provided – speech and video signal This makes it possible to classify whether a spoken request is addressed to the system or not: with the camera of the mobile device, the user's gaze direction is detected; in the speech signal, prosodic features are analyzed Encouraging recognition rates of up to 93 % are achieved in the speech-only condition Further improvement is expected from the fusion of the two information sources.
- Published
- 2020
17. Comparison and combination of confidence measures
- Author
-
Elmar Nöth, Stefan Steidl, Anton Batliner, Georg Stemmer, and Heinrich Niemann
- Subjects
Training set ,Artificial neural network ,business.industry ,Computer science ,Machine learning ,computer.software_genre ,Cross entropy ,Posteriori probability ,Confidence measures ,Entropy (information theory) ,Beam search ,Artificial intelligence ,Language model ,ddc:004 ,business ,computer - Abstract
A set of features for word-level confidence estimation is developed. The features should be easy to implement and should require no additional knowledge beyond the information which is available from the speech recognizer and the training data. We compare a number of features based on a common scoring method, the normalized cross entropy. We also study different ways to combine the features. An artificial neural network leads to the best performance, and a recognition rate of 76% is achieved. The approach is extended not only to detect recognition errors but also to distinguish between insertion and substitution errors.
- Published
- 2020
18. Integrated recognition of words and phrase boundaries
- Author
-
Florian Gallwitz, Anton Batliner, Jan Buckow, Richard Huber, Heinrich Niemann, and Elmar Nöth
- Subjects
ddc:004 - Published
- 2020
19. Voice source state as a source of information in speech recognition: detection of laryngealizations
- Author
-
Anton Batliner, Ralf Kompe, Elmar Nöth, Heinrich Niemann, and Andreas Kießling
- Subjects
Voice activity detection ,Artificial neural network ,Computer science ,business.industry ,Speech recognition ,Computation ,Pattern recognition ,Linear discriminant analysis ,Signal ,Constant false alarm rate ,Cepstrum ,State (computer science) ,Artificial intelligence ,ddc:004 ,business - Abstract
Laryngealizations are irregular voiced portions of speech, which can have morpho-syntactic functions and can disturb the automatic computation of FO. Two methods for the automatic detection of laryngealizations are described in this paper: With a Gaussian classifier using spectral and cepstral features a recognition rate of 80% (false alarm rate of 8%) could be achieved. As an alternative a “non-standard” method has been developed: an artificial neural network (ANN) was used for non-linear inverse filtering of speech signals. The inversely filtered signal was directly used as input for another ANN, which was trained to detect laryngealizations. In preliminary experiments we obtained a recognition rate of 65% (12% false alarms).
- Published
- 2020
20. The prosodic marking of phrase boundaries: expectations and results
- Author
-
Ralf Kompe, Elmar Nöth, Anton Batliner, Heinrich Niemann, U. Kilian, and Andreas Kießling
- Subjects
Sentence generation ,Phrase ,Grammar ,business.industry ,Computer science ,media_common.quotation_subject ,computer.software_genre ,Perception ,Dependent clause ,Artificial intelligence ,ddc:004 ,business ,computer ,Natural language processing ,Sentence ,media_common - Abstract
Using sentence templates and a stochastic context-free grammar a large corpus (10,000 sentences) has been created, where prosodic phrase boundaries are labeled in the sentences automatically during sentence generation. With perception experiments on a subset of 500 utterances we verified that 92% of the automatically marked boundaries were perceived as prosodically marked. In initial automatic classification experiments for three levels of boundaries recognition rates up to 81% could be achieved.
- Published
- 2020
21. Automatic evaluation of prosodic features of tracheoesophageal substitute voice
- Author
-
Joachim Hornegger, Anton Batliner, Frank Rosanowski, Elmar Nöth, Ulrich Eysholdt, Tino Haderlein, Hikmet Toy, and Maria Schuster
- Subjects
Male ,Voice Quality ,medicine.medical_treatment ,Evaluation data ,Speech recognition ,Laryngectomy ,Speech, Esophageal ,Speech outcome ,Intelligibility (communication) ,Loudness ,Correlation ,medicine ,Humans ,Prosody ,Aged ,Electronic Data Processing ,business.industry ,General Medicine ,Middle Aged ,Speech, Alaryngeal ,Otorhinolaryngology ,Head and neck surgery ,ddc:004 ,business ,Tracheoesophageal Fistula - Abstract
In comparison with laryngeal voice, substitute voice after laryngectomy is characterized by restricted aero-acoustic properties. Until now, an objective means of prosodic differences between substitute and normal voices does not exist. In a pilot study, we applied an automatic prosody analysis module to 18 speech samples of laryngectomees (age: 64.2 +/- 8.3 years) and 18 recordings of normal speakers of the same age (65.4 +/- 7.6 years). Ninety-five different features per word based upon the speech energy, fundamental frequency F(0) and duration measures on words, pauses and voiced/voiceless sections were measured. These reflect aspects of loudness, pitch and articulation rate. Subjective evaluation of the 18 patients' voices was performed by a panel of five experts on the criteria "noise", "speech effort", "roughness", "intelligibility", "match of breath and sense units" and "overall quality". These ratings were compared to the automatically computed features. Several of them could be identified being twice as high for the laryngectomees compared to the normal speakers, and vice versa. Comparing the evaluation data of the human experts and the automatic rating, correlation coefficients of up to 0.84 were measured. The automatic analysis serves as a good means to objectify and quantify the global speech outcome of laryngectomees. Even better results are expected when both the computation of the features and the comparison method to the human ratings will have been revised and adapted to the special properties of the substitute voices.
- Published
- 2020
22. Can you tell apart spontaneous and read speech if you just look at prosody?
- Author
-
Andreas Kießling, Heinrich Niemann, Elmar Nöth, Ralf Kompe, and Anton Batliner
- Subjects
Systematic difference ,Computer science ,business.industry ,Speech recognition ,Speech corpus ,computer.software_genre ,language.human_language ,German ,language ,Artificial intelligence ,ddc:004 ,Prosody ,business ,computer ,Classifier (UML) ,Natural language processing ,Spontaneous speech - Abstract
Although the recognition of spontaneous speech is the ultimate aim of speech understanding systems it has rarely been investigated so far. In this article first analyses of a German database containing identical utterances of spontaneous and read speech are presented. We describe the differences in prosody between these two registers and report results of a classifier that was trained using prosodic features to discriminate spontaneous and read speech. A systematic difference could be observed that is however rather complex and partly speaker dependent
- Published
- 2020
23. Automatic modelling of depressed speech: relevant features and relevance of gender
- Author
-
Sebastian Schnieder, Elmar Nöth, Jarek Krajewski, Florian Hönig, and Anton Batliner
- Subjects
Range (music) ,Psychomotor retardation ,Computer science ,business.industry ,Speech recognition ,Feature selection ,computer.software_genre ,Loudness ,Variation (linguistics) ,otorhinolaryngologic diseases ,medicine ,Relevance (information retrieval) ,Artificial intelligence ,medicine.symptom ,ddc:004 ,Prosody ,business ,computer ,Natural language processing - Abstract
Depression is an affective disorder characterised by psychomotor retardation; in speech, this shows up in reduction of pitch (variation, range), loudness, and tempo, and in voice qualities different from those of typical modal speech. A similar reduction can be observed in sleepy speech (relaxation). In this paper, we employ a small group of acoustic features modelling prosody and spectrum that have been proven successful in the modelling of sleepy speech, enriched with voice quality features, for the modelling of depressed speech within a regression approach. This knowledge-based approach is complemented by and compared with brute-forcing and automatic feature selection. We further discuss gender differences and the contributions of (groups of) features both for the modelling of depression and across depression and sleepiness.
- Published
- 2020
24. On the use of prosody in automatic dialogue understanding
- Author
-
Elmar Nöth, M. Nutt, Heinrich Niemann, Florian Gallwitz, Richard Huber, Anton Batliner, Jürgen Haas, Volker Warnke, Jan Buckow, and Manuela Boros
- Subjects
Linguistics and Language ,Shallow parsing ,Parsing ,Artificial neural network ,business.industry ,Computer science ,Communication ,SIGNAL (programming language) ,Speech synthesis ,computer.software_genre ,Language and Linguistics ,Computer Science Applications ,Task (project management) ,Modeling and Simulation ,Segmentation ,Computer Vision and Pattern Recognition ,Artificial intelligence ,ddc:004 ,business ,Prosody ,computer ,Software ,Natural language processing - Abstract
In this paper, we show how prosodic information can be used in automatic dialogue systems and give some examples of promising new approaches. Most of these examples are taken from our own work in the V erbmobil speech-to-speech translation system and in the EVAR train timetable dialogue system. In a `prosodic orbit', we first present units, phenomena, annotations and statistical methods from the signal (acoustics) to the dialogue understanding phase. We show then, how prosody can be used together with other knowledge sources for the task of resegmentation if a first segmentation turns out to be wrong, and how an integrated approach leads to better results than a sequential use of the different knowledge sources; then we present a hybrid approach which is used to perform a shallow parsing and which uses prosody to guide the parsing; finally, we show how a critical system evaluation can help to improve the overall performance of automatic dialogue systems.
- Published
- 2019
25. Prosodic processing and its use in VERBMOBIL
- Author
-
Anton Batliner, Ralf Kompe, Heinrich Niemann, Elmar Nöth, and A. Kiessling
- Subjects
Parsing ,business.industry ,Computer science ,Speech recognition ,artificial intelligence ,computer.software_genre ,Speech processing ,Künstliche Intelligenz ,Artificial intelligence ,ddc:620 ,Computational linguistics ,ddc:004 ,Prosody ,business ,computer ,Natural language processing ,Word (computer architecture) ,Sentence - Abstract
We present the prosody module of the VERBMOBlL speech-to-speech translation system, the world wide first complete system, which successfully uses prosodic information in the linguistic analysis. This is achieved by computing probabilities for clause boundaries, accentuation, and different types of sentence mood for each of the word hypotheses computed by the word recognizer. These probabilities guide the search of the linguistic analysis. Disambiguation is already achieved during the analysis and not by a prosodic verification of different linguistic hypotheses. So far, the most useful prosodic information is provided by clause boundaries. These are detected with a recognition rate of 94%. For the parsing of word hypotheses graphs, the use of clause boundary probabilities yields a speed-up of 92% and a 96% reduction of alternative readings.
- Published
- 2019
26. Private emotions versus social interaction: a data-driven approach towards analysing emotion in speech
- Author
-
Elmar Nöth, Christian Hacker, Anton Batliner, and Stefan Steidl
- Subjects
Computer science ,Speech recognition ,Social relation ,Computer Science Applications ,Education ,Data-driven ,Arousal ,Human-Computer Interaction ,medicine ,Multimedia information systems ,AIBO ,Valence (psychology) ,medicine.symptom ,ddc:004 ,Confusion ,Spontaneous speech ,Cognitive psychology - Abstract
The `traditional' first two dimensions in emotion research are VALENCE and AROUSAL. Normally, they are obtained by using elicited, acted data. In this paper, we use realistic, spontaneous speech data from our `AIBO' corpus (human-robot communication, children interacting with Sony's AIBO robot). The recordings were done in a Wizard-of-Oz scenario: the children believed that AIBO obeys their commands; in fact, AIBO followed a fixed script and often disobeyed. Five labellers annotated each word as belonging to one of eleven emotion-related states; seven of these states which occurred frequently enough are dealt with in this paper. The confusion matrices of these labels were used in a Non-Metrical Multi-dimensional Scaling to display two dimensions; the first we interpret as VALENCE, the second, however, not as AROUSAL but as INTERACTION, i.e., addressing oneself (angry, joyful) or the communication partner (motherese, reprimanding). We show that it depends on the specifity of the scenario and on the subjects' conceptualizations whether this new dimension can be observed, and discuss impacts on the practice of labelling and processing emotional data. Two-dimensional solutions based on acoustic and linguistic features that were used for automatic classification of these emotional states are interpreted along the same lines.
- Published
- 2019
27. M = Syntax + Prosody: a syntactic–prosodic labelling scheme for large spontaneous speech databases
- Author
-
Andreas Kießling, Heinrich Niemann, Ralf Kompe, Anton Batliner, Elmar Nöth, and M. Mast
- Subjects
Scheme (programming language) ,Linguistics and Language ,Computer science ,Speech recognition ,Boundary (topology) ,computer.software_genre ,Language and Linguistics ,Prosody ,computer.programming_language ,Database ,Artificial neural network ,business.industry ,Communication ,Statistical model ,Perceptron ,Syntax ,Computer Science Applications ,Modeling and Simulation ,Computer Vision and Pattern Recognition ,Language model ,Artificial intelligence ,ddc:004 ,business ,computer ,Software ,Natural language processing - Abstract
In automatic speech understanding, division of continuous running speech into syntactic chunks is a great problem. Syntactic boundaries are often marked by prosodic means. For the training of statistical models for prosodic boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation), we developed a syntactic‐prosodic labelling scheme where diAerent types of syntactic boundaries are labelled for a large spontaneous speech corpus. This labelling scheme is presented and compared with other labelling schemes for perceptual‐prosodic, syntactic, and dialogue act boundaries. Interlabeller consistencies and estimation of eAort needed are discussed. We compare the results of classifiers (multi-layer perceptrons (MLPs) and n-gram language models) trained on these syntactic‐prosodic boundary labels with classifiers trained on perceptual‐prosodic and pure syntactic labels. The main advantage of the rough syntactic‐prosodic labels presented in this paper is that large amounts of data can be labelled with relatively little eAort. The classifiers trained with these labels turned out to be superior with respect to purely prosodic or syntactic labelling schemes, yielding recognition rates of up to 96% for the two-class-problem ‘boundary versus no boundary’. The use of boundary information leads to a marked improvement in the syntactic processing of the VM system. ” 1998 Elsevier Science B.V. All rights reserved.
- Published
- 2019
28. 'Of all things the measure is man' - automatic classification of emotions and inter-labeler consistency
- Author
-
Heinrich Niemann, Anton Batliner, Michael Levit, Stefan Steidl, and Elmar Nöth
- Subjects
Support vector machine ,Artificial neural network ,Computer science ,Emotion classification ,Speech recognition ,Supervised learning ,Entropy (information theory) ,Emotion recognition ,ddc:004 ,Speech processing ,Facial recognition system - Abstract
In traditional classification problems, the reference needed for training a classifier is given and considered to be absolutely correct. However, this does not apply to all tasks. In emotion recognition in non-acted speech, for instance, one often does not know which emotion was really intended by the speaker. Hence, the data is annotated by a group of human labelers who do not agree on one common class in most cases. Often, similar classes are confused systematically. We propose a new entropy-based method to evaluate classification results taking into account these systematic confusions. We can show that a classifier which achieves a recognition rate of "only" about 60 % on a four-class-problem performs as well as our five human labelers on average.
- Published
- 2019
29. Verbmobil: the use of prosody in the linguistic components of a speech understanding system
- Author
-
Ralf Kompe, Elmar Nöth, Anton Batliner, Heinrich Niemann, and A. Kiessling
- Subjects
Parsing ,Acoustics and Ultrasonics ,business.industry ,Computer science ,Speech recognition ,Intonation (linguistics) ,computer.software_genre ,Speech processing ,Syntax ,Linguistics ,ComputingMethodologies_PATTERNRECOGNITION ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Electrical and Electronic Engineering ,Language translation ,ddc:004 ,business ,Prosody ,computer ,Software ,Natural language ,Sentence ,Natural language processing - Abstract
We show how prosody can be used in speech understanding systems. This is demonstrated with the VERBMOBIL speech to-speech translation system which, to our knowledge, is the first complete system which successfully uses prosodic information in the linguistic analysis. Prosody is used by computing probabilities for clause boundaries, accentuation, and different types, of sentence mood for each of the word hypotheses computed by the word recognizer. These probabilities guide the search of the linguistic analysis. Disambiguation is already achieved during the analysis and not by a prosodic verification of different linguistic hypotheses. So far, the most useful prosodic information is provided by clause boundaries. These are detected with a recognition rate of 94%. For the parsing of word hypotheses graphs, the use of clause boundary probabilities yields a speed-up of 92% and a 96% reduction of alternative readings.
- Published
- 2019
30. The INTERSPEECH 2015 computational paralinguistics challenge: nativeness, Parkinson's & eating condition
- Author
-
Florian Hönig, Simone Hantke, Juan Rafael Orozco-Arroyave, Elmar Nöth, Felix Weninger, Anton Batliner, Yue Zhang, Stefan Steidl, and Björn Schuller
- Subjects
Computer science ,digestive, oral, and skin physiology ,ddc:004 ,Cognitive psychology - Abstract
The INTERSPEECH 2015 Computational Paralinguistics Challenge addresses three different problems for the first time in research competition under well-defined conditions: the estimation of the degree of nativeness, the neurological state of patients with Parkinson’s condition, and the eating conditions of speakers, i. e., whether and which food type they are eating in a seven-class problem. In this paper, we describe these sub-challenges, their conditions, and the baseline feature extraction and classifiers, as provided to the participants. Index Terms: Computational Paralinguistics, Challenge, Degree of Nativeness, Parkinson’s Condition, Eating Condition
- Published
- 2019
31. How to find trouble in communication
- Author
-
Elmar Nöth, Jörg Spilker, Kerstin Fischer, Anton Batliner, and Richard Huber
- Subjects
Linguistics and Language ,Computer science ,media_common.quotation_subject ,Speech recognition ,Anger ,computer.software_genre ,Language and Linguistics ,Task (project management) ,Critical phase ,Classifier (linguistics) ,Human operator ,Prosody ,media_common ,Artificial neural network ,business.industry ,Communication ,Computer Science Applications ,Modeling and Simulation ,Computer Vision and Pattern Recognition ,Artificial intelligence ,State (computer science) ,ddc:004 ,business ,computer ,Software ,Natural language processing - Abstract
Automatic dialogue systems used, for instance, in call centers, should be able to determine in a critical phase of the dialogue-indicated by the customers vocal expression of anger/irritation-when it is better to pass over to a human operator. At a first glance, this does not seem to be a complicated task: It is reported in the literature that emotions can be told apart quite reliably on the basis of prosodic features. However, these results are achieved most of the time in a laboratory setting, with experienced speakers (actors), and with elicited, controlled speech. We compare classification results obtained with the same feature set for elicited speech and for a Wizard-of-Oz scenario, where users believe that they are really communicating with an automatic dialogue system. It turns out that the closer we get to a realistic scenario, the less reliable is prosody as an indicator of the speakers' emotional state. As a consequence, we propose to change the target such that we cease looking for traces of particular emotions in the users' speech, but instead look for indicators of TROUBLE IN COMMUNICATION. For this reason, we propose the module Monitoring of User State [especially of] Emotion (MOUSE) in which a prosodic classifier is combined with other knowledge sources, such as conversationally peculiar linguistic behavior, for example, the use of repetitions. For this module, preliminary experimental results are reported showing a more adequate modelling of TROUBLE IN COMMUNICATION.
- Published
- 2019
32. The INTERSPEECH 2019 Computational Paralinguistics Challenge: Styrian Dialects, Continuous Sleepiness, Baby Sounds & Orca Activity
- Author
-
Jarek Krajewski, Simone Hantke, Amanda Seidl, Anton Batliner, Björn Schuller, Florian B. Pokorny, Elika Bergelson, Christian Bergler, Margaret Cychosz, Alejandrina Cristia, Sebastian Schnieder, Ralf Vollmann, Elmar Nöth, Shahin Amiriparian, Maximilian Schmitt, Anne S. Warlaumont, Lisa Yankowitz, and Sonja-Dana Roelen
- Subjects
Computer science ,Speech recognition ,ddc:004 - Published
- 2019
33. A Survey on perceived speaker traits: Personality, likability, pathology, and the first challenge
- Author
-
Tobias Bocklet, Alessandro Vinciarelli, Rob J.J.H. van Son, Björn Schuller, Elmar Nöth, Felix Weninger, Felix Burkhardt, Anton Batliner, Gelareh Mohammadi, Benjamin Weiss, Stefan Steidl, Florian Eyben, and ACLC (FGw)
- Subjects
Computer science ,media_common.quotation_subject ,Intelligibility (communication) ,ddc ,Theoretical Computer Science ,Human-Computer Interaction ,Trait ,Personality ,ddc:004 ,Feature set ,Baseline (configuration management) ,Software ,Cognitive psychology ,media_common - Abstract
The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits - the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fields of research and describe the three sub-challenges in terms of the challenge conditions, the baseline results provided by the organisers, and a new openSMILE feature set, which has been used for computing the baselines and which has been provided to the participants. Furthermore, we summarise the approaches and the results presented by the participants to show the various techniques that are currently applied to solve these classification tasks.KeywordsComputational paralinguistics; Personality; Likability; Pathology; Survey; Challenge
- Published
- 2015
34. Automatic intelligibility assessment of pathologic speech over the telephone
- Author
-
Frank Rosanowski, Ulrich Eysholdt, Anton Batliner, Elmar Nöth, and Tino Haderlein
- Subjects
Adult ,Male ,Sound Spectrography ,Support Vector Machine ,Time Factors ,Voice Quality ,Computer science ,medicine.medical_treatment ,Headset ,Speech recognition ,Laryngectomy ,Intelligibility (communication) ,Speech Acoustics ,Standard deviation ,Correlation ,Automation ,Speech and Hearing ,Speech Production Measurement ,Arts and Humanities (miscellaneous) ,Germany ,medicine ,Cluster Analysis ,Humans ,Partial laryngectomy ,Aged ,Jitter ,Aged, 80 and over ,Speech Intelligibility ,Signal Processing, Computer-Assisted ,Middle Aged ,LPN and LVN ,Markov Chains ,Telephone ,Speech, Alaryngeal ,Support vector machine ,Speech Perception ,Female ,ddc:004 ,Speech Recognition Software - Abstract
Objective assessment of intelligibility on the telephone is desirable for voice and speech assessment and rehabilitation. A total of 82 patients after partial laryngectomy read a standardized text which was synchronously recorded by a headset and via telephone. Five experienced raters assessed intelligibility perceptually on a five-point scale. Objective evaluation was performed by support vector regression on the word accuracy (WA) and word correctness (WR) of a speech recognition system, and a set of prosodic features. WA and WR alone exhibited correlations to human evaluation between |r| = 0.57 and |r| = 0.75. The correlation was r = 0.79 for headset and r = 0.86 for telephone recordings when prosodic features and WR were combined. The best feature subset was optimal for both signal qualities. It consists of WR, the average duration of the silent pauses before a word, the standard deviation of the fundamental frequency on the entire sample, the standard deviation of jitter, and the ratio of the durations of the voiced sections and the entire recording.
- Published
- 2011
35. PEAKS – A system for the automatic evaluation of voice and speech disorders
- Author
-
Anton Batliner, Andreas Maier, Frank Rosanowski, Tino Haderlein, Ulrich Eysholdt, Elmar Nöth, and Maria Schuster
- Subjects
Linguistics and Language ,Computer science ,business.industry ,Communication ,Speech recognition ,Speech processing ,Language and Linguistics ,Computer Science Applications ,Voice analysis ,Correlation ,Modeling and Simulation ,otorhinolaryngologic diseases ,The Internet ,Computer Vision and Pattern Recognition ,ddc:004 ,business ,Prosody ,Software - Abstract
We present a novel system for the automatic evaluation of speech and voice disorders. The system can be accessed via the internet platform-independently. The patient reads a text or names pictures. His or her speech is then analyzed by automatic speech recognition and prosodic analysis. For patients who had their larynx removed due to cancer and for children with cleft lip and palate we show that we can achieve significant correlations between the automatic analysis and the judgment of human experts in a leave-one-out experiment (p
- Published
- 2009
36. QMOS - a Robust Visualization Method for Speaker Dependencies With Different Microphones
- Author
-
Tobias Cincarek, Elmar Nöth, Stefan Wenhardt, Maria Schuster, Ulrich Eysholdt, Andreas Maier, Tino Haderlein, Anton Batliner, and Stefan Steidl
- Subjects
Reduction (complexity) ,Sammon mapping ,Computer science ,Microphone ,Speech recognition ,Dimensionality reduction ,Principal component analysis ,ddc:004 ,Linear discriminant analysis ,Signal ,Visualization - Abstract
There are several methods to create visualizations of speech data. All of them, how-ever, lack the ability to remove microphone-dependent distortions. We examined the useof Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and theCOmprehensive Space Map of Objective Signal (COSMOS) method in this work. To solvethe problem of lacking microphone independency of PCA, LDA, and COSMOS, we presenttwo methods to reduce the influence of the recording conditions on the visualization. Thefirst one is a rigid registration of maps created from identical speakers recorded underdifferent conditions, i.e. different microphones and distances. The second method is anextension of the COSMOS method, which performs a non-rigid registration during themapping procedure. As a measure for the quality of the visualization, we computed themapping error which occurs during the dimension reduction and the grouping error as theaverage distance between the representations of the same speaker recorded by different mi-crophones. The best linear method in leave-one-speaker-out evaluation is PCA plus rigidregistration with a mapping error of 47% and a grouping error of 18%. The proposedmethod, however, surpasses this even further with a mapping error of 24% and a groupingerror which is close to zero.Keywords: Speech intelligibility, speech and voice disorders, speech evaluation, dimen-sionality reduction, Sammon mapping, QMOS, COSMOS, COmprehensive Space Map ofObjective Signal
- Published
- 2009
37. Numeric quantification of intelligibility in schoolchildren with isolated and combined cleft palate
- Author
-
Elmar Nöth, Emeka Nkenke, B. Vogt, Andreas Maier, Anton Batliner, Ulrich Eysholdt, and Maria Schuster
- Subjects
Gynecology ,medicine.medical_specialty ,School age child ,Otorhinolaryngology ,business.industry ,medicine ,Head and neck surgery ,Congenital cleft ,ddc:004 ,Congenital disease ,business - Abstract
Hintergrund Spaltfehlbildungen konnen trotz adaquater Behandlung funktionelle Beeintrachtigungen, z. B. Lautbildungsstorungen, verursachen. Diese variieren individuell stark. Typisch sind z. B. veranderte nasale Luftfuhrung und verlagerte Artikulation, die zu einer verminderten Verstandlichkeit fuhren. Ein Zusammenhang zwischen der Art der Spaltfehlbildung und der Verstandlichkeit konnte bisher nur durch kategoriale, subjektive Bewertungen beschrieben werden. In dieser Studie wird die spaltabhangige Verstandlichkeit erstmals objektiv und numerisch mittels automatischer Spracherkennungstechnik quantifiziert.
- Published
- 2007
38. Prosody, empty categories and parsing - a success story
- Author
-
Anton Batliner, A. Feldhaus, S. Geissler, T. Kiss, Ralf Kompe, and Elmar Nöth
- Subjects
TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Künstliche Intelligenz ,ddc:004 ,ddc:620 ,artificial intelligence - Abstract
We describe a number of experiments that demonstrate the usefulness of prosodic information for a processing module which parses spoken utterances with a feature-based grammar employing empty categories. We show that by requiring certain prosodic properties from those positions in the input, where the presence of an empty category has to be hypothesized, a derivation can be accomplished more efficiently The approach has been implemented in the machine translation project Verbmobil and results in a signicant reduction of the work-load for the parser.
- Published
- 2013
- Full Text
- View/download PDF
39. Syntactic-prosodic labeling of large spontaneous speech data-bases
- Author
-
Ralf Kompe, Elmar Nöth, Anton Batliner, A. Kiessling, and Heinrich Niemann
- Subjects
Computer science ,business.industry ,Speech recognition ,Perceptron ,Speech processing ,computer.software_genre ,artificial intelligence ,Robustness (computer science) ,Künstliche Intelligenz ,Automatic speech ,Language model ,Artificial intelligence ,ddc:620 ,ddc:004 ,business ,computer ,Natural language processing ,Natural language ,Statistic ,Spontaneous speech - Abstract
In automatic speech understanding, the division of continuously running speech into syntactic chunks is a great problem. Syntactic boundaries are often marked by prosodic means. For the training of statistic models for prosodic boundaries large databases are necessary. For the German VERBMOBIL project (automatic speech-to-speech translation), we developed a syntactic-prosodic labeling scheme where two main types of boundaries (major syntactic boundaries and syntactically ambiguous boundaries) and some other special boundaries are labeled for a large VERBMOBIL spontaneous speech corpus. We compare the results of classifiers (multilayer perceptrons and language models) trained on these syntactic-prosodic boundary labels with classifiers trained on perceptual-prosodic and pure syntactic labels. The main advantage of the rough syntactic-prosodic labels presented in this paper is that large amounts of data could be labeled within a short time. Therefore, the classifiers trained with these labels turned out to be superior (recognition rates of up to 96%).
- Published
- 2013
- Full Text
- View/download PDF
40. Pitch determination considering laryngealization effects in spoken dialogs
- Author
-
B. Kahles, Joachim Denzler, Volker Strom, Heinrich Niemann, R. Kompe, A. Kiessling, and Elmar Nöth
- Subjects
Artificial neural network ,Computer science ,Information seeking ,Speech recognition ,SIGNAL (programming language) ,Intonation (linguistics) ,Context (language use) ,Fundamental frequency ,Interrogative ,Phenomenon ,Künstliche Intelligenz ,ddc:004 ,ddc:620 - Abstract
A frequent phenomenon in spoken dialogs of the information seeking type are short elliptic utterances whose mood (declarative or interrogative) can only be distinguished by intonation. The main acoustic evidence is conveyed by the fundamental frequency or F/sub 0/-contour. Many algorithms for F/sub 0/ determination have been reported in the literature. A common problem are irregularities of speech known as 'laryngealizations'. This article describes an approach based on neural network techniques for the improved determination of fundamental frequency. First, an improved version of the authors' neural network algorithm for reconstruction of the voice source signal (glottis signal) is presented. Second, the reconstructed voice source signal is used as input to another neural network distinguishing the three classes 'voiceless', 'voiced non-laryngealized', and 'voiced laryngealized'. Third, the results are used to improve an existing F/sub 0/ algorithm. Results of this approach are presented and discussed in the context of the application in a spoken dialog system. >
- Published
- 2011
- Full Text
- View/download PDF
41. Associating children's non-verbal and verbal behaviour: Body movements, emotions, and laughter in a human-robot interaction
- Author
-
Anton Batliner, Stefan Steidl, and Elmar Nöth
- Subjects
Laughter ,Nonverbal communication ,media_common.quotation_subject ,Behavioural sciences ,Body movement ,AIBO ,ddc:004 ,Pragmatics ,Psychology ,Human–robot interaction ,media_common ,Cognitive psychology ,Gesture - Abstract
In this article, we associate different types of vocal behaviour denoting emotional user states and laughter with different types of body movements such as gestures, forward bends, or liveliness. Our subjects are German children giving commands to Sony's Aibo robot; the data are fully realistic. The analysis reveals characteristic and significant co-occurrences of body movements and vocal events.
- Published
- 2011
42. An Extension to the Sammon Mapping for the Robust Visualization of Speaker Dependencies
- Author
-
Julian Exner, Tino Haderlein, Stefan Steidl, Andreas Maier, Elmar Nöth, and Anton Batliner
- Subjects
Computer science ,Microphone ,business.industry ,Dimensionality reduction ,Two step ,Ranging ,Extension (predicate logic) ,Zero (linguistics) ,Visualization ,Sammon mapping ,Computer vision ,Artificial intelligence ,ddc:004 ,business - Abstract
We present a novel method for the visualization of speakers which is microphone independent. To solve the problem of lacking microphone independency we present two methods to reduce the influence of the recording conditions on the visualization. The first one is a registration of maps created from identical speakers recorded under different conditions, i.e., different microphones and distances in two steps: Dimension reduction followed by the linear registration of the maps. The second method is an extension of the Sammon mapping method, which performs a non-linear registration during the dimension reduction procedure. The proposed method surpasses the two step registration approach with a mapping error ranging from 17 % to 24 % and a grouping error which is close to zero.
- Published
- 2008
43. Does multimodality really help? the classification of emotion and of On/Off-focus in multimodal dialogues - two case studies
- Author
-
Christian Hacker, Elmar Nöth, and Anton Batliner
- Subjects
German ,Focus (computing) ,Modalities ,Computer science ,Human–computer interaction ,Speech recognition ,Emotion detection ,language ,ddc:004 ,User state ,language.human_language ,Reflexive pronoun ,Multimodality - Abstract
Very often in articles on monomodal human-machine-interaction (HMI) it is pointed out that the results can strongly be improved if other modalities are taken into account. In this contribution we look at two different problems in HMI: the detection of emotion or user state and the question whether the user is currently interacting with the machine, himself or another person (On/Off-Focus). We present monomodal classification results for these two problems and discuss whether multimodal classification seems to be promising for the respective problem. Different fusion models are considered. The examples are taken from the German HMI projects "SmartKom" and "SmartWeb".
- Published
- 2007
44. Automatic scoring of the intelligibility in patients with cancer of the oral cavity
- Author
-
Andreas Maier, Emeka Nkenke, Maria Schuster, Elmar Nöth, and Anton Batliner
- Subjects
Computer science ,Speech recognition ,Word recognition ,otorhinolaryngologic diseases ,In patient ,ddc:004 ,Intelligibility (communication) ,Oral cavity ,Speech processing - Abstract
After surgical treatment of cancer of the oral cavity patients often suffer from functional restrictions such as speech disorders. In this paper we present a novel approach to assess the outcome of the treatment w.r.t. the intelligibility of the patient using the result of an automatic speech recognition system. The word recognition rate was taken as intelligibility score. Compared to four speech experts this method yields results that are as good as the best speech expert compared to the other experts. The correlation between our system and the mean opinion of the experts is .92. Furthermore we show that our system has better performance than the average expert and is more reliable. Index Terms: Speech intelligibility, Speech processing, Biomedical acoustics, Acoustic applications
- Published
- 2007
45. The Gesture Interpretation Module
- Author
-
Rui Ping Shi, Anton Batliner, Elmar Nöth, Heinrich Niemann, Viktor Zeißler, Carmen Frank, and Johann Adelhardt
- Subjects
Communication ,Facial expression ,Unconscious mind ,Interpretation (logic) ,InformationSystems_INFORMATIONINTERFACESANDPRESENTATION(e.g.,HCI) ,Computer science ,business.industry ,Semantics ,Body language ,Gesture recognition ,ddc:004 ,business ,Utterance ,Gesture - Abstract
Humans make often conscious and unconscious gestures, which reflect their mind, thoughts and the way these are formulated. These inherently complex processes can in general not be substituted by a corresponding verbal utterance that has the same semantics (McNeill, 1992). Gesture, which is a kind of body language, contains important information on the intention and the state of the gesture producer. Therefore, it is an important communication channels in human computer interaction.
- Published
- 2006
46. Language models beyond word strings
- Author
-
Jörg Spilker, Elmar Nöth, Georg Stemmer, Heinrich Niemann, Florian Gallwitz, and Anton Batliner
- Subjects
Audio mining ,Computer science ,business.industry ,Speech recognition ,Word error rate ,Speech synthesis ,Speech corpus ,computer.software_genre ,Speech segmentation ,Cache language model ,Speech analytics ,Language model ,Artificial intelligence ,ddc:004 ,business ,computer ,Natural language processing - Abstract
In this paper we want to show how n-gram language models can be used to provide additional information in automatic speech understanding systems beyond the pure word chain. This becomes important in the context of conversational dialogue systems that have to recognize and interpret spontaneous speech. We show how n-grams can: (1) help to classify prosodic events like boundaries and accents; (2) be extended to directly provide boundary information in the speech recognition phase; (3) help to process speech repairs; and (4) detect and semantically classify out-of-vocabulary words. The approaches can work on the best word chain or a word hypotheses graph. Examples and experimental results are provided from our own research within the EVAR information retrieval system and the VERBMOBIL speech-to-speech translation system.
- Published
- 2005
47. An integrated model of acoustics and language using semantic classification trees
- Author
-
H. Niemann, Elmar Nöth, R. Kuhn, A. Gebhard, S. Harbeck, M. Mast, R. De Mori, R. Kompe, and J. Fischer
- Subjects
Spoken word ,Phrase ,Computer science ,Speech recognition ,computer.software_genre ,Semantic network ,Bayes' theorem ,Rule-based machine translation ,Context model ,Artificial neural network ,business.industry ,artificial intelligence ,Speech processing ,ComputingMethodologies_PATTERNRECOGNITION ,Künstliche Intelligenz ,Artificial intelligence ,Language model ,ddc:004 ,ddc:620 ,business ,computer ,Classifier (UML) ,Natural language ,Natural language processing - Abstract
We propose multilevel semantic classification trees to combine different information sources for predicting speech events (e.g. word chains, phrases, etc.). Traditionally in speech recognition systems these information sources (acoustic evidence, language model) are calculated independently and combined via Bayes rule. The proposed approach allows one to combine sources of different types it is no longer necessary for each source to yield a probability. Moreover the tree can look at several information sources simultaneously. The approach is demonstrated for the prediction of prosodically marked phrase boundaries, combining information about the spoken word chain, word category information, prosodic parameters, and the result of a neural network predicting the boundary on the basis of acoustic-prosodic features. The recognition rates of up to 90% for the two class problem boundary vs. no boundary are already comparable to results achieved with the above mentioned Bayes rule approach that combines the acoustic classifier with a 5-gram categorical language model. This is remarkable, since so far only a small set of questions combining information from different sources have been implemented.
- Published
- 2002
48. Prosodic Classification of Offtalk: First Experiments
- Author
-
Elmar Nöth, Heinrich Niemann, Viktor Zeissler, and Anton Batliner
- Subjects
Facial expression ,Computer science ,Aside ,business.industry ,Speech recognition ,Feature vector ,Multimodal communication ,computer.software_genre ,Facial recognition system ,Artificial intelligence ,ddc:004 ,business ,Prosody ,computer ,Natural language processing ,Spontaneous speech ,Gesture - Abstract
SmartKom is a multi-modal dialogue system which combines speech with gesture and facial expression. In this paper, we want to deal with one of those phenomena which can be observed in such elaborated systems that we want to call 'offtalk', i.e., speech that is not directed to the system (speaking to oneself, speaking aside). We report the classification results of first experiments which use a large prosodic feature vector in combination with part-of-speech information.
- Published
- 2002
49. The Prosody Module
- Author
-
Johann Adelhardt, Heinrich Niemann, Carmen Frank, Rui Ping Shi, Anton Batliner, Elmar Nöth, and Viktor Zeißler
- Subjects
Artificial neural network ,business.industry ,Computer science ,Feature vector ,computer.software_genre ,language.human_language ,German ,Annotation ,language ,Language model ,Artificial intelligence ,ddc:004 ,Prosody ,business ,computer ,Sentence ,Statistic ,Natural language processing - Abstract
We describe the acoustic-prosodic and syntactic-prosodic annotation and classification of boundaries, accents and sentence mood integrated in the Verbmobil system for the three languages German, English, and Japanese. For the acoustic-prosodic classification, a large feature vector with normalized prosodic features is used. For the three languages, a multilingual prosody module was developed that reduces memory requirement considerably, compared to three monolingual modules. For classification, neural networks and statistic language models are used.
- Published
- 2000
50. Fast and Robust Features for Prosodic Classification?
- Author
-
Elmar Nöth, Volker Warnke, Richard Huber, Anton Batliner, Jan Buckow, and Heinrich Niemann
- Subjects
Set (abstract data type) ,Normalization (statistics) ,ComputingMethodologies_PATTERNRECOGNITION ,Computer science ,Speech recognition ,Multilayer perceptron ,Segmentation ,ddc:004 ,Prosody ,Speech processing ,Word (computer architecture) ,Pitch contour - Abstract
In our previous research, we have shown that prosody can be used to dramatically improve the performance of the automatic speech translation system Verbmobil [5,7,8]. In Verbmobil, prosodic information is made available to the different modules of the system by annotating the output of a word recognizer with prosodic markers. These markers are determined in a classification process. The computation of the prosodic features used for classification was previously based on a time alignment of the phoneme sequence of the recognized words. The phoneme segmentation was needed for the normalization of duration and energy features. This time alignment was very expensive in terms of computational effort and memory requirement. In our new approach the normalization is done on the word level with precomputed duration and energy statistics, thus the phoneme segmentation can be avoided. With the new set of prosodic features better classification results can be achieved, the features extraction can be sped up by 64 %, and the memory requirements are even reduced by 92%.
- Published
- 1999
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.