25 results on '"Soroosh Mariooryad"'
Search Results
2. Speaker Generation.
- Author
-
Daisy Stanton, Matt Shannon, Soroosh Mariooryad, R. J. Skerry-Ryan, Eric Battenberg, Tom Bagby, and David Kao
- Published
- 2022
- Full Text
- View/download PDF
3. Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis.
- Author
-
Ron J. Weiss, R. J. Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, and Diederik P. Kingma
- Published
- 2021
- Full Text
- View/download PDF
4. Location-Relative Attention Mechanisms for Robust Long-Form Speech Synthesis.
- Author
-
Eric Battenberg, R. J. Skerry-Ryan, Soroosh Mariooryad, Daisy Stanton, David Kao, Matt Shannon, and Tom Bagby
- Published
- 2020
- Full Text
- View/download PDF
5. Semi-Supervised Generative Modeling for Controllable Speech Synthesis.
- Author
-
Raza Habib, Soroosh Mariooryad, Matt Shannon, Eric Battenberg, R. J. Skerry-Ryan, Daisy Stanton, David Kao, and Tom Bagby
- Published
- 2020
6. Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora.
- Author
-
Soroosh Mariooryad, Reza Lotfian, and Carlos Busso
- Published
- 2014
- Full Text
- View/download PDF
7. Automatic characterization of speaking styles in educational videos.
- Author
-
Soroosh Mariooryad, Anitha Kannan, Dilek Hakkani-Tür, and Elizabeth Shriberg
- Published
- 2014
- Full Text
- View/download PDF
8. Feature and model level compensation of lexical content for facial emotion recognition.
- Author
-
Soroosh Mariooryad and Carlos Busso
- Published
- 2013
- Full Text
- View/download PDF
9. Analysis and Compensation of the Reaction Lag of Evaluators in Continuous Emotional Annotations.
- Author
-
Soroosh Mariooryad and Carlos Busso
- Published
- 2013
- Full Text
- View/download PDF
10. Audiovisual corpus to analyze whisper speech.
- Author
-
Tam Tran, Soroosh Mariooryad, and Carlos Busso
- Published
- 2013
- Full Text
- View/download PDF
11. Factorizing speaker, lexical and emotional variabilities observed in facial expressions.
- Author
-
Soroosh Mariooryad and Carlos Busso
- Published
- 2012
- Full Text
- View/download PDF
12. Detecting Sleepiness by Fusing Classifiers Trained with Novel Acoustic Features.
- Author
-
Tauhidur Rahman, Soroosh Mariooryad, Shalini Keshavamurthy, Gang Liu 0001, John H. L. Hansen, and Carlos Busso
- Published
- 2011
- Full Text
- View/download PDF
13. Speaker Generation
- Author
-
Daisy Stanton, Matt Shannon, Soroosh Mariooryad, RJ Skerry-Ryan, Eric Battenberg, Tom Bagby, and David Kao
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,I.2.7 ,G.3 ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computation and Language (cs.CL) ,Computer Science - Sound ,Machine Learning (cs.LG) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
This work explores the task of synthesizing speech in nonexistent human-sounding voices. We call this task "speaker generation", and present TacoSpawn, a system that performs competitively at this task. TacoSpawn is a recurrent attention-based text-to-speech model that learns a distribution over a speaker embedding space, which enables sampling of novel and diverse speakers. Our method is easy to implement, and does not require transfer learning from speaker ID systems. We present objective and subjective metrics for evaluating performance on this task, and demonstrate that our proposed objective metrics correlate with human perception of speaker similarity. Audio samples are available on our demo page., 12 pages, 3 figures, 4 tables, appendix with 2 tables
- Published
- 2021
14. Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
- Author
-
Eric Battenberg, Soroosh Mariooryad, Ron Weiss, RJ Skerry-Ryan, and Diederik P. Kingma
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Artificial neural network ,Computer science ,Speech processing ,Computer Science - Sound ,Autoregressive model ,Flow (mathematics) ,Cascade ,Audio and Speech Processing (eess.AS) ,FOS: Electrical engineering, electronic engineering, information engineering ,Spectrogram ,Waveform ,Algorithm ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing ,Block (data storage) - Abstract
We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length blocks, each one containing hundreds of samples. The interdependencies of waveform samples within each block are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding blocks.This model can be optimized directly with maximum likelihood, with-out using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features. The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed., Comment: 6 pages including supplement, 3 figures. accepted to ICASSP 2021
- Published
- 2020
- Full Text
- View/download PDF
15. The Cost of Dichotomizing Continuous Labels for Binary Classification Problems: Deriving a Bayesian-Optimal Classifier
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
Structured support vector machine ,Computer science ,business.industry ,05 social sciences ,050401 social sciences methods ,Pattern recognition ,Bayes classifier ,Quadratic classifier ,Machine learning ,computer.software_genre ,050105 experimental psychology ,Human-Computer Interaction ,Support vector machine ,ComputingMethodologies_PATTERNRECOGNITION ,0504 sociology ,Binary classification ,Margin classifier ,Maximum a posteriori estimation ,0501 psychology and cognitive sciences ,Artificial intelligence ,business ,Classifier (UML) ,computer ,Software - Abstract
Many pattern recognition problems involve characterizing samples with continuous labels instead of discrete categories. While regression models are suitable for these learning tasks, these labels are often discretized into binary classes to formulate the problem as a conventional classification task (e.g., classes with low versus high values). This methodology brings intrinsic limitations on the classification performance. The continuous labels are typically normally-distributed, with many samples close to the boundary threshold, resulting in poor classification rates. Previous studies only use the discretized labels to train binary classifiers, neglecting the original, continuous labels. This study demonstrates that, even in binary classification problems, exploiting the original labels before splitting the classes can lead to better classification performance. This work proposes an optimal classifier based on the Bayesian maximum a posterior (MAP) criterion for these problems, which effectively utilizes the real-valued labels. We derive the theoretical average performance of this classifier, which can be considered as the expected upper bound performance for the task. Experimental evaluations on synthetic and real data sets show the improvement achieved by the proposed classifier, in contrast to conventional classifiers trained with binary labels. These evaluations clearly demonstrate the optimality of the proposed classifier, and the precision of the expected upper bound obtained by our derivation.
- Published
- 2017
16. Location-Relative Attention Mechanisms For Robust Long-Form Speech Synthesis
- Author
-
Eric Battenberg, David T. H. Kao, Tom Bagby, Soroosh Mariooryad, RJ Skerry-Ryan, Daisy Stanton, and Matt Shannon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Sound (cs.SD) ,Computer Science - Computation and Language ,Mechanism (biology) ,Computer science ,Speech recognition ,020206 networking & telecommunications ,Speech synthesis ,02 engineering and technology ,computer.software_genre ,Computer Science - Sound ,Machine Learning (cs.LG) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Consistency (database systems) ,Audio and Speech Processing (eess.AS) ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,FOS: Electrical engineering, electronic engineering, information engineering ,0305 other medical science ,computer ,Computation and Language (cs.CL) ,Energy (signal processing) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Despite the ability to produce human-level speech for in-domain text, attention-based end-to-end text-to-speech (TTS) systems suffer from text alignment failures that increase in frequency for out-of-domain text. We show that these failures can be addressed using simple location-relative attention mechanisms that do away with content-based query/key comparisons. We compare two families of attention mechanisms: location-relative GMM-based mechanisms and additive energy-based mechanisms. We suggest simple modifications to GMM-based attention that allow it to align quickly and consistently during training, and introduce a new location-relative attention mechanism to the additive energy-based family, called Dynamic Convolution Attention (DCA). We compare the various mechanisms in terms of alignment speed and consistency during training, naturalness, and ability to generalize to long utterances, and conclude that GMM attention and DCA can generalize to very long utterances, while preserving naturalness for shorter, in-domain utterances., Comment: Accepted to ICASSP 2020
- Published
- 2019
- Full Text
- View/download PDF
17. Facial Expression Recognition in the Presence of Speech Using Blind Lexical Compensation
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
Facial expression ,Face hallucination ,Speech recognition ,Phonetic transcription ,02 engineering and technology ,Facial recognition system ,Human-Computer Interaction ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Face (geometry) ,0202 electrical engineering, electronic engineering, information engineering ,Three-dimensional face recognition ,020201 artificial intelligence & image processing ,Transcription (software) ,0305 other medical science ,Articulation (phonetics) ,Psychology ,Software - Abstract
During spontaneous conversations the articulation process as well as the internal emotional states influence the facial configurations. Inferring the conveyed emotions from the information presented in facial expressions requires decoupling the linguistic and affective messages in the face. Normalizing and compensating for the underlying lexical content have shown improvement in recognizing facial expressions. However, this requires the transcription and phoneme alignment information, which is not available in broad range of applications. This study uses the asymmetric bilinear factorization model to perform the decoupling of linguistic and affective information when they are not given. The emotion recognition evaluations on the IEMOCAP database show the capability of the proposed approach in separating these factors in facial expressions, yielding statistically significant performance improvements. The achieved improvement is similar to the case when the ground truth phonetic transcription is known. Similarly, experiments on the SEMAINE database using image-based features demonstrate the effectiveness of the proposed technique in practical scenarios.
- Published
- 2016
18. Correcting Time-Continuous Emotional Labels by Modeling the Reaction Lag of Evaluators
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
business.industry ,Emotion classification ,Lag ,Speech recognition ,Feature extraction ,Mutual information ,Machine learning ,computer.software_genre ,Human-Computer Interaction ,Artificial intelligence ,Emotion recognition ,Valence (psychology) ,Psychology ,business ,computer ,Software - Abstract
An appealing scheme to characterize expressive behaviors is the use of emotional dimensions such as activation (calm versus active) and valence (negative versus positive). These descriptors offer many advantages to describe the wide spectrum of emotions. Due to the continuous nature of fast-changing expressive vocal and gestural behaviors, it is desirable to continuously track these emotional traces, capturing subtle and localized events (e.g., with FEELTRACE). However, time-continuous annotations introduce challenges that affect the reliability of the labels. In particular, an important issue is the evaluators’ reaction lag caused by observing, appraising, and responding to the expressive behaviors. An empirical analysis demonstrates that this delay varies from 1 to 6 seconds, depending on the annotator, expressive dimension, and actual behaviors. Our experiments show accuracy improvements even with fixed delays (1-3 seconds). This paper proposes to compensate for this reaction lag by finding the time-shift that maximizes the mutual information between the expressive behaviors and the time-continuous annotations. The approach is implemented by making different assumptions about the evaluators’ reaction lag. The benefits of compensating for the delay is demonstrated with emotion classification experiments. On average, the classifiers trained with facial and speech features show more than 7 percent relative improvements over baseline classifiers trained and tested without shifting the time-continuous annotations.
- Published
- 2015
19. Compensating for speaker or lexical variabilities in speech for emotion recognition
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
Normalization (statistics) ,Linguistics and Language ,Computer science ,business.industry ,Communication ,Speech recognition ,Mutual information ,Speaker recognition ,computer.software_genre ,Language and Linguistics ,Computer Science Applications ,Speaker diarisation ,Modeling and Simulation ,Feature (machine learning) ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Set (psychology) ,business ,computer ,Software ,Human voice ,Natural language processing ,Uncertainty reduction theory - Abstract
Affect recognition is a crucial requirement for future human machine interfaces to effectively respond to nonverbal behaviors of the user. Speech emotion recognition systems analyze acoustic features to deduce the speaker's emotional state. However, human voice conveys a mixture of information including speaker, lexical, cultural, physiological and emotional traits. The presence of these communication aspects introduces variabilities that affect the performance of an emotion recognition system. Therefore, building robust emotional models requires careful considerations to compensate for the effect of these variabilities. This study aims to factorize speaker characteristics, verbal content and expressive behaviors in various acoustic features. The factorization technique consists in building phoneme level trajectory models for the features. We propose a metric to quantify the dependency between acoustic features and communication traits (i.e., speaker, lexical and emotional factors). This metric, which is motivated by the mutual information framework, estimates the uncertainty reduction in the trajectory models when a given trait is considered. The analysis provides important insights on the dependency between the features and the aforementioned factors. Motivated by these results, we propose a feature normalization technique based on the whitening transformation that aims to compensate for speaker and lexical variabilities. The benefit of employing this normalization scheme is validated with the presented factor analysis method. The emotion recognition experiments show that the normalization approach can attenuate the variability imposed by the verbal content and speaker identity, yielding 4.1% and 2.4% relative performance improvements on a selected set of features, respectively.
- Published
- 2014
20. Iterative Feature Normalization Scheme for Automatic Emotion Detection from Speech
- Author
-
Angeliki Metallinou, Soroosh Mariooryad, Carlos Busso, and Shrikanth S. Narayanan
- Subjects
Normalization (statistics) ,business.industry ,Speech recognition ,Feature extraction ,Emotion detection ,Pattern recognition ,Human-Computer Interaction ,Emotion recognition ,Affine transformation ,Artificial intelligence ,Natural approach ,Psychology ,business ,Software - Abstract
The externalization of emotion is intrinsically speaker-dependent. A robust emotion recognition system should be able to compensate for these differences across speakers. A natural approach is to normalize the features before training the classifiers. However, the normalization scheme should not affect the acoustic differences between emotional classes. This study presents the iterative feature normalization (IFN) framework, which is an unsupervised front-end, especially designed for emotion detection. The IFN approach aims to reduce the acoustic differences, between the neutral speech across speakers, while preserving the inter-emotional variability in expressive speech. This goal is achieved by iteratively detecting neutral speech for each speaker, and using this subset to estimate the feature normalization parameters. Then, an affine transformation is applied to both neutral and emotional speech. This process is repeated till the results from the emotion detection system are consistent between consecutive iterations. The IFN approach is exhaustively evaluated using the IEMOCAP database and a data set obtained under free uncontrolled recording conditions with different evaluation configurations. The results show that the systems trained with the IFN approach achieve better performance than systems trained either without normalization or with global normalization.
- Published
- 2013
21. Exploring Cross-Modality Affective Reactions for Audiovisual Emotion Recognition
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
Communication ,Facial expression ,Modalities ,business.industry ,Mutual information ,Facial recognition system ,Multimodal interaction ,Human-Computer Interaction ,Dialog box ,business ,Psychology ,psychological phenomena and processes ,Software ,Human communication ,Cognitive psychology ,Gesture - Abstract
Psycholinguistic studies on human communication have shown that during human interaction individuals tend to adapt their behaviors mimicking the spoken style, gestures, and expressions of their conversational partners. This synchronization pattern is referred to as entrainment. This study investigates the presence of entrainment at the emotion level in cross-modality settings and its implications on multimodal emotion recognition systems. The analysis explores the relationship between acoustic features of the speaker and facial expressions of the interlocutor during dyadic interactions. The analysis shows that 72 percent of the time the speakers displayed similar emotions, indicating strong mutual influence in their expressive behaviors. We also investigate the cross-modality, cross-speaker dependence, using mutual information framework. The study reveals a strong relation between facial and acoustic features of one subject with the emotional state of the other subject. It also shows strong dependence between heterogeneous modalities across conversational partners. These findings suggest that the expressive behaviors from one dialog partner provide complementary information to recognize the emotional state of the other dialog partner. The analysis motivates classification experiments exploiting cross-modality, cross-speaker information. The study presents emotion recognition experiments using the IEMOCAP and SEMAINE databases. The results demonstrate the benefit of exploiting this emotional entrainment effect, showing statistically significant improvements.
- Published
- 2013
22. Generating Human-Like Behaviors Using Joint, Speech-Driven Models for Conversational Agents
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
Facial expression ,Visual perception ,Acoustics and Ultrasonics ,Computer science ,Speech recognition ,Animation ,computer.software_genre ,Speech processing ,Electrical and Electronic Engineering ,Dialog system ,computer ,Dynamic Bayesian network ,Computer facial animation ,Gesture - Abstract
During human communication, every spoken message is intrinsically modulated within different verbal and nonverbal cues that are externalized through various aspects of speech and facial gestures. These communication channels are strongly interrelated, which suggests that generating human-like behavior requires a careful study of their relationship. Neglecting the mutual influence of different communicative channels in the modeling of natural behavior for a conversational agent may result in unrealistic behaviors that can affect the intended visual perception of the animation. This relationship exists both between audiovisual information and within different visual aspects. This paper explores the idea of using joint models to preserve the coupling not only between speech and facial expression, but also within facial gestures. As a case study, the paper focuses on building a speech-driven facial animation framework to generate natural head and eyebrow motions. We propose three dynamic Bayesian networks (DBNs), which make different assumptions about the coupling between speech, eyebrow and head motion. Synthesized animations are produced based on the MPEG-4 facial animation standard, using the audiovisual IEMOCAP database. The experimental results based on perceptual evaluations reveal that the proposed joint models (speech/eyebrow/head) outperform audiovisual models that are separately trained (speech/head and speech/eyebrow).
- Published
- 2012
23. Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora
- Author
-
Reza Lotfian, Carlos Busso, and Soroosh Mariooryad
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,Speech recognition ,Speech corpus ,Speech processing ,computer.software_genre ,Task (project management) ,Phone ,Natural (music) ,Conversation ,Artificial intelligence ,Affective computing ,business ,computer ,Natural language processing ,media_common - Abstract
A key element in affective computing is to have large corpora of genuine emotional samples collected during natural conversations. Recording natural interactions through telephone is an appealing approach to build emotional databases. However, collecting real conversational data with expressive reactions is a challenging task, especially if the recordings are to be shared with the community (e.g., privacy concerns). This study explores a novel approach consisting in retrieving emotional reactions from existing spontaneous speech databases collected for general speech processing problems. Although most of the recordings in these databases are expected to have non-emotional expressions, given the naturalness of the interactions, the flow of the conversation can lead to emotional responses from conversation partners which we aim to retrieve. We use the IEMOCAP and SEMAINE databases to build emotion detector systems. We use these classifiers to identify emotional behaviors from the FISHER database, which is a large conversational speech corpus recorded over the phone. Subjective evaluations over the retrieved samples demonstrate the potential of the proposed scheme to build naturalistic emotional speech database. Index Terms: emotion recognition, expressive speech, information retrieval, emotional databases
- Published
- 2014
24. Analysis and Compensation of the Reaction Lag of Evaluators in Continuous Emotional Annotations
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
Artificial neural network ,Contextual image classification ,business.industry ,Speech recognition ,Lag ,Mutual information ,Stimulus (physiology) ,computer.software_genre ,Annotation ,Emotion recognition ,Artificial intelligence ,Affective computing ,Psychology ,business ,GeneralLiterature_REFERENCE(e.g.,dictionaries,encyclopedias,glossaries) ,computer ,Natural language processing - Abstract
Defining useful emotional descriptors to characterize expressive behaviors is an important research area in affective computing. Recent studies have shown the benefits of using continuous emotional evaluations to annotate spontaneous corpora. Instead of assigning global labels per segments, this approach captures the temporal dynamic evolution of the emotions. A challenge of continuous assessments is the inherent reaction lag of the evaluators. During the annotation process, an observer needs to sense the stimulus, perceive the emotional message, and define his/her judgment, all this in real time. As a result, we expect a reaction lag between the annotation and the underlying emotional content. This paper uses mutual information to quantify and compensate for this reaction lag. Classification experiments on the SEMAINE database demonstrate that the performance of emotion recognition systems improve when the evaluator reaction lag is considered. We explore annotator-dependent and annotator-independent compensation schemes.
- Published
- 2013
25. Factorizing speaker, lexical and emotional variabilities observed in facial expressions
- Author
-
Carlos Busso and Soroosh Mariooryad
- Subjects
Facial expression ,Computer science ,Speech recognition ,Metric (mathematics) ,Feature extraction ,Mutual information ,Affect (psychology) ,Speaker recognition ,Facial recognition system ,TRACE (psycholinguistics) - Abstract
An effective human computer interaction system should be equipped with mechanisms to recognize and respond to the affective state of the user. However, spoken message conveys different communicative aspects such as the verbal content, emotional state and idiosyncrasy of the speaker. Each of these aspects introduces variability that will affect the performance of an emotion recognition system. If the models used to capture the expressive behaviors are constrained by the lexical content and speaker identity, it is expected that the observed uncertainty in the channel will decrease, improving the accuracy of the system. Motivated by these observations, this study aims to quantify and localize the speaker, lexical and emotional variabilities observed in the face during human interaction. A metric inspired in mutual information theory is proposed to quantify the dependency of facial features on these factors. This metric uses the trace of the covariance matrix of facial motion trajectories to measure the uncertainty. The experimental results confirm the strong influence of the lexical information in the lower part of the face. For this facial region, the results demonstrate the benefit of constraining the emotional model on the lexical content. The ultimate goal of this research is to utilize this information to constrain the emotional models on the underlying lexical units to improve the accuracy of emotion recognition systems.
- Published
- 2012
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.