Author: "Busso, Carlos" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Busso, Carlos"' showing total 626 results

Start Over Author "Busso, Carlos"

626 results on '"Busso, Carlos"'

1. Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment

Author: Leem, Seong-Gyun, Fulford, Daniel, Onnela, Jukka-Pekka, Gard, David, and Busso, Carlos
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound repository. Our experiment indicates that the text-based environment descriptions processed by a large language model (LLM) produce representations that improve the noise-robustness of the SER system. In addition, our proposed approach with an LLM yields better performance than our environment-agnostic baselines, especially in low signal-to-noise ratio (SNR) conditions. When testing at -5dB SNR level, our proposed method shows better performance than our best baseline model by 31.8 % (arousal), 23.5% (dominance), and 9.5% (valence).
Published: 2024

2. A Layer-Anchoring Strategy for Enhancing Cross-Lingual Speech Emotion Recognition

Author: Upadhyay, Shreya G., Busso, Carlos, and Lee, Chi-Chun
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cross-lingual speech emotion recognition (SER) is important for a wide range of everyday applications. While recent SER research relies heavily on large pretrained models for emotion training, existing studies often concentrate solely on the final transformer layer of these models. However, given the task-specific nature and hierarchical architecture of these models, each transformer layer encapsulates different levels of information. Leveraging this hierarchical structure, our study focuses on the information embedded across different layers. Through an examination of layer feature similarity across different languages, we propose a novel strategy called a layer-anchoring mechanism to facilitate emotion transfer in cross-lingual SER tasks. Our approach is evaluated using two distinct language affective corpora (MSP-Podcast and BIIC-Podcast), achieving a best UAR performance of 60.21% on the BIIC-podcast corpus. The analysis uncovers interesting insights into the behavior of popular pretrained models.
Published: 2024

3. We Need Variations in Speech Synthesis: Sub-center Modelling for Speaker Embeddings

Author: Ulgen, Ismail Rasim, Busso, Carlos, Hansen, John H. L., and Sisman, Berrak
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning
Abstract: In speech synthesis, modeling of rich emotions and prosodic variations present in human voice are crucial to synthesize natural speech. Although speaker embeddings have been widely used in personalized speech synthesis as conditioning inputs, they are designed to lose variation to optimize speaker recognition accuracy. Thus, they are suboptimal for speech synthesis in terms of modeling the rich variations at the output speech distribution. In this work, we propose a novel speaker embedding network which utilizes multiple class centers in the speaker classification training rather than a single class center as traditional embeddings. The proposed approach introduces variations in the speaker embedding while retaining the speaker recognition performance since model does not have to map all of the utterances of a speaker into a single class center. We apply our proposed embedding in voice conversion task and show that our method provides better naturalness and prosody in synthesized speech., Comment: Submitted to IEEE Signal Processing Letters
Published: 2024

4. Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline

Author: Salman, Ali N., Du, Zongyang, Chandra, Shreeram Suresh, Ulgen, Ismail Rasim, Busso, Carlos, and Sisman, Berrak
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Voice conversion (VC) research traditionally depends on scripted or acted speech, which lacks the natural spontaneity of real-life conversations. While natural speech data is limited for VC, our study focuses on filling in this gap. We introduce a novel data-sourcing pipeline that makes the release of a natural speech dataset for VC, named NaturalVoices. The pipeline extracts rich information in speech such as emotion and signal-to-noise ratio (SNR) from raw podcast data, utilizing recent deep learning methods and providing flexibility and ease of use. NaturalVoices marks a large-scale, spontaneous, expressive, and emotional speech dataset, comprising over 3,800 hours speech sourced from the original podcasts in the MSP-Podcast dataset. Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC, suggesting the potential of NaturalVoices for broader speech generation tasks.
Published: 2024

5. emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

Author: Rajapakshe, Thejan, Rana, Rajib, Khalifa, Sara, Sisman, Berrak, Schuller, Bjorn W., and Busso, Carlos
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets., Comment: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402
Published: 2024
Full Text: View/download PDF

6. Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

Author: Ulgen, Ismail Rasim, Du, Zongyang, Busso, Carlos, and Sisman, Berrak
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significant improvement in SER performance, whether employed as a standalone pretraining task or integrated into a multi-task pretraining setting., Comment: Accepted to ICASSP 2024
Published: 2024
Full Text: View/download PDF

7. Versatile audio-visual learning for emotion recognition

Author: Goncalves, Lucas, Leem, Seong-Gyun, Lin, Wei-Cheng, Sisman, Berrak, and Busso, Carlos
Subjects: Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression or classification tasks. This study proposes a versatile audio-visual learning (VAVL) framework for handling unimodal and multimodal systems for emotion regression or emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on the CREMA-D, MSP-IMPROV, and CMU-MOSEI corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus., Comment: 18 pages, 4 Figures, 3 tables (published at IEEE Transactions on Affective Computing)
Published: 2023

8. Mixed-EVC: Mixed Emotion Synthesis and Control in Voice Conversion

Author: Zhou, Kun, Sisman, Berrak, Busso, Carlos, Ma, Bin, and Li, Haizhou
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Emotional voice conversion (EVC) traditionally targets the transformation of spoken utterances from one emotional state to another, with previous research mainly focusing on discrete emotion categories. This paper departs from the norm by introducing a novel perspective: a nuanced rendering of mixed emotions and enhancing control over emotional expression. To achieve this, we propose a novel EVC framework, Mixed-EVC, which only leverages discrete emotion training labels. We construct an attribute vector that encodes the relationships among these discrete emotions, which is predicted using a ranking-based support vector machine and then integrated into a sequence-to-sequence (seq2seq) EVC framework. Mixed-EVC not only learns to characterize the input emotional style but also quantifies its relevance to other emotions during training. As a result, users have the ability to assign these attributes to achieve their desired rendering of mixed emotions. Objective and subjective evaluations confirm the effectiveness of our approach in terms of mixed emotion synthesis and control while surpassing traditional baselines in the conversion of discrete emotions from one to another.
Published: 2022

9. Speech emotion recognition in real static and dynamic human-robot interaction scenarios

Author: Grágeda, Nicolás, Busso, Carlos, Alvarado, Eduardo, García, Ricardo, Mahu, Rodrigo, Huenupan, Fernando, and Yoma, Néstor Becerra
Published: 2025
Full Text: View/download PDF

10. Driving Anomaly Detection Using Conditional Generative Adversarial Network

Author: Qiu, Yuning, Misu, Teruhisa, and Busso, Carlos
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Anomaly driving detection is an important problem in advanced driver assistance systems (ADAS). It is important to identify potential hazard scenarios as early as possible to avoid potential accidents. This study proposes an unsupervised method to quantify driving anomalies using a conditional generative adversarial network (GAN). The approach predicts upcoming driving scenarios by conditioning the models on the previously observed signals. The system uses the difference of the output from the discriminator between the predicted and actual signals as a metric to quantify the anomaly degree of a driving segment. We take a driver-centric approach, considering physiological signals from the driver and controller area network-Bus (CAN-Bus) signals from the vehicle. The approach is implemented with convolutional neural networks (CNNs) to extract discriminative feature representations, and with long short-term memory (LSTM) cells to capture temporal information. The study is implemented and evaluated with the driving anomaly dataset (DAD), which includes 250 hours of naturalistic recordings manually annotated with driving events. The experimental results reveal that recordings annotated with events that are likely to be anomalous, such as avoiding on-road pedestrians and traffic rule violations, have higher anomaly scores than recordings without any event annotation. The results are validated with perceptual evaluations, where annotators are asked to assess the risk and familiarity of the videos detected with high anomaly scores. The results indicate that the driving segments with higher anomaly scores are more risky and less regularly seen on the road than other driving segments, validating the proposed unsupervised approach., Comment: 15 pages, 14 figures, 6 tables
Published: 2022

11. Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

Author: Sridhar, Kusha and Busso, Carlos
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The prediction of valence from speech is an important, but challenging problem. The externalization of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, especially with deep neural networks (DNNs), since it requires optimizing millions of parameters. This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. Speech samples from the selected speakers are used to create the adaptation set. This approach leverages transfer learning using pre-trained models, which are adapted with these speech samples. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches. These methods differ on the use of the adaptation set in the personalization of the valence models. The results demonstrate that a valence prediction model can be efficiently personalized with these unsupervised approaches, leading to relative improvements as high as 13.52%., Comment: 8 Figures and 5 tables
Published: 2022
Full Text: View/download PDF

12. Deep temporal clustering features for speech emotion recognition

Author: Lin, Wei-Cheng and Busso, Carlos
Published: 2024
Full Text: View/download PDF

13. The Multimodal Driver Monitoring Database: A Naturalistic Corpus to Study Driver Attention

Author: Jha, Sumit, Marzban, Mohamed F., Hu, Tiancheng, Mahmoud, Mohamed H., Al-Dhahir, Naofal, and Busso, Carlos
Subjects: Computer Science - Computer Vision and Pattern Recognition, 97P30
Abstract: A smart vehicle should be able to monitor the actions and behaviors of the human driver to provide critical warnings or intervene when necessary. Recent advancements in deep learning and computer vision have shown great promise in monitoring human behaviors and activities. While these algorithms work well in a controlled environment, naturalistic driving conditions add new challenges such as illumination variations, occlusions and extreme head poses. A vast amount of in-domain data is required to train models that provide high performance in predicting driving related tasks to effectively monitor driver actions and behaviors. Toward building the required infrastructure, this paper presents the multimodal driver monitoring (MDM) dataset, which was collected with 59 subjects that were recorded performing various tasks. We use the Fi- Cap device that continuously tracks the head movement of the driver using fiducial markers, providing frame-based annotations to train head pose algorithms in naturalistic driving conditions. We ask the driver to look at predetermined gaze locations to obtain accurate correlation between the driver's facial image and visual attention. We also collect data when the driver performs common secondary activities such as navigation using a smart phone and operating the in-car infotainment system. All of the driver's activities are recorded with high definition RGB cameras and time-of-flight depth camera. We also record the controller area network-bus (CAN-Bus), extracting important information. These high quality recordings serve as the ideal resource to train various efficient algorithms for monitoring the driver, providing further advancements in the field of in-vehicle safety systems., Comment: 14 pages, 12 Figures, 3 tables
Published: 2020
Full Text: View/download PDF

14. Estimation of Driver's Gaze Region from Head Position and Orientation using Probabilistic Confidence Regions

Author: Jha, Sumit and Busso, Carlos
Subjects: Computer Science - Computer Vision and Pattern Recognition, 68T45
Abstract: A smart vehicle should be able to understand human behavior and predict their actions to avoid hazardous situations. Specific traits in human behavior can be automatically predicted, which can help the vehicle make decisions, increasing safety. One of the most important aspects pertaining to the driving task is the driver's visual attention. Predicting the driver's visual attention can help a vehicle understand the awareness state of the driver, providing important contextual information. While estimating the exact gaze direction is difficult in the car environment, a coarse estimation of the visual attention can be obtained by tracking the position and orientation of the head. Since the relation between head pose and gaze direction is not one-to-one, this paper proposes a formulation based on probabilistic models to create salient regions describing the visual attention of the driver. The area of the predicted region is small when the model has high confidence on the prediction, which is directly learned from the data. We use Gaussian process regression (GPR) to implement the framework, comparing the performance with different regression formulations such as linear regression and neural network based methods. We evaluate these frameworks by studying the tradeoff between spatial resolution and accuracy of the probability map using naturalistic recordings collected with the UTDrive platform. We observe that the GPR method produces the best result creating accurate predictions with localized salient regions. For example, the 95% confidence region is defined by an area that covers 3.77% region of a sphere surrounding the driver., Comment: 13 Pages, 12 figures, 2 tables
Published: 2020
Full Text: View/download PDF

15. Multimodal attention for lip synthesis using conditional generative adversarial networks

Author: Vidal, Andrea and Busso, Carlos
Published: 2023
Full Text: View/download PDF

16. The Ambiguous World of Emotion Representation

Author: Sethu, Vidhyasaharan, Provost, Emily Mower, Epps, Julien, Busso, Carlos, Cummins, Nicholas, and Narayanan, Shrikanth
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Artificial intelligence and machine learning systems have demonstrated huge improvements and human-level parity in a range of activities, including speech recognition, face recognition and speaker verification. However, these diverse tasks share a key commonality that is not true in affective computing: the ground truth information that is inferred can be unambiguously represented. This observation provides some hints as to why affective computing, despite having attracted the attention of researchers for years, may not still be considered a mature field of research. A key reason for this is the lack of a common mathematical framework to describe all the relevant elements of emotion representations. This paper proposes the AMBiguous Emotion Representation (AMBER) framework to address this deficiency. AMBER is a unified framework that explicitly describes categorical, numerical and ordinal representations of emotions, including time varying representations. In addition to explaining the core elements of AMBER, the paper also discusses how some of the commonly employed emotion representation schemes can be viewed through the AMBER framework, and concludes with a discussion of how the proposed framework can be used to reason about current and future affective computing systems.
Published: 2019

17. Report of 2017 NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps

Author: Chang, Shih-Fu, Hauptmann, Alex, Morency, Louis-Philippe, Antani, Sameer, Bulterman, Dick, Busso, Carlos, Chai, Joyce, Hirschberg, Julia, Jain, Ramesh, Mayer-Patel, Ketan, Meth, Reuven, Mooney, Raymond, Nahrstedt, Klara, Narayanan, Shri, Natarajan, Prem, Oviatt, Sharon, Prabhakaran, Balakrishnan, Smeulders, Arnold, Sundaram, Hari, Zhang, Zhengyou, and Zhou, Michelle
Subjects: Computer Science - Multimedia
Abstract: With the transformative technologies and the rapidly changing global R&D landscape, the multimedia and multimodal community is now faced with many new opportunities and uncertainties. With the open source dissemination platform and pervasive computing resources, new research results are being discovered at an unprecedented pace. In addition, the rapid exchange and influence of ideas across traditional discipline boundaries have made the emphasis on multimedia multimodal research even more important than before. To seize these opportunities and respond to the challenges, we have organized a workshop to specifically address and brainstorm the challenges, opportunities, and research roadmaps for MM research. The two-day workshop, held on March 30 and 31, 2017 in Washington DC, was sponsored by the Information and Intelligent Systems Division of the National Science Foundation of the United States. Twenty-three (23) invited participants were asked to review and identify research areas in the MM field that are most important over the next 10-15 year timeframe. Important topics were selected through discussion and consensus, and then discussed in depth in breakout groups. Breakout groups reported initial discussion results to the whole group, who continued with further extensive deliberation. For each identified topic, a summary was produced after the workshop to describe the main findings, including the state of the art, challenges, and research roadmaps planned for the next 5, 10, and 15 years in the identified area., Comment: Long Report of NSF Workshop on Multimedia Challenges, Opportunities and Research Roadmaps, held in March 2017, Washington DC. Short report available separately
Published: 2019

18. Semi-Supervised Speech Emotion Recognition with Ladder Networks

Author: Parthasarathy, Srinivas and Busso, Carlos
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech emotion recognition (SER) systems find applications in various fields such as healthcare, education, and security and defense. A major drawback of these systems is their lack of generalization across different conditions. This problem can be solved by training models on large amounts of labeled data from the target domain, which is expensive and time-consuming. Another approach is to increase the generalization of the models. An effective way to achieve this goal is by regularizing the models through multitask learning (MTL), where auxiliary tasks are learned along with the primary task. These methods often require the use of labeled data which is computationally expensive to collect for emotion recognition (gender, speaker identity, age or other emotional descriptors). This study proposes the use of ladder networks for emotion recognition, which utilizes an unsupervised auxiliary task. The primary task is a regression problem to predict emotional attributes. The auxiliary task is the reconstruction of intermediate feature representations using a denoising autoencoder. This auxiliary task does not require labels so it is possible to train the framework in a semi-supervised fashion with abundant unlabeled data from the target domain. This study shows that the proposed approach creates a powerful framework for SER, achieving superior performance than fully supervised single-task learning (STL) and MTL baselines. The approach is implemented with several acoustic features, showing that ladder networks generalize significantly better in cross-corpus settings. Compared to the STL baselines, the proposed approach achieves relative gains in concordance correlation coefficient (CCC) between 3.0% and 3.5% for within corpus evaluations, and between 16.1% and 74.1% for cross corpus evaluations, highlighting the power of the architecture.
Published: 2019
Full Text: View/download PDF

19. A comprehensive analysis of sialolith proteins and the clinical implications

Author: Busso, Carlos S, Guidry, Jessie J, Gonzalez, Jhanis J, Zorba, Vassilia, Son, Leslie S, Winsauer, Peter J, and Walvekar, Rohan R
Subjects: Medical Biochemistry and Metabolomics, Biomedical and Clinical Sciences, Urologic Diseases, Dental/Oral and Craniofacial Disease, Sialolithiasis, Sialolith, Protein profiling, Extracellular exosomes, Biochemistry & Molecular Biology, Biochemistry and cell biology, Clinical sciences, Medical biochemistry and metabolomics
Abstract: BackgroundSialolithiasis or salivary gland stones are associated with high clinical morbidity. The advances in the treatment of sialolithiasis has been limited, however, by our understanding of their composition. More specifically, there is little information regarding the formation and composition of the protein matrix, the role of mineralogical deposition, or the contributions of cell epithelium and secretions from the salivary glands. A better understanding of these stone characteristics could pave the way for future non-invasive treatment strategies.MethodsTwenty-nine high-quality ductal stone samples were analyzed. The preparation included successive washings to avoid contamination from saliva and blood. The sialoliths were macerated in liquid nitrogen and the maceration was subjected to a sequential, four-step, protein extraction. The four fractions were pooled together, and a standardized aliquot was subjected to tandem liquid chromatography mass spectrometry (LCMS). The data output was subjected to a basic descriptive statistical analysis for parametric confirmation and a subsequent G.O.-KEGG data base functional analysis and classification for biological interpretation.ResultsThe LC-MS output detected 6934 proteins, 824 of which were unique for individual stones. An example of our sialolith protein data is available via ProteomeXchange with the identifier PXD012422. More important, the sialoliths averaged 53% homology with bone-forming proteins that served as a standard comparison, which favorably compared with 62% homology identified among all sialolith sample proteins. The non-homologous protein fraction had a highly variable protein identity. The G.O.-KEGG functional analysis indicated that extracellular exosomes are a primary cellular component in sialolithiasis. Light and electron microscopy also confirmed the presence of exosomal-like features and the presence of intracellular microcrystals.ConclusionSialolith formation presents similarities with the hyperoxaluria that forms kidney stones, which suggests the possibility of a common origin. Further verification of a common origin could fundamentally change the way in which lithiasis is studied and treated.
Published: 2020

20. End-to-end Audiovisual Speech Activity Detection with Bimodal Recurrent Neural Models

Author: Tao, Fei and Busso, Carlos
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech activity detection (SAD) plays an important role in current speech processing systems, including automatic speech recognition (ASR). SAD is particularly difficult in environments with acoustic noise. A practical solution is to incorporate visual information, increasing the robustness of the SAD approach. An audiovisual system has the advantage of being robust to different speech modes (e.g., whisper speech) or background noise. Recent advances in audiovisual speech processing using deep learning have opened opportunities to capture in a principled way the temporal relationships between acoustic and visual features. This study explores this idea proposing a \emph{bimodal recurrent neural network} (BRNN) framework for SAD. The approach models the temporal dynamic of the sequential audiovisual data, improving the accuracy and robustness of the proposed SAD system. Instead of estimating hand-crafted features, the study investigates an end-to-end training approach, where acoustic and visual features are directly learned from the raw data during training. The experimental evaluation considers a large audiovisual corpus with over 60.8 hours of recordings, collected from 105 speakers. The results demonstrate that the proposed framework leads to absolute improvements up to 1.2% under practical scenarios over a VAD baseline using only audio implemented with deep neural network (DNN). The proposed approach achieves 92.7% F1-score when it is evaluated using the sensors from a portable tablet under noisy acoustic environment, which is only 1.0% lower than the performance obtained under ideal conditions (e.g., clean speech obtained with a high definition camera and a close-talking microphone)., Comment: Submitted to Speech Communication
Published: 2018
Full Text: View/download PDF

21. Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks

Author: Sadoughi, Najmeh and Busso, Carlos
Subjects: Computer Science - Human-Computer Interaction
Abstract: Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents (VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN (CSG), which learns the relationship between emotion and lexical content in a principled manner. This model uses a set of articulatory and emotional features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model (CSG-Emo-Aware). Objective evaluations of these models show improvements for the CSG-Emo-Adapted compared with the CSG model, as the trajectory sequences are closer to the original sequences. Subjective evaluations show significantly better results for this model compared with the CSG model when the target emotion is happiness.
Published: 2018
Full Text: View/download PDF

22. Curriculum Learning for Speech Emotion Recognition from Crowdsourced Labels

Author: Lotfian, Reza and Busso, Carlos
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This study introduces a method to design a curriculum for machine-learning to maximize the efficiency during the training process of deep neural networks (DNNs) for speech emotion recognition. Previous studies in other machine-learning problems have shown the benefits of training a classifier following a curriculum where samples are gradually presented in increasing level of difficulty. For speech emotion recognition, the challenge is to establish a natural order of difficulty in the training set to create the curriculum. We address this problem by assuming that ambiguous samples for humans are also ambiguous for computers. Speech samples are often annotated by multiple evaluators to account for differences in emotion perception across individuals. While some sentences with clear emotional content are consistently annotated, sentences with more ambiguous emotional content present important disagreement between individual evaluations. We propose to use the disagreement between evaluators as a measure of difficulty for the classification task. We propose metrics that quantify the inter-evaluation agreement to define the curriculum for regression problems and binary and multi-class classification problems. The experimental results consistently show that relying on a curriculum based on agreement between human judgments leads to statistically significant improvements over baselines trained without a curriculum., Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2018
Full Text: View/download PDF

23. Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes

Author: Parthasarathy, Srinivas and Busso, Carlos
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Recognizing emotions using few attribute dimensions such as arousal, valence and dominance provides the flexibility to effectively represent complex range of emotional behaviors. Conventional methods to learn these emotional descriptors primarily focus on separate models to recognize each of these attributes. Recent work has shown that learning these attributes together regularizes the models, leading to better feature representations. This study explores new forms of regularization by adding unsupervised auxiliary tasks to reconstruct hidden layer representations. This auxiliary task requires the denoising of hidden representations at every layer of an auto-encoder. The framework relies on ladder networks that utilize skip connections between encoder and decoder layers to learn powerful representations of emotional dimensions. The results show that ladder networks improve the performance of the system compared to baselines that individually learn each attribute, and conventional denoising autoencoders. Furthermore, the unsupervised auxiliary tasks have promising potential to be used in a semi-supervised setting, where few labeled sentences are available., Comment: Submitted to Interspeech 2018
Published: 2018
Full Text: View/download PDF

24. Domain Adversarial for Acoustic Emotion Recognition

Author: Abdelwahab, Mohammed and Busso, Carlos
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The performance of speech emotion recognition is affected by the differences in data distributions between train (source domain) and test (target domain) sets used to build and evaluate the models. This is a common problem, as multiple studies have shown that the performance of emotional classifiers drop when they are exposed to data that does not match the distribution used to build the emotion classifiers. The difference in data distributions becomes very clear when the training and testing data come from different domains, causing a large performance gap between validation and testing performance. Due to the high cost of annotating new data and the abundance of unlabeled data, it is crucial to extract as much useful information as possible from the available unlabeled data. This study looks into the use of adversarial multitask training to extract a common representation between train and test domains. The primary task is to predict emotional attribute-based descriptors for arousal, valence, or dominance. The secondary task is to learn a common representation where the train and test domains cannot be distinguished. By using a gradient reversal layer, the gradients coming from the domain classifier are used to bring the source and target domain representations closer. We show that exploiting unlabeled data consistently leads to better emotion recognition performance across all emotional dimensions. We visualize the effect of adversarial training on the feature representation across the proposed deep learning architecture. The analysis shows that the data representations for the train and test domains converge as the data is passed to deeper layers of the network. We also evaluate the difference in performance when we use a shallow neural network versus a \emph{deep neural network} (DNN) and the effect of the number of shared layers used by the task and domain classifiers., Comment: submitted to IEEE transactions on signal processing
Published: 2018
Full Text: View/download PDF

25. Face detection and grimace scale prediction of white furred mice

Author: Vidal, Andrea, Jha, Sumit, Hassler, Shayne, Price, Theodore, and Busso, Carlos
Published: 2022
Full Text: View/download PDF

26. Speech-driven Animation with Meaningful Behaviors

Author: Sadoughi, Najmeh and Busso, Carlos
Subjects: Computer Science - Human-Computer Interaction
Abstract: Conversational agents (CAs) play an important role in human computer interaction. Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, (2) initializes the state configuration models increasing the range of the generated behaviors, and (3) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained model., Comment: 13 pages, 12 figures, 5 tables
Published: 2017
Full Text: View/download PDF

27. Using machine learning to increase access to and engagement with trauma‐focused interventions for posttraumatic stress disorder

Author: Lenton‐Brym, Ariella P., primary, Collins, Alexis, additional, Lane, Jeanine, additional, Busso, Carlos, additional, Ouyang, Jessica, additional, Fitzpatrick, Skye, additional, Kuo, Janice R., additional, and Monson, Candice M., additional
Published: 2024
Full Text: View/download PDF

28. SPEECH EMOTION RECOGNITION IN REAL STATIC AND DYNAMIC HUMAN-ROBOT INTERACTION SCENARIOS

Author: Grágeda, Nicolás, primary, Busso, Carlos, additional, Alvarado, Eduardo, additional, García, Ricardo, additional, Mahu, Rodrigo, additional, Huenupan, Fernando, additional, and Yoma, Néstor Becerra, additional
Published: 2024
Full Text: View/download PDF

29. Generalization of Self-Supervised Learning-Based Representations for Cross-Domain Speech Emotion Recognition

Author: Naini, Abinay Reddy, primary, Kohler, Mary A., additional, Richerson, Elizabeth, additional, Robinson, Donita, additional, and Busso, Carlos, additional
Published: 2024
Full Text: View/download PDF

30. Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

Author: Ulgen, Ismail Rasim, primary, Du, Zongyang, additional, Busso, Carlos, additional, and Sisman, Berrak, additional
Published: 2024
Full Text: View/download PDF

31. Dynamic Speech Emotion Recognition Using A Conditional Neural Process

Author: Martinez-Lucas, Luz, primary and Busso, Carlos, additional
Published: 2024
Full Text: View/download PDF

32. Smartphone sensing of social interactions in people with and without schizophrenia

Author: Fulford, Daniel, Mote, Jasmine, Gonzalez, Rachel, Abplanalp, Samuel, Zhang, Yuting, Luckenbaugh, Jarrod, Onnela, Jukka-Pekka, Busso, Carlos, and Gard, David E.
Published: 2021
Full Text: View/download PDF

33. Plant Species and Defoliation Effects on Soil Nitrogen Mineralization in a Semiarid Rangeland of Argentina

Author: Ambrosino, Mariela Lis, Martínez, Juan Manuel, Busso, Carlos Alberto, Minoldo, Gabriela Verónica, Torres, Yanina Alejandra, Ithurrart, Leticia Soledad, and Cardillo, Daniela Solange
Published: 2021
Full Text: View/download PDF

34. Computer-assisted discrimination of cancerous and pre-cancerous from benign oral lesions based on multispectral autofluorescence imaging endoscopy.

Author: Duran Sierra, Elvis de Jesus, Shuna Cheng, Cuenca, Rodrigo, Ahmed, Beena, Ji, Jim, Yakovlev, Vladislav V., Martinez, Mathias, Al-Khalil, Moustafa, Al-Enazi, Hussain, Busso, Carlos, and Jo, Javier A.
Published: 2024
Full Text: View/download PDF

35. Analyzing Continuous-Time and Sentence-Level Annotations for Speech Emotion Recognition.

Author: Martinez-Lucas, Luz, Lin, Wei-Cheng, and Busso, Carlos
Abstract: The emotional content of several databases are annotated with continuous-time (CT) annotations, providing traces with frame-by-frame scores describing the instantaneous value of an emotional attribute. However, having a single score describing the global emotion of a short segment is more convenient for several emotion recognition formulations. A common approach is to derive sentence-level (SL) labels from CT annotations by aggregating the values of the emotional traces across time and annotators. How similar are these aggregated SL labels from labels originally collected at the sentence level? The release of the MSP-Podcast (SL annotations) and MSP-Conversation (CT annotations) corpora provides the resources to explore the validity of aggregating SL labels from CT annotations. There are 2,884 speech segments that belong to both corpora. Using this set, this study (1) compares both types of annotations using statistical metrics, (2) evaluates their inter-evaluator agreements, and (3) explores the effect of these SL labels on speech emotion recognition (SER) tasks. The analysis reveals benefits of using SL labels derived from CT annotations in the estimation of valence. This analysis also provides insights on how the two types of labels differ and how that could affect a model. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

36. Total and structure colonization by arbuscular mycorrhizal fungi in native, perennial grasses of different forage quality exposed to defoliation

Author: Ambrosino, Mariela Lis, Busso, Carlos Alberto, Cabello, Marta Noemí, Velázquez, María Silvana, Torres, Yanina Alejandra, Ithurrart, Leticia Soledad, Cardillo, Daniela Solange, and Palomo, Iris Rosana
Published: 2020
Full Text: View/download PDF

37. Multimodal Behavior Modeling for Socially Interactive Agents

Author: Pelachaud, Catherine, primary, Busso, Carlos, additional, and Heylen, Dirk, additional
Published: 2021
Full Text: View/download PDF

38. Adequate management of post-fire defoliation would not affect the metabolic activity of axillary buds in grasses

Author: Ithurrart, Leticia S., Busso, Carlos A., Torres, Yanina A., Giorgetti, Hugo D., Rodriguez, Gustavo D., and Ambrosino, Mariela L.
Published: 2019

39. Differences in root surface adsorption, root uptake, subcellular distribution, and chemical forms of Cd between low- and high-Cd-accumulating wheat cultivars

Author: Xiao, Ya-Tao, Du, Zhen-Jie, Busso, Carlos-A, Qi, Xue-Bin, Wu, Hai-Qing, Guo, Wei, and Wu, Da-Fu
Published: 2020
Full Text: View/download PDF

40. Head Motion Generation

Author: Sadoughi, Najmeh, Busso, Carlos, Müller, Bertram, Editor-in-Chief, Wolf, Sebastian I., Editor-in-Chief, Brüggemann, Gert-Peter, Section Editor, Deng, Zhigang, Section Editor, McIntosh, Andrew S., Section Editor, Miller, Freeman, Section Editor, and Selbie, W. Scott, Section Editor
Published: 2018
Full Text: View/download PDF

41. End-to-end audiovisual speech activity detection with bimodal recurrent neural models

Author: Tao, Fei and Busso, Carlos
Published: 2019
Full Text: View/download PDF

42. Speech-driven animation with meaningful behaviors

Author: Sadoughi, Najmeh and Busso, Carlos
Published: 2019
Full Text: View/download PDF

43. Combining Relative and Absolute Learning Formulations to Predict Emotional Attributes From Speech

Author: Naini, Abinay Reddy, primary, Subramanium, Shruthi, additional, Leem, Seong-Gyun, additional, and Busso, Carlos, additional
Published: 2023
Full Text: View/download PDF

44. Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization

Author: Lin, Wei-Cheng, primary, Goncalves, Lucas, additional, and Busso, Carlos, additional
Published: 2023
Full Text: View/download PDF

45. 8 Head Pose as an Indicator of Drivers’ Visual Attention

Author: Jha, Sumit, primary and Busso, Carlos, additional
Published: 2020
Full Text: View/download PDF

46. Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory

Author: Sadoughi, Najmeh, Busso, Carlos, Hutchison, David, Editorial Board Member, Kanade, Takeo, Editorial Board Member, Kittler, Josef, Editorial Board Member, Kleinberg, Jon M., Editorial Board Member, Mattern, Friedemann, Editorial Board Member, Mitchell, John C., Editorial Board Member, Naor, Moni, Editorial Board Member, Pandu Rangan, C., Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Terzopoulos, Demetri, Editorial Board Member, Tygar, Doug, Editorial Board Member, Weikum, Gerhard, Series Editor, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Beskow, Jonas, editor, Peters, Christopher, editor, Castellano, Ginevra, editor, O'Sullivan, Carol, editor, Leite, Iolanda, editor, and Kopp, Stefan, editor
Published: 2017
Full Text: View/download PDF

47. Communities of arbuscular mycorrhizal fungi associated with perennial grasses of different forage quality exposed to defoliation

Author: Ambrosino, Mariela Lis, Cabello, Marta Noemí, Busso, Carlos Alberto, Velázquez, María Silvana, Torres, Yanina Alejandra, Cardillo, Daniela Solange, Ithurrart, Leticia Soledad, Montenegro, Oscar Alberto, Giorgetti, Hugo, and Rodriguez, Gustavo
Published: 2018
Full Text: View/download PDF

48. Calibration free, user-independent gaze estimation with tensor analysis

Author: Li, Nanxiang and Busso, Carlos
Published: 2018
Full Text: View/download PDF

49. MSP-DISK: Naturalistic and Diverse In-Vehicle Database for Joint Pose and Seat Belt Detection

Author: Brooks, Isaac, primary, Gogineni, Susmitha, additional, Jha, Sumit, additional, Ray, Soumitry Jagadev, additional, Narasimha, Rajesh, additional, Al-Dhahir, Naofal, additional, and Busso, Carlos, additional
Published: 2023
Full Text: View/download PDF

50. Computation and Memory Efficient Noise Adaptation of Wav2Vec2.0 for Noisy Speech Emotion Recognition with Skip Connection Adapters

Author: Leem, Seong-Gyun, primary, Fulford, Daniel, additional, Onnela, Jukka-Pekka, additional, Gard, David, additional, and Busso, Carlos, additional
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

626 results on '"Busso, Carlos"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources