Author: "Leglaive, Simon" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Leglaive, Simon"' showing total 111 results

Start Over Author "Leglaive, Simon"

111 results on '"Leglaive, Simon"'

1. AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

Author: Sadok, Samir, Leglaive, Simon, Girin, Laurent, Richard, Gaël, and Alameda-Pineda, Xavier
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement., Comment: 5 pages, https://samsad35.github.io/site-ancogen
Published: 2025

2. VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

Author: Fiche, Guénolé, Leglaive, Simon, Alameda-Pineda, Xavier, Agudo, Antonio, Moreno-Noguer, Francesc, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

3. MEGA: Masked Generative Autoencoder for Human Mesh Recovery

Author: Fiche, Guénolé, Leglaive, Simon, Alameda-Pineda, Xavier, and Moreno-Noguer, Francesc
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Human Mesh Recovery (HMR) from a single RGB image is a highly ambiguous problem, as an infinite set of 3D interpretations can explain the 2D observation equally well. Nevertheless, most HMR methods overlook this issue and make a single prediction without accounting for this ambiguity. A few approaches generate a distribution of human meshes, enabling the sampling of multiple predictions; however, none of them is competitive with the latest single-output model when making a single prediction. This work proposes a new approach based on masked generative modeling. By tokenizing the human pose and shape, we formulate the HMR task as generating a sequence of discrete tokens conditioned on an input image. We introduce MEGA, a MaskEd Generative Autoencoder trained to recover human meshes from images and partial human mesh token sequences. Given an image, our flexible generation scheme allows us to predict a single human mesh in deterministic mode or to generate multiple human meshes in stochastic mode. Experiments on in-the-wild benchmarks show that MEGA achieves state-of-the-art performance in deterministic and stochastic modes, outperforming single-output and multi-output approaches.
Published: 2024

4. Objective and subjective evaluation of speech enhancement methods in the UDASE task of the 7th CHiME challenge

Author: Leglaive, Simon, Fraticelli, Matthieu, ElGhazaly, Hend, Borne, Léonie, Sadeghi, Mostafa, Wisdom, Scott, Pariente, Manuel, Hershey, John R., Pressnitzer, Daniel, and Barker, Jon P.
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Supervised models for speech enhancement are trained using artificially generated mixtures of clean speech and noise signals. However, the synthetic training conditions may not accurately reflect real-world conditions encountered during testing. This discrepancy can result in poor performance when the test domain significantly differs from the synthetic training domain. To tackle this issue, the UDASE task of the 7th CHiME challenge aimed to leverage real-world noisy speech recordings from the test domain for unsupervised domain adaptation of speech enhancement models. Specifically, this test domain corresponds to the CHiME-5 dataset, characterized by real multi-speaker and conversational speech recordings made in noisy and reverberant domestic environments, for which ground-truth clean speech signals are not available. In this paper, we present the objective and subjective evaluations of the systems that were submitted to the CHiME-7 UDASE task, and we provide an analysis of the results. This analysis reveals a limited correlation between subjective ratings and several supervised nonintrusive performance metrics recently proposed for speech enhancement. Conversely, the results suggest that more traditional intrusive objective metrics can be used for in-domain performance evaluation using the reverberant LibriCHiME-5 dataset developed for the challenge. The subjective evaluation indicates that all systems successfully reduced the background noise, but always at the expense of increased distortion. Out of the four speech enhancement methods evaluated subjectively, only one demonstrated an improvement in overall quality compared to the unprocessed noisy speech, highlighting the difficulty of the task. The tools and audio material created for the CHiME-7 UDASE task are shared with the community.
Published: 2024
Full Text: View/download PDF

5. VQ-HPS: Human Pose and Shape Estimation in a Vector-Quantized Latent Space

Author: Fiche, Guénolé, Leglaive, Simon, Alameda-Pineda, Xavier, Agudo, Antonio, and Moreno-Noguer, Francesc
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Previous works on Human Pose and Shape Estimation (HPSE) from RGB images can be broadly categorized into two main groups: parametric and non-parametric approaches. Parametric techniques leverage a low-dimensional statistical body model for realistic results, whereas recent non-parametric methods achieve higher precision by directly regressing the 3D coordinates of the human body mesh. This work introduces a novel paradigm to address the HPSE problem, involving a low-dimensional discrete latent representation of the human mesh and framing HPSE as a classification task. Instead of predicting body model parameters or 3D vertex coordinates, we focus on predicting the proposed discrete latent representation, which can be decoded into a registered human mesh. This innovative paradigm offers two key advantages. Firstly, predicting a low-dimensional discrete representation confines our predictions to the space of anthropomorphic poses and shapes even when little training data is available. Secondly, by framing the problem as a classification task, we can harness the discriminative power inherent in neural networks. The proposed model, VQ-HPS, predicts the discrete latent representation of the mesh. The experimental results demonstrate that VQ-HPS outperforms the current state-of-the-art non-parametric approaches while yielding results as realistic as those produced by parametric methods when trained with little data. VQ-HPS also shows promising results when training on large-scale datasets, highlighting the significant potential of the classification approach for HPSE. See the project page at https://g-fiche.github.io/research-pages/vqhps/
Published: 2023

6. The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement

Author: Leglaive, Simon, Borne, Léonie, Tzinis, Efthymios, Sadeghi, Mostafa, Fraticelli, Matthieu, Wisdom, Scott, Pariente, Manuel, Pressnitzer, Daniel, and Hershey, John R.
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Supervised speech enhancement models are trained using artificially generated mixtures of clean speech and noise signals, which may not match real-world recording conditions at test time. This mismatch can lead to poor performance if the test domain significantly differs from the synthetic training domain. This paper introduces the unsupervised domain adaptation for conversational speech enhancement (UDASE) task of the 7th CHiME challenge. This task aims to leverage real-world noisy speech recordings from the target domain for unsupervised domain adaptation of speech enhancement models. The target domain corresponds to the multi-speaker reverberant conversational speech recordings of the CHiME-5 dataset, for which the ground-truth clean speech reference is unavailable. Given a CHiME-5 recording, the task is to estimate the clean, potentially multi-speaker, reverberant speech, removing the additive background noise. We discuss the motivation for the CHiME-7 UDASE task and describe the data, the task, and the baseline system.
Published: 2023
Full Text: View/download PDF

7. Unsupervised speech enhancement with deep dynamical generative speech and noise models

Author: Lin, Xiaoyu, Leglaive, Simon, Girin, Laurent, and Alameda-Pineda, Xavier
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model. We propose to replace the NMF noise model with a deep dynamical generative model (DDGM) depending either on the DVAE latent variables, or on the noisy observations, or on both. This DDGM can be trained in three configurations: noise-agnostic, noise-dependent and noise adaptation after noise-dependent training. Experimental results show that the proposed method achieves competitive performance compared to state-of-the-art unsupervised speech enhancement methods, while the noise-dependent training configuration yields a much more time-efficient inference process.
Published: 2023

8. Motion-DVAE: Unsupervised learning for fast human motion denoising

Author: Fiche, Guénolé, Leglaive, Simon, Alameda-Pineda, Xavier, and Séguier, Renaud
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Pose and motion priors are crucial for recovering realistic and accurate human motion from noisy observations. Substantial progress has been made on pose and shape estimation from images, and recent works showed impressive results using priors to refine frame-wise predictions. However, a lot of motion priors only model transitions between consecutive poses and are used in time-consuming optimization procedures, which is problematic for many applications requiring real-time motion capture. We introduce Motion-DVAE, a motion prior to capture the short-term dependencies of human motion. As part of the dynamical variational autoencoder (DVAE) models family, Motion-DVAE combines the generative capability of VAE models and the temporal modeling of recurrent architectures. Together with Motion-DVAE, we introduce an unsupervised learned denoising method unifying regression- and optimization-based approaches in a single framework for real-time 3D human pose estimation. Experiments show that the proposed approach reaches competitive performance with state-of-the-art methods while being much faster.
Published: 2023

9. A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Author: Sadok, Samir, Leglaive, Simon, Girin, Laurent, Alameda-Pineda, Xavier, and Séguier, Renaud
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture., Comment: 14 figures, https://samsad35.github.io/site-mdvae/
Published: 2023
Full Text: View/download PDF

10. A vector quantized masked autoencoder for audiovisual speech emotion recognition

Author: Sadok, Samir, Leglaive, Simon, and Séguier, Renaud
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The limited availability of labeled data is a major challenge in audiovisual speech emotion recognition (SER). Self-supervised learning approaches have recently been proposed to mitigate the need for labeled data in various applications. This paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning and applied to SER. Unlike previous approaches, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned by vector quantized variational autoencoders. A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence, which are then used for an SER downstream task. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods. Extensive ablation experiments are also provided to assess the contribution of the different model components., Comment: 15 pages, 5 figures, https://samsad35.github.io/VQ-MAE-AudioVisual/
Published: 2023

11. A vector quantized masked autoencoder for speech emotion recognition

Author: Sadok, Samir, Leglaive, Simon, and Séguier, Renaud
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised learning has recently emerged as a promising solution to address this challenge. In this paper, we propose the vector quantized masked autoencoder for speech (VQ-MAE-S), a self-supervised model that is fine-tuned to recognize emotions from speech signals. The VQ-MAE-S model is based on a masked autoencoder (MAE) that operates in the discrete latent space of a vector-quantized variational autoencoder. Experimental results show that the proposed VQ-MAE-S model, pre-trained on the VoxCeleb2 dataset and fine-tuned on emotional speech data, outperforms an MAE working on the raw spectrogram representation and other state-of-the-art methods in SER., Comment: https://samsad35.github.io/VQ-MAE-Speech/
Published: 2023

12. Speech Modeling with a Hierarchical Transformer Dynamical VAE

Author: Lin, Xiaoyu, Bie, Xiaoyu, Leglaive, Simon, Girin, Laurent, and Alameda-Pineda, Xavier
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a corresponding sequence of latent vectors. In almost all the DVAEs of the literature, the temporal dependencies within each sequence and across the two sequences are modeled with recurrent neural networks. In this paper, we propose to model speech signals with the Hierarchical Transformer DVAE (HiT-DVAE), which is a DVAE with two levels of latent variable (sequence-wise and frame-wise) and in which the temporal dependencies are implemented with the Transformer architecture. We show that HiT-DVAE outperforms several other DVAEs for speech spectrogram modeling, while enabling a simpler training procedure, revealing its high potential for downstream low-level speech processing tasks such as speech enhancement.
Published: 2023

13. Expectation-Maximization Based Defense Mechanism for Distributed Model Predictive Control

Author: Nogueira, Rafael Accácio, Bourdais, Romain, Leglaive, Simon, and Guéguen, Hervé
Subjects: Electrical Engineering and Systems Science - Systems and Control
Abstract: Controlling large-scale systems sometimes requires decentralized computation. Communication among agents is crucial to achieving consensus and optimal global behavior. These negotiation mechanisms are sensitive to attacks on those exchanges. This paper proposes an algorithm based on Expectation Maximization to mitigate the effects of attacks in a resource allocation based distributed model predictive control. The performance is assessed through an academic example of the temperature control of multiple rooms under input power constraints.
Published: 2022

14. Learning and controlling the source-filter representation of speech with a variational autoencoder

Author: Sadok, Samir, Leglaive, Simon, Girin, Laurent, Alameda-Pineda, Xavier, and Séguier, Renaud
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$., Comment: 23 pages, 7 figures, companion website: https://samsad35.github.io/site-sfvae/
Published: 2022
Full Text: View/download PDF

15. HiT-DVAE: Human Motion Generation via Hierarchical Transformer Dynamical VAE

Author: Bie, Xiaoyu, Guo, Wen, Leglaive, Simon, Girin, Lauren, Moreno-Noguer, Francesc, and Alameda-Pineda, Xavier
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Studies on the automatic processing of 3D human pose data have flourished in the recent past. In this paper, we are interested in the generation of plausible and diverse future human poses following an observed 3D pose sequence. Current methods address this problem by injecting random variables from a single latent space into a deterministic motion prediction framework, which precludes the inherent multi-modality in human motion generation. In addition, previous works rarely explore the use of attention to select which frames are to be used to inform the generation process up to our knowledge. To overcome these limitations, we propose Hierarchical Transformer Dynamical Variational Autoencoder, HiT-DVAE, which implements auto-regressive generation with transformer-like attention mechanisms. HiT-DVAE simultaneously learns the evolution of data and latent space distribution with time correlated probabilistic dependencies, thus enabling the generative model to learn a more complex and time-varying latent space as well as diverse and realistic human motions. Furthermore, the auto-regressive generation brings more flexibility on observation and prediction, i.e. one can have any length of observation and predict arbitrary large sequences of poses with a single pre-trained model. We evaluate the proposed method on HumanEva-I and Human3.6M with various evaluation methods, and outperform the state-of-the-art methods on most of the metrics.
Published: 2022

16. Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

Author: Bie, Xiaoyu, Leglaive, Simon, Alameda-Pineda, Xavier, and Girin, Laurent
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.
Published: 2021
Full Text: View/download PDF

17. A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling

Author: Bie, Xiaoyu, Girin, Laurent, Leglaive, Simon, Hueber, Thomas, and Alameda-Pineda, Xavier
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling., Comment: Accepted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2008.12595
Published: 2021

18. Dynamical Variational Autoencoders: A Comprehensive Review

Author: Girin, Laurent, Leglaive, Simon, Bie, Xiaoyu, Diard, Julien, Hueber, Thomas, and Alameda-Pineda, Xavier
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Variational autoencoders (VAEs) are powerful deep generative models widely used to represent high-dimensional complex data through a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, the input data vectors are processed independently. Recently, a series of papers have presented different extensions of the VAE to process sequential data, which model not only the latent space but also the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks or state-space models. In this paper, we perform a literature review of these models. We introduce and discuss a general class of models, called dynamical variational autoencoders (DVAEs), which encompasses a large subset of these temporal VAE extensions. Then, we present in detail seven recently proposed DVAE models, with an aim to homogenize the notations and presentation lines, as well as to relate these models with existing classical temporal models. We have reimplemented those seven DVAE models and present the results of an experimental benchmark conducted on the speech analysis-resynthesis task (the PyTorch code is made publicly available). The paper concludes with a discussion on important issues concerning the DVAE class of models and future research guidelines.
Published: 2020
Full Text: View/download PDF

19. A Recurrent Variational Autoencoder for Speech Enhancement

Author: Leglaive, Simon, Alameda-Pineda, Xavier, Girin, Laurent, and Horaud, Radu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents a generative approach to speech enhancement based on a recurrent variational autoencoder (RVAE). The deep generative speech model is trained using clean speech signals only, and it is combined with a nonnegative matrix factorization noise model for speech enhancement. We propose a variational expectation-maximization algorithm where the encoder of the RVAE is fine-tuned at test time, to approximate the distribution of the latent variables given the noisy speech observations. Compared with previous approaches based on feed-forward fully-connected architectures, the proposed recurrent deep generative speech model induces a posterior temporal dynamic over the latent variables, which is shown to improve the speech enhancement results.
Published: 2019

20. Audio-visual Speech Enhancement Using Conditional Variational Auto-Encoders

Author: Sadeghi, Mostafa, Leglaive, Simon, Alameda-PIneda, Xavier, Girin, Laurent, and Horaud, Radu
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Variational auto-encoders (VAEs) are deep generative latent variable models that can be used for learning the distribution of complex data. VAEs have been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. One advantage of this generative approach is that it does not require pairs of clean and noisy speech signals at training. In this paper, we propose audio-visual variants of VAEs for single-channel and speaker-independent speech enhancement. We develop a conditional VAE (CVAE) where the audio speech generative process is conditioned on visual information of the lip region. At test time, the audio-visual speech generative model is combined with a noise model based on nonnegative matrix factorization, and speech enhancement relies on a Monte Carlo expectation-maximization algorithm. Experiments are conducted with the recently published NTCD-TIMIT dataset as well as the GRID corpus. The results confirm that the proposed audio-visual CVAE effectively fuses audio and visual information, and it improves the speech enhancement performance compared with the audio-only VAE model, especially when the speech signal is highly corrupted by noise. We also show that the proposed unsupervised audio-visual speech enhancement approach outperforms a state-of-the-art supervised deep learning method., Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2019
Full Text: View/download PDF

21. Audio-noise Power Spectral Density Estimation Using Long Short-term Memory

Author: Li, Xiaofei, Leglaive, Simon, Girin, Laurent, and Horaud, Radu
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose a method using a long short-term memory (LSTM) network to estimate the noise power spectral density (PSD) of single-channel audio signals represented in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency bands is trained, which processes each frequency band individually by mapping the noisy STFT magnitude sequence to its corresponding noise PSD sequence. Unlike deep-learning-based speech enhancement methods that learn the full-band spectral structure of speech segments, the proposed method exploits the sub-band STFT magnitude evolution of noise with a long time dependency, in the spirit of the unsupervised noise estimators described in the literature. Speaker- and speech-independent experiments with different types of noise show that the proposed method outperforms the unsupervised estimators, and generalizes well to noise types that are not present in the training set., Comment: Submitted to IEEE Signal Processing Letters
Published: 2019
Full Text: View/download PDF

22. Speech enhancement with variational autoencoders and alpha-stable distributions

Author: Leglaive, Simon, Simsekli, Umut, Liutkus, Antoine, Girin, Laurent, and Horaud, Radu
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: This paper focuses on single-channel semi-supervised speech enhancement. We learn a speaker-independent deep generative speech model using the framework of variational autoencoders. The noise model remains unsupervised because we do not assume prior knowledge of the noisy recording environment. In this context, our contribution is to propose a noise model based on alpha-stable distributions, instead of the more conventional Gaussian non-negative matrix factorization approach found in previous studies. We develop a Monte Carlo expectation-maximization algorithm for estimating the model parameters at test time. Experimental results show the superiority of the proposed approach both in terms of perceptual quality and intelligibility of the enhanced speech signal., Comment: 5 pages, 3 figures, audio examples and code available online : https://team.inria.fr/perception/research/icassp2019-asvae/. arXiv admin note: text overlap with arXiv:1811.06713
Published: 2019
Full Text: View/download PDF

23. A variance modeling framework based on variational autoencoders for speech enhancement

Author: Leglaive, Simon, Girin, Laurent, and Horaud, Radu
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: In this paper we address the problem of enhancing speech signals in noisy mixtures using a source separation approach. We explore the use of neural networks as an alternative to a popular speech variance model based on supervised non-negative matrix factorization (NMF). More precisely, we use a variational autoencoder as a speaker-independent supervised generative speech model, highlighting the conceptual similarities that this approach shares with its NMF-based counterpart. In order to be free of generalization issues regarding the noisy recording environments, we follow the approach of having a supervised model only for the target speech signal, the noise model being based on unsupervised NMF. We develop a Monte Carlo expectation-maximization algorithm for inferring the latent variables in the variational autoencoder and estimating the unsupervised model parameters. Experiments show that the proposed method outperforms a semi-supervised NMF baseline and a state-of-the-art fully supervised deep learning approach., Comment: 6 pages, 3 figures
Published: 2019
Full Text: View/download PDF

24. Semi-supervised multichannel speech enhancement with variational autoencoders and non-negative matrix factorization

Author: Leglaive, Simon, Girin, Laurent, and Horaud, Radu
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: In this paper we address speaker-independent multichannel speech enhancement in unknown noisy environments. Our work is based on a well-established multichannel local Gaussian modeling framework. We propose to use a neural network for modeling the speech spectro-temporal content. The parameters of this supervised model are learned using the framework of variational autoencoders. The noisy recording environment is supposed to be unknown, so the noise spectro-temporal modeling remains unsupervised and is based on non-negative matrix factorization (NMF). We develop a Monte Carlo expectation-maximization algorithm and we experimentally show that the proposed approach outperforms its NMF-based counterpart, where speech is modeled using supervised NMF., Comment: 5 pages, 2 figures, audio examples and code available online at https://team.inria.fr/perception/icassp-2019-mvae/
Published: 2018
Full Text: View/download PDF

25. Towards Improving Speech Emotion Recognition Using Synthetic Data Augmentation from Emotion Conversion

Author: Ibrahim, Karim M., primary, Perzo, Antony, additional, and Leglaive, Simon, additional
Published: 2024
Full Text: View/download PDF

26. SwimXYZ: A large-scale dataset of synthetic swimming motions and videos

Author: Fiche, Guénolé, primary, Sevestre, Vincent, additional, Gonzalez-Barral, Camila, additional, Leglaive, Simon, additional, and Séguier, Renaud, additional
Published: 2023
Full Text: View/download PDF

27. Motion-DVAE: Unsupervised learning for fast human motion denoising

Author: Fiche, Guénolé, primary, Leglaive, Simon, additional, Alameda-Pineda, Xavier, additional, and Séguier, Renaud, additional
Published: 2023
Full Text: View/download PDF

28. The CHiME-7 UDASE task: Unsupervised domain adaptation for conversational speech enhancement

Author: Leglaive, Simon, primary, Borne, Léonie, additional, Tzinis, Efthymios, additional, Sadeghi, Mostafa, additional, Fraticelli, Matthieu, additional, Wisdom, Scott, additional, Pariente, Manuel, additional, Pressnitzer, Daniel, additional, and Hershey, John, additional
Published: 2023
Full Text: View/download PDF

29. Unsupervised speech enhancement with deep dynamical generative speech and noise models

Author: Lin, Xiaoyu, primary, Leglaive, Simon, additional, Girin, Laurent, additional, and Alameda-Pineda, Xavier, additional
Published: 2023
Full Text: View/download PDF

30. Speech Modeling with a Hierarchical Transformer Dynamical VAE

Author: Lin, Xiaoyu, primary, Bie, Xiaoyu, additional, Leglaive, Simon, additional, Girin, Laurent, additional, and Alameda-Pineda, Xavier, additional
Published: 2023
Full Text: View/download PDF

31. A Vector Quantized Masked Autoencoder for Speech Emotion Recognition

Author: Sadok, Samir, primary, Leglaive, Simon, additional, and Séguier, Renaud, additional
Published: 2023
Full Text: View/download PDF

32. LatentForensics: Towards frugal deepfake detection in the StyleGAN latent space

Author: Delmas, Matthieu, Kacete, Amine, Paquelet, Stephane, Leglaive, Simon, Seguier, Renaud, Delmas, Matthieu, Kacete, Amine, Paquelet, Stephane, Leglaive, Simon, and Seguier, Renaud
Abstract: The classification of forged videos has been a challenge for the past few years. Deepfake classifiers can now reliably predict whether or not video frames have been tampered with. However, their performance is tied to both the dataset used for training and the analyst's computational power. We propose a deepfake detection method that operates in the latent space of a state-of-the-art generative adversarial network (GAN) trained on high-quality face images. The proposed method leverages the structure of the latent space of StyleGAN to learn a lightweight binary classification model. Experimental results on standard datasets reveal that the proposed approach outperforms other state-of-the-art deepfake classification methods, especially in contexts where the data available to train the models is rare, such as when a new manipulation method is introduced. To the best of our knowledge, this is the first study showing the interest of the latent space of StyleGAN for deepfake classification. Combined with other recent studies on the interpretation and manipulation of this latent space, we believe that the proposed approach can further help in developing frugal deepfake classification methods based on interpretable high-level properties of face images., Comment: 7 pages, 3 figures, 5 tables
Published: 2023

33. Learning and controlling the source-filter representation of speech with a variational autoencoder

Author: Sadok, Samir, primary, Leglaive, Simon, additional, Girin, Laurent, additional, Alameda-Pineda, Xavier, additional, and Séguier, Renaud, additional
Published: 2023
Full Text: View/download PDF

34. LatentForensics: Towards lighter deepfake detection in the StyleGAN latent space

Author: Delmas, Matthieu, Kacete, Amine, Paquelet, Stephane, Leglaive, Simon, and Seguier, Renaud
Subjects: FOS: Computer and information sciences, I.2.10, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 68T45
Abstract: The classification of forged videos has been a challenge for the past few years. Deepfake classifiers can now reliably predict whether or not video frames have been tampered with. However, their performance is tied to both the dataset used for training and the analyst's computational power. We propose a deepfake classification method that operates in the latent space of a state-of-the-art generative adversarial network (GAN) trained on high-quality face images. The proposed method leverages the structure of the latent space of StyleGAN to learn a lightweight classification model. Experimental results on a standard dataset reveal that the proposed approach outperforms other state-of-the-art deepfake classification methods. To the best of our knowledge, this is the first study showing the interest of the latent space of StyleGAN for deepfake classification. Combined with other recent studies on the interpretation and manipulation of this latent space, we believe that the proposed approach can help in developing robust deepfake classification methods based on interpretable high-level properties of face images., Comment: 5 pages, 5 figures, 1 tables, submitted to ICIP 2023
Published: 2023
Full Text: View/download PDF

35. A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

Author: Sadok, Samir, primary, Leglaive, Simon, additional, Girin, Laurent, additional, Alameda-Pineda, Xavier, additional, and Seguier, Renaud, additional
Published: 2023
Full Text: View/download PDF

36. A Multimodal Dynamical Variational Autoencoder for Audiovisual Speech Representation Learning

Author: Leglaive, Simon, primary
Published: 2022
Full Text: View/download PDF

37. Les auto-encodeurs variationnels dynamiques et leur application à la modélisation de spectrogrammes de parole

Author: Girin, Laurent, Bie, Xiaoyu, Leglaive, Simon, Hueber, Thomas, Alameda-Pineda, Xavier, GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA), Vers des robots à l’intelligence sociale au travers de l’apprentissage, de la perception et de la commande (ROBOTLEARN), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université Grenoble Alpes (UGA), Institut d'Électronique et des Technologies du numéRique (IETR), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-Nantes Université - pôle Sciences et technologie, Nantes Université (Nantes Univ)-Nantes Université (Nantes Univ), Université de Nantes, and ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019)
Subjects: speech signals modeling, [SPI]Engineering Sciences [physics], speech spectrograms, spectrogrammes de parole, modélisation des signaux de parole, auto-encodeurs variationnels dynamiques, dynamical variational autoencoders, analyse-resynthèse de la parole, speech analysis-resynthesis
Abstract: International audience; The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present this class of models and illustrate their high potential for modeling (spectrograms of) speech signals with speech analysis-resynthesis experiments.; L'auto-encodeur variationnel (AEV) est un modèle génératif profond permettant d'apprendre de façon auto-supervisé des représentations latentes compactes, à partir de données complexes de grande dimension. Dans le modèle AEV original, les vecteurs de données d'entrée sont traités indépendamment. Ces dernières années, plusieurs travaux ont proposé différentes extensions de l'AEV afin de traiter des données séquentielles (notamment temporelles). Ces modèles utilisent classiquement des réseaux de neurones récurrents pour tenir compte non seulement des dépendances entre les vecteurs d'une séquence d'entrée, mais également celles entre les représentations latentes correspondantes. Nous avons récemment effectué une revue complète de ces modèles et les avons unifiés en une classe générale appelée auto-encodeurs variationnels dynamiques (AEVDs). Dans le présent article, nous présentons cette classe de modèles et illustrons leur fort potentiel pour la modélisation des (spectrogrammes de) signaux de parole avec des expériences en analyse-resynthèse.
Published: 2022
Full Text: View/download PDF

38. Les auto-encodeurs variationnels dynamiques et leur application à la modélisation de spectrogrammes de parole

Author: Girin, Laurent, primary, Bie, Xiaoyu, additional, Leglaive, Simon, additional, Hueber, Thomas, additional, and Alameda-Pineda, Xavier, additional
Published: 2022
Full Text: View/download PDF

39. Expectation-Maximization Based Defense Mechanism for Distributed Model Predictive Control

Author: Nogueira, Rafael Accácio, primary, Bourdais, Romain, additional, Leglaive, Simon, additional, and Guéguen, Hervé, additional
Published: 2022
Full Text: View/download PDF

40. Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders

Author: Bie, Xiaoyu, primary, Leglaive, Simon, additional, Alameda-Pineda, Xavier, additional, and Girin, Laurent, additional
Published: 2022
Full Text: View/download PDF

41. A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling

Author: Bie, Xiaoyu, primary, Girin, Laurent, additional, Leglaive, Simon, additional, Hueber, Thomas, additional, and Alameda-Pineda, Xavier, additional
Published: 2021
Full Text: View/download PDF

42. Dynamical Variational Autoencoders: A Comprehensive Review

Author: Girin, Laurent, primary, Leglaive, Simon, primary, Bie, Xiaoyu, primary, Diard, Julien, primary, Hueber, Thomas, primary, and Alameda-Pineda, Xavier, primary
Published: 2021
Full Text: View/download PDF

43. A Recurrent Variational Autoencoder for Speech Enhancement

Author: Leglaive, Simon, primary, Alameda-Pineda, Xavier, additional, Girin, Laurent, additional, and Horaud, Radu, additional
Published: 2020
Full Text: View/download PDF

44. Modeling reverberant mixtures for multichannel audio source separation

Author: Leglaive, Simon, Laboratoire Traitement et Communication de l'Information (LTCI), Télécom ParisTech-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS), Télécom ParisTech, Roland Badeau, Gaël Richard, STAR, ABES, Institut Mines-Télécom [Paris] (IMT)-Télécom Paris, and Leglaive, Simon
Subjects: Multichannel reverberant mixtures, Mélanges multicanaux réverbérants, Séparation sous-déterminée de sources audio, Modèles probabilistes, Inférence variationnelle, Inférence variationnelle, Non-negative matrix factorization, Factorisation en matrices non-négatives, Inférence statistique, Acoustique statistique des salles, Mélanges multicanaux réverbérants, Modèles probabilistes, Under-determined audio source separation, Inférence statistique, Factorisation en matrices non-négatives, Séparation sous-déterminée de sources audio, Probabilistic models, Statistical room acoustics, Variational inference, [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, Statistical inference, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: In this thesis we address the problem of audio source separation for multichannel mixtures recorded in a reverberant environment. Our work focuses on the under-determined case, that is, when the number of sources to be separated is greater than the number of channels in the mixture. In order to tackle such a problem, it is often useful to develop a parametric model that explains the observed data. In this thesis we adopt a probabilistic and hierarchical approach in which the modeling of the monophonic source signals is distinguished from that of the mixing process.The sources are characterized in a time-frequency domain in order to obtain a sparse representation, suitable for the development of a model because highlighting the specific structure of audio signals and particularly musical ones. We rely on a probabilistic modeling of the sources where their time-frequency coefficients are represented as latent random variables. Defining the source model then amounts to defining the prior joint distribution of these coefficients. The source models in this thesis are mainly based on the Gaussian and the Student’s t distributions. We will also use non-negative matrix factorization approaches. One advantage of this rank reduction technique is that the number of parameters to be estimated is reduced.The main contributions of this thesis concern the modeling of the mixture in the presence of reverberation. Such a mixture is naturally represented in the time domain by the convolution of the source signals with the room impulse responses which characterize the acoustic path between each source and each microphone. These responses are called mixing filters in the context of source separation. The latter are generally treated in the literature as deterministic parameters, that are only estimated from the observed data. It is known, however, that they correspond to room responses, so they have a very specific structure that could be used to guide their estimation.In a first part we consider a common approximation in the literature, which consists in approaching the temporal convolution by a simple multiplication in the short-time Fourier transform domain, under the hypothesis that the impulse response of the mixing filters is short. The mixture is then characterized by the frequency response of the filters. Based on geometrical room acoustics concepts, we model the direct path and the first echoes of the room response by an autoregressive process in the frequency domain. According to statistical room acoustics results, late reverberation is modeled as a Gaussian random process also in the frequency domain. We exploit the exponential temporal decay of late reverberation to obtain theoretical expressions of the autocovariance function and power spectral density of this process. We also propose an autoregressive moving average parametrization of these two quantities. Finally, we develop a source separation method based on an expectation-maximization algorithm which exploits priors on the mixing filters in order to perform maximum a posteriori estimation.In a second part, we wish to relax the short mixing filters assumption because it fundamentally limits the separation performance for highly reverberant mixtures. We propose to infer the time-frequency source coefficients from the time-domain mixture observations, using a variational method. This approach makes it possible to exactly represent the convolutive mixing process, in the time domain. Preliminary results obtained by assuming that the mixing filters are known show the robustness of this approach in the presence of high reverberation. We then develop a room impulse response model based on the Student’s t distribution. This distribution allows us to take into account the direct path and the first echoes which, from a statistical point of view, correspond to outliers with respect to the Gaussian reverberation model with exponentially decaying amplitude. Finally, we develop a source separation method based on a variational inference technique where the mixing filters are considered as latent random variables in the time domain. We also show that this approach allows us to adapt the time-frequency representation to each individual source in the mixture, especially in terms of resolution., Cette thèse traite du problème de séparation de sources sonores pour les mélanges multicanaux enregistrés en milieu réverbérant. Nous focalisons nos travaux sur le cas sous-déterminé, c’est-à-dire lorsque le nombre de sources à séparer est supérieur au nombre de canaux du mélange. Afin d’aborder un tel problème, il est souvent utile de développer un modèle paramétrique permettant d’expliquer les données observées, c’est-à-dire le mélange. Nous adoptons dans cette thèse une approche probabiliste et hiérarchique où l’on distingue la modélisation des signaux sources monophoniques de celle du processus de mélange.Les sources sont caractérisées dans un domaine temps-fréquence afin d’obtenir une représentation parcimonieuse, propice au développement d’un modèle car mettant en évidence la structure spécifique des signaux audio et plus particulièrement musicaux. Nous mettons en œuvre une modélisation probabiliste des sources où leurs coefficients temps-fréquence sont représentés comme des variables aléatoires latentes. Définir le modèle de source revient alors à définir la distribution jointe a priori de ces coefficients. Les modèles employés dans cette thèse se basent principalement sur les distributions gaussienne et t de Student. Nous utiliserons de plus des approches par factorisation en matrices non-négatives. L’intérêt de cette technique de réduction de rang réside notamment dans le caractère sous-déterminé du problème, elle permet en effet de réduire le nombre de paramètres à estimer.Les principales contributions de cette thèse concernent la modélisation du mélange en présence de réverbération. Celui-ci est naturellement représenté dans le domaine temporel par la convolution des signaux sources avec les réponses impulsionnelles de salle qui caractérisent le chemin acoustique entre chaque source et chaque microphone. Ces réponses sont appelées filtres de mélange dans le contexte de la séparation de sources. Ces derniers sont généralement traités dans la littérature comme des paramètres déterministes estimés uniquement à partir des données observées. On sait cependant qu’ils correspondent à des réponses de salle, ils ont par conséquent une structure bien précise qu’il serait possible d’exploiter afin de guider leur estimation.Dans une première partie nous considérons une approximation fréquente dans la littérature, qui consiste à approcher la convolution temporelle par une simple multiplication dans le domaine de la transformée de Fourier à court-terme, sous une hypothèse de filtres de mélange à réponse impulsionnelle courte. Le mélange est alors caractérisé par la réponse en fréquence des filtres. A partir de concepts d’acoustique géométrique des salles nous modélisons le trajet direct et les premiers échos de la réponse de salle par un processus autorégressif en fréquence. Suivant des résultats d’acoustique statistique des salles, la réverbération tardive est modélisée comme un processus gaussien en fréquence. Nous exploitons la décroissance exponentielle de la réverbération tardive dans le domaine temporel pour obtenir des expressions théoriques de la fonction d’autocovariance et de la densité spectrale de puissance de ce processus. Nous proposons également une paramétrisation autorégressive à moyenne ajustée de ces quantités. Nous développons finalement une méthode de séparation de sources basée sur un algorithme espérance-maximisation et permettant d’exploiter ces modèles par l’intermédiaire d’a priori sur les filtres de mélange, dans le cadre d’une estimation au sens du maximum a posteriori.Dans une seconde partie nous souhaitons relâcher l’hypothèse de filtres de mélange courts car celle-ci limite fondamentalement les performances de séparation pour des mélanges fortement réverbérants. Nous proposons alors une méthode d’inférence variationnelle des coefficients temps-fréquence des sources à partir des observations temporelles du mélange. Cette approche permet de représenter de façon exacte le processus de mélange convolutif. Des résultats préliminaires obtenus en supposant la connaissance des filtres de mélange permettent de montrer la robustesse de cette approche en présence de forte réverbération. Nous développons ensuite un modèle de réponse impulsionnelle de salle basé sur la distribution t de Student. Celle-ci permet de prendre en compte le trajet direct et les premiers échos qui d’un point de vue statistique correspondent à des valeurs aberrantes par rapport au modèle de réverbération gaussien à amplitude exponentiellement décroissante. Nous développons finalement une méthode de séparation de sources basée sur une technique d’inférence variationnelle où les filtres de mélange sont considérés comme des variables aléatoires latentes dans le domaine temporel. Nous montrons également que cette approche permet d’avoir une représentation temps-fréquence adaptée à chaque source composant le mélange, notamment en terme de résolution.
Published: 2017

45. Audio-Visual Speech Enhancement Using Conditional Variational Auto-Encoders

Author: Sadeghi, Mostafa, primary, Leglaive, Simon, additional, Alameda-Pineda, Xavier, additional, Girin, Laurent, additional, and Horaud, Radu, additional
Published: 2020
Full Text: View/download PDF

46. Audio-Noise Power Spectral Density Estimation Using Long Short-Term Memory

Author: Li, Xiaofei, primary, Leglaive, Simon, additional, Girin, Laurent, additional, and Horaud, Radu, additional
Published: 2019
Full Text: View/download PDF

47. Speech Enhancement with Variational Autoencoders and Alpha-stable Distributions

Author: Leglaive, Simon, primary, Simsekli, Umut, additional, Liutkus, Antoine, additional, Girin, Laurent, additional, and Horaud, Radu, additional
Published: 2019
Full Text: View/download PDF

48. Semi-supervised Multichannel Speech Enhancement with Variational Autoencoders and Non-negative Matrix Factorization

Author: Leglaive, Simon, primary, Girin, Laurent, additional, and Horaud, Radu, additional
Published: 2019
Full Text: View/download PDF

49. Séparation de sources audio en milieu réverbérant : Factorisation en matrices non-négatives et représentation temporelle du mélange convolutif

Author: Leglaive, Simon, Badeau, Roland, Richard, Gael, Télécom ParisTech, Projet ANR EDISON 3D, ANR-13-CORD-0008,EDISON 3D,Edition et Diffusion Sonore spatialisée en 3 dimensions(2013), Badeau, Roland, and CONTENUS NUMERIQUES ET INTERACTIONS - Edition et Diffusion Sonore spatialisée en 3 dimensions - - EDISON 3D2013 - ANR-13-CORD-0008 - CONTINT - VALID
Subjects: [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, [SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing
Abstract: This paper addresses the problem of multichannel audio source separation in under-determined reverberant mixtures. We target a semi-blind scenario assuming that the mixing filters are known. The proposed method consists in working directly with the time-domain mixture signals. This approach makes it possible to accurately represent the convolutive mixing process, it is therefore suitable for the separation of highly reverberant mixtures. The source signals are represented in the modified discrete cosine transform domain with a Gaussian model based on non-negative matrix factorization (NMF). Source inference is based on a variational expectation-maximization algorithm. We experimentally show the advantage of using a time-domain representation of the convolutive mixture and a source model based on NMF., Cet article traite du problème de séparation de sources audio sous-déterminé pour les mélanges réverbérants multi- canaux. Nous visons une application semi-aveugle où les filtres de mélange sont connus. La méthode proposée consiste à travailler directement avec les signaux temporels du mélange. Cette approche permet de représenter de façon exacte le processus de mélange convolutif, elle est donc adaptée pour la séparation de mélanges fortement réverbérants. Les signaux sources sont quant à eux représentés dans le domaine de la transformée en cosinus discrète modifiée, en utilisant un modèle gaussien basé sur la factorisation en matrices non-négatives. L'inférence des sources repose sur un algorithme espérance-maximisation variationnel. Nous montrons expérimentalement l'intérêt d'utiliser conjointement une représentation temporelle du mélange convolutif et un modèle de source basé sur la factorisation en matrices non-négatives.
Published: 2017

50. A VARIANCE MODELING FRAMEWORK BASED ON VARIATIONAL AUTOENCODERS FOR SPEECH ENHANCEMENT

Author: Leglaive, Simon, primary, Girin, Laurent, additional, and Horaud, Radu, additional
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

111 results on '"Leglaive, Simon"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources