Audio-visual graphical models for speech processing

Authors :: John R. Hershey
Hagai Attias
Nebojsa Jojic
Trausti Kristjansson
Source :: ICASSP (5)
Publication Year :: 2004
Publisher :: IEEE, 2004.
Abstract: Perceiving sounds in a noisy environment is a challenging problem. Visual lip-reading can provide relevant information but is also challenging because lips are moving and a tracker must deal with a variety of conditions. Typically audio-visual systems have been assembled from individually engineered modules. We propose to fuse audio and video in a probabilistic generative model that implements cross-model self-supervised learning, enabling adaptation to audio-visual data. The video model features a Gaussian mixture model embedded in a linear subspace of a sprite which translates in the video. The system can learn to detect and enhance speech in noise given only a short (30 second) sequence of audio-visual data. We show some results for speech detection and enhancement, and discuss extensions to the model that are under investigation.

Subjects :: Sprite (computer graphics)
Voice activity detection
business.industry
Computer science
Speech recognition
Feature extraction
Speech processing
Mixture model
Adaptive filter
Speech enhancement
Noise
Acoustical engineering
Video tracking
Computer vision
Artificial intelligence
Graphical model
business

Database :: OpenAIRE
Journal :: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing
Accession number :: edsair.doi...........929ed6515aaa2646e6625807b2a2589c

Tools