Descriptor: "audiovisual fusion" / Database: OpenAIRE - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"audiovisual fusion"' showing total 15 results

Start Over Descriptor "audiovisual fusion" Database OpenAIRE

15 results on '"audiovisual fusion"'

1. Rethinking the Mechanisms Underlying the McGurk Illusion

Author: Brenna Mandujano, Kristina C. Backer, Antoine J. Shahin, and Mariel G Gonzales
Subjects: Consonant, medicine.medical_specialty, media_common.quotation_subject, Place of articulation, Illusion, phonemic representations, Audiology, 050105 experimental psychology, lcsh:RC321-571, 03 medical and health sciences, Behavioral Neuroscience, 0302 clinical medicine, medicine, 0501 psychology and cognitive sciences, Visual dominance, lcsh:Neurosciences. Biological psychiatry. Neuropsychiatry, Biological Psychiatry, media_common, Original Research, multisensory integration, 05 social sciences, Multisensory integration, audiovisual fusion, Psychiatry and Mental health, Neuropsychology and Physiological Psychology, Neurology, cross-modal phonetic encoding, McGurk illusion, Percept, Psychology, 030217 neurology & neurosurgery, Neuroscience
Abstract: The McGurk illusion occurs when listeners hear an illusory percept (i.e., “da”), resulting from mismatched pairings of audiovisual (AV) speech stimuli (i.e., auditory/ba/paired with visual/ga/). Hearing a third percept—distinct from both the auditory and visual input—has been used as evidence of AV fusion. We examined whether the McGurk illusion is instead driven by visual dominance, whereby the third percept, e.g., “da,” represents a default percept for visemes with an ambiguous place of articulation (POA), like/ga/. Participants watched videos of a talker uttering various consonant vowels (CVs) with (AV) and without (V-only) audios of/ba/. Individuals transcribed the CV they saw (V-only) or heard (AV). In the V-only condition, individuals predominantly saw “da”/“ta” when viewing CVs with indiscernible POAs. Likewise, in the AV condition, upon perceiving an illusion, they predominantly heard “da”/“ta” for CVs with indiscernible POAs. The illusion was stronger in individuals who exhibited weak/ba/auditory encoding (examined using a control auditory-only task). In Experiment2, we attempted to replicate these findings using stimuli recorded from a different talker. The V-only results were not replicated, but again individuals predominately heard “da”/“ta”/“tha” as an illusory percept for various AV combinations, and the illusion was stronger in individuals who exhibited weak/ba/auditory encoding. These results demonstrate that when visual CVs with indiscernible POAs are paired with a weakly encoded auditory/ba/, listeners default to hearing “da”/“ta”/“tha”—thus, tempering the AV fusion account, and favoring a default mechanism triggered when both AV stimuli are ambiguous.
Published: 2020

2. A Laboratory Study of the McGurk Effect in 324 Monozygotic and Dizygotic Twins

Author: Guo Feng, Bin Zhou, Wen Zhou, Michael S. Beauchamp, and John F. Magnotti
Subjects: medicine.medical_specialty, Speech perception, media_common.quotation_subject, Illusion, Audiology, 050105 experimental psychology, lcsh:RC321-571, 03 medical and health sciences, 0302 clinical medicine, Perception, medicine, 0501 psychology and cognitive sciences, behavioral genetics, lcsh:Neurosciences. Biological psychiatry. Neuropsychiatry, Behavioural genetics, media_common, Original Research, General Neuroscience, multisensory integration, 05 social sciences, Multisensory integration, audiovisual fusion, Twin study, twin studies, McGurk effect, Percept, Psychology, 030217 neurology & neurosurgery, Neuroscience
Abstract: Multisensory integration of information from the talker's voice and the talker's mouth facilitates human speech perception. A popular assay of audiovisual integration is the McGurk effect, an illusion in which incongruent visual speech information categorically changes the percept of auditory speech. There is substantial interindividual variability in susceptibility to the McGurk effect. To better understand possible sources of this variability, we examined the McGurk effect in 324 native Mandarin speakers, consisting of 73 monozygotic (MZ) and 89 dizygotic (DZ) twin pairs. When tested with 9 different McGurk stimuli, some participants never perceived the illusion and others always perceived it. Within participants, perception was similar across time (r = 0.55 at a 2-year retest in 150 participants) suggesting that McGurk susceptibility reflects a stable trait rather than short-term perceptual fluctuations. To examine the effects of shared genetics and prenatal environment, we compared McGurk susceptibility between MZ and DZ twins. Both twin types had significantly greater correlation than unrelated pairs (r = 0.28 for MZ twins and r = 0.21 for DZ twins) suggesting that the genes and environmental factors shared by twins contribute to individual differences in multisensory speech perception. Conversely, the existence of substantial differences within twin pairs (even MZ co-twins) and the overall low percentage of explained variance (5.5%) argues against a deterministic view of individual differences in multisensory integration.
Published: 2019

3. Prediction-Based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations

Author: Stavros Petridis, Maja Pantic, and Commission of the European Communities
Subjects: METIS-315566, Technology, HMI-HF: Human Factors, DATABASE, HEAR, EWI-26753, Computer science, Speech recognition, 02 engineering and technology, Machine learning, computer.software_genre, Computer Science, Artificial Intelligence, FACIAL ANIMATION, 03 medical and health sciences, EXPRESSIONS, 0302 clinical medicine, 0202 electrical engineering, electronic engineering, information engineering, Feature (machine learning), Computer Science, Cybernetics, HMM, SPEECH RECOGNITION, LAUGHTER, Set (psychology), Hidden Markov model, Computer facial animation, Computational model, Science & Technology, MODULAR NEURAL-NETWORKS, IDENTIFICATION, business.industry, Frame (networking), audiovisual fusion, DRIVEN, Class (biology), n/a OA procedure, Nonlinguistic Vocalisation Classification, Visualization, Human-Computer Interaction, Computer Science, Prediction-based Fusion, IR-99374, 020201 artificial intelligence & image processing, Artificial intelligence, business, computer, EC Grant Agreement nr.: FP7/611153, Audio-visual Fusion, 030217 neurology & neurosurgery, Software
Abstract: Prediction plays a key role in recent computational models of the brain and it has been suggested that the brain constantly makes multisensory spatiotemporal predictions. Inspired by these findings we tackle the problem of audiovisual fusion from a new perspective based on prediction. We train predictive models which model the spatiotemporal relationship between audio and visual features by learning the audio-to-visual and visual-to-audio feature mapping for each class. Similarly, we train predictive models which model the time evolution of audio and visual features by learning the past-to-future feature mapping for each class. In classification, all the class-specific regression models produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The set of class-specific regressors which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, is chosen to label the input frame. We perform cross-database experiments, using the AMI, SAL and MAHNOB databases, in order to classify laughter and speech and subject-independent experiments on the AVIC database in order to classify laughter, hesitation and consent. In virtually all cases prediction-based audiovisual fusion consistently outperforms the two most commonly used fusion approaches, decision-level and feature-level fusion.
Published: 2016
Full Text: View/download PDF

4. Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

Author: Stavros Petridis, Pingchuan Ma, Themos Stafylakis, Maja Pantic, and Georgios Tzimiropoulos
Subjects: FOS: Computer and information sciences, Technology, Computer science, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, Word error rate, 02 engineering and technology, Computer Science, Artificial Intelligence, Attention Architectures, 030507 speech-language pathology & audiology, 03 medical and health sciences, Engineering, Connectionism, 0202 electrical engineering, electronic engineering, information engineering, Hidden Markov model, Audiovisual Speech Recognition, Science & Technology, 020206 networking & telecommunications, Audio-visual speech recognition, Engineering, Electrical & Electronic, CTC, Visualization, Conditional independence, Computer Science, Noise (video), 0305 other medical science, Audiovisual Fusion
Abstract: Recent works in speech recognition rely either on connectionist temporal classification (CTC) or sequence-to-sequence models for character-level recognition. CTC assumes conditional independence of individual characters, whereas attention-based models can provide nonsequential alignments. Therefore, we could use a CTC loss in combination with an attention-based model in order to force monotonic alignments and at the same time get rid of the conditional independence assumption. In this paper, we use the recently proposed hybrid CTC/attention architecture for audio-visual recognition of speech in-the-wild. To the best of our knowledge, this is the first time that such a hybrid architecture architecture is used for audio-visual recognition of speech. We use the LRS2 database and show that the proposed audio-visual model leads to an 1.3% absolute decrease in word error rate over the audio-only model and achieves the new state-of-the-art performance on LRS2 database (7% word error rate). We also observe that the audio-visual model significantly outperforms the audio-based model (up to 32.9% absolute improvement in word error rate) for several different types of noise as the signal-to-noise ratio decreases., Comment: Accepted to IEEE SLT 2018
Published: 2018
Full Text: View/download PDF

5. End-to-end Audiovisual Speech Recognition

Author: Stavros Petridis, Pingchuan Ma, Georgios Tzimiropoulos, Themos Stafylakis, Maja Pantic, Feipeng Cai, and Commission of the European Communities
Subjects: FOS: Computer and information sciences, Technology, Computer science, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Engineering, ComputerApplications_MISCELLANEOUS, 0202 electrical engineering, electronic engineering, information engineering, Waveform, Hidden Markov model, Audiovisual Speech Recognition, Science & Technology, Modality (human–computer interaction), Audio signal, business.industry, Deep learning, Engineering, Electrical & Electronic, Acoustics, End-to-End Training, Noise, BGRUs, Residual Networks, Word recognition, 020201 artificial intelligence & image processing, Artificial intelligence, Mel-frequency cepstrum, 0305 other medical science, business, Audiovisual Fusion
Abstract: Several end-to-end deep learning approaches have been recently presented which extract either audio or visual features from the input images or audio signals and perform speech recognition. However, research on end-to-end audiovisual models is very limited. In this work, we present an end-to-end audiovisual model based on residual networks and Bidirectional Gated Recurrent Units (BGRUs). To the best of our knowledge, this is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW). The model consists of two streams, one for each modality, which extract features directly from mouth regions and raw waveforms. The temporal dynamics in each stream/modality are modeled by a 2-layer BGRU and the fusion of multiple streams/modalities takes place via another 2-layer BGRU. A slight improvement in the classification rate over an end-to-end audio-only and MFCC-based model is reported in clean audio conditions and low levels of noise. In presence of high levels of noise, the end-to-end audiovisual model significantly outperforms both audio-only models., Comment: Accepted to ICASSP 2018
Published: 2018
Full Text: View/download PDF

6. Online learning of audiovisual signatures for recognition and tracking of people within an ambient sensor network

Author: Decroix, François-Xavier, Équipe Robotique, Action et Perception (LAAS-RAP), Laboratoire d'analyse et d'architecture des systèmes (LAAS), Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Université Toulouse 3 Paul Sabatier (UT3 Paul Sabatier), F.LERASLE, Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Université Paul Sabatier - Toulouse III, Frédéric Lerasle, Julien Pinquier, Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT)-Université de Toulouse (UT)-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Université de Toulouse (UT)-Institut National des Sciences Appliquées (INSA)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole), and Université de Toulouse (UT)
Subjects: Audiovisual fusion, Multi-target tracking, Vision par ordinateur, Fusion audiovisuelle, Traitement automatique de la parole, Suivi multi-cibles, Automatic speech processing, Computer vision, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph], Suivi multi-cible, [SPI.AUTO]Engineering Sciences [physics]/Automatic
Abstract: National audience; The neOCampus operation, started in 2013 by Paul Sabatier University in Toulouse, aims to create a connected, innovative, intelligent and sustainable campus, by exploiting the skills of 11 laboratories and several industrial partners. These multidisciplinary skills are combined in order to improve users (students, teachers, administrative sta) daily comfort and to reduce the ecological footprint of the campus. The intelligence we want to bring to the campus of the future requires to provide to its buildings a perception of its intern activity. Indeed, optimizing the energy resources needs a characterization of the user's activities so that the building can automatically adapt itself to it. Human activity being open to multiple levels of interpretation, our work is focused on extracting people trajectories, its more elementary component. Characterizing users activities, in terms of movement, uses data extracted from cameras and microphones distributed in a room, forming a sparse network of heterogeneous sensors. From these data, we then seek to extract audiovisual signatures and rough localizations of the people transiting through this network of sensors. While protecting person privacy, signatures must be discriminative, to distinguish a person from another one, and compact, to optimize computational costs and enables the building to adapt itself. Having regard to these constraints, the characteristics we model are the speaker's timbre, and his appearance, in terms of colorimetric distribution. The scientic contributions of this thesis are thus at the intersection of the elds of speech processing and computer vision, by introducing new methods of fusing audio and visual signatures of individuals. To achieve this fusion, new sound source location indices as well as an audiovisual adaptation of a multi-target tracking method were introduced, representing the main contributions of this work. The thesis is structured in 4 chapters, and the rst one presents the state of the art on visual reidentication of persons and speaker recognition. Acoustic and visual modalities are not correlated, so two signatures are separately computed, one for video and one for audio, using existing methods in the literature. After a rst chapter dedicated to the state of the art in re-identication and speaker recognition methods, the details of the computation of the signatures is explored in chapter 2. The fusion of the signatures is then dealt as a problem of matching between audio and video observations, whose corresponding detections are spatially coherent and compatible. Two novel association strategies are introduced in chapter 3. Spatio-temporal coherence of the bimodal observations is then discussed in chapter 4, in a context of multi-target tracking.; L'opération neOCampus, initiée en 2013 par l'Université Paul Sabatier, a pour objectif de créer un campus connecté, innovant, intelligent et durable en exploitant les compétences de 11 laboratoires et de plusieurs partenaires industriels. Pluridisciplinaires, ces compétences sont croisées dans le but d'améliorer le confort au quotidien des usagers du campus (étudiants, corps enseignant, personnel administratif) et de diminuer son empreinte écologique. L'intelligence que nous souhaitons apporter au Campus du futur exige de fournir à ses bâtiments une perception de son activité interne. En effet, l'optimisation des ressources énergétiques nécessite une caract érisation des activités des usagers afin que le âatiment puisse s'y adapter automatiquement. L'activité humaine étant sujet à plusieurs niveaux d'interpétation nos travaux se focalisent sur l'extraction des déplacements des personnes présentes, sa composante la plus élémentaire. La caractérisation de l'activité des usagers, en termes de déplacements, exploite des données extraites de caméras et de microphones disséminés dans une piéce, ces derniers formant ainsi un réseau épars de capteurs hétérogènes. Nous cherchons alors à extraire de ces données une signature audiovisuelle et une localisation grossière des personnes transitant dans ce réseau de capteurs. Tout en préservant la vie privée de l'individu, la signature doit être discriminante, afin de distinguer les personnes entre elles, et compacte, afin d'optimiser les temps de traitement et permettre au bâtiment de s'auto-adapter. Eu égard à ces contraintes, les caractéristiques que nous modélisons sont le timbre de la voix du locuteur, et son apparence vestimentaire en termes de distribution colorimétrique. Les contributions scientifiques de ces travaux s'inscrivent ainsi au croisement des communaut és parole et vision, en introduisant des méthodes de fusion de signatures sonores et visuelles d'individus. Pour réaliser cette fusion, des nouveaux indices de localisation de source sonore ainsi qu'une adaptation audiovisuelle d'une méthode de suivi multi-cibles ont été introduits, représentant les contributions principales de ces travaux. Le mémoire est structuré en 4 chapitres. Le premier présente un état de l'art sur les problèmes de ré-identification visuelle de personnes et de reconnaissance de locuteurs. Les modalités sonores et visuelles ne présentant aucune corrélation, deux signatures, une vidéo et une audio sont générées séparément, à l'aide de méthodes préexistantes de la littérature. Le détail de la génération de ces signatures est l'objet du chapitre 2. La fusion de ces signatures est alors traitée comme un problème de mise en correspondance d'observations audio et vidéo, dont les détections correspondantes sont cohérentes et compatibles spatialement, et pour lesquelles deux nouvelles stratégies d'association sont introduites au chapitre 3. La cohérence spatio-temporelle des observations sonores et visuelles est ensuite traitée dans le chapitre 4, dans un contexte de suivi multi-cibles.
Published: 2017

7. Apprentissage en ligne de signatures audiovisuelles pour la reconnaissance et le suivi de personnes au sein d'un réseau de capteurs ambiants

Author: Decroix, François-Xavier, Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Université Paul Sabatier - Toulouse III, Frédéric Lerasle, and Julien Pinquier
Subjects: Audiovisual fusion, Multi-target tracking, Vision par ordinateur, Fusion audiovisuelle, Traitement automatique de la parole, Automatic speech processing, Suivi multi-cibles, Computer vision, [PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph]
Abstract: The neOCampus operation, started in 2013 by Paul Sabatier University in Toulouse, aims to create a connected, innovative, intelligent and sustainable campus, by exploiting the skills of 11 laboratories and several industrial partners. These multidisciplinary skills are combined in order to improve users (students, teachers, administrative staff) daily comfort and to reduce the ecological footprint of the campus. The intelligence we want to bring to the campus of the future requires to provide to its buildings a perception of its intern activity. Indeed, optimizing the energy resources needs a characterization of the user's activities so that the building can automatically adapt itself to it. Human activity being open to multiple levels of interpretation, our work is focused on extracting people trajectories, its more elementary component. Characterizing users activities, in terms of movement, uses data extracted from cameras and microphones distributed in a room, forming a sparse network of heterogeneous sensors. From these data, we then seek to extract audiovisual signatures and rough localizations of the people transiting through this network of sensors. While protecting person privacy, signatures must be discriminative, to distinguish a person from another one, and compact, to optimize computational costs and enables the building to adapt itself. Having regard to these constraints, the characteristics we model are the speaker's timbre, and his appearance, in terms of colorimetric distribution. The scientific contributions of this thesis are thus at the intersection of the fields of speech processing and computer vision, by introducing new methods of fusing audio and visual signatures of individuals. To achieve this fusion, new sound source location indices as well as an audiovisual adaptation of a multi-target tracking method were introduced, representing the main contributions of this work. The thesis is structured in 4 chapters, and the first one presents the state of the art on visual reidentification of persons and speaker recognition. Acoustic and visual modalities are not correlated, so two signatures are separately computed, one for video and one for audio, using existing methods in the literature. After a first chapter dedicated to the state of the art in re-identification and speaker recognition methods, the details of the computation of the signatures is explored in chapter 2. The fusion of the signatures is then dealt as a problem of matching between audio and video observations, whose corresponding detections are spatially coherent and compatible. Two novel association strategies are introduced in chapter 3. Spatio-temporal coherence of the bimodal observations is then discussed in chapter 4, in a context of multi-target tracking.; L'opération neOCampus, initiée en 2013 par l'Université Paul Sabatier, a pour objectif de créer un campus connecté, innovant, intelligent et durable en exploitant les compétences de 11 laboratoires et de plusieurs partenaires industriels. Pluridisciplinaires, ces compétences sont croisées dans le but d'améliorer le confort au quotidien des usagers du campus (étudiants, corps enseignant, personnel administratif) et de diminuer son empreinte écologique. L'intelligence que nous souhaitons apporter au Campus du futur exige de fournir à ses bâtiments une perception de son activité interne. En effet, l'optimisation des ressources énergétiques nécessite une caractérisation des activités des usagers afin que le bâtiment puisse s'y adapter automatiquement. L'activité humaine étant sujet à plusieurs niveaux d'interprétation nos travaux se focalisent sur l'extraction des déplacements des personnes présentes, sa composante la plus élémentaire. La caractérisation de l'activité des usagers, en termes de déplacements, exploite des données extraites de caméras et de microphones disséminés dans une pièce, ces derniers formant ainsi un réseau épars de capteurs hétérogènes. Nous cherchons alors à extraire de ces données une signature audiovisuelle et une localisation grossière des personnes transitant dans ce réseau de capteurs. Tout en préservant la vie privée de l'individu, la signature doit être discriminante, afin de distinguer les personnes entre elles, et compacte, afin d'optimiser les temps de traitement et permettre au bâtiment de s'auto-adapter. Eu égard à ces contraintes, les caractéristiques que nous modélisons sont le timbre de la voix du locuteur, et son apparence vestimentaire en termes de distribution colorimétrique. Les contributions scientifiques de ces travaux s'inscrivent ainsi au croisement des communautés parole et vision, en introduisant des méthodes de fusion de signatures sonores et visuelles d'individus. Pour réaliser cette fusion, des nouveaux indices de localisation de source sonore ainsi qu'une adaptation audiovisuelle d'une méthode de suivi multi-cibles ont été introduits, représentant les contributions principales de ces travaux. Le mémoire est structuré en 4 chapitres. Le premier présente un état de l'art sur les problèmes de ré-identification visuelle de personnes et de reconnaissance de locuteurs. Les modalités sonores et visuelles ne présentant aucune corrélation, deux signatures, une vidéo et une audio sont générées séparément, à l'aide de méthodes préexistantes de la littérature. Le détail de la génération de ces signatures est l'objet du chapitre 2. La fusion de ces signatures est alors traitée comme un problème de mise en correspondance d'observations audio et vidéo, dont les détections correspondantes sont cohérentes et compatibles spatialement, et pour lesquelles deux nouvelles stratégies d'association sont introduites au chapitre 3. La cohérence spatio-temporelle des observations sonores et visuelles est ensuite traitée dans le chapitre 4, dans un contexte de suivi multi-cibles.
Published: 2017

8. Online Audiovisual Signature Training for Person Re-identification

Author: François-Xavier Decroix, Isabelle Ferrané, Frédéric Lerasle, Julien Pinquier, Équipe Robotique, Action et Perception ( LAAS-RAP ), Laboratoire d'analyse et d'architecture des systèmes [Toulouse] ( LAAS ), Institut National Polytechnique [Toulouse] ( INP ) -Institut National des Sciences Appliquées - Toulouse ( INSA Toulouse ), Institut National des Sciences Appliquées ( INSA ) -Institut National des Sciences Appliquées ( INSA ) -Université Paul Sabatier - Toulouse 3 ( UPS ) -Centre National de la Recherche Scientifique ( CNRS ) -Institut National Polytechnique [Toulouse] ( INP ) -Institut National des Sciences Appliquées - Toulouse ( INSA Toulouse ), Institut National des Sciences Appliquées ( INSA ) -Institut National des Sciences Appliquées ( INSA ) -Université Paul Sabatier - Toulouse 3 ( UPS ) -Centre National de la Recherche Scientifique ( CNRS ), Institut de recherche en informatique de Toulouse ( IRIT ), Institut National Polytechnique [Toulouse] ( INP ) -Université Toulouse 1 Capitole ( UT1 ) -Université Toulouse - Jean Jaurès ( UT2J ) -Université Paul Sabatier - Toulouse 3 ( UPS ) -Centre National de la Recherche Scientifique ( CNRS ), Équipe Robotique, Action et Perception (LAAS-RAP), Laboratoire d'analyse et d'architecture des systèmes (LAAS), Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Institut National Polytechnique de Toulouse - INPT (FRANCE), Centre National de la Recherche Scientifique - CNRS (FRANCE), Université Toulouse III - Paul Sabatier - UT3 (FRANCE), Université Toulouse - Jean Jaurès - UT2J (FRANCE), Université Toulouse 1 Capitole - UT1 (FRANCE), Institut National Polytechnique de Toulouse - Toulouse INP (FRANCE), Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT)-Université de Toulouse (UT)-Institut National des Sciences Appliquées - Toulouse (INSA Toulouse), Institut National des Sciences Appliquées (INSA)-Université de Toulouse (UT)-Institut National des Sciences Appliquées (INSA)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3), Université de Toulouse (UT)-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université de Toulouse (UT)-Université Toulouse Capitole (UT Capitole), Université de Toulouse (UT), Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse - Jean Jaurès (UT2J), Université de Toulouse (UT)-Toulouse Mind & Brain Institut (TMBI), Université Toulouse - Jean Jaurès (UT2J), and Université de Toulouse (UT)-Université de Toulouse (UT)-Université Toulouse III - Paul Sabatier (UT3)
Subjects: Computer science, Energy management, Speech recognition, Multimodal signature, Vision par ordinateur et reconnaissance de formes, 02 engineering and technology, Intelligence artificielle, Re identification, Traitement des images, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audiovisual fusion, Salience (neuroscience), Human–computer interaction, Activity detection, [INFO.INFO-TI]Computer Science [cs]/Image Processing [eess.IV], [ INFO.INFO-TI ] Computer Science [cs]/Image Processing, 0202 electrical engineering, electronic engineering, information engineering, Traitement du signal et de l'image, 020201 artificial intelligence & image processing, Consumer confidence index, 0305 other medical science, Synthèse d'image et réalité virtuelle
Abstract: International audience; In intelligent environments, activity detection is a necessary pre-processing step for adaptive energy management and interaction with humans. To characterize the interactions between individuals or between an individual and the infrastructure of a building, a re-identification process is required and using multimodal models improves its robustness. In this paper, we propose a method for audiovisual fusion, which introduces a novel confidence index of audio-video salience zones, for training an audiovisual signature of a person within a sparse network of cameras and microphones.
Published: 2016
Full Text: View/download PDF

9. Audio-visual speech scene analysis: Characterization of the dynamics of unbinding and rebinding the McGurk effect

Author: Jean-Luc Schwartz, Frédéric Berthommier, Olha Nahorna, GIPSA - Perception, Contrôle, Multimodalité et Dynamiques de la parole (GIPSA-PCMD), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), European Project: 339152,EC:FP7:ERC,ERC-2013-ADG,SPEECH UNIT(E)S(2014), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), and Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)
Subjects: Speech perception, Visual perception, Acoustics and Ultrasonics, Computer science, attentional mechanisms, audiovisual speech perception, Speech recognition, Acoustics, media_common.quotation_subject, multisensory coherence, audiovisual fusion, Context (language use), conditional binding, Speech processing, [SHS]Humanities and Social Sciences, Silence, Arts and Humanities (miscellaneous), Perception, [SCCO.PSYC]Cognitive science/Psychology, McGurk effect, [INFO]Computer Science [cs], Syllable, media_common
Abstract: International audience; While audiovisual interactions in speech perception have long been considered as automatic, recentdata suggest that this is not the case. In a previous study, Nahorna et al. [(2012). J. Acoust. Soc.Am. 132, 1061–1077] showed that the McGurk effect is reduced by a previous incoherentaudiovisual context. This was interpreted as showing the existence of an audiovisual binding stagecontrolling the fusion process. Incoherence would produce unbinding and decrease the weight ofthe visual input in fusion. The present paper explores the audiovisual binding system to characterizeits dynamics. A first experiment assesses the dynamics of unbinding, and shows that it is rapid: Anincoherent context less than 0.5 s long (typically one syllable) suffices to produce a maximalreduction in the McGurk effect. A second experiment tests the rebinding process, by presenting ashort period of either coherent material or silence after the incoherent unbinding context.Coherence provides rebinding, with a recovery of the McGurk effect, while silence providesno rebinding and hence freezes the unbinding process. These experiments are interpreted in theframework of an audiovisual speech scene analysis process assessing the perceptual organization ofan audiovisual speech input before decision takes place at a higher processing stage.
Published: 2015
Full Text: View/download PDF

10. Analysis of multisensory speech scenes : behavioral demonstration and characterization of the audiovisual binding system

Author: Nahorna, Olha, Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Stendhal - Grenoble 3-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), Université de Grenoble, Jean-Luc Schwartz, Frédéric Berthommier, GIPSA - Perception, Contrôle, Multimodalité et Dynamiques de la parole (GIPSA-PCMD), Département Parole et Cognition (GIPSA-DPC), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), This work was supported by the French National Research Agency (ANR) through funding of the MULTISTAP project (MULTISTability and binding in Audition and sPeech: ANR-08-BLAN-0167 MULTISTAP), Mr Jean-Luc Schwartz, Mr Frédéric Berthommier, ANR-08-BLAN-0167,Multistap,Multistabilité et groupement perceptif dans l'Audition et dans la Parole(2008), Nahorna, Olha, Blanc - Multistabilité et groupement perceptif dans l'Audition et dans la Parole - - Multistap2008 - ANR-08-BLAN-0167 - BLANC - VALID, and STAR, ABES
Subjects: fusion audiovisuelle, binding, perception de la parole, Multimodal speech, [SDV.NEU.PC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Psychology and behavior, [SCCO.NEUR]Cognitive science/Neuroscience, [SCCO.NEUR] Cognitive science/Neuroscience, effet McGurk, [SDV.NEU.PC] Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Psychology and behavior, scene analysis, audiovisual fusion, liage, [SDV.NEU.SC]Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Cognitive Sciences, speech perception, Parole multimodale, L'effet McGurk, McGurk effect, analyse des scenès, Liage et fusion audiovisuelle, [SDV.NEU.SC] Life Sciences [q-bio]/Neurons and Cognition [q-bio.NC]/Cognitive Sciences, Audiovisual fusion and binding
Abstract: In audiovisual speech the coherent auditory and visual streams are generally fused into a single percept. This results in enhanced intelligibility in noise, or in visual modification of the auditory percept in the famous “McGurk effect” (the dubbing of the sound “ba” on the image of the speaker uttering “ga” is often perceived as “da”). It is classically considered that processing is done independently in the auditory and visual systems before interaction occurs at a certain representational stage, resulting in an integrated percept. However, some behavioral and neurophysiological data suggest the existence of a two-stage process. A first stage would involve binding together the appropriate pieces of audio and video information, before fusion in a second stage. To demonstrate the existence of this first stage, we have designed an original paradigm aiming at possibly “unbinding” the audio and visual streams. Our paradigm consists in providing before a McGurk stimulus (used as an indicator of audiovisual fusion) an audiovisual context either coherent or incoherent. In the case of an incoherent context we observe a significant decrease of the McGurk effect, implying a reduction of the amount of audiovisual fusion. Various kinds of incoherence (acoustic syllables dubbed on video sentences, phonetic or temporal modifications of the acoustic content of a regular sequence of audiovisual syllables) can significantly reduce the McGurk effect. The unbinding process is fast since one incoherent syllable is enough to produce maximal unbinding. On the other side, the inverse process of “rebinding” by a coherent context following unbinding is progressive, since it appears that at least three coherent syllables are needed to completely recover from unbinding. The subject can also be “freezed” in an “unbound” state by adding a pause between an incoherent context and the McGurk target. In total seven experiments were performed to demonstrate and describe the binding process in audiovisual speech perception. The data are interpreted in the framework of a two-stage “binding and fusion” model., Dans la parole audiovisuelle, les flux auditifs et visuels cohérents sont généralement fusionnés en un percept unifié. Il en résulte une meilleure intelligibilité dans le bruit, et cela peut induire une modification visuelle du percept auditif dans le célèbre « effet McGurk » (le montage d'un son « ba » avec une image d'un locuteur prononçant « ga » est souvent perçu comme « da »). La vision classique considère que le traitement est effectué indépendamment dans les systèmes auditif et visuel avant que l'interaction ne se produise à un certain niveau de représentation, ce qui résulte en un percept intégré. Cependant certaines données comportementales et neurophysiologiques suggèrent l'existence d'un processus à deux niveaux. Le premier niveau implique le liage des éléments d'information auditive et visuelle appropriés avant de donner naissance à un percept fusionné au second niveau. Pour démontrer l'existence de ce premier niveau, nous avons élaboré un paradigme original qui vise à tenter de « délier » ces deux flux. Notre paradigme consiste à faire précéder l'effet McGurk (indicateur de la fusion audiovisuelle) par un contexte soit cohérent soit incohérent. Dans le cas du contexte incohérent on observe une diminution significative de perception d'effet McGurk, donc une décroissance de la fusion audiovisuelle. Les différent types d'incohérence (syllabes acoustiques superposées à des phrases visuelles, modifications phonétiques ou temporelles du contenu acoustique de séquences régulières de syllabes audiovisuelles) peuvent réduire significativement l'effet McGurk. Le processus de déliage est rapide, une unique syllabe incohérente suffisant pour obtenir un résultat de déliage maximal. Par contre le processus inverse de « reliage » par un contexte cohérent suivant le déliage est progressif, puisqu'il apparaît qu'au minimum trois syllabes cohérentes sont nécessaires. Nous pouvons également geler le sujet dans son état délié en rajoutant une pause entre un contexte incohérent et l'effet McGurk. Au total 7 expériences ont été effectuées pour démontrer et décrire le processus de liage dans la parole audiovisuelle. Les données sont interprétées dans le cadre du modèle à deux niveaux « liage et fusion ».
Published: 2013

11. Capacités audiovisuelles en robot humanoïde NAO

Author: Sanchez-Riera, Jordi, Interpretation and Modelling of Images and Videos (PERCEPTION), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Jean Kuntzmann (LJK), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), Université de Grenoble, and Radu Horaud
Subjects: Audiovisual fusion, Fusion audiovisuelle, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], Stereo vision, Reconnaissance d'actions, Action recognition
Abstract: In this thesis we plan to investigate the complementarity of auditory and visual sensory data for building a high-level interpretation of a scene. The audiovisual (AV) input received by the robot is a function of both the external environment and of the robot's actual localization which is closely related to its actions. Current research in AV scene analysis has tended to focus on ﬁxed perceivers. However, psychophysical evidence suggests that humans use small head and body movements, in order to optimize the location of their ears with respect to the source. Similarly, by walking or turning, the robot may be able to improve the incoming visual data. For example, in binocular perception, it is desirable to reduce the viewing distance to an object of interest. This allows the 3D structure of the object to be analyzed at a higher depth-resolution.; Dans cette thèse nous avons l'intention d'enquêter sur la complémentarité des données auditives et visuelles sensorielles pour la construction d'une interprétation de haut niveau d'une scène. L'audiovisuel (AV) d'entrée reçus par le robot est une fonction à la fois l'environnement extérieur et de la localisation réelle du robot qui est étroitement liée à ses actions. La recherche actuelle dans AV analyse de scène a eu tendance à se concentrer sur les observateurs fixes. Toutefois, la preuve psychophysique donne à penser que les humains utilisent petite tête et les mouvements du corps, afin d'optimiser l'emplacement de leurs oreilles à l'égard de la source. De même, en marchant ou en tournant, le robot mai être en mesure d'améliorer les données entrantes visuelle. Par exemple, dans la perception binoculaire, il est souhaitable de réduire la distance de vue à un objet d'intérêt. Cela permet à la structure 3D de l'objet à analyser à une profondeur de résolution supérieure.
Published: 2013

12. Capacités audiovisuelles en robot humanoïde NAO

Author: Sanchez-Riera, Jordi, Interpretation and Modelling of Images and Videos (PERCEPTION), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Jean Kuntzmann (LJK), Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS), Université de Grenoble, Radu Horaud, and STAR, ABES
Subjects: [INFO.INFO-OH] Computer Science [cs]/Other [cs.OH], Audiovisual fusion, Fusion audiovisuelle, [INFO.INFO-OH]Computer Science [cs]/Other [cs.OH], Stereo vision, Reconnaissance d'actions, Action recognition
Abstract: In this thesis we plan to investigate the complementarity of auditory and visual sensory data for building a high-level interpretation of a scene. The audiovisual (AV) input received by the robot is a function of both the external environment and of the robot's actual localization which is closely related to its actions. Current research in AV scene analysis has tended to focus on ﬁxed perceivers. However, psychophysical evidence suggests that humans use small head and body movements, in order to optimize the location of their ears with respect to the source. Similarly, by walking or turning, the robot may be able to improve the incoming visual data. For example, in binocular perception, it is desirable to reduce the viewing distance to an object of interest. This allows the 3D structure of the object to be analyzed at a higher depth-resolution., Dans cette thèse nous avons l'intention d'enquêter sur la complémentarité des données auditives et visuelles sensorielles pour la construction d'une interprétation de haut niveau d'une scène. L'audiovisuel (AV) d'entrée reçus par le robot est une fonction à la fois l'environnement extérieur et de la localisation réelle du robot qui est étroitement liée à ses actions. La recherche actuelle dans AV analyse de scène a eu tendance à se concentrer sur les observateurs fixes. Toutefois, la preuve psychophysique donne à penser que les humains utilisent petite tête et les mouvements du corps, afin d'optimiser l'emplacement de leurs oreilles à l'égard de la source. De même, en marchant ou en tournant, le robot mai être en mesure d'améliorer les données entrantes visuelle. Par exemple, dans la perception binoculaire, il est souhaitable de réduire la distance de vue à un objet d'intérêt. Cela permet à la structure 3D de l'objet à analyser à une profondeur de résolution supérieure.
Published: 2013

13. Audiovisual diarization of people in video content

Author: Christine Sénac, Philippe Joly, Elie Khoury, IDIAP Research Institute, Laboratoire d'Informatique de l'Université du Maine (LIUM), Le Mans Université (UM)-Centre National de la Recherche Scientifique (CNRS), Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), and Université Fédérale Toulouse Midi-Pyrénées
Subjects: Multimedia, Computer Networks and Communications, Computer science, Speech recognition, Search engine indexing, computer.software_genre, Speaker diarisation, Segmentation, Hardware and Architecture, Audiovisual fusion, Video indexing, Content (measure theory), Media Technology, People diarization, [INFO]Computer Science [cs], Unsupervised clustering, computer, Software
Abstract: International audience; Audio-Visual People Diarization (AVPD) is an original framework that simultaneously improves audio, video, and audiovisual diarization results. Following a literature review of people diarization for both audio and video content and their limitations, which includes our own contributions, we describe a proposed method for associating both audio and video information by using co-occurrence matrices and present experiments which were conducted on a corpus containing TV news, TV debates, and movies. Results show the effectiveness of the overall diarization system and confirm the gains audio information can bring to video indexing and vice versa.
Published: 2012
Full Text: View/download PDF

14. Indexation vidéo non-supervisée basée sur la caractérisation des personnes

Author: El Khoury, Elie, Équipe Structuration, Analyse et MOdélisation de documents Vidéo et Audio (IRIT-SAMoVA), Institut de recherche en informatique de Toulouse (IRIT), Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées-Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse - Jean Jaurès (UT2J)-Université Toulouse III - Paul Sabatier (UT3), Université Fédérale Toulouse Midi-Pyrénées-Centre National de la Recherche Scientifique (CNRS)-Institut National Polytechnique (Toulouse) (Toulouse INP), Université Fédérale Toulouse Midi-Pyrénées-Université Toulouse 1 Capitole (UT1), Université Fédérale Toulouse Midi-Pyrénées, Université Paul Sabatier - Toulouse III, and Philippe JOLY(joly@irit.fr)
Subjects: Face clustering, Speaker clustering, Fusion audiovisuelle, Extraction du costume, Regroupement en locuteurs, Face detection, Clothing extraction, GLR-BIC segmentation, Audiovisual fusion, Détection des visages, Diarization, Regroupement des visages, [INFO.INFO-HC]Computer Science [cs]/Human-Computer Interaction [cs.HC], Segmentation en locuteurs, Speaker segmentation
Abstract: This thesis consists to propose a method for an unsupervised characterization of persons within audiovisual documents, by exploring the data related for their physical appearance and their voice. From a general manner, the automatic recognition methods, either in video or audio, need a huge amount of a priori knowledge about their content. In this work, the goal is to study the two modes in a correlated way and to explore their properties in a collaborative and robust way, in order to produce a reliable result as independent as possible from any a priori knowledge. More particularly, we have studied the characteristics of the audio stream and we have proposed many methods for speaker segmentation and clustering and that we have evaluated in a french competition. Then, we have carried a deep study on visual descriptors (face, clothing) that helped us to propose novel approches for detecting, tracking, and clustering of people within the document. Finally, the work was focused on the audiovisual fusion by proposing a method based on computing the cooccurrence matrix that allowed us to establish an association between audio and video indexes, and to correct them. That will enable us to produce a dynamic audiovisual model for each speaker.; Cette thèse consiste à proposer une méthode de caractérisation non-supervisée des intervenants dans les documents audiovisuels, en exploitant des données liées à leur apparence physique et à leur voix. De manière générale, les méthodes d'identification automatique, que ce soit en vidéo ou en audio, nécessitent une quantité importante de connaissances a priori sur le contenu. Dans ce travail, le but est d'étudier les deux modes de façon corrélée et d'exploiter leur propriété respective de manière collaborative et robuste, afin de produire un résultat fiable aussi indépendant que possible de toute connaissance a priori. Plus particulièrement, nous avons étudié les caractéristiques du flux audio et nous avons proposé plusieurs méthodes pour la segmentation et le regroupement en locuteurs que nous avons évaluées dans le cadre d'une campagne d'évaluation. Ensuite, nous avons mené une étude approfondie sur les descripteurs visuels (visage, costume) qui nous ont servis à proposer de nouvelles approches pour la détection, le suivi et le regroupement des personnes. Enfin, le travail s'est focalisé sur la fusion des données audio et vidéo en proposant une approche basée sur le calcul d'une matrice de cooccurrence qui nous a permis d'établir une association entre l'index audio et l'index vidéo et d'effectuer leur correction. Nous pouvons ainsi produire un modèle audiovisuel dynamique des intervenants.
Published: 2010

15. Pairing audio speech and various visual displays: binding or not binding ?

Author: Aymeric Devergie, Frédéric Berthommier, nicolas grimault, Berthommier, Frédéric, Neurosciences Sensorielles Comportement Cognition, Centre National de la Recherche Scientifique (CNRS)-Université Claude Bernard Lyon 1 (UCBL), Université de Lyon-Université de Lyon, GIPSA - Parole, Multimodalité, Développement (GIPSA-PMD), Département Parole et Cognition (GIPSA-DPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Grenoble Images Parole Signal Automatique (GIPSA-lab), and Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)-Université Stendhal - Grenoble 3-Université Pierre Mendès France - Grenoble 2 (UPMF)-Université Joseph Fourier - Grenoble 1 (UJF)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP )-Centre National de la Recherche Scientifique (CNRS)
Subjects: Audiovisual fusion, perceptual binding, [SCCO.PSYC]Cognitive science/Psychology, [SCCO.PSYC] Cognitive science/Psychology, multimodal phonetic processing
Abstract: International audience; Recent findings demonstrate that audiovisual fusion during speech perception may involve pre-phonetic processing. The aim of the current experiment is to investigate this hypothesis using a pairing task between auditory sequences of vowels and non speech visual cues. The audio sequences are composed of 6 auditory French vowels alternating in pitch (or not) in order to build 2 interleaved streams of 3 vowels each. Various elementary visual displays are mounted in synchrony with one vowel stream out of the two. Our hypothesis is that, in a forced choice pairing task, the AV synchronized vowels will be found more frequently if such a perceptual binding operates. We show that the most efficient visual feature increasing pairing performance is the movement. Surprisingly, some features we manipulated do not provide the increase in pairing performances. The visual cue of contrast variation is not correctly paired with the synchronized auditory vowels. Moreover, the auditory segregation, based on the pitch difference between the vowels streams, has no additional effect on pairing. In addition, the modulation of the auditory envelop, synchronized with the variation of the visual cue, has also no effect. Finally, when we introduce a phonetic cue in the visual display, pairing increases in comparison with non specific visual cues. The relative contribution of perceptual binding and late phonetic fusion is discussed.

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

15 results on '"audiovisual fusion"'

1. Rethinking the Mechanisms Underlying the McGurk Illusion

2. A Laboratory Study of the McGurk Effect in 324 Monozygotic and Dizygotic Twins

3. Prediction-Based Audiovisual Fusion for Classification of Non-Linguistic Vocalisations

4. Audio-Visual Speech Recognition With A Hybrid CTC/Attention Architecture

5. End-to-end Audiovisual Speech Recognition

6. Online learning of audiovisual signatures for recognition and tracking of people within an ambient sensor network

7. Apprentissage en ligne de signatures audiovisuelles pour la reconnaissance et le suivi de personnes au sein d'un réseau de capteurs ambiants

8. Online Audiovisual Signature Training for Person Re-identification

9. Audio-visual speech scene analysis: Characterization of the dynamics of unbinding and rebinding the McGurk effect

10. Analysis of multisensory speech scenes : behavioral demonstration and characterization of the audiovisual binding system

11. Capacités audiovisuelles en robot humanoïde NAO

12. Capacités audiovisuelles en robot humanoïde NAO

13. Audiovisual diarization of people in video content

14. Indexation vidéo non-supervisée basée sur la caractérisation des personnes

15. Pairing audio speech and various visual displays: binding or not binding ?

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Journal

Database

Publisher

15 results on '"audiovisual fusion"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources