Back to Search
Start Over
Audio-visual Speaker Recognition with a Cross-modal Discriminative Network
- Source :
- INTERSPEECH
- Publication Year :
- 2020
-
Abstract
- Audio-visual speaker recognition is one of the tasks in the recent 2019 NIST speaker recognition evaluation (SRE). Studies in neuroscience and computer science all point to the fact that vision and auditory neural signals interact in the cognitive process. This motivated us to study a cross-modal network, namely voice-face discriminative network (VFNet) that establishes the general relation between human voice and face. Experiments show that VFNet provides additional speaker discriminative information. With VFNet, we achieve 16.54% equal error rate relative reduction over the score level fusion audio-visual baseline on evaluation set of 2019 NIST SRE.
- Subjects :
- 0209 industrial biotechnology
Computer science
Speech recognition
Word error rate
Cognition
02 engineering and technology
Speaker recognition
020901 industrial engineering & automation
Discriminative model
Audio and Speech Processing (eess.AS)
Face (geometry)
0202 electrical engineering, electronic engineering, information engineering
FOS: Electrical engineering, electronic engineering, information engineering
NIST
020201 artificial intelligence & image processing
Set (psychology)
Human voice
Electrical Engineering and Systems Science - Audio and Speech Processing
Subjects
Details
- Language :
- English
- Database :
- OpenAIRE
- Journal :
- INTERSPEECH
- Accession number :
- edsair.doi.dedup.....9ab0976f663c7a921fc9846c06126ea4