Back to Search
Start Over
Speaker-aware speech-transformer
- Source :
- ASRU
- Publication Year :
- 2020
- Publisher :
- arXiv, 2020.
-
Abstract
- Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as the study platform to investigate speaker aware training of E2E models. We propose a model called Speaker-Aware Speech-Transformer (SAST), which is a standard ST equipped with a speaker attention module (SAM). The SAM has a static speaker knowledge block (SKB) that is made of i-vectors. At each time step, the encoder output attends to the i-vectors in the block, and generates a weighted combined speaker embedding vector, which helps the model to normalize the speaker variations. The SAST model trained in this way becomes independent of specific training speakers and thus generalizes better to unseen testing speakers. We investigate different factors of SAM. Experimental results on the AISHELL-1 task show that SAST achieves a relative 6.5% CER reduction (CERR) over the speaker-independent (SI) baseline. Moreover, we demonstrate that SAST still works quite well even if the i-vectors in SKB all come from a different data source other than the acoustic training set.
- Subjects :
- Data source
FOS: Computer and information sciences
Sound (cs.SD)
Training set
Computer Science - Computation and Language
Computer science
Speech recognition
Time step
I vector
Computer Science - Sound
Audio and Speech Processing (eess.AS)
FOS: Electrical engineering, electronic engineering, information engineering
Embedding
Encoder
Computation and Language (cs.CL)
Speaker adaptation
Transformer (machine learning model)
Electrical Engineering and Systems Science - Audio and Speech Processing
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- ASRU
- Accession number :
- edsair.doi.dedup.....9036310dd864aca6340e44aa5745eb6e
- Full Text :
- https://doi.org/10.48550/arxiv.2001.01557