Author: "Jaejin Cho" / Publisher: arxiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jaejin Cho"' showing total 3 results

Start Over Author "Jaejin Cho" Publisher arxiv

3 results on '"Jaejin Cho"'

1. Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

Author: Jesus Villalba, Najim Dehak, Jaejin Cho, and LAUREANO MORO VELAZQUEZ
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), Signal Processing, FOS: Electrical engineering, electronic engineering, information engineering, Electrical and Electronic Engineering, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-supervised speech representation, e.g., wav2vec, can be used as utterance-level representation with pooling, but the models are usually large. There are also SSL techniques to learn utterance-level representation. One of the most successful is a contrastive method, which requires negative sampling: selecting alternative samples to contrast with the current sample (anchor). However, this does not ensure that all the negative samples belong to classes different from the anchor class without labels. This paper applies a non-contrastive self-supervised method to learn utterance-level embeddings. We adapted DIstillation with NO labels (DINO) from computer vision to speech. Unlike contrastive methods, DINO does not require negative sampling. We compared DINO to x-vector trained in a supervised manner. When transferred to down-stream tasks (speaker verification, speech emotion recognition (SER), and Alzheimer's disease detection), DINO outperformed x-vector. We studied the influence of several aspects during transfer learning such as dividing the fine-tuning process into steps, chunk lengths, or augmentation. During fine-tuning, tuning the last affine layers first and then the whole network surpassed fine-tuning all at once. Using shorter chunk lengths, although they generate more diverse inputs, did not necessarily improve performance, implying speech segments at least with a specific length are required for better performance per application. Augmentation was helpful in SER., Comment: EARLY ACCESS of IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing
Published: 2022
Full Text: View/download PDF

2. Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

Author: Matthew Wiesner, Martin Karafiat, Nelson Yalta, Shinji Watanabe, Takaaki Hori, Sri Harish Mallidi, Murali Karthick Baskar, Ruizhi Li, and Jaejin Cho
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Computer science, Speech recognition, 02 engineering and technology, Lexicon, Computer Science - Sound, Data modeling, Machine Learning (cs.LG), 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Sequence, Computer Science - Computation and Language, 020206 networking & telecommunications, Convolution (computer science), Recurrent neural network, Language model, 0305 other medical science, Transfer of learning, Computation and Language (cs.CL), Decoding methods, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sequence-to-sequence (seq2seq) approach for low-resource ASR is a relatively new direction in speech research. The approach benefits by performing model training without using lexicon and alignments. However, this poses a new problem of requiring more data compared to conventional DNN-HMM systems. In this work, we attempt to use data from 10 BABEL languages to build a multi-lingual seq2seq model as a prior model, and then port them towards 4 other BABEL languages using transfer learning approach. We also explore different architectures for improving the prior multilingual seq2seq model. The paper also discusses the effect of integrating a recurrent neural network language model (RNNLM) with a seq2seq model during decoding. Experimental results show that the transfer learning approach from the multilingual model shows substantial gains over monolingual models across all 4 BABEL languages. Incorporating an RNNLM also brings significant improvements in terms of %WER, and achieves recognition performance comparable to the models trained with twice more training data.
Published: 2018
Full Text: View/download PDF

3. End-to-end Speech Recognition with Word-based RNN Language Models

Author: Jaejin Cho, Takaaki Hori, and Shinji Watanabe
Subjects: FOS: Computer and information sciences, Vocabulary, Computer Science - Computation and Language, Computer science, Computer Science - Artificial Intelligence, Speech recognition, media_common.quotation_subject, 020206 networking & telecommunications, 02 engineering and technology, 030507 speech-language pathology & audiology, 03 medical and health sciences, Recurrent neural network, Artificial Intelligence (cs.AI), Test set, 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), Language model, 0305 other medical science, Hidden Markov model, Computation and Language (cs.CL), Word (computer architecture), Decoding methods, media_common
Abstract: This paper investigates the impact of word-based RNN language models (RNN-LMs) on the performance of end-to-end automatic speech recognition (ASR). In our prior work, we have proposed a multi-level LM, in which character-based and word-based RNN-LMs are combined in hybrid CTC/attention-based ASR. Although this multi-level approach achieves significant error reduction in the Wall Street Journal (WSJ) task, two different LMs need to be trained and used for decoding, which increase the computational cost and memory usage. In this paper, we further propose a novel word-based RNN-LM, which allows us to decode with only the word-based LM, where it provides look-ahead word probabilities to predict next characters instead of the character-based LM, leading competitive accuracy with less computation compared to the multi-level LM. We demonstrate the efficacy of the word-based RNN-LMs using a larger corpus, LibriSpeech, in addition to WSJ we used in the prior work. Furthermore, we show that the proposed model achieves 5.1 %WER for WSJ Eval’92 test set when the vocabulary size is increased, which is the best WER reported for end-to-end ASR systems on this benchmark.
Published: 2018
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

3 results on '"Jaejin Cho"'

1. Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

2. Multilingual sequence-to-sequence speech recognition: architecture, transfer learning, and language modeling

3. End-to-end Speech Recognition with Word-based RNN Language Models

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

3 results on '"Jaejin Cho"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources