Author: "Mošner, Ladislav" / Publisher: arxiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Mošner, Ladislav"' showing total 4 results

Start Over Author "Mošner, Ladislav" Publisher arxiv

Author: Peng, Junyi, Plchot, Oldřich, Stafylakis, Themos, Mošner, Ladislav, Burget, Lukáš, and Černocký, Jan
Subjects: Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models are pretrained and finetuned on the same dataset. Three pre-trained models including HuBERT, Conformer and WavLM are evaluated on four different speaker verification datasets with varying sizes. Our experiments show that these self-pretrained models achieve competitive performance on downstream speaker verification tasks with only one-third of the data compared to Librispeech pretraining, such as VoxCeleb1 and CNCeleb1. Furthermore, when pre-training only on the VoxCeleb2-dev, the Conformer model outperforms the one pre-trained on 94k hours of data using the same fine-tuning settings., Comment: Accepted to Interspeech 2023
Published: 2023
Full Text: View/download PDF

Author: Stafylakis, Themos, Mošner, Ladislav, Plchot, Oldřich, Rohdin, Johan, Silnova, Anna, Burget, Lukáš, and Černocký, Jan \\'Honza'
Subjects: Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parameter tuning, a modified loss with aggregation over segments, and a two-stage training approach, we are able to train a competitive ResNet-based embedding extractor. Finally, we experiment with two different aggregation functions and analyze their behaviour in terms of their gradients., Comment: Accepted at Interspeech 2022
Published: 2022
Full Text: View/download PDF

Author: Landini, Federico, Wang, Shuai, Diez, Mireia, Burget, Lukáš, Matějka, Pavel, Žmolíková, Kateřina, Mošner, Ladislav, Plchot, Oldřich, Novotný, Ondřej, Zeinali, Hossein, and Rohdin, Johan
Subjects: Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper describes the systems developed by the BUT team for the four tracks of the second DIHARD speech diarization challenge. For tracks 1 and 2 the systems were based on performing agglomerative hierarchical clustering (AHC) over x-vectors, followed by the Bayesian Hidden Markov Model (HMM) with eigenvoice priors applied at x-vector level followed by the same approach applied at frame level. For tracks 3 and 4, the systems were based on performing AHC using x-vectors extracted on all channels.
Published: 2019
Full Text: View/download PDF

Author: Zeinali, Hossein, Matějka, Pavel, Mošner, Ladislav, Plchot, Oldřich, Silnova, Anna, Novotný, Ondřej, Profant, Ján, Glembek, Ondřej, and Burget, Lukáš
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptation and achieved less than ~10% relative improvement. In the submission to open condition, we used 3 x-vector systems and also one i-vector based system.
Published: 2019
Full Text: View/download PDF

Books, media, physical & digital resources

Searchworks