Author: "Moritz, Niko" / Publication Year Range: This year - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Moritz, Niko"' showing total 7 results

Start Over Author "Moritz, Niko" Publication Year Range This year

7 results on '"Moritz, Niko"'

1. Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

Author: Zhao, Jinzheng, Moritz, Niko, Lakomkin, Egor, Xie, Ruiming, Xiu, Zhiping, Zmolikova, Katerina, Ahmed, Zeeshan, Gaur, Yashesh, Le, Duc, and Fuegen, Christian
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Cascaded speech-to-speech translation systems often suffer from the error accumulation problem and high latency, which is a result of cascaded modules whose inference delays accumulate. In this paper, we propose a transducer-based speech translation model that outputs discrete speech tokens in a low-latency streaming fashion. This approach eliminates the need for generating text output first, followed by machine translation (MT) and text-to-speech (TTS) systems. The produced speech tokens can be directly used to generate a speech signal with low latency by utilizing an acoustic language model (LM) to obtain acoustic tokens and an audio codec model to retrieve the waveform. Experimental results show that the proposed method outperforms other existing approaches and achieves state-of-the-art results for streaming translation in terms of BLEU, average latency, and BLASER 2.0 scores for multiple language pairs using the CVSS-C dataset as a benchmark., Comment: Submitted to ICASSP 2025
Published: 2024

2. M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

Author: Yang, Yufeng, Raj, Desh, Lin, Ju, Moritz, Niko, Jia, Junteng, Keren, Gil, Lakomkin, Egor, Huang, Yiteng, Donley, Jacob, Mahadeokar, Jay, and Kalinli, Ozlem
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach., Comment: In submission to IEEE ICASSP 2025
Published: 2024

3. Effective internal language model training and fusion for factorized transducer model

Author: Guo, Jinxi, Moritz, Niko, Ma, Yingyi, Seide, Frank, Wu, Chunyang, Mahadeokar, Jay, Kalinli, Ozlem, Fuegen, Christian, and Seltzer, Mike
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The internal language model (ILM) of the neural transducer has been widely studied. In most prior work, it is mainly used for estimating the ILM score and is subsequently subtracted during inference to facilitate improved integration with external language models. Recently, various of factorized transducer models have been proposed, which explicitly embrace a standalone internal language model for non-blank token prediction. However, even with the adoption of factorized transducer models, limited improvement has been observed compared to shallow fusion. In this paper, we propose a novel ILM training and decoding strategy for factorized transducer models, which effectively combines the blank, acoustic and ILM scores. Our experiments show a 17% relative improvement over the standard decoding method when utilizing a well-trained ILM and the proposed decoding strategy on LibriSpeech datasets. Furthermore, when compared to a strong RNN-T baseline enhanced with external LM fusion, the proposed model yields a 5.5% relative improvement on general-sets and an 8.9% WER reduction for rare words. The proposed model can achieve superior performance without relying on external language models, rendering it highly efficient for production use-cases. To further improve the performance, we propose a novel and memory-efficient ILM-fusion-aware minimum word error rate (MWER) training method which improves ILM integration significantly., Comment: Accepted to ICASSP 2024
Published: 2024

4. AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

Author: Lin, Ju, Moritz, Niko, Huang, Yiteng, Xie, Ruiming, Sun, Ming, Fuegen, Christian, and Seide, Frank
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Wearable devices like smart glasses are approaching the compute capability to seamlessly generate real-time closed captions for live conversations. We build on our recently introduced directional Automatic Speech Recognition (ASR) for smart glasses that have microphone arrays, which fuses multi-channel ASR with serialized output training, for wearer/conversation-partner disambiguation as well as suppression of cross-talk speech from non-target directions and noise. When ASR work is part of a broader system-development process, one may be faced with changes to microphone geometries as system development progresses. This paper aims to make multi-channel ASR insensitive to limited variations of microphone-array geometry. We show that a model trained on multiple similar geometries is largely agnostic and generalizes well to new geometries, as long as they are not too different. Furthermore, training the model this way improves accuracy for seen geometries by 15 to 28\% relative. Lastly, we refine the beamforming by a novel Non-Linearly Constrained Minimum Variance criterion., Comment: Accepted to ICASSP 2024
Published: 2024

5. Biomechanical considerations of semi-anatomic glass fiber-reinforced (GFRC) composite implant for mandibular segmental defects: A technical note

Author: Väisänen, Antti, primary, Hoikkala, Niko, additional, Härkönen, Ville, additional, Moritz, Niko, additional, and Vallittu, Pekka K., additional
Published: 2024
Full Text: View/download PDF

6. Effective Internal Language Model Training and Fusion for Factorized Transducer Model

Author: Guo, Jinxi, primary, Moritz, Niko, additional, Ma, Yingyi, additional, Seide, Frank, additional, Wu, Chunyang, additional, Mahadeokar, Jay, additional, Kalinli, Ozlem, additional, Fuegen, Christian, additional, and Seltzer, Mike, additional
Published: 2024
Full Text: View/download PDF

7. AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

Author: Lin, Ju, primary, Moritz, Niko, additional, Huang, Yiteng, additional, Xie, Ruiming, additional, Sun, Ming, additional, Fuegen, Christian, additional, and Seide, Frank, additional
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

7 results on '"Moritz, Niko"'

1. Textless Streaming Speech-to-Speech Translation using Semantic Speech Tokens

2. M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses

3. Effective internal language model training and fusion for factorized transducer model

4. AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

5. Biomechanical considerations of semi-anatomic glass fiber-reinforced (GFRC) composite implant for mandibular segmental defects: A technical note

6. Effective Internal Language Model Training and Fusion for Factorized Transducer Model

7. AGADIR: Towards Array-Geometry Agnostic Directional Speech Recognition

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

7 results on '"Moritz, Niko"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources