Author: "Uhlich, Stefan" / Topic: machine learning (cs.lg) - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Uhlich, Stefan"' showing total 10 results

Start Over Author "Uhlich, Stefan" Topic machine learning (cs.lg)

10 results on '"Uhlich, Stefan"'

1. Autotts: End-to-End Text-to-Speech Synthesis Through Differentiable Duration Modeling

Author: Nguyen, Bac, Cardinaux, Fabien, and Uhlich, Stefan
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. However, they typically require external alignment models, which are not necessarily optimized for the decoder as they are not jointly trained. In this paper, we propose a differentiable duration method for learning monotonic alignments between input and output sequences. Our method is based on a soft-duration mechanism that optimizes a stochastic process in expectation. Using this differentiable duration method, we introduce AutoTTS, a direct text-to-waveform speech synthesis model. AutoTTS enables high-fidelity speech synthesis through a combination of adversarial training and matching the total ground-truth duration. Experimental results show that our model obtains competitive results while enjoying a much simpler training pipeline. Audio samples are available online., ICASSP 2023
Published: 2023
Full Text: View/download PDF

2. Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

Author: Koo, Junghyun, Martínez-Ramírez, Marco A., Liao, Wei-Hsiang, Uhlich, Stefan, Lee, Kyogu, and Mitsufuji, Yuki
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dataset with an effective data preprocessing method that alleviates the data scarcity of obtaining unprocessed dry data. We analyze the proposed encoder for the disentanglement capability of audio effects and also validate its performance for mixing style transfer through both objective and subjective evaluations. From the results, we show the proposed system not only converts the mixing style of multitrack audio close to a reference but is also robust with mixture-wise style transfer upon using a music source separation model.
Published: 2022

3. A Statistical Model for Predicting Generalization in Few-Shot Classification

Author: Bendou, Yassir, Gripon, Vincent, Pasdeloup, Bastien, Mauch, Lukas, Uhlich, Stefan, Cardinaux, Fabien, Hacene, Ghouthi Boukli, and Garcia, Javier Alonso
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: The estimation of the generalization error of classifiers often relies on a validation set. Such a set is hardly available in few-shot learning scenarios, a highly disregarded shortcoming in the field. In these scenarios, it is common to rely on features extracted from pre-trained neural networks combined with distance-based classifiers such as nearest class mean. In this work, we introduce a Gaussian model of the feature distribution. By estimating the parameters of this model, we are able to predict the generalization error on new classification tasks with few samples. We observe that accurate distance estimates between class-conditional densities are the key to accurate estimates of the generalization performance. Therefore, we propose an unbiased estimator for these distances and integrate it in our numerical analysis. We empirically show that our approach outperforms alternatives such as the leave-one-out cross-validation strategy.
Published: 2022
Full Text: View/download PDF

4. Distortion Audio Effects: Learning How to Recover the Clean Signal

Author: Imort, Johannes, Fabbro, Giorgio, Ramírez, Marco A. Martínez, Uhlich, Stefan, Koyama, Yuichiro, and Mitsufuji, Yuki
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, ismir, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, ismir2022, Machine Learning (cs.LG)
Abstract: Given the recent advances in music source separation and automatic mixing, removing audio effects in music tracks is a meaningful step toward developing an automated remixing system. This paper focuses on removing distortion audio effects applied to guitar tracks in music production. We explore whether effect removal can be solved by neural networks designed for source separation and audio effect modeling. Our approach proves particularly effective for effects that mix the processed and clean signals. The models achieve better quality and significantly faster inference compared to state-of-the-art solutions based on sparse optimization. We demonstrate that the models are suitable not only for declipping but also for other types of distortion effects. By discussing the results, we stress the usefulness of multiple evaluation metrics to assess different aspects of reconstruction in distortion effect removal., Comment: Audio examples available at https://joimort.github.io/distortionremoval/
Published: 2022
Full Text: View/download PDF

5. Training Speech Enhancement Systems with Noisy Speech Datasets

Author: Saito, Koichi, Uhlich, Stefan, Fabbro, Giorgio, and Mitsufuji, Yuki
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, deep neural network (DNN)-based speech enhancement (SE) systems have been used with great success. During training, such systems require clean speech data - ideally, in large quantity with a variety of acoustic conditions, many different speaker characteristics and for a given sampling rate (e.g., 48kHz for fullband SE). However, obtaining such clean speech data is not straightforward - especially, if only considering publicly available datasets. At the same time, a lot of material for automatic speech recognition (ASR) with the desired acoustic/speaker/sampling rate characteristics is publicly available except being clean, i.e., it also contains background noise as this is even often desired in order to have ASR systems that are noise-robust. Hence, using such data to train SE systems is not straightforward. In this paper, we propose two improvements to train SE systems on noisy speech data. First, we propose several modifications of the loss functions, which make them robust against noisy speech targets. In particular, computing the median over the sample axis before averaging over time-frequency bins allows to use such data. Furthermore, we propose a noise augmentation scheme for mixture-invariant training (MixIT), which allows using it also in such scenarios. For our experiments, we use the Mozilla Common Voice dataset and we show that using our robust loss function improves PESQ by up to 0.19 compared to a system trained in the traditional way. Similarly, for MixIT we can see an improvement of up to 0.27 in PESQ when using our proposed noise augmentation., 5 pages, 3 figures, submitted to WASPAA2021
Published: 2021

6. Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Author: Narihira, Takuya, Alonsogarcia, Javier, Cardinaux, Fabien, Hayakawa, Akio, Ishii, Masato, Iwaki, Kazunori, Kemp, Thomas, Kobayashi, Yoshiyuki, Mauch, Lukas, Nakamura, Akira, Obuchi, Yukio, Shin, Andrew, Suzuki, Kenji, Tiedmann, Stephen, Uhlich, Stefan, Yashima, Takuya, and Yoshiyama, Kazuki
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)
Abstract: While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspective, with emphasis on usability and compatibility as its core design principles. We elaborate on each of our design principles and its merits, and validate our attempts via experiments., https://nnabla.org
Published: 2021

7. Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency

Author: Khan, Mohammad Asif, Cardinaux, Fabien, Uhlich, Stefan, Ferras, Marc, and Fischer, Asja
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In recent years generative adversarial network (GAN) based models have been successfully applied for unsupervised speech-to-speech conversion.The rich compact harmonic view of the magnitude spectrogram is considered a suitable choice for training these models with audio data. To reconstruct the speech signal first a magnitude spectrogram is generated by the neural network, which is then utilized by methods like the Griffin-Lim algorithm to reconstruct a phase spectrogram. This procedure bears the problem that the generated magnitude spectrogram may not be consistent, which is required for finding a phase such that the full spectrogram has a natural-sounding speech waveform. In this work, we approach this problem by proposing a condition encouraging spectrogram consistency during the adversarial training procedure. We demonstrate our approach on the task of translating the voice of a male speaker to that of a female speaker, and vice versa. Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.
Published: 2020

8. Mixed Precision DNNs: All you need is a good parametrization

Author: Uhlich, Stefan, Mauch, Lukas, Cardinaux, Fabien, Yoshiyama, Kazuki, Garcia, Javier Alonso, Tiedemann, Stephen, Kemp, Thomas, and Nakamura, Akira
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Efficient deep neural network (DNN) inference on mobile or embedded devices typically involves quantization of the network parameters and activations. In particular, mixed precision networks achieve better performance than networks with homogeneous bitwidth for the same size constraint. Since choosing the optimal bitwidths is not straight forward, training methods, which can learn them, are desirable. Differentiable quantization with straight-through gradients allows to learn the quantizer's parameters using gradient methods. We show that a suited parametrization of the quantizer is the key to achieve a stable training and a good final performance. Specifically, we propose to parametrize the quantizer with the step size and dynamic range. The bitwidth can then be inferred from them. Other parametrizations, which explicitly use the bitwidth, consistently perform worse. We confirm our findings with experiments on CIFAR-10 and ImageNet and we obtain mixed precision DNNs with learned quantization parameters, achieving state-of-the-art performance., International Conference on Learning Representations (ICLR) 2020; Source code at https://github.com/sony/ai-research-code
Published: 2019

9. Iteratively Training Look-Up Tables for Network Quantization

Author: Cardinaux, Fabien, Uhlich, Stefan, Yoshiyama, Kazuki, Garc��a, Javier Alonso, Tiedemann, Stephen, Kemp, Thomas, and Nakamura, Akira
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Operating deep neural networks on devices with limited resources requires the reduction of their memory footprints and computational requirements. In this paper we introduce a training method, called look-up table quantization, LUT-Q, which learns a dictionary and assigns each weight to one of the dictionary's values. We show that this method is very flexible and that many other techniques can be seen as special cases of LUT-Q. For example, we can constrain the dictionary trained with LUT-Q to generate networks with pruned weight matrices or restrict the dictionary to powers-of-two to avoid the need for multiplications. In order to obtain fully multiplier-less networks, we also introduce a multiplier-less version of batch normalization. Extensive experiments on image recognition and object detection tasks show that LUT-Q consistently achieves better performance than other methods with the same quantization bitwidth., NIPS 2018 workshop on Compact Deep Neural Networks with industrial applications
Published: 2018

10. Improving DNN-based Music Source Separation using Phase Features

Author: Muth, Joachim, Uhlich, Stefan, Perraudin, Nathanael, Kemp, Thomas, Cardinaux, Fabien, and Mitsufuji, Yuki
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Music source separation with deep neural networks typically relies only on amplitude features. In this paper we show that additional phase features can improve the separation performance. Using the theoretical relationship between STFT phase and amplitude, we conjecture that derivatives of the phase are a good feature representation opposed to the raw phase. We verify this conjecture experimentally and propose a new DNN architecture which combines amplitude and phase. This joint approach achieves a better signal-to distortion ratio on the DSD100 dataset for all instruments compared to a network that uses only amplitude features. Especially, the bass instrument benefits from the phase information., 7 pages, 9 figures, Joint Workshop on Machine Learning for Music at ICML, IJCAI/ECAI and AAMAS, 2018
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

10 results on '"Uhlich, Stefan"'

1. Autotts: End-to-End Text-to-Speech Synthesis Through Differentiable Duration Modeling

2. Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

3. A Statistical Model for Predicting Generalization in Few-Shot Classification

4. Distortion Audio Effects: Learning How to Recover the Clean Signal

5. Training Speech Enhancement Systems with Noisy Speech Datasets

6. Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

7. Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency

8. Mixed Precision DNNs: All you need is a good parametrization

9. Iteratively Training Look-Up Tables for Network Quantization

10. Improving DNN-based Music Source Separation using Phase Features

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

10 results on '"Uhlich, Stefan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources