Author: "Kleijn, W. Bastiaan" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Kleijn, W. Bastiaan"' showing total 604 results

Start Over Author "Kleijn, W. Bastiaan"

604 results on '"Kleijn, W. Bastiaan"'

1. TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

Author: Ma, Wan-Duo Kurt, Lewis, J. P., and Kleijn, W. Bastiaan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases., Comment: 14 pages, 18 figures, Project Page: https://hohonu-vicml.github.io/Trailblazer.Page/
Published: 2023

2. Exact Diffusion Inversion via Bidirectional Integration Approximation

Author: Zhang, Guoqiang, Lewis, J. P., Kleijn, W. Bastiaan, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

3. Exact Diffusion Inversion via Bi-directional Integration Approximation

Author: Zhang, Guoqiang, Lewis, J. P., and Kleijn, W. Bastiaan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, various methods have been proposed to address the inconsistency issue of DDIM inversion to enable image editing, such as EDICT [36] and Null-text inversion [22]. However, the above methods introduce considerable computational overhead. In this paper, we propose a new technique, named \emph{bi-directional integration approximation} (BDIA), to perform exact diffusion inversion with neglible computational overhead. Suppose we would like to estimate the next diffusion state $\boldsymbol{z}_{i-1}$ at timestep $t_i$ with the historical information $(i,\boldsymbol{z}_i)$ and $(i+1,\boldsymbol{z}_{i+1})$. We first obtain the estimated Gaussian noise $\hat{\boldsymbol{\epsilon}}(\boldsymbol{z}_i,i)$, and then apply the DDIM update procedure twice for approximating the ODE integration over the next time-slot $[t_i, t_{i-1}]$ in the forward manner and the previous time-slot $[t_i, t_{t+1}]$ in the backward manner. The DDIM step for the previous time-slot is used to refine the integration approximation made earlier when computing $\boldsymbol{z}_i$. A nice property of BDIA-DDIM is that the update expression for $\boldsymbol{z}_{i-1}$ is a linear combination of $(\boldsymbol{z}_{i+1}, \boldsymbol{z}_i, \hat{\boldsymbol{\epsilon}}(\boldsymbol{z}_i,i))$. This allows for exact backward computation of $\boldsymbol{z}_{i+1}$ given $(\boldsymbol{z}_i, \boldsymbol{z}_{i-1})$, thus leading to exact diffusion inversion. It is demonstrated with experiments that (round-trip) BDIA-DDIM is particularly effective for image editing. Our experiments further show that BDIA-DDIM produces markedly better image sampling qualities than DDIM for text-to-image generation. BDIA can also be applied to improve the performance of other ODE solvers in addition to DDIM. In our work, it is found that applying BDIA to the EDM sampling procedure produces consistently better performance over four pre-trained models., Comment: arXiv admin note: text overlap with arXiv:2304.11328. Our code is available at https://github.com/guoqiang-zhang-x/BDIA
Published: 2023

4. On Accelerating Diffusion-Based Sampling Process via Improved Integration Approximation

Author: Zhang, Guoqiang, Kenta, Niwa, and Kleijn, W. Bastiaan
Subjects: Computer Science - Machine Learning, Mathematics - Numerical Analysis
Abstract: A popular approach to sample a diffusion-based generative model is to solve an ordinary differential equation (ODE). In existing samplers, the coefficients of the ODE solvers are pre-determined by the ODE formulation, the reverse discrete timesteps, and the employed ODE methods. In this paper, we consider accelerating several popular ODE-based sampling processes (including EDM, DDIM, and DPM-Solver) by optimizing certain coefficients via improved integration approximation (IIA). We propose to minimize, for each time step, a mean squared error (MSE) function with respect to the selected coefficients. The MSE is constructed by applying the original ODE solver for a set of fine-grained timesteps, which in principle provides a more accurate integration approximation in predicting the next diffusion state. The proposed IIA technique does not require any change of a pre-trained model, and only introduces a very small computational overhead for solving a number of quadratic optimization problems. Extensive experiments show that considerably better FID scores can be achieved by using IIA-EDM, IIA-DDIM, and IIA-DPM-Solver than the original counterparts when the neural function evaluation (NFE) is small (i.e., less than 25).
Published: 2023

5. Lookahead Diffusion Probabilistic Models for Refining Mean Estimation

Author: Zhang, Guoqiang, Kenta, Niwa, and Kleijn, W. Bastiaan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the correlation in the outputs of the deep neural networks (DNNs) over subsequent timesteps in diffusion probabilistic models (DPMs) to refine the mean estimation of the conditional Gaussian distributions in the backward process. A typical DPM first obtains an estimate of the original data sample $\boldsymbol{x}$ by feeding the most recent state $\boldsymbol{z}_i$ and index $i$ into the DNN model and then computes the mean vector of the conditional Gaussian distribution for $\boldsymbol{z}_{i-1}$. We propose to calculate a more accurate estimate for $\boldsymbol{x}$ by performing extrapolation on the two estimates of $\boldsymbol{x}$ that are obtained by feeding $(\boldsymbol{z}_{i+1},i+1)$ and $(\boldsymbol{z}_{i},i)$ into the DNN model. The extrapolation can be easily integrated into the backward process of existing DPMs by introducing an additional connection over two consecutive timesteps, and fine-tuning is not required. Extensive experiments showed that plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and high-order DPM-Solvers leads to a significant performance gain in terms of FID score., Comment: accepted by CVPR, 2023
Published: 2023

6. LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models

Author: Jenrungrot, Teerapat, Chinen, Michael, Kleijn, W. Bastiaan, Skoglund, Jan, Borsos, Zalán, Zeghidour, Neil, and Tagliasacchi, Marco
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec., Comment: 5 pages, accepted to ICASSP 2023, project page: https://mjenrungrot.github.io/chrome-media-audio-papers/publications/lmcodec
Published: 2023

7. Directed Diffusion: Direct Control of Object Placement through Attention Guidance

Author: Ma, Wan-Duo Kurt, Lewis, J. P., Lahiri, Avisek, Leung, Thomas, and Kleijn, W. Bastiaan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement., Comment: Our project page: https://hohonu-vicml.github.io/DirectedDiffusion.Page
Published: 2023

8. Estimation of Source and Receiver Positions, Room Geometry and Reflection Coefficients From a Single Room Impulse Response

Author: Yu, Wangyang and Kleijn, W. Bastiaan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We propose an algorithm to estimate source and receiver positions, room geometry and reflection coefficients from a single room impulse response simultaneously. It is based on a symmetry analysis of the room impulse response. The proposed method utilizes the times of arrivals of the direct path, first order reflections and second order reflections. The proposed method is robust to erroneous pulses and non-specular reflections. It can be applied to any room with parallel walls as long as the required arrival times of reflections are available. In contrast to the state-of-art method, we do not restrict the location of source and receiver.
Published: 2023

9. Ultra-Low-Bitrate Speech Coding with Pretrained Transformers

Author: Siahkoohi, Ali, Chinen, Michael, Denton, Tom, Kleijn, W. Bastiaan, and Skoglund, Jan
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech coding facilitates the transmission of speech over low-bandwidth networks with minimal distortion. Neural-network based speech codecs have recently demonstrated significant improvements in quality over traditional approaches. While this new generation of codecs is capable of synthesizing high-fidelity speech, their use of recurrent or convolutional layers often restricts their effective receptive fields, which prevents them from compressing speech efficiently. We propose to further reduce the bitrate of neural speech codecs through the use of pretrained Transformers, capable of exploiting long-range dependencies in the input signal due to their inductive bias. As such, we use a pretrained Transformer in tandem with a convolutional encoder, which is trained end-to-end with a quantizer and a generative adversarial net decoder. Our numerical experiments show that supplementing the convolutional encoder of a neural speech codec with Transformer speech embeddings yields a speech codec with a bitrate of $600\,\mathrm{bps}$ that outperforms the original neural speech codec in synthesized speech quality when trained at the same bitrate. Subjective human evaluations suggest that the quality of the resulting codec is comparable or better than that of conventional codecs operating at three to four times the rate., Comment: Proceedings of INTERSPEECH 2022
Published: 2022

10. On the Relevance of Bandwidth Extension for Speaker Verification

Author: Faundez-Zanuy, Marcos, Nilsson, Mattias, and Kleijn, W. Bastiaan
Subjects: Computer Science - Sound, Computer Science - Cryptography and Security, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we consider the effect of a bandwidth extension of narrow-band speech signals (0.3-3.4 kHz) to 0.3-8 kHz on speaker verification. Using covariance matrix based verification systems together with detection error trade-off curves, we compare the performance between systems operating on narrow-band, wide-band (0-8 kHz), and bandwidth-extended speech. The experiments were conducted using different short-time spectral parameterizations derived from microphone and ISDN speech databases. The studied bandwidth-extension algorithm did not introduce artifacts that affected the speaker verification task, and we achieved improvements between 1 and 10 percent (depending on the model order) over the verification system designed for narrow-band speech when mel-frequency cepstral coefficients for the short-time spectral parameterization were used., Comment: 4 pages published in 7th International Conference on Spoken Language Processing, September 16-20, 2002, Denver, Colorado, USA. arXiv admin note: text overlap with arXiv:2202.13865
Published: 2022

11. A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range

Author: Zhang, Guoqiang, Niwa, Kenta, and Kleijn, W. Bastiaan
Subjects: Computer Science - Machine Learning
Abstract: We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter epsilon within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient g_t and its first momentum m_t before using them to estimate the second momentum. The new optimization method is referred to as Aida. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet. Code is available at this URL, Comment: 10 pages
Published: 2022

12. On the relevance of bandwidth extension for speaker identification

Author: Faundez-Zanuy, Marcos, Nilsson, Mattias, and Kleijn, W. Bastiaan
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper we discuss the relevance of bandwidth extension for speaker identification tasks. Mainly we want to study if it is possible to recognize voices that have been bandwith extended. For this purpose, we created two different databases (microphonic and ISDN) of speech signals that were bandwidth extended from telephone bandwidth ([300, 3400] Hz) to full bandwidth ([100, 8000] Hz). We have evaluated different parameterizations, and we have found that the MELCEPST parameterization can take advantage of the bandwidth extension algorithms in several situations., Comment: 4 pages
Published: 2022

13. Extending AdamW by Leveraging Its Second Moment and Magnitude

Author: Zhang, Guoqiang, Kenta, Niwa, and Kleijn, W. Bastiaan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control
Abstract: Recent work [4] analyses the local convergence of Adam in a neighbourhood of an optimal solution for a twice-differentiable function. It is found that the learning rate has to be sufficiently small to ensure local stability of the optimal solution. The above convergence results also hold for AdamW. In this work, we propose a new adaptive optimisation method by extending AdamW in two aspects with the purpose to relax the requirement on small learning rate for local stability, which we refer to as Aida. Firstly, we consider tracking the 2nd moment r_t of the pth power of the gradient-magnitudes. r_t reduces to v_t of AdamW when p=2. Suppose {m_t} is the first moment of AdamW. It is known that the update direction m_{t+1}/(v_{t+1}+epsilon)^0.5 (or m_{t+1}/(v_{t+1}^0.5+epsilon) of AdamW (or Adam) can be decomposed as the sign vector sign(m_{t+1}) multiplied elementwise by a vector of magnitudes |m_{t+1}|/(v_{t+1}+epsilon)^0.5 (or |m_{t+1}|/(v_{t+1}^0.5+epsilon)). Aida is designed to compute the qth power of the magnitude in the form of |m_{t+1}|^q/(r_{t+1}+epsilon)^(q/p) (or |m_{t+1}|^q/((r_{t+1})^(q/p)+epsilon)), which reduces to that of AdamW when (p,q)=(2,1). Suppose the origin 0 is a local optimal solution of a twice-differentiable function. It is found theoretically that when q>1 and p>1 in Aida, the origin 0 is locally stable only when the weight-decay is non-zero. Experiments are conducted for solving ten toy optimisation problems and training Transformer and Swin-Transformer for two deep learning (DL) tasks. The empirical study demonstrates that in a number of scenarios (including the two DL tasks), Aida with particular setups of (p,q) not equal to (2,1) outperforms the setup (p,q)=(2,1) of AdamW., Comment: 9 pages
Published: 2021

14. Revisiting the Primal-Dual Method of Multipliers for Optimisation over Centralised Networks

Author: Zhang, Guoqiang, Niwa, Kenta, and Kleijn, W. Bastiaan
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Mathematics - Optimization and Control
Abstract: The primal-dual method of multipliers (PDMM) was originally designed for solving a decomposable optimisation problem over a general network. In this paper, we revisit PDMM for optimisation over a centralized network. We first note that the recently proposed method FedSplit [1] implements PDMM for a centralized network. In [1], Inexact FedSplit (i.e., gradient based FedSplit) was also studied both empirically and theoretically. We identify the cause for the poor reported performance of Inexact FedSplit, which is due to the improper initialisation in the gradient operations at the client side. To fix the issue of Inexact FedSplit, we propose two versions of Inexact PDMM, which are referred to as gradient-based PDMM (GPDMM) and accelerated GPDMM (AGPDMM), respectively. AGPDMM accelerates GPDMM at the cost of transmitting two times the number of parameters from the server to each client per iteration compared to GPDMM. We provide a new convergence bound for GPDMM for a class of convex optimisation problems. Our new bounds are tighter than those derived for Inexact FedSplit. We also investigate the update expressions of AGPDMM and SCAFFOLD to find their similarities. It is found that when the number K of gradient steps at the client side per iteration is K=1, both AGPDMM and SCAFFOLD reduce to vanilla gradient descent with proper parameter setup. Experimental results indicate that AGPDMM converges faster than SCAFFOLD when K>1 while GPDMM converges slightly worse than SCAFFOLD., Comment: 13 pages
Published: 2021

15. Handling Background Noise in Neural Speech Generation

Author: Denton, Tom, Luebs, Alejandro, Lim, Felicia S. C., Storus, Andrew, Yeh, Hengchin, Kleijn, W. Bastiaan, and Skoglund, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Recent advances in neural-network based generative modeling of speech has shown great potential for speech coding. However, the performance of such models drops when the input is not clean speech, e.g., in the presence of background noise, preventing its use in practical applications. In this paper we examine the reason and discuss methods to overcome this issue. Placing a denoising preprocessing stage when extracting features and target clean speech during training is shown to be the best performing strategy., Comment: 5 pages, 3 figures, presented at the Asilomar Conference on Signals, Systems, and Computers 2020
Published: 2021

16. Generative Speech Coding with Predictive Variance Regularization

Author: Kleijn, W. Bastiaan, Storus, Andrew, Chinen, Michael, Denton, Tom, Lim, Felicia S. C., Luebs, Alejandro, Skoglund, Jan, and Yeh, Hengchin
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, 94, I.m
Abstract: The recent emergence of machine-learning based generative models for speech suggests a significant reduction in bit rate for speech codecs is possible. However, the performance of generative models deteriorates significantly with the distortions present in real-world input signals. We argue that this deterioration is due to the sensitivity of the maximum likelihood criterion to outliers and the ineffectiveness of modeling a sum of independent signals with a single autoregressive model. We introduce predictive-variance regularization to reduce the sensitivity to outliers, resulting in a significant increase in performance. We show that noise reduction to remove unwanted signals can significantly increase performance. We provide extensive subjective performance evaluations that show that our system based on generative modeling provides state-of-the-art coding performance at 3 kb/s for real-world speech signals at reasonable computational complexity.
Published: 2021

17. Distributed Network Privacy using Error Correcting Codes

Author: O'Connor, Matt and Kleijn, W. Bastiaan
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: Most current distributed processing research deals with improving the flexibility and convergence speed of algorithms for networks of finite size with no constraints on information sharing and no concept for expected levels of signal privacy. In this work we investigate the concept of data privacy in unbounded public networks, where linear codes are used to create hard limits on the number of nodes contributing to a distributed task. We accomplish this by wrapping local observations in a linear code and intentionally applying symbol errors prior to transmission. If many nodes join the distributed task, a proportional number of symbol errors are introduced into the code leading to decoding failure if the code's predefined symbol error limit is exceeded.
Published: 2019

18. Generative Speech Enhancement Based on Cloned Networks

Author: Chinen, Michael, Kleijn, W. Bastiaan, Lim, Felicia S. C., and Skoglund, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We propose to implement speech enhancement by the regeneration of clean speech from a salient representation extracted from the noisy signal. The network that extracts salient features is trained using a set of weight-sharing clones of the extractor network. The clones receive mel-frequency spectra of different noisy versions of the same speech signal as input. By encouraging the outputs of the clones to be similar for these different input signals, we train a feature extractor network that is robust to noise. At inference, the salient features form the input to a WaveNet network that generates a natural and clean speech signal with the same attributes as the ground-truth clean signal. As the signal becomes noisier, our system produces natural sounding errors that stay on the speech manifold, in place of traditional artifacts found in other systems. Our experiments confirm that our generative enhancement system provides state-of-the-art enhancement performance within the generative class of enhancers according to a MUSHRA-like test. The clones based system matches or outperforms the other systems at each input signal-to-noise (SNR) range with statistical significance., Comment: Accepted WASPAA 2019
Published: 2019

19. Salient Speech Representations Based on Cloned Networks

Author: Kleijn, W. Bastiaan, Lim, Felicia S. C., Chinen, Michael, and Skoglund, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We define salient features as features that are shared by signals that are defined as being equivalent by a system designer. The definition allows the designer to contribute qualitative information. We aim to find salient features that are useful as conditioning for generative networks. We extract salient features by jointly training a set of clones of an encoder network. Each network clone receives as input a different signal from a set of equivalent signals. The objective function encourages the network clones to map their input into a set of features that is identical across the clones. It additionally encourages feature independence and, optionally, reconstruction of a desired target signal by a decoder. As an application, we train a system that extracts a time-sequence of feature vectors of speech and uses it as a conditioning of a WaveNet generative system, facilitating both coding and enhancement., Comment: Interspeech 2019
Published: 2019

20. The HSIC Bottleneck: Deep Learning without Back-Propagation

Author: Ma, Wan-Duo Kurt, Lewis, J. P., and Kleijn, W. Bastiaan
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We introduce the HSIC (Hilbert-Schmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to the conventional cross-entropy loss and backpropagation that has a number of distinct advantages. It mitigates exploding and vanishing gradients, resulting in the ability to learn very deep networks without skip connections. There is no requirement for symmetric feedback or update locking. We find that the HSIC bottleneck provides performance on MNIST/FashionMNIST/CIFAR10 classification comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels. Appending a single layer trained with SGD (without backpropagation) to reformat the information further improves performance.
Published: 2019

21. Room Geometry Estimation from Room Impulse Responses using Convolutional Neural Networks

Author: Yu, Wangyang and Kleijn, W. Bastiaan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We describe a new method to estimate the geometry of a room given room impulse responses. The method utilises convolutional neural networks to estimate the room geometry and uses the mean square error as the loss function. In contrast to existing methods, we do not require the position or distance of sources or receivers in the room. The method can be used with only a single room impulse response between one source and one receiver for room geometry estimation. The proposed estimation method can achieve an average of six centimetre accuracy. In addition, the proposed method is shown to be computationally efficient compared to state-of-the-art methods.
Published: 2019

22. Rapidly Adapting Moment Estimation

Author: Zhang, Guoqiang, Niwa, Kenta, and Kleijn, W. Bastiaan
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic variation of the first moment of gradients may provide useful information to obtain the learning rates. We refer to the new method as the rapidly adapting moment estimation (RAME). The theoretical convergence of deterministic RAME is studied by using an analysis similar to the one used in [1] for Adam. Experimental results for training a number of DNNs show promising performance of RAME w.r.t. the convergence speed and generalization performance compared to the stochastic heavy-ball (SHB) method, Adam, and RMSprop., Comment: 11 pages
Published: 2019

23. Kernel Density Estimation-Based Markov Models with Hidden State

Author: Henter, Gustav Eje, Leijon, Arne, and Kleijn, W. Bastiaan
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Signal Processing, Statistics - Machine Learning, 62M10, 62G07, G.3
Abstract: We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes. The KDE Markov models (KDE-MMs) we discuss are nonlinear, nonparametric, fully probabilistic representations of stationary processes, based on techniques with strong asymptotic consistency properties. The models generate new data by concatenating points from the training data sequences in a context-sensitive manner, together with some additive driving noise. We present novel EM-type maximum-likelihood algorithms for data-driven bandwidth selection in KDE-MMs. Additionally, we augment the KDE-MMs with a hidden state, yielding a new model class, KDE-HMMs. The added state variable captures non-Markovian long memory and signal structure (e.g., slow oscillations), complementing the short-range dependences described by the Markov process. The resulting joint Markov and hidden-Markov structure is appealing for modelling complex real-world processes such as speech signals. We present guaranteed-ascent EM-update equations for model parameters in the case of Gaussian kernels, as well as relaxed update formulas that greatly accelerate training in practice. Experiments demonstrate increased held-out set probability for KDE-HMMs on several challenging natural and synthetic data series, compared to traditional techniques such as autoregressive models, HMMs, and their combinations., Comment: 14 pages, 6 figures
Published: 2018

24. Bregman Monotone Operator Splitting

Author: Niwa, Kenta and Kleijn, W. Bastiaan
Subjects: Mathematics - Optimization and Control, 46N10
Abstract: Monotone operator splitting is a powerful paradigm that facilitates parallel processing for optimization problems where the cost function can be split into two convex functions. We propose a generalized form of monotone operator splitting based on Bregman divergence. We show that an appropriate design of the Bregman divergence leads to faster convergence than conventional splitting algorithms. The proposed Bregman monotone operator splitting (B-MOS) is applied to an application to illustrate its effectiveness. B-MOS was found to significantly improve the convergence rate., Comment: 19 pages, 1 figure
Published: 2018

25. Directional emphasis in ambisonics

Author: Kleijn, W. Bastiaan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We describe an ambisonics enhancement method that increases the signal strength in specified directions at low computational cost. The method can be used in a static setup to emphasize the signal arriving from a particular direction or set of directions. It can also be used in an adaptive arrangement where it sharpens directionality and reduces the distortion in timbre associated with low-degree ambisonics representations. The emphasis operator has very low computational complexity and can be applied to time-domain as well as time-frequency ambisonics representations. The operator upscales a low-degree ambisonics representation to a higher degree representation.
Published: 2018

26. Wavenet based low rate speech coding

Author: Kleijn, W. Bastiaan, Lim, Felicia S. C., Luebs, Alejandro, Skoglund, Jan, Stimberg, Florian, Wang, Quan, and Walters, Thomas C.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model., Comment: 5 pages, 2 figures
Published: 2017

27. On Relationship between Primal-Dual Method of Multipliers and Kalman Filter

Author: Zhang, Guoqiang, Kleijn, W. Bastiaan, and Heusdens, Richard
Subjects: Mathematics - Optimization and Control, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Information Theory
Abstract: Recently the primal-dual method of multipliers (PDMM), a novel distributed optimization method, was proposed for solving a general class of decomposable convex optimizations over graphic models. In this work, we first study the convergence properties of PDMM for decomposable quadratic optimizations over tree-structured graphs. We show that with proper parameter selection, PDMM converges to its optimal solution in finite number of iterations. We then apply PDMM for the causal estimation problem over a statistical linear state-space model. We show that PDMM and the Kalman filter have the same update expressions, where PDMM can be interpreted as solving a sequence of quadratic optimizations over a growing chain graph., Comment: 11 pages
Published: 2017

28. An evaluation of intrusive instrumental intelligibility metrics

Author: Van Kuyk, Steven, Kleijn, W. Bastiaan, and Hendriks, Richard C.
Subjects: Computer Science - Sound
Abstract: Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and $\text{sEPSM}^\text{corr}$. In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the top performing metrics have high performance. The intelligibility data were obtained from 11 listening tests described in the literature. The stimuli included Dutch, Danish, and English speech that was distorted by additive noise, reverberation, competing talkers, pre-processing enhancement, and post-processing enhancement. SIIB and HASPI had the highest performance achieving a correlation with listening test scores on average of $\rho=0.92$ and $\rho=0.89$, respectively. The high performance of SIIB may, in part, be the result of SIIBs developers having access to all the intelligibility data considered in the evaluation. The results show that intelligibility metrics tend to perform poorly on data sets that were not used during their development. By modifying the original implementations of SIIB and STOI, the advantage of reducing statistical dependencies between input features is demonstrated. Additionally, the paper presents a new version of SIIB called $\text{SIIB}^\text{Gauss}$, which has similar performance to SIIB and HASPI, but takes less time to compute by two orders of magnitude., Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018
Published: 2017
Full Text: View/download PDF

29. An instrumental intelligibility metric based on information theory

Author: Van Kuyk, Steven, Kleijn, W. Bastiaan, and Hendriks, Richard C.
Subjects: Computer Science - Sound
Abstract: We propose a monaural intrusive instrumental intelligibility metric called speech intelligibility in bits (SIIB). SIIB is an estimate of the amount of information shared between a talker and a listener in bits per second. Unlike existing information theoretic intelligibility metrics, SIIB accounts for talker variability and statistical dependencies between time-frequency units. Our evaluation shows that relative to state-of-the-art intelligibility metrics, SIIB is highly correlated with the intelligibility of speech that has been degraded by noise and processed by speech enhancement algorithms., Comment: Published in IEEE Signal Processing Letters
Published: 2017
Full Text: View/download PDF

30. Derivation and Analysis of the Primal-Dual Method of Multipliers Based on Monotone Operator Theory

Author: Sherson, Thomas, Heusdens, Richard, and Kleijn, W. Bastiaan
Subjects: Mathematics - Optimization and Control
Abstract: In this paper we present a novel derivation for an existing node-based algorithm for distributed optimisation termed the primal-dual method of multipliers (PDMM). In contrast to its initial derivation, in this work monotone operator theory is used to connect PDMM with other first-order methods such as Douglas-Rachford splitting and the alternating direction method of multipliers thus providing insight to the operation of the scheme. In particular, we show how PDMM combines a lifted dual form in conjunction with Peaceman-Rachford splitting to remove the need for collaboration between nodes per iteration. We demonstrate sufficient conditions for strong primal convergence for a general class of functions while under the assumption of strong convexity and functional smoothness, we also introduce a primal geometric convergence bound. Finally we introduce a distributed method of parameter selection in the geometric convergent case, requiring only finite transmissions to implement regardless of network topology., Comment: 13 pages, 6 figures
Published: 2017

31. Cross-modal Subspace Learning for Fine-grained Sketch-based Image Retrieval

Author: Xu, Peng, Yin, Qiyue, Huang, Yongye, Song, Yi-Zhe, Ma, Zhanyu, Wang, Liang, Xiang, Tao, Kleijn, W. Bastiaan, and Guo, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Sketch-based image retrieval (SBIR) is challenging due to the inherent domain-gap between sketch and photo. Compared with pixel-perfect depictions of photos, sketches are iconic renderings of the real world with highly abstract. Therefore, matching sketch and photo directly using low-level visual clues are unsufficient, since a common low-level subspace that traverses semantically across the two modalities is non-trivial to establish. Most existing SBIR studies do not directly tackle this cross-modal problem. This naturally motivates us to explore the effectiveness of cross-modal retrieval methods in SBIR, which have been applied in the image-text matching successfully. In this paper, we introduce and compare a series of state-of-the-art cross-modal subspace learning methods and benchmark them on two recently released fine-grained SBIR datasets. Through thorough examination of the experimental results, we have demonstrated that the subspace learning can effectively model the sketch-photo domain-gap. In addition we draw a few key insights to drive future research., Comment: Accepted by Neurocomputing
Published: 2017

32. Training Deep Neural Networks via Optimization Over Graphs

Author: Zhang, Guoqiang and Kleijn, W. Bastiaan
Subjects: Computer Science - Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In this work, we propose to train a deep neural network by distributed optimization over a graph. Two nonlinear functions are considered: the rectified linear unit (ReLU) and a linear unit with both lower and upper cutoffs (DCutLU). The problem reformulation over a graph is realized by explicitly representing ReLU or DCutLU using a set of slack variables. We then apply the alternating direction method of multipliers (ADMM) to update the weights of the network layerwise by solving subproblems of the reformulated problem. Empirical results suggest that the ADMM-based method is less sensitive to overfitting than the stochastic gradient descent (SGD) and Adam methods., Comment: 5 pages
Published: 2017

33. A Practical Online Multichannel Dereverberation Approach with Data-Reuse Technique

Author: Huang, Weilong, primary, Xue, Cheng, additional, Feng, Jinwei, additional, and Kleijn, W. Bastiaan, additional
Published: 2024
Full Text: View/download PDF

34. Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation

Author: Ghifary, Muhammad, Kleijn, W. Bastiaan, Zhang, Mengjie, Balduzzi, David, and Li, Wen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Learning, Statistics - Machine Learning
Abstract: In this paper, we propose a novel unsupervised domain adaptation algorithm based on deep learning for visual object recognition. Specifically, we design a new model called Deep Reconstruction-Classification Network (DRCN), which jointly learns a shared encoding representation for two tasks: i) supervised classification of labeled source data, and ii) unsupervised reconstruction of unlabeled target data.In this way, the learnt representation not only preserves discriminability, but also encodes useful information from the target domain. Our new DRCN model can be optimized by using backpropagation similarly as the standard neural networks. We evaluate the performance of DRCN on a series of cross-domain object recognition tasks, where DRCN provides a considerable improvement (up to ~8% in accuracy) over the prior state-of-the-art algorithms. Interestingly, we also observe that the reconstruction pipeline of DRCN transforms images from the source domain into images whose appearance resembles the target dataset. This suggests that DRCN's performance is due to constructing a single composite representation that encodes information about both the structure of target images and the classification of source images. Finally, we provide a formal analysis to justify the algorithm's objective in domain adaptation context., Comment: to appear in European Conference on Computer Vision (ECCV) 2016
Published: 2016

35. Scatter Component Analysis: A Unified Framework for Domain Adaptation and Domain Generalization

Author: Ghifary, Muhammad, Balduzzi, David, Kleijn, W. Bastiaan, and Zhang, Mengjie
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Learning, Statistics - Machine Learning, I.2.6, I.4
Abstract: This paper addresses classification tasks on a particular target domain in which labeled training data are only available from source domains different from (but related to) the target. Two closely related frameworks, domain adaptation and domain generalization, are concerned with such tasks, where the only difference between those frameworks is the availability of the unlabeled target data: domain adaptation can leverage unlabeled target information, while domain generalization cannot. We propose Scatter Component Analyis (SCA), a fast representation learning algorithm that can be applied to both domain adaptation and domain generalization. SCA is based on a simple geometrical measure, i.e., scatter, which operates on reproducing kernel Hilbert space. SCA finds a representation that trades between maximizing the separability of classes, minimizing the mismatch between domains, and maximizing the separability of data; each of which is quantified through scatter. The optimization problem of SCA can be reduced to a generalized eigenvalue problem, which results in a fast and exact solution. Comprehensive experiments on benchmark cross-domain object recognition datasets verify that SCA performs much faster than several state-of-the-art algorithms and also provides state-of-the-art classification accuracy in both domain adaptation and domain generalization. We also show that scatter can be used to establish a theoretical generalization bound in the case of domain adaptation., Comment: to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence
Published: 2015

36. Domain Generalization for Object Recognition with Multi-task Autoencoders

Author: Ghifary, Muhammad, Kleijn, W. Bastiaan, Zhang, Mengjie, and Balduzzi, David
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Learning, Statistics - Machine Learning
Abstract: The problem of domain generalization is to take knowledge acquired from a number of related domains where training data is available, and to then successfully apply it to previously unseen domains. We propose a new feature learning algorithm, Multi-Task Autoencoder (MTAE), that provides good generalization performance for cross-domain object recognition. Our algorithm extends the standard denoising autoencoder framework by substituting artificially induced corruption with naturally occurring inter-domain variability in the appearance of objects. Instead of reconstructing images from noisy versions, MTAE learns to transform the original image into analogs in multiple related domains. It thereby learns features that are robust to variations across domains. The learnt features are then used as inputs to a classifier. We evaluated the performance of the algorithm on benchmark image recognition datasets, where the task is to learn features from multiple datasets and to then predict the image label from unseen datasets. We found that (denoising) MTAE outperforms alternative autoencoder-based models as well as the current state-of-the-art algorithms for domain generalization., Comment: accepted in ICCV 2015
Published: 2015

37. Domain Adaptive Neural Networks for Object Recognition

Author: Ghifary, Muhammad, Kleijn, W. Bastiaan, and Zhang, Mengjie
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: We propose a simple neural network model to deal with the domain adaptation problem in object recognition. Our model incorporates the Maximum Mean Discrepancy (MMD) measure as a regularization in the supervised learning to reduce the distribution mismatch between the source and target domains in the latent space. From experiments, we demonstrate that the MMD regularization is an effective tool to provide good domain adaptation models on both SURF features and raw image pixels of a particular image data set. We also show that our proposed model, preceded by the denoising auto-encoder pretraining, achieves better performance than recent benchmark models on the same data sets. This work represents the first study of MMD measure in the context of neural networks.
Published: 2014

38. Optimal Index Assignment for Multiple Description Scalar Quantization

Author: Zhang, Guoqiang, Klejsa, Janusz, and Kleijn, W. Bastiaan
Subjects: Computer Science - Information Theory
Abstract: We provide a method for designing an optimal index assignment for scalar K-description coding. The method stems from a construction of translated scalar lattices, which provides a performance advantage by exploiting a so-called staggered gain. Interestingly, generation of the optimal index assignment is based on a lattice in K-1 dimensional space. The use of the K-1 dimensional lattice facilitates analytic insight into the performance and eliminates the need for a greedy optimization of the index assignment. It is shown that that the optimal index assignment is not unique. This is illustrated for the two-description case, where a periodic index assignment is selected from possible optimal assignments and described in detail. The new index assignment is applied to design of a K-description quantizer, which is found to outperform a reference K-description quantizer at high rates. The performance advantage due to the staggered gain increases with increasing redundancy among the descriptions., Comment: 21 pages, 4 figures, submitted to IEEE Trans. Signal Processing
Published: 2011

39. On Distribution Preserving Quantization

Author: Li, Minyue, Klejsa, Janusz, and Kleijn, W. Bastiaan
Subjects: Computer Science - Information Theory
Abstract: Upon compressing perceptually relevant signals, conventional quantization generally results in unnatural outcomes at low rates. We propose distribution preserving quantization (DPQ) to solve this problem. DPQ is a new quantization concept that confines the probability space of the reconstruction to be identical to that of the source. A distinctive feature of DPQ is that it facilitates a seamless transition between signal synthesis and quantization. A theoretical analysis of DPQ leads to a distribution preserving rate-distortion function (DP-RDF), which serves as a lower bound on the rate of any DPQ scheme, under a constraint on distortion. In general situations, the DP-RDF approaches the classic rate-distortion function for the same source and distortion measure, in the limit of an increasing rate. A practical DPQ scheme based on a multivariate transformation is also proposed. This scheme asymptotically achieves the DP-RDF for i.i.d. Gaussian sources and the mean squared error., Comment: 29 pages, 4 figures, submitted to IEEE Transactions on Information Theory
Published: 2011

40. Bounding the Rate Region of Vector Gaussian Multiple Descriptions with Individual and Central Receivers

Author: Zhang, Guoqiang, Kleijn, W. Bastiaan, and Østergaard, Jan
Subjects: Computer Science - Information Theory
Abstract: In this work, the rate region of the vector Gaussian multiple description problem with individual and central quadratic distortion constraints is studied. In particular, an outer bound to the rate region of the L-description problem is derived. The bound is obtained by lower bounding a weighted sum rate for each supporting hyperplane of the rate region. The key idea is to introduce at most L-1 auxiliary random variables and further impose upon the variables a Markov structure according to the ordering of the description weights. This makes it possible to greatly simplify the derivation of the outer bound. In the scalar Gaussian case, the complete rate region is fully characterized by showing that the outer bound is tight. In this case, the optimal weighted sum rate for each supporting hyperplane is obtained by solving a single maximization problem. This contrasts with existing results, which require solving a min-max optimization problem., Comment: 34 pages, submitted to IEEE Transactions on Information Theory
Published: 2010
Full Text: View/download PDF

41. A Deep Learning Based Fault Diagnosis Method Combining Domain Knowledge and Transfer Learning

Author: Choudhury, Madhurjya Dev, primary, Kleijn, W. Bastiaan, additional, Blincoe, Kelly, additional, and Dhupia, Jaspreet Singh, additional
Published: 2023
Full Text: View/download PDF

42. A High-Rate Extension to Soundstream

Author: Kang, Hong-Goo, primary, Skoglund, Jan, additional, Kleijn, W. Bastiaan, additional, Storus, Andrew, additional, and Yeh, Hengchin, additional
Published: 2023
Full Text: View/download PDF

43. Multi-Channel Audio Signal Generation

Author: Kleijn, W. Bastiaan, primary, Chinen, Michael, additional, Lim, Felicia S. C., additional, and Skoglund, Jan, additional
Published: 2023
Full Text: View/download PDF

44. Neural Optimization Of Geometry And Fixed Beamformer For Linear Microphone Arrays

Author: Yan, Longfei, primary, Huang, Weilong, additional, Kleijn, W. Bastiaan, additional, and Abhayapala, Thushara D., additional
Published: 2023
Full Text: View/download PDF

45. LMCodec: A Low Bitrate Speech Codec with Causal Transformer Models

Author: Jenrungrot, Teerapat, primary, Chinen, Michael, additional, Kleijn, W. Bastiaan, additional, Skoglund, Jan, additional, Borsos, Zalán, additional, Zeghidour, Neil, additional, and Tagliasacchi, Marco, additional
Published: 2023
Full Text: View/download PDF

46. Lookahead Diffusion Probabilistic Models for Refining Mean Estimation

Author: Zhang, Guoqiang, primary, Niwa, Kenta, additional, and Kleijn, W. Bastiaan, additional
Published: 2023
Full Text: View/download PDF

47. CoordiNet: Constrained Dynamics Learning for State Coordination Over Graph

Author: Niwa, Kenta, primary, Ueda, Naonori, additional, Sawada, Hiroshi, additional, Fujino, Akinori, additional, Takeda, Shoichiro, additional, Zhang, Guoqiang, additional, and Kleijn, W. Bastiaan, additional
Published: 2023
Full Text: View/download PDF

48. A method of speech periodicity enhancement using transform-domain signal decomposition

Author: Huang, Feng, Lee, Tan, Kleijn, W. Bastiaan, and Kong, Ying-Yee
Published: 2015
Full Text: View/download PDF

49. A Linear-time Independence Criterion Based on a Finite Basis Approximation

Author: Yan, Longfei, Kleijn, W Bastiaan, and Abhayapala, Thushara D
Subjects: Uncategorized
Abstract: No description supplied
Published: 2023
Full Text: View/download PDF

50. Dirichlet Process Mixture of Generalized Inverted Dirichlet Distributions for Positive Vector Data With Extended Variational Inference

Author: Ma, Zhanyu, primary, Lai, Yuping, additional, Xie, Jiyang, additional, Meng, Deyu, additional, Kleijn, W. Bastiaan, additional, Guo, Jun, additional, and Yu, Jingyi, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

604 results on '"Kleijn, W. Bastiaan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources