Author: "Kawai, Hisashi" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Kawai, Hisashi"' showing total 27 results

Start Over Author "Kawai, Hisashi" Database arXiv

27 results on '"Kawai, Hisashi"'

1. Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Transferring linguistic knowledge from a pretrained language model (PLM) to an acoustic model has been shown to greatly improve the performance of automatic speech recognition (ASR). However, due to the heterogeneous feature distributions in cross-modalities, designing an effective model for feature alignment and knowledge transfer between linguistic and acoustic sequences remains a challenging task. Optimal transport (OT), which efficiently measures probability distribution discrepancies, holds great potential for aligning and transferring knowledge between acoustic and linguistic modalities. Nonetheless, the original OT treats acoustic and linguistic feature sequences as two unordered sets in alignment and neglects temporal order information during OT coupling estimation. Consequently, a time-consuming pretraining stage is required to learn a good alignment between the acoustic and linguistic representations. In this paper, we propose a Temporal Order Preserved OT (TOT)-based Cross-modal Alignment and Knowledge Transfer (CAKT) (TOT-CAKT) for ASR. In the TOT-CAKT, local neighboring frames of acoustic sequences are smoothly mapped to neighboring regions of linguistic sequences, preserving their temporal order relationship in feature alignment and matching. With the TOT-CAKT model framework, we conduct Mandarin ASR experiments with a pretrained Chinese PLM for linguistic knowledge transfer. Our results demonstrate that the proposed TOT-CAKT significantly improves ASR performance compared to several state-of-the-art models employing linguistic knowledge transfer, and addresses the weaknesses of the original OT-based method in sequential feature alignment for ASR., Comment: Accepted to IEEE SLT 2024
Published: 2024

2. Generative linguistic representation for spoken language identification

Author: Shen, Peng, Lu, Xuguang, and Kawai, Hisashi
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Effective extraction and application of linguistic features are central to the enhancement of spoken Language IDentification (LID) performance. With the success of recent large models, such as GPT and Whisper, the potential to leverage such pre-trained models for extracting linguistic features for LID tasks has become a promising area of research. In this paper, we explore the utilization of the decoder-based network from the Whisper model to extract linguistic features through its generative mechanism for improving the classification accuracy in LID tasks. We devised two strategies - one based on the language embedding method and the other focusing on direct optimization of LID outputs while simultaneously enhancing the speech recognition tasks. We conducted experiments on the large-scale multilingual datasets MLS, VoxLingua107, and CommonVoice to test our approach. The experimental results demonstrated the effectiveness of the proposed method on both in-domain and out-of-domain datasets for LID tasks., Comment: Accepted by IEEE ASRU2023
Published: 2023

3. Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition

Author: Shen, Peng, Lu, Xugang, and Kawai, Hisashi
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this paper, to better address these tasks, we first introduce speaker labels into an autoregressive transformer-based speech recognition model to support multi-speaker overlapped speech recognition. Then, to improve speaker diarization, we propose a novel speaker mask branch to detection the speech segments of individual speakers. With the proposed model, we can perform both speech recognition and speaker diarization tasks simultaneously using a single model. Experimental results on the LibriSpeech-based overlapped dataset demonstrate the effectiveness of the proposed method in both speech recognition and speaker diarization tasks, particularly enhancing the accuracy of speaker diarization in relatively complex multi-talker scenarios.
Published: 2023

4. Neural domain alignment for spoken language recognition based on optimal transport

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Domain shift poses a significant challenge in cross-domain spoken language recognition (SLR) by reducing its effectiveness. Unsupervised domain adaptation (UDA) algorithms have been explored to address domain shifts in SLR without relying on class labels in the target domain. One successful UDA approach focuses on learning domain-invariant representations to align feature distributions between domains. However, disregarding the class structure during the learning process of domain-invariant representations can result in over-alignment, negatively impacting the classification task. To overcome this limitation, we propose an optimal transport (OT)-based UDA algorithm for a cross-domain SLR, leveraging the distribution geometry structure-aware property of OT. An OT-based discrepancy measure on a joint distribution over feature and label information is considered during domain alignment in OT-based UDA. Our previous study discovered that completely aligning the distributions between the source and target domains can introduce a negative transfer, where classes or irrelevant classes from the source domain map to a different class in the target domain during distribution alignment. This negative transfer degrades the performance of the adaptive model. To mitigate this issue, we introduce coupling-weighted partial optimal transport (POT) within our UDA framework for SLR, where soft weighting on the OT coupling based on transport cost is adaptively set during domain alignment. A cross-domain SLR task was used in the experiments to evaluate the proposed UDA. The results demonstrated that our proposed UDA algorithm significantly improved the performance over existing UDA algorithms in a cross-channel SLR task.
Published: 2023

5. Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Due to the modality discrepancy between textual and acoustic modeling, efficiently transferring linguistic knowledge from a pretrained language model (PLM) to acoustic encoding for automatic speech recognition (ASR) still remains a challenging task. In this study, we propose a cross-modality knowledge transfer (CMKT) learning framework in a temporal connectionist temporal classification (CTC) based ASR system where hierarchical acoustic alignments with the linguistic representation are applied. Additionally, we propose the use of Sinkhorn attention in cross-modality alignment process, where the transformer attention is a special case of this Sinkhorn attention process. The CMKT learning is supposed to compel the acoustic encoder to encode rich linguistic knowledge for ASR. On the AISHELL-1 dataset, with CTC greedy decoding for inference (without using any language model), we achieved state-of-the-art performance with 3.64% and 3.94% character error rates (CERs) for the development and test sets, which corresponding to relative improvements of 34.18% and 34.88% compared to the baseline CTC-ASR system, respectively., Comment: Submitted to ICASSP 2024
Published: 2023

6. Cross-modal Alignment with Optimal Transport for CTC-based ASR

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Temporal connectionist temporal classification (CTC)-based automatic speech recognition (ASR) is one of the most successful end to end (E2E) ASR frameworks. However, due to the token independence assumption in decoding, an external language model (LM) is required which destroys its fast parallel decoding property. Several studies have been proposed to transfer linguistic knowledge from a pretrained LM (PLM) to the CTC based ASR. Since the PLM is built from text while the acoustic model is trained with speech, a cross-modal alignment is required in order to transfer the context dependent linguistic knowledge from the PLM to acoustic encoding. In this study, we propose a novel cross-modal alignment algorithm based on optimal transport (OT). In the alignment process, a transport coupling matrix is obtained using OT, which is then utilized to transform a latent acoustic representation for matching the context-dependent linguistic features encoded by the PLM. Based on the alignment, the latent acoustic feature is forced to encode context dependent linguistic information. We integrate this latent acoustic feature to build conformer encoder-based CTC ASR system. On the AISHELL-1 data corpus, our system achieved 3.96% and 4.27% character error rate (CER) for dev and test sets, respectively, which corresponds to relative improvements of 28.39% and 29.42% compared to the baseline conformer CTC ASR system without cross-modal knowledge transfer., Comment: Accepted to IEEE ASRU 2023
Published: 2023

7. Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

Author: Shen, Peng, Lu, Xugang, and Kawai, Hisashi
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: For Mandarin end-to-end (E2E) automatic speech recognition (ASR) tasks, compared to character-based modeling units, pronunciation-based modeling units could improve the sharing of modeling units in model training but meet homophone problems. In this study, we propose to use a novel pronunciation-aware unique character encoding for building E2E RNN-T-based Mandarin ASR systems. The proposed encoding is a combination of pronunciation-base syllable and character index (CI). By introducing the CI, the RNN-T model can overcome the homophone problem while utilizing the pronunciation information for extracting modeling units. With the proposed encoding, the model outputs can be converted into the final recognition result through a one-to-one mapping. We conducted experiments on Aishell and MagicData datasets, and the experimental results showed the effectiveness of the proposed method.
Published: 2022

8. Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

Author: Xin, Detai, Takamichi, Shinnosuke, Okamoto, Takuma, Kawai, Hisashi, and Saruwatari, Hiroshi
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A signal resampling method and an image scaling method are implemented in the proposed method to warp the mel-spectrograms or hidden features of the neural vocoder. We also design and open-source a Japanese speech corpus containing three kinds of speaking rates to evaluate the proposed speaking rate control method. Experimental results of comprehensive objective and subjective evaluations demonstrate that 1) the proposed method outperforms a baseline time-scale modification algorithm in speech naturalness, 2) warping mel-spectrograms by image scaling obtained the best performance among all proposed methods, and 3) the proposed speaking rate control method can be incorporated into HiFi-GAN without losing computational efficiency., Comment: submitted to INTERSPEECH 2022
Published: 2022

9. Transducer-based language embedding for spoken language identification

Author: Shen, Peng, Lu, Xugang, and Kawai, Hisashi
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The acoustic and linguistic features are important cues for the spoken language identification (LID) task. Recent advanced LID systems mainly use acoustic features that lack the usage of explicit linguistic feature encoding. In this paper, we propose a novel transducer-based language embedding approach for LID tasks by integrating an RNN transducer model into a language embedding framework. Benefiting from the advantages of the RNN transducer's linguistic representation capability, the proposed method can exploit both phonetically-aware acoustic features and explicit linguistic features for LID tasks. Experiments were carried out on the large-scale multilingual LibriSpeech and VoxLingua107 datasets. Experimental results showed the proposed method significantly improves the performance on LID tasks with 12% to 59% and 16% to 24% relative improvement on in-domain and cross-domain datasets, respectively., Comment: This paper was accepted by Interspeech 2022
Published: 2022

10. Partial Coupling of Optimal Transport for Spoken Language Identification

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language
Abstract: In order to reduce domain discrepancy to improve the performance of cross-domain spoken language identification (SLID) system, as an unsupervised domain adaptation (UDA) method, we have proposed a joint distribution alignment (JDA) model based on optimal transport (OT). A discrepancy measurement based on OT was adopted for JDA between training and test data sets. In our previous study, it was supposed that the training and test sets share the same label space. However, in real applications, the label space of the test set is only a subset of that of the training set. Fully matching training and test domains for distribution alignment may introduce negative domain transfer. In this paper, we propose an JDA model based on partial optimal transport (POT), i.e., only partial couplings of OT are allowed during JDA. Moreover, since the label of test data is unknown, in the POT, a soft weighting on the coupling based on transport cost is adaptively set during domain alignment. Experiments were carried out on a cross-domain SLID task to evaluate the proposed UDA. Results showed that our proposed UDA significantly improved the performance due to the consideration of the partial couplings in OT., Comment: This work was submitted to INTERSPEECH 2022
Published: 2022

11. Siamese Neural Network with Joint Bayesian Model Structure for Speaker Verification

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Generative probability models are widely used for speaker verification (SV). However, the generative models are lack of discriminative feature selection ability. As a hypothesis test, the SV can be regarded as a binary classification task which can be designed as a Siamese neural network (SiamNN) with discriminative training. However, in most of the discriminative training for SiamNN, only the distribution of pair-wised sample distances is considered, and the additional discriminative information in joint distribution of samples is ignored. In this paper, we propose a novel SiamNN with consideration of the joint distribution of samples. The joint distribution of samples is first formulated based on a joint Bayesian (JB) based generative model, then a SiamNN is designed with dense layers to approximate the factorized affine transforms as used in the JB model. By initializing the SiamNN with the learned model parameters of the JB model, we further train the model parameters with the pair-wised samples as a binary discrimination task for SV. We carried out SV experiments on data corpus of speakers in the wild (SITW) and VoxCeleb. Experimental results showed that our proposed model improved the performance with a large margin compared with state of the art models for SV., Comment: arXiv admin note: substantial text overlap with arXiv:2101.03329
Published: 2021

12. CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

Author: Magassouba, Aly, Sugiura, Komei, and Kawai, Hisashi
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: Navigation guided by natural language instructions is particularly suitable for Domestic Service Robots that interacts naturally with users. This task involves the prediction of a sequence of actions that leads to a specified destination given a natural language navigation instruction. The task thus requires the understanding of instructions, such as ``Walk out of the bathroom and wait on the stairs that are on the right''. The Visual and Language Navigation remains challenging, notably because it requires the exploration of the environment and at the accurate following of a path specified by the instructions to model the relationship between language and vision. To address this, we propose the CrossMap Transformer network, which encodes the linguistic and visual features to sequentially generate a path. The CrossMap transformer is tied to a Transformer-based speaker that generates navigation instructions. The two networks share common latent features, for mutual enhancement through a double back translation model: Generated paths are translated into instructions while generated instructions are translated into path The experimental results show the benefits of our approach in terms of instruction understanding and instruction generation., Comment: 8 pages, 5 figures, 5 tables. Submitted to IEEE Robotics and Automation Letters
Published: 2021

13. Predicting and Attending to Damaging Collisions for Placing Everyday Objects in Photo-Realistic Simulations

Author: Magassouba, Aly, Sugiura, Komei, Nakayama, Angelica, Hirakawa, Tsubasa, Yamashita, Takayoshi, Fujiyoshi, Hironobu, and Kawai, Hisashi
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: Placing objects is a fundamental task for domestic service robots (DSRs). Thus, inferring the collision-risk before a placing motion is crucial for achieving the requested task. This problem is particularly challenging because it is necessary to predict what happens if an object is placed in a cluttered designated area. We show that a rule-based approach that uses plane detection, to detect free areas, performs poorly. To address this, we develop PonNet, which has multimodal attention branches and a self-attention mechanism to predict damaging collisions, based on RGBD images. Our method can visualize the risk of damaging collisions, which is convenient because it enables the user to understand the risk. For this purpose, we build and publish an original dataset that contains 12,000 photo-realistic images of specific placing areas, with daily life objects, in home environments. The experimental results show that our approach improves accuracy compared with the baseline methods., Comment: 18 pages, 7 figures, 5 tables. Submitted to Advanced Robotics
Published: 2021

14. Coupling a generative model with a discriminative learning framework for speaker verification

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The speaker verification (SV) task is to decide whether an utterance is spoken by a target or an imposter speaker. For most studies, a log-likelihood ratio (LLR) score is estimated based on a generative probability model on speaker features and compared with a threshold for making a decision. However, the generative model usually focuses on individual feature distributions, does not have the discriminative feature selection ability, and is easy to be distracted by nuisance features. The SV could be formulated as a binary discrimination task where neural network-based discriminative learning could be applied. In discriminative learning, the nuisance features could be removed with the help of label supervision. However, discriminative learning pays more attention to classification boundaries and is prone to overfitting to a training set which may result in bad generalization on a test set. Thus, we propose a hybrid learning framework, i.e., coupling a joint Bayesian (JB) generative model structure and parameters with a neural discriminative learning framework for SV. A two-branch Siamese neural network is built with dense layers that are coupled with factorized affine transforms as used in the JB model. The LLR score estimation in the JB model is formulated according to the distance metric in the discriminative learning framework. By initializing the two-branch neural network with the generatively learned model parameters of the JB model, we train the model parameters with the pairwise samples as a binary discrimination task. Moreover, a direct evaluation metric in SV based on minimum empirical Bayes risk is designed and integrated as an objective function in discriminative learning. We carried out SV experiments on Speakers in the wild and Voxceleb. Experimental results showed that our proposed model improved the performance with a large margin compared with state-of-art models for SV.
Published: 2021

15. Unsupervised neural adaptation model based on optimal transport for spoken language identification

Author: Lu, Xugang, Shen, Peng, Tsao, Yu, and Kawai, Hisashi
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Due to the mismatch of statistical distributions of acoustic speech between training and testing sets, the performance of spoken language identification (SLID) could be drastically degraded. In this paper, we propose an unsupervised neural adaptation model to deal with the distribution mismatch problem for SLID. In our model, we explicitly formulate the adaptation as to reduce the distribution discrepancy on both feature and classifier for training and testing data sets. Moreover, inspired by the strong power of the optimal transport (OT) to measure distribution discrepancy, a Wasserstein distance metric is designed in the adaptation loss. By minimizing the classification loss on the training data set with the adaptation loss on both training and testing data sets, the statistical distribution difference between training and testing domains is reduced. We carried out SLID experiments on the oriental language recognition (OLR) challenge data corpus where the training and testing data sets were collected from different conditions. Our results showed that significant improvements were achieved on the cross domain test tasks.
Published: 2020

16. Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

Author: Wu, Yi-Chiao, Hayashi, Tomoki, Okamoto, Takuma, Kawai, Hisashi, and Toda, Tomoki
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real time because of its compact model and non-autoregressive (non-AR) and non-causal mechanisms. Although PWG achieves high-fidelity speech generation, the generic and simple network architecture lacks pitch controllability for an unseen auxiliary fundamental frequency ($F_{0}$) feature such as a scaled $F_{0}$. To improve the pitch controllability and speech modeling capability, we apply a QP structure with PDCNNs to PWG, which introduces pitch information to the network by dynamically changing the network architecture corresponding to the auxiliary $F_{0}$ feature. Both objective and subjective experimental results show that QPPWG outperforms PWG when the auxiliary $F_{0}$ feature is scaled. Moreover, analyses of the intermediate outputs of QPPWG also show better tractability and interpretability of QPPWG, which respectively models spectral and excitation-like signals using the cascaded fixed and adaptive blocks of the QP structure., Comment: 15 pages, 10 figures, 8 tables
Published: 2020
Full Text: View/download PDF

17. Alleviating the Burden of Labeling: Sentence Generation by Attention Branch Encoder-Decoder Network

Author: Ogura, Tadashi, Magassouba, Aly, Sugiura, Komei, Hirakawa, Tsubasa, Yamashita, Takayoshi, Fujiyoshi, Hironobu, and Kawai, Hisashi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Domestic service robots (DSRs) are a promising solution to the shortage of home care workers. However, one of the main limitations of DSRs is their inability to interact naturally through language. Recently, data-driven approaches have been shown to be effective for tackling this limitation; however, they often require large-scale datasets, which is costly. Based on this background, we aim to perform automatic sentence generation of fetching instructions: for example, "Bring me a green tea bottle on the table." This is particularly challenging because appropriate expressions depend on the target object, as well as its surroundings. In this paper, we propose the attention branch encoder--decoder network (ABEN), to generate sentences from visual inputs. Unlike other approaches, the ABEN has multimodal attention branches that use subword-level attention and generate sentences based on subword embeddings. In experiments, we compared the ABEN with a baseline method using four standard metrics in image captioning. Results show that the ABEN outperformed the baseline in terms of these metrics., Comment: 9 pages, 8 figures. accepted for IEEE Robotics and Automation Letters (RA-L) with presentation at IROS 2020
Published: 2020

18. Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

Author: Wu, Yi-Chiao, Hayashi, Tomoki, Okamoto, Takuma, Kawai, Hisashi, and Toda, Tomoki
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generative speed is much faster than real time. While utilizing PWG as a vocoder to generate speech on the basis of acoustic features such as spectral and prosodic features, PWG generates high-fidelity speech. However, when the input acoustic features include unseen pitches, the pitch accuracy of PWG-generated speech degrades because of the fixed and generic network of PWG without prior knowledge of speech periodicity. The proposed QPPWG adopts a pitch-dependent dilated convolution network (PDCNN) module, which introduces the pitch information into PWG via the dynamically changed network architecture, to improve the pitch controllability and speech modeling capability of vanilla PWG. Both objective and subjective evaluation results show the higher pitch accuracy and comparable speech quality of QPPWG-generated speech when the QPPWG model size is only 70 % of that of vanilla PWG., Comment: 5 page, 6 figures, 2 tables. Proc. Interspeech, 2020
Published: 2020

19. Cross-scale Attention Model for Acoustic Event Classification

Author: Lu, Xugang, Shen, Peng, Li, Sheng, Tsao, Yu, and Kawai, Hisashi
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: A major advantage of a deep convolutional neural network (CNN) is that the focused receptive field size is increased by stacking multiple convolutional layers. Accordingly, the model can explore the long-range dependency of features from the top layers. However, a potential limitation of the network is that the discriminative features from the bottom layers (which can model the short-range dependency) are smoothed out in the final representation. This limitation is especially evident in the acoustic event classification (AEC) task, where both short- and long-duration events are involved in an audio clip and needed to be classified. In this paper, we propose a cross-scale attention (CSA) model, which explicitly integrates features from different scales to form the final representation. Moreover, we propose the adoption of the attention mechanism to specify the weights of local and global features based on the spatial and temporal characteristics of acoustic events. Using mathematic formulations, we further reveal that the proposed CSA model can be regarded as a weighted residual CNN (ResCNN) model when the ResCNN is used as a backbone model. We tested the proposed model on two AEC datasets: one is an urban AEC task, and the other is an AEC task in smart car environments. Experimental results show that the proposed CSA model can effectively improve the performance of current state-of-the-art deep learning algorithms.
Published: 2019

20. A Multimodal Target-Source Classifier with Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects

Author: Magassouba, Aly, Sugiura, Komei, and Kawai, Hisashi
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: In this study, we focus on multimodal language understanding for fetching instructions in the domestic service robots context. This task consists of predicting a target object, as instructed by the user, given an image and an unstructured sentence, such as "Bring me the yellow box (from the wooden cabinet)." This is challenging because of the ambiguity of natural language, i.e., the relevant information may be missing or there might be several candidates. To solve such a task, we propose the multimodal target-source classifier model with attention branches (MTCM-AB), which is an extension of the MTCM. Our methodology uses the attention branch network (ABN) to develop a multimodal attention mechanism based on linguistic and visual inputs. Experimental validation using a standard dataset showed that the MTCM-AB outperformed both state-of-the-art methods and the MTCM. In particular the MTCM-AB accuracy on average was 90.1% while human performance was 90.3% on the PFN-PIC dataset., Comment: 9 pages, 5 figures, accepted for IEEE Robotics and Automation Letters
Published: 2019

21. Multimodal Attention Branch Network for Perspective-Free Sentence Generation

Author: Magassouba, Aly, Sugiura, Komei, and Kawai, Hisashi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics
Abstract: In this paper, we address the automatic sentence generation of fetching instructions for domestic service robots. Typical fetching commands such as "bring me the yellow toy from the upper part of the white shelf" includes referring expressions, i.e., "from the white upper part of the white shelf". To solve this task, we propose a multimodal attention branch network (Multi-ABN) which generates natural sentences in an end-to-end manner. Multi-ABN uses multiple images of the same fixed scene to generate sentences that are not tied to a particular viewpoint. This approach combines a linguistic attention branch mechanism with several attention branch mechanisms. We evaluated our approach, which outperforms the state-of-the-art method on a standard metrics. Our method also allows us to visualize the alignment between the linguistic input and the visual features., Comment: 10 pages, 4 figures. Accepted for CoRL 2019
Published: 2019

22. Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target-Source Classification

Author: Magassouba, Aly, Sugiura, Komei, Quoc, Anh Trinh, and Kawai, Hisashi
Subjects: Computer Science - Robotics
Abstract: In this paper, we address multimodal language understanding for unconstrained fetching instruction in domestic service robots context. A typical fetching instruction such as "Bring me the yellow toy from the white shelf" requires to infer the user intention, that is what object (target) to fetch and from where (source). To solve the task, we propose a Multimodal Target-source Classifier Model (MTCM), which predicts the region-wise likelihood of target and source candidates in the scene. Unlike other methods, MTCM can handle regionwise classification based on linguistic and visual features. We evaluated our approach that outperformed the state-of-the-art method on a standard data set. In addition, we extended MTCM with Generative Adversarial Nets (MTCM-GAN), and enabled simultaneous data augmentation and classification., Comment: 8 pages, 6 figures, 5 tables. Accepted for RA-L with presentation at IROS 2019
Published: 2019

23. Incorporating Symbolic Sequential Modeling for Speech Enhancement

Author: Liao, Chien-Feng, Tsao, Yu, Lu, Xugang, and Kawai, Hisashi
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: In a noisy environment, a lossy speech signal can be automatically restored by a listener if he/she knows the language well. That is, with the built-in knowledge of a "language model", a listener may effectively suppress noise interference and retrieve the target speech signals. Accordingly, we argue that familiarity with the underlying linguistic content of spoken utterances benefits speech enhancement (SE) in noisy environments. In this study, in addition to the conventional modeling for learning the acoustic noisy-clean speech mapping, an abstract symbolic sequential modeling is incorporated into the SE framework. This symbolic sequential modeling can be regarded as a "linguistic constraint" in learning the acoustic noisy-clean speech mapping function. In this study, the symbolic sequences for acoustic signals are obtained as discrete representations with a Vector Quantized Variational Autoencoder algorithm. The obtained symbols are able to capture high-level phoneme-like content from speech signals. The experimental results demonstrate that the proposed framework can obtain notable performance improvement in terms of perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) on the TIMIT dataset., Comment: Accepted to Interspeech 2019
Published: 2019

24. A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

Author: Magassouba, Aly, Sugiura, Komei, and Kawai, Hisashi
Subjects: Computer Science - Robotics, Computer Science - Computation and Language
Abstract: This paper focuses on a multimodal language understanding method for carry-and-place tasks with domestic service robots. We address the case of ambiguous instructions, that is, when the target area is not specified. For instance "put away the milk and cereal" is a natural instruction where there is ambiguity regarding the target area, considering environments in daily life. Conventionally, this instruction can be disambiguated from a dialogue system, but at the cost of time and cumbersome interaction. Instead, we propose a multimodal approach, in which the instructions are disambiguated using the robot's state and environment context. We develop the Multi-Modal Classifier Generative Adversarial Network (MMC-GAN) to predict the likelihood of different target areas considering the robot's physical limitation and the target clutter. Our approach, MMC-GAN, significantly improves accuracy compared with baseline methods that use instructions only or simple deep neural networks., Comment: 9 pages, 7 figures, accepted for IEEE Robotics and Automation Letters (RA-L)
Published: 2018

25. Grounded Language Understanding for Manipulation Instructions Using GAN-Based Classification

Author: Sugiura, Komei and Kawai, Hisashi
Subjects: Computer Science - Robotics, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction
Abstract: The target task of this study is grounded language understanding for domestic service robots (DSRs). In particular, we focus on instruction understanding for short sentences where verbs are missing. This task is of critical importance to build communicative DSRs because manipulation is essential for DSRs. Existing instruction understanding methods usually estimate missing information only from non-grounded knowledge; therefore, whether the predicted action is physically executable or not was unclear. In this paper, we present a grounded instruction understanding method to estimate appropriate objects given an instruction and situation. We extend the Generative Adversarial Nets (GAN) and build a GAN-based classifier using latent representations. To quantitatively evaluate the proposed method, we have developed a data set based on the standard data set used for Visual QA. Experimental results have shown that the proposed method gives the better result than baseline methods., Comment: 6 pages, 3 figures, published at IEEE ASRU 2017
Published: 2018

26. End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks

Author: Fu, Szu-Wei, Wang, Tao-Wei, Tsao, Yu, Lu, Xugang, and Kawai, Hisashi
Subjects: Statistics - Machine Learning, Computer Science - Learning, Computer Science - Sound
Abstract: Speech enhancement model is used to map a noisy speech to a clean speech. In the training stage, an objective function is often adopted to optimize the model parameters. However, in most studies, there is an inconsistency between the model optimization criterion and the evaluation criterion on the enhanced speech. For example, in measuring speech intelligibility, most of the evaluation metric is based on a short-time objective intelligibility (STOI) measure, while the frame based minimum mean square error (MMSE) between estimated and clean speech is widely used in optimizing the model. Due to the inconsistency, there is no guarantee that the trained model can provide optimal performance in applications. In this study, we propose an end-to-end utterance-based speech enhancement framework using fully convolutional neural networks (FCN) to reduce the gap between the model optimization and evaluation criterion. Because of the utterance-based optimization, temporal correlation information of long speech segments, or even at the entire utterance level, can be considered when perception-based objective functions are used for the direct optimization. As an example, we implement the proposed FCN enhancement framework to optimize the STOI measure. Experimental results show that the STOI of test speech is better than conventional MMSE-optimized speech due to the consistency between the training and evaluation target. Moreover, by integrating the STOI in model optimization, the intelligibility of human subjects and automatic speech recognition (ASR) system on the enhanced speech is also substantially improved compared to those generated by the MMSE criterion., Comment: Accepted in IEEE Transactions on Audio, Speech and Language Processing (TASLP)
Published: 2017

27. Raw Waveform-based Speech Enhancement by Fully Convolutional Networks

Author: Fu, Szu-Wei, Tsao, Yu, Lu, Xugang, and Kawai, Hisashi
Subjects: Statistics - Machine Learning, Computer Science - Learning, Computer Science - Sound
Abstract: This study proposes a fully convolutional network (FCN) model for raw waveform-based speech enhancement. The proposed system performs speech enhancement in an end-to-end (i.e., waveform-in and waveform-out) manner, which dif-fers from most existing denoising methods that process the magnitude spectrum (e.g., log power spectrum (LPS)) only. Because the fully connected layers, which are involved in deep neural networks (DNN) and convolutional neural networks (CNN), may not accurately characterize the local information of speech signals, particularly with high frequency components, we employed fully convolutional layers to model the waveform. More specifically, FCN consists of only convolutional layers and thus the local temporal structures of speech signals can be efficiently and effectively preserved with relatively few weights. Experimental results show that DNN- and CNN-based models have limited capability to restore high frequency components of waveforms, thus leading to decreased intelligibility of enhanced speech. By contrast, the proposed FCN model can not only effectively recover the waveforms but also outperform the LPS-based DNN baseline in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). In addition, the number of model parameters in FCN is approximately only 0.2% compared with that in both DNN and CNN.
Published: 2017

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

27 results on '"Kawai, Hisashi"'

1. Temporal Order Preserved Optimal Transport-based Cross-modal Knowledge Transfer Learning for ASR

2. Generative linguistic representation for spoken language identification

3. Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition

4. Neural domain alignment for spoken language recognition based on optimal transport

5. Hierarchical Cross-Modality Knowledge Transfer with Sinkhorn Attention for CTC-based ASR

6. Cross-modal Alignment with Optimal Transport for CTC-based ASR

7. Pronunciation-aware unique character encoding for RNN Transducer-based Mandarin speech recognition

8. Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

9. Transducer-based language embedding for spoken language identification

10. Partial Coupling of Optimal Transport for Spoken Language Identification

11. Siamese Neural Network with Joint Bayesian Model Structure for Speaker Verification

12. CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation

13. Predicting and Attending to Damaging Collisions for Placing Everyday Objects in Photo-Realistic Simulations

14. Coupling a generative model with a discriminative learning framework for speaker verification

15. Unsupervised neural adaptation model based on optimal transport for spoken language identification

16. Quasi-Periodic Parallel WaveGAN: A Non-autoregressive Raw Waveform Generative Model with Pitch-dependent Dilated Convolution Neural Network

17. Alleviating the Burden of Labeling: Sentence Generation by Attention Branch Encoder-Decoder Network

18. Quasi-Periodic Parallel WaveGAN Vocoder: A Non-autoregressive Pitch-dependent Dilated Convolution Model for Parametric Speech Generation

19. Cross-scale Attention Model for Acoustic Event Classification

20. A Multimodal Target-Source Classifier with Attention Branches to Understand Ambiguous Instructions for Fetching Daily Objects

21. Multimodal Attention Branch Network for Perspective-Free Sentence Generation

22. Understanding Natural Language Instructions for Fetching Daily Objects Using GAN-Based Multimodal Target-Source Classification

23. Incorporating Symbolic Sequential Modeling for Speech Enhancement

24. A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions

25. Grounded Language Understanding for Manipulation Instructions Using GAN-Based Classification

26. End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks

27. Raw Waveform-based Speech Enhancement by Fully Convolutional Networks

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

27 results on '"Kawai, Hisashi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources