Author: "Araki, Shoko" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Araki, Shoko"' showing total 383 results

Start Over Author "Araki, Shoko"

383 results on '"Araki, Shoko"'

1. Mamba-based Segmentation Model for Speaker Diarization

Author: Plaquet, Alexis, Tawara, Naohiro, Delcroix, Marc, Horiguchi, Shota, Ando, Atsushi, and Araki, Shoko
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparing the state-of-the-art neural segmentation of the pyannote pipeline with our proposed Mamba-based variant. Mamba's stronger processing capabilities allow usage of longer local windows, which significantly improve diarization quality by making the speaker embedding extraction more reliable. We find Mamba to be a superior alternative to both traditional RNN and the tested attention-based model. Our proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets., Comment: 5 pages, 4 figures. Submitted to ICASSP 2025. Code at https://github.com/nttcslab-sp/mamba-diarization
Published: 2024

2. SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model

Author: Hernandez-Olivan, Carlos, Delcroix, Marc, Ochiai, Tsubasa, Niizumi, Daisuke, Tawara, Naohiro, Nakatani, Tomohiro, and Araki, Shoko
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Target sound extraction (TSE) consists of isolating a desired sound from a mixture of arbitrary sounds using clues to identify it. A TSE system requires solving two problems at once, identifying the target source and extracting the target signal from the mixture. For increased practicability, the same system should work with various types of sound. The duality of the problem and the wide variety of sounds make it challenging to train a powerful TSE system from scratch. In this paper, to tackle this problem, we explore using a pre-trained audio foundation model that can provide rich feature representations of sounds within a TSE system. We chose the masked-modeling duo (M2D) foundation model, which appears especially suited for the TSE task, as it is trained using a dual objective consisting of sound-label predictions and improved masked prediction. These objectives are related to sound identification and the signal extraction problems of TSE. We propose a new TSE system that integrates the feature representation from M2D into SoundBeam, which is a strong TSE system that can exploit both target sound class labels and pre-recorded enrollments (or audio queries) as clues. We show experimentally that using M2D can increase extraction performance, especially when employing enrollment clues.
Published: 2024

3. NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

Author: Kamo, Naoyuki, Tawara, Naohiro, Ando, Atsushi, Kano, Takatomo, Sato, Hiroshi, Ikeshita, Rintaro, Moriya, Takafumi, Horiguchi, Shota, Matsuura, Kohei, Ogawa, Atsunori, Plaquet, Alexis, Ashihara, Takanori, Ochiai, Tsubasa, Mimura, Masato, Delcroix, Marc, Nakatani, Tomohiro, Asami, Taichi, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach. We then apply guided source separation (GSS) with several improvements to the baseline system. Finally, we perform ASR using a combination of systems built from strong pre-trained models. Our proposed system achieves a macro tcpWER of 21.3 % on the dev set, which is a 57 % relative improvement over the baseline., Comment: 5 pages, 4 figures, CHiME8 challenge
Published: 2024

4. Interaural time difference loss for binaural target sound extraction

Author: Hernandez-Olivan, Carlos, Delcroix, Marc, Ochiai, Tsubasa, Tawara, Naohiro, Nakatani, Tomohiro, and Araki, Shoko
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Binaural target sound extraction (TSE) aims to extract a desired sound from a binaural mixture of arbitrary sounds while preserving the spatial cues of the desired sound. Indeed, for many applications, the target sound signal and its spatial cues carry important information about the sound source. Binaural TSE can be realized with a neural network trained to output only the desired sound given a binaural mixture and an embedding characterizing the desired sound class as inputs. Conventional TSE systems are trained using signal-level losses, which measure the difference between the extracted and reference signals for the left and right channels. In this paper, we propose adding explicit spatial losses to better preserve the spatial cues of the target sound. In particular, we explore losses aiming at preserving the interaural level (ILD), phase (IPD), and time differences (ITD). We show experimentally that adding such spatial losses, particularly our newly proposed ITD loss, helps preserve better spatial cues while maintaining the signal-level metrics., Comment: Accepted in the International Workshop on Acoustic Signal Enhancement (IWAENC 2024)
Published: 2024

5. Rethinking Processing Distortions: Disentangling the Impact of Speech Enhancement Errors on Speech Recognition Performance

Author: Ochiai, Tsubasa, Iwamoto, Kazuma, Delcroix, Marc, Ikeshita, Rintaro, Sato, Hiroshi, Araki, Shoko, and Katagiri, Shigeru
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance., Comment: 13 pages, 6 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing
Published: 2024

6. Probing Self-supervised Learning Models with Target Speech Extraction

Author: Peng, Junyi, Delcroix, Marc, Ochiai, Tsubasa, Plchot, Oldrich, Ashihara, Takanori, Araki, Shoko, and Cernocky, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enrollment speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation., Comment: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop
Published: 2024

7. Target Speech Extraction with Pre-trained Self-supervised Learning Models

Author: Peng, Junyi, Delcroix, Marc, Ochiai, Tsubasa, Plchot, Oldrich, Araki, Shoko, and Cernocky, Jan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters., Comment: Accepted to ICASSP 2024
Published: 2024

8. Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers

Author: Tammen, Marvin, Ochiai, Tsubasa, Delcroix, Marc, Nakatani, Tomohiro, Araki, Shoko, and Doclo, Simon
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to specific microphone arrays, necessitating a different model for varying channel permutations, numbers, or geometries. To improve the robustness of the ASA module against such variations, in this paper we investigate three approaches: training with random channel configurations, employing the transform-average-concatenate method to process multi-channel input features, and utilizing robust input features. Our experiments on the CHiME-3 and DEMAND datasets show that these approaches enable the ASA-augmented beamformer to track moving speakers across different microphone arrays unseen in training., Comment: accepted at Interspeech 2024
Published: 2024

9. Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

Author: Ogawa, Atsunori, Tawara, Naohiro, Delcroix, Marc, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: We investigate the effectiveness of using a large ensemble of advanced neural language models (NLMs) for lattice rescoring on automatic speech recognition (ASR) hypotheses. Previous studies have reported the effectiveness of combining a small number of NLMs. In contrast, in this study, we combine up to eight NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are trained with two different random initialization seeds. We combine these NLMs through iterative lattice generation. Since these NLMs work complementarily with each other, by combining them one by one at each rescoring iteration, language scores attached to given lattice arcs can be gradually refined. Consequently, errors of the ASR hypotheses can be gradually reduced. We also investigate the effectiveness of carrying over contextual information (previous rescoring results) across a lattice sequence of a long speech such as a lecture speech. In experiments using a lecture speech corpus, by combining the eight NLMs and using context carry-over, we obtained a 24.4% relative word error rate reduction from the ASR 1-best baseline. For further comparison, we performed simultaneous (i.e., non-iterative) NLM combination and 100-best rescoring using the large ensemble of NLMs, which confirmed the advantage of lattice rescoring with iterative NLM combination., Comment: Accepted to ICASSP 2022
Published: 2023

10. How does end-to-end speech recognition training impact speech enhancement artifacts?

Author: Iwamoto, Kazuma, Ochiai, Tsubasa, Delcroix, Marc, Ikeshita, Rintaro, Sato, Hiroshi, Araki, Shoko, and Katagiri, Shigeru
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Jointly training a speech enhancement (SE) front-end and an automatic speech recognition (ASR) back-end has been investigated as a way to mitigate the influence of \emph{processing distortion} generated by single-channel SE on ASR. In this paper, we investigate the effect of such joint training on the signal-level characteristics of the enhanced signals from the viewpoint of the decomposed noise and artifact errors. The experimental analyses provide two novel findings: 1) ASR-level training of the SE front-end reduces the artifact errors while increasing the noise errors, and 2) simply interpolating the enhanced and observed signals, which achieves a similar effect of reducing artifacts and increasing noise, improves ASR performance without jointly modifying the SE and ASR modules, even for a strong ASR back-end using a WavLM feature extractor. Our findings provide a better understanding of the effect of joint training and a novel insight for designing an ASR agnostic SE front-end., Comment: 5 pages, 1 figure, 1 table
Published: 2023

11. Neural network-based virtual microphone estimation with virtual microphone and beamformer-level multi-task loss

Author: Segawa, Hanako, Ochiai, Tsubasa, Delcroix, Marc, Nakatani, Tomohiro, Ikeshita, Rintaro, Araki, Shoko, Yamada, Takeshi, and Makino, Shoji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Array processing performance depends on the number of microphones available. Virtual microphone estimation (VME) has been proposed to increase the number of microphone signals artificially. Neural network-based VME (NN-VME) trains an NN with a VM-level loss to predict a signal at a microphone location that is available during training but not at inference. However, this training objective may not be optimal for a specific array processing back-end, such as beamforming. An alternative approach is to use a training objective considering the array-processing back-end, such as a loss on the beamformer output. This approach may generate signals optimal for beamforming but not physically grounded. To combine the advantages of both approaches, this paper proposes a multi-task loss for NN-VME that combines both VM-level and beamformer-level losses. We evaluate the proposed multi-task NN-VME on multi-talker underdetermined conditions and show that it achieves a 33.1 % relative WER improvement compared to using only real microphones and 10.8 % compared to using a prior NN-VME approach., Comment: 5 pages, 2 figures, 1 table
Published: 2023

12. DOA-informed switching independent vector extraction and beamforming for speech enhancement in underdetermined situations

Author: Ueda, Tetsuya, Nakatani, Tomohiro, Ikeshita, Rintaro, Araki, Shoko, and Makino, Shoji
Published: 2024
Full Text: View/download PDF

13. Modified Parametric Multichannel Wiener Filter \\for Low-latency Enhancement of Speech Mixtures with Unknown Number of Speakers

Author: Guo, Ning, Nakatani, Tomohiro, Araki, Shoko, and Moriya, Takehiro
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: This paper introduces a novel low-latency online beamforming (BF) algorithm, named Modified Parametric Multichannel Wiener Filter (Mod-PMWF), for enhancing speech mixtures with unknown and varying number of speakers. Although conventional BFs such as linearly constrained minimum variance BF (LCMV BF) can enhance a speech mixture, they typically require such attributes of the speech mixture as the number of speakers and the acoustic transfer functions (ATFs) from the speakers to the microphones. When the mixture attributes are unavailable, estimating them by low-latency processing is challenging, hindering the application of the BFs to the problem. In this paper, we overcome this problem by modifying a conventional Parametric Multichannel Wiener Filter (PMWF). The proposed Mod-PMWF can adaptively form a directivity pattern that enhances all the speakers in the mixture without explicitly estimating these attributes. Our experiments will show the proposed BF's effectiveness in interference reduction ratios and subjective listening tests.
Published: 2023

14. Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

Author: Delcroix, Marc, Tawara, Naohiro, Diez, Mireia, Landini, Federico, Silnova, Anna, Ogawa, Atsunori, Nakatani, Tomohiro, Burget, Lukas, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddings, one for each speaker in a chunk. We can cluster these embeddings using constrained agglomerative hierarchical clustering (cAHC), ensuring embeddings from the same chunk belong to different clusters. This paper introduces an alternative clustering approach, a multi-stream extension of the successful Bayesian HMM clustering of x-vectors (VBx), called MS-VBx. Experiments on three datasets demonstrate that MS-VBx outperforms cAHC in diarization and speaker counting performance., Comment: Accepted at Interspeech 2023
Published: 2023

15. ConceptBeam: Concept Driven Target Speech Extraction

Author: Ohishi, Yasunori, Delcroix, Marc, Ochiai, Tsubasa, Araki, Shoko, Takeuchi, Daiki, Niizumi, Daisuke, Kimura, Akisato, Harada, Noboru, and Kashino, Kunio
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound
Abstract: We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation., Comment: Accepted to ACM Multimedia 2022
Published: 2022
Full Text: View/download PDF

16. Mask-based Neural Beamforming for Moving Speakers with Self-Attention-based Tracking

Author: Ochiai, Tsubasa, Delcroix, Marc, Nakatani, Tomohiro, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sources do not move. However, sources often move in practice, which causes performance degradation. In this paper, we address the problem of mask-based beamforming for moving sources. We first review classical approaches to tracking a moving source, which perform online or blockwise computation of the SCMs. We show that these approaches can be interpreted as computing a sum of instantaneous SCMs weighted by attention weights. These weights indicate which time frames of the signal to consider in the SCM computation. Online or blockwise computation assumes a heuristic and deterministic way of computing these attention weights that, although simple, may not result in optimal performance. We thus introduce a learning-based framework that computes optimal attention weights for beamforming. We achieve this using a neural network implemented with self-attention layers. We show experimentally that our proposed framework can greatly improve beamforming performance in moving source situations while maintaining high performance in non-moving situations, thus enabling the development of mask-based beamformers robust to source movements., Comment: 11 pages, 7 figures, Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing
Published: 2022

17. SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning

Author: Delcroix, Marc, Vázquez, Jorge Bennasar, Ochiai, Tsubasa, Kinoshita, Keisuke, Ohishi, Yasunori, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: In many situations, we would like to hear desired sound events (SEs) while being able to ignore interference. Target sound extraction (TSE) tackles this problem by estimating the audio signal of the sounds of target SE classes in a mixture of sounds while suppressing all other sounds. We can achieve this with a neural network that extracts the target SEs by conditioning it on clues representing the target SE classes. Two types of clues have been proposed, i.e., target SE class labels and enrollment audio samples (or audio queries), which are pre-recorded audio samples of sounds from the target SE classes. Systems based on SE class labels can directly optimize embedding vectors representing the SE classes, resulting in high extraction performance. However, extending these systems to extract new SE classes not encountered during training is not easy. Enrollment-based approaches extract SEs by finding sounds in the mixtures that share similar characteristics to the enrollment audio samples. These approaches do not explicitly rely on SE class definitions and can thus handle new SE classes. In this paper, we introduce a TSE framework, SoundBeam, that combines the advantages of both approaches. We also perform an extensive evaluation of the different TSE schemes using synthesized and real mixtures, which shows the potential of SoundBeam., Comment: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on Feb. 10th, 2022, and accepted on Oct. 20, 2022
Published: 2022

18. Effective data screening technique for crowdsourced speech intelligibility experiments: Evaluation with IRM-based speech enhancement

Author: Yamamoto, Ayako, Irino, Toshio, Araki, Shoko, Arai, Kenichi, Ogawa, Atsunori, Kinoshita, Keisuke, and Nakatani, Tomohiro
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: It is essential to perform speech intelligibility (SI) experiments with human listeners in order to evaluate objective intelligibility measures for developing effective speech enhancement and noise reduction algorithms. Recently, crowdsourced remote testing has become a popular means for collecting a massive amount and variety of data at a relatively small cost and in a short time. However, careful data screening is essential for attaining reliable SI data. We performed SI experiments on speech enhanced by an "oracle" ideal ratio mask (IRM) in a well-controlled laboratory and in crowdsourced remote environments that could not be controlled directly. We introduced simple tone pip tests, in which participants were asked to report the number of audible tone pips, to estimate their listening levels above audible thresholds. The tone pip tests were very effective for data screening to reduce the variability of crowdsourced remote results so that the laboratory results would become similar. The results also demonstrated the SI of an oracle IRM, giving us the upper limit of the mask-based single-channel speech enhancement., Comment: This paper was submitted to APSIPA ASC 2022 (https://www.apsipa2022.org). The original title [v1] was "Subjective intelligibility of speech sounds enhanced by ideal ratio mask via crowdsourced remote experiments with effective data screening."
Published: 2022
Full Text: View/download PDF

19. How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

Author: Iwamoto, Kazuma, Ochiai, Tsubasa, Delcroix, Marc, Ikeshita, Rintaro, Sato, Hiroshi, Araki, Shoko, and Katagiri, Shigeru
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with single-channel speech enhancement (SE). In this paper, we investigate the causes of ASR performance degradation by decomposing the SE errors using orthogonal projection-based decomposition (OPD). OPD decomposes the SE errors into noise and artifact components. The artifact component is defined as the SE error signal that cannot be represented as a linear combination of speech and noise sources. We propose manually scaling the error components to analyze their impact on ASR. We experimentally identify the artifact component as the main cause of performance degradation, and we find that mitigating the artifact can greatly improve ASR performance. Furthermore, we demonstrate that the simple observation adding (OA) technique (i.e., adding a scaled version of the observed signal to the enhanced speech) can monotonically increase the signal-to-artifact ratio under a mild condition. Accordingly, we experimentally confirm that OA improves ASR performance for both simulated and real recordings. The findings of this paper provide a better understanding of the influence of SE errors on ASR and open the door to future research on novel approaches for designing effective single-channel SE front-ends for ASR., Comment: 5 pages, 5 figures, submitted to Interspeech 2022
Published: 2022

20. Switching Independent Vector Analysis and Its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms

Author: Nakatani, Tomohiro, Ikeshita, Rintaro, Kinoshita, Keisuke, Sawada, Hiroshi, Kamo, Naoyuki, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate N sources from their sound mixture even with diffuse noise when a sufficiently large number (=M) of microphones are available (i.e., M>>N). However, the estimation accuracy seriously degrades as the number of microphones, or more specifically M-N (>=0), decreases. To overcome this limitation of IVA, we propose switching IVA (swIVA) in this paper. With swIVA, time frames of an observed signal with time-varying characteristics are clustered into several groups, each of which can be well handled by IVA using a small number of microphones, and thus accurate estimation can be achieved by applying IVA individually to each of the groups. Conventionally, a switching mechanism was introduced into a beamformer; however, no blind source separation algorithms with a switching mechanism have been successfully developed until this paper. In order to incorporate dereverberation capability, this paper further extends swIVA to blind Convolutional beamforming algorithm (swCIVA). It integrates swIVA and switching Weighted Prediction Error-based dereverberation (swWPE) in a jointly optimal way. We show that both swIVA and swCIVA can be optimized effectively based on blind signal processing, and that their performance can be further improved using a spatial guide for the initialization. Experiments show that both proposed methods largely outperform conventional IVA and its Convolutional beamforming extension (CIVA) in terms of objective signal quality and automatic speech recognition scores when using a relatively small number of microphones., Comment: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processing on 27 July 2021, accepted on 22 Feb. 2022
Published: 2021

21. Blind and neural network-guided convolutional beamformer for joint denoising, dereverberation, and source separation

Author: Nakatani, Tomohiro, Ikeshita, Rintaro, Kinoshita, Keisuke, Sawada, Hiroshi, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. For making the optimization computationally tractable, we incorporate two techniques into the approach: the Source-Wise Factorization (SW-Fact) of a CBF and the Independent Vector Extraction (IVE). To further improve the performance, we develop a method that integrates a neural network(NN) based source power spectra estimation with CBF optimization by an inverse-Gamma prior. Experiments using noisy reverberant mixtures reveal that our proposed method with both blind and NN-guided scenarios greatly outperforms the conventional state-of-the-art NN-supported mask-based CBF in terms of the improvement in automatic speech recognition and signal distortion reduction performance., Comment: Accepted by IEEE ICASSP 2021
Published: 2021
Full Text: View/download PDF

22. Few-shot learning of new sound classes for target sound extraction

Author: Delcroix, Marc, Vázquez, Jorge Bennasar, Ochiai, Tsubasa, Kinoshita, Keisuke, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Target sound extraction consists of extracting the sound of a target acoustic event (AE) class from a mixture of AE sounds. It can be realized using a neural network that extracts the target sound conditioned on a 1-hot vector that represents the desired AE class. With this approach, embedding vectors associated with the AE classes are directly optimized for the extraction of sound classes seen during training. However, it is not easy to extend this framework to new AE classes, i.e. unseen during training. Recently, speech, music, or AE sound extraction based on enrollment audio of the desired sound offers the potential of extracting any target sound in a mixture given only a short audio signal of a similar sound. In this work, we propose combining 1-hot- and enrollment-based target sound extraction, allowing optimal performance for seen AE classes and simple extension to new classes. In experiments with synthesized sound mixtures generated with the Freesound Dataset (FSD) datasets, we demonstrate the benefit of the combined framework for both seen and new AE classes. Besides, we also propose adapting the embedding vectors obtained from a few enrollment audio samples (few-shot) to further improve performance on new classes., Comment: To appear in Interspeech 2021
Published: 2021

23. PILOT: Introducing Transformers for Probabilistic Sound Event Localization

Author: Schymura, Christopher, Bönninghoff, Benedikt, Ochiai, Tsubasa, Delcroix, Marc, Kinoshita, Keisuke, Nakatani, Tomohiro, Araki, Shoko, and Kolossa, Dorothea
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance., Comment: Accepted at INTERSPEECH 2021
Published: 2021

24. Comparison of remote experiments using crowdsourcing and laboratory experiments on speech intelligibility

Author: Yamamoto, Ayako, Irino, Toshio, Arai, Kenichi, Araki, Shoko, Ogawa, Atsunori, Kinoshita, Keisuke, and Nakatani, Tomohiro
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether the results are entirely reliable. In this study, we compared speech intelligibility scores obtained in remote and laboratory experiments. The results showed that the mean and standard deviation (SD) of the remote experiments' speech reception threshold (SRT) were higher than those of the laboratory experiments. However, the variance in the SRTs across the speech-enhancement conditions revealed similarities, implying that remote testing results may be as useful as laboratory experiments to develop an objective measure. We also show that the practice session scores correlate with the SRT values. This is a priori information before performing the main tests and would be useful for data screening to reduce the variability of the SRT distribution., Comment: This paper was submitted to Interspeech2021
Published: 2021
Full Text: View/download PDF

25. Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization

Author: Schymura, Christopher, Ochiai, Tsubasa, Delcroix, Marc, Kinoshita, Keisuke, Nakatani, Tomohiro, Araki, Shoko, and Kolossa, Dorothea
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a novel approach to sound event localization by utilizing an attention-based sequence-to-sequence model. These types of models have been successfully applied to problems in natural language processing and automatic speech recognition. In this work, a multi-channel audio signal is encoded to a latent representation, which is subsequently decoded to a sequence of estimated directions-of-arrival. Herein, attentions allow for capturing temporal dependencies in the audio signal by focusing on specific frames that are relevant for estimating the activity and direction-of-arrival of sound events at the current time-step. The framework is evaluated on three publicly available datasets for sound event localization. It yields superior localization performance compared to state-of-the-art methods in both anechoic and reverberant conditions., Comment: Published in Proceedings of the 28th European Signal Processing Conference (EUSIPCO), 2020
Published: 2021

26. Data Fusion for Audiovisual Speaker Localization: Extending Dynamic Stream Weights to the Spatial Domain

Author: Wissing, Julio, Boenninghoff, Benedikt, Kolossa, Dorothea, Ochiai, Tsubasa, Delcroix, Marc, Kinoshita, Keisuke, Nakatani, Tomohiro, Araki, Shoko, and Schymura, Christopher
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Estimating the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization. Both applications benefit from a known speaker position when, for instance, applying beamforming or assigning unique speaker identities. Recently, several approaches utilizing acoustic signals augmented with visual data have been proposed for this task. However, both the acoustic and the visual modality may be corrupted in specific spatial regions, for instance due to poor lighting conditions or to the presence of background noise. This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions in the localization space. This fusion is achieved via a neural network, which combines the predictions of individual audio and video trackers based on their time- and location-dependent reliability. A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models., Comment: 4 pages, 6 figures, ICASSP 2021
Published: 2021

27. Multimodal Attention Fusion for Target Speaker Extraction

Author: Sato, Hiroshi, Ochiai, Tsubasa, Kinoshita, Keisuke, Delcroix, Marc, Nakatani, Tomohiro, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speaker extraction offers a more stable performance than single modality methods for simulated data, its adaptation towards realistic situations has not been fully explored as well as evaluations on real recorded mixtures. One of the major issues to handle realistic situations is how to make the system robust to clue corruption because in real recordings both clues may not be equally reliable, e.g. visual clues may be affected by occlusions. In this work, we propose a novel attention mechanism for multi-modal fusion and its training methods that enable to effectively capture the reliability of the clues and weight the more reliable ones. Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data. Moreover, we also record an audio-visual dataset of simultaneous speech with realistic visual clue corruption and show that audio-visual target speaker extraction with our proposals successfully work on real data., Comment: 7 pages, 5 figures
Published: 2021

28. Neural Network-based Virtual Microphone Estimator

Author: Ochiai, Tsubasa, Delcroix, Marc, Nakatani, Tomohiro, Ikeshita, Rintaro, Kinoshita, Keisuke, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic conditions. In this paper, as an alternative approach, we propose a neural network-based virtual microphone estimator (NN-VME). The NN-VME estimates virtual microphone signals directly in the time domain, by utilizing the precise estimation capability of the recent time-domain neural networks. We adopt a fully supervised learning framework that uses actual observations at the locations of the virtual microphones at training time. Consequently, the NN-VME can be trained using only multi-channel observations and thus directly on real recordings, avoiding the need for unrealistic physical model-based assumptions. Experiments on the CHiME-4 corpus show that the proposed NN-VME achieves high virtual microphone estimation performance even for real recordings and that a beamformer augmented with the NN-VME improves both the speech enhancement and recognition performance., Comment: 5 pages, 2 figures, submitted to ICASSP 2021
Published: 2021

29. Block Coordinate Descent Algorithms for Auxiliary-Function-Based Independent Vector Extraction

Author: Ikeshita, Rintaro, Nakatani, Tomohiro, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Signal Processing, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we address the problem of extracting all super-Gaussian source signals from a linear mixture in which (i) the number of super-Gaussian sources $K$ is less than that of sensors $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that do not need to be extracted. To solve this problem, independent vector extraction (IVE) using a majorization minimization and block coordinate descent (BCD) algorithms has been developed, attaining robust source extraction and low computational cost. We here improve the conventional BCDs for IVE by carefully exploiting the stationarity of the Gaussian noise components. We also newly develop a BCD for a semiblind IVE in which the transfer functions for several super-Gaussian sources are given a priori. Both algorithms consist of a closed-form formula and a generalized eigenvalue decomposition. In a numerical experiment of extracting speech signals from noisy mixtures, we show that when $K = 1$ in a blind case or at least $K - 1$ transfer functions are given in a semiblind case, the convergence of our proposed BCDs is significantly faster than those of the conventional ones., Comment: Accepted by IEEE Transactions on Signal Processing
Published: 2020
Full Text: View/download PDF

30. Listen to What You Want: Neural Network-based Universal Sound Selector

Author: Ochiai, Tsubasa, Delcroix, Marc, Koizumi, Yuma, Ito, Hiroaki, Kinoshita, Keisuke, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Being able to control the acoustic events (AEs) to which we want to listen would allow the development of more controllable hearable devices. This paper addresses the AE sound selection (or removal) problems, that we define as the extraction (or suppression) of all the sounds that belong to one or multiple desired AE classes. Although this problem could be addressed with a combination of source separation followed by AE classification, this is a sub-optimal way of solving the problem. Moreover, source separation usually requires knowing the maximum number of sources, which may not be practical when dealing with AEs. In this paper, we propose instead a universal sound selection neural network that enables to directly select AE sounds from a mixture given user-specified target AE classes. The proposed framework can be explicitly optimized to simultaneously select sounds from multiple desired AE classes, independently of the number of sources in the mixture. We experimentally show that the proposed method achieves promising AE sound selection performance and could be generalized to mixtures with a number of sources that are unseen during training., Comment: 5 pages, 2 figures, submitted to INTERSPEECH 2020
Published: 2020

31. Cognitive-driven convolutional beamforming using EEG-based auditory attention decoding

Author: Aroudi, Ali, Delcroix, Marc, Nakatani, Tomohiro, Kinoshita, Keisuke, Araki, Shoko, and Doclo, Simon
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing interfering speakers, reverberation and ambient noise, in this paper we propose a cognitive-driven multi-microphone speech enhancement system, which combines a neural-network-based mask estimator, weighted minimum power distortionless response convolutional beamformers and AAD. To control the suppression of the interfering speaker, we also propose an extension incorporating an interference suppression constraint. The experimental results show that the proposed system outperforms the state-of-the-art cognitive-driven speech enhancement systems in challenging reverberant and noisy conditions.
Published: 2020

32. Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system

Author: Kinoshita, Keisuke, Delcroix, Marc, Araki, Shoko, and Nakatani, Tomohiro
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Automatic meeting analysis is an essential fundamental technology required to let, e.g. smart devices follow and respond to our conversations. To achieve an optimal automatic meeting analysis, we previously proposed an all-neural approach that jointly solves source separation, speaker diarization and source counting problems in an optimal way (in a sense that all the 3 tasks can be jointly optimized through error back-propagation). It was shown that the method could well handle simulated clean (noiseless and anechoic) dialog-like data, and achieved very good performance in comparison with several conventional methods. However, it was not clear whether such all-neural approach would be successfully generalized to more complicated real meeting data containing more spontaneously-speaking speakers, severe noise and reverberation, and how it performs in comparison with the state-of-the-art systems in such scenarios. In this paper, we first consider practical issues required for improving the robustness of the all-neural approach, and then experimentally show that, even in real meeting scenarios, the all-neural approach can perform effective speech enhancement, and simultaneously outperform state-of-the-art systems., Comment: 8 pages, to appear in ICASSP2020
Published: 2020

33. Overdetermined independent vector analysis

Author: Ikeshita, Rintaro, Nakatani, Tomohiro, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: We address the convolutive blind source separation problem for the (over-)determined case where (i) the number of nonstationary target-sources $K$ is less than that of microphones $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that need not to be extracted. Independent vector analysis (IVA) can solve the problem by separating into $M$ sources and selecting the top $K$ highly nonstationary signals among them, but this approach suffers from a waste of computation especially when $K \ll M$. Channel reductions in preprocessing of IVA by, e.g., principle component analysis have the risk of removing the target signals. We here extend IVA to resolve these issues. One such extension has been attained by assuming the orthogonality constraint (OC) that the sample correlation between the target and noise signals is to be zero. The proposed IVA, on the other hand, does not rely on OC and exploits only the independence between sources and the stationarity of the noises. This enables us to develop several efficient algorithms based on block coordinate descent methods with a problem specific acceleration. We clarify that one such algorithm exactly coincides with the conventional IVA with OC, and also explain that the other newly developed algorithms are faster than it. Experimental results show the improved computational load of the new algorithms compared to the conventional methods. In particular, a new algorithm specialized for $K = 1$ outperforms the others., Comment: To appear at the 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)
Published: 2020
Full Text: View/download PDF

34. Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

Author: Delcroix, Marc, Ochiai, Tsubasa, Zmolikova, Katerina, Kinoshita, Keisuke, Tawara, Naohiro, Nakatani, Tomohiro, and Araki, Shoko
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound
Abstract: Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker to extract his/her voice characteristics that are then used to guide a neural network towards extracting speech of that speaker. SpeakerBeam presents a practical alternative to speech separation as it enables tracking speech of a target speaker across utterances, and achieves promising speech extraction performance. However, it sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures, because it is difficult to discriminate the target speaker from the interfering speakers. In this paper, we investigate strategies for improving the speaker discrimination capability of SpeakerBeam. First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation. Besides, we investigate (1) the use of spatial features to better discriminate speakers when microphone array recordings are available, (2) adding an auxiliary speaker identification loss for helping to learn more discriminative voice characteristics. We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures, and outperform TasNet in terms of target speech extraction., Comment: 5 pages, 3 figures. Submitted to ICASSP 2020
Published: 2020

35. GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech

Author: Yamamoto, Katsuhiko, Irino, Toshio, Araki, Shoko, Kinoshita, Keisuke, and Nakatani, Tomohiro
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the domain of a temporal envelope extracted by the gammachirp auditory filterbank and modulation filterbank. We also extend GEDI with multi-resolution analysis (mr-GEDI) to predict the speech intelligibility of sounds under non-stationary noise conditions. We evaluate GEDI in terms of speech intelligibility predictions of speech sounds enhanced by a classic spectral subtraction and a Wiener filtering method. The predictions are compared with human results for various signal-to-noise ratio conditions with additive pink and babble noises. The results showed that mr-GEDI predicted the intelligibility curves better than short-time objective intelligibility (STOI) measure, extended-STOI (ESTOI) measure, and hearing-aid speech perception index (HASPI) under pink-noise conditions, and better than HASPI under babble-noise conditions. The mr-GEDI method does not present an overestimation tendency and is considered a more conservative approach than STOI and ESTOI. Therefore, the evaluation with mr-GEDI may provide additional information in the development of speech enhancement algorithms., Comment: Preprint, 37 pages, 6 tables, 9 figures
Published: 2019
Full Text: View/download PDF

36. All-neural online source separation, counting, and diarization for meeting analysis

Author: von Neumann, Thilo, Kinoshita, Keisuke, Delcroix, Marc, Araki, Shoko, Nakatani, Tomohiro, and Haeb-Umbach, Reinhold
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation. The NN-based estimator operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources. The neural network is recurrent over time as well as over the number of sources. The simulation experiments show that state of the art separation performance is achieved, while at the same time delivering good diarization and source counting results. It even generalizes well to an unseen large number of blocks., Comment: 5 pages, to appear in ICASSP2019
Published: 2019

37. FastFCA: A Joint Diagonalization Based Fast Algorithm for Audio Source Separation Using A Full-Rank Spatial Covariance Model

Author: Ito, Nobutaka, Araki, Shoko, and Nakatani, Tomohiro
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: A source separation method using a full-rank spatial covariance model has been proposed by Duong et al. ["Under-determined Reverberant Audio Source Separation Using a Full-rank Spatial Covariance Model," IEEE Trans. ASLP, vol. 18, no. 7, pp. 1830-1840, Sep. 2010], which is referred to as full-rank spatial covariance analysis (FCA) in this paper. Here we propose a fast algorithm for estimating the model parameters of the FCA, which is named Fast-FCA, and applicable to the two-source case. Though quite effective in source separation, the conventional FCA has a major drawback of expensive computation. Indeed, the conventional algorithm for estimating the model parameters of the FCA requires frame-wise matrix inversion and matrix multiplication. Therefore, the conventional FCA may be infeasible in applications with restricted computational resources. In contrast, the proposed FastFCA bypasses matrix inversion and matrix multiplication owing to joint diagonalization based on the generalized eigenvalue problem. Furthermore, the FastFCA is strictly equivalent to the conventional algorithm. An experiment has shown that the FastFCA was over 250 times faster than the conventional algorithm with virtually the same source separation performance.
Published: 2018

38. How Does End-To-End Speech Recognition Training Impact Speech Enhancement Artifacts?

Author: Iwamoto, Kazuma, primary, Ochiai, Tsubasa, additional, Delcroix, Marc, additional, Ikeshita, Rintaro, additional, Sato, Hiroshi, additional, Araki, Shoko, additional, and Katagiri, Shigeru, additional
Published: 2024
Full Text: View/download PDF

39. Target Speech Extraction with Pre-Trained Self-Supervised Learning Models

Author: Peng, Junyi, primary, Delcroix, Marc, additional, Ochiai, Tsubasa, additional, Plchot, Oldřich, additional, Araki, Shoko, additional, and Černocký, Jan, additional
Published: 2024
Full Text: View/download PDF

40. Online Target Sound Extraction with Knowledge Distillation from Partially Non-Causal Teacher

Author: Wakayama, Keigo, primary, Ochiai, Tsubasa, additional, Delcroix, Marc, additional, Yasuda, Masahiro, additional, Saito, Shoichiro, additional, Araki, Shoko, additional, and Nakayama, Akira, additional
Published: 2024
Full Text: View/download PDF

41. Neural Network-Based Virtual Microphone Estimation with Virtual Microphone and Beamformer-Level Multi-Task Loss

Author: Segawa, Hanako, primary, Ochiai, Tsubasa, additional, Delcroix, Marc, additional, Nakatani, Tomohiro, additional, Ikeshita, Rintaro, additional, Araki, Shoko, additional, Yamada, Takeshi, additional, and Makino, Shoji, additional
Published: 2024
Full Text: View/download PDF

42. GEDI: Gammachirp envelope distortion index for predicting intelligibility of enhanced speech

Author: Yamamoto, Katsuhiko, Irino, Toshio, Araki, Shoko, Kinoshita, Keisuke, and Nakatani, Tomohiro
Published: 2020
Full Text: View/download PDF

43. A STUDY ON THE BASES OF TOWNS AND VILLAGES IN SIX TOWNS AND ONE VILLAGE AMALGAMATED LOCAL GOVERNMENTS

Author: MIKURINO, Suzuna, primary, MURAKAMI, Yoshiaki, additional, ARAKI, Shoko, additional, and AKITA, Noriko, additional
Published: 2024
Full Text: View/download PDF

44. Blind and Spatially-Regularized Online Joint Optimization of Source Separation, Dereverberation, and Noise Reduction

Author: Ueda, Tetsuya, primary, Nakatani, Tomohiro, additional, Ikeshita, Rintaro, additional, Kinoshita, Keisike, additional, Araki, Shoko, additional, and Makino, Shoji, additional
Published: 2024
Full Text: View/download PDF

45. Recent Advances in Multichannel Source Separation and Denoising Based on Source Sparseness

Author: Ito, Nobutaka, Araki, Shoko, Nakatani, Tomohiro, and Makino, Shoji, editor
Published: 2018
Full Text: View/download PDF

46. Research on the introduction process and actual conditions of group migration projects before the establishment of the Disaster Prevention and Group Relocation Promotion Project

Author: Magake, Kurumi, primary, Araki, Shoko, additional, Kimura, Reo, additional, and Akita, Noriko, additional
Published: 2023
Full Text: View/download PDF

47. 防災集団移転促進事業の創設経緯とその理念

Author: Araki, Shoko, primary, Magake, Kurumi, additional, Kimura, Reo, additional, and Akita, Noriko, additional
Published: 2023
Full Text: View/download PDF

48. 認定中心市街地活性化基本計画終了後の自治体における中心市街地活性化施策の展開に関する研究

Author: Sumiya, Shogo, primary, Kariya, Tomohiro, additional, Araki, Shoko, additional, and Ubaura, Michio, additional
Published: 2023
Full Text: View/download PDF

49. A Study on the Background of the Introduction of Specific Usage Limitation Area and its Potential as a Method of Residential Promotion in Non Area-Divided Cities

Author: Yoshida, Moeka, primary, Ubaura, Michio, additional, and Araki, Shoko, additional
Published: 2023
Full Text: View/download PDF

50. Multichannel Speech Enhancement Approaches to DNN-Based Far-Field Speech Recognition

Author: Delcroix, Marc, Yoshioka, Takuya, Ito, Nobutaka, Ogawa, Atsunori, Kinoshita, Keisuke, Fujimoto, Masakiyo, Higuchi, Takuya, Araki, Shoko, Nakatani, Tomohiro, Watanabe, Shinji, editor, Delcroix, Marc, editor, Metze, Florian, editor, and Hershey, John R., editor
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

383 results on '"Araki, Shoko"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources