13 results on '"Shanfa Ke"'
Search Results
2. Multi-speaker DoA Estimation Using Audio and Visual Modality
- Author
-
Yulin Wu, Ruimin Hu, Xiaochen Wang, and Shanfa Ke
- Subjects
Artificial Intelligence ,Computer Networks and Communications ,General Neuroscience ,Software - Published
- 2023
- Full Text
- View/download PDF
3. Distortion Reduction via CAE and DenseNet Mixture Network for Low Bitrate Spatial Audio Object Coding
- Author
-
Yulin Wu, Ruimin Hu, Xiaochen Wang, Chenhao Hu, and Shanfa Ke
- Subjects
Hardware and Architecture ,Signal Processing ,Media Technology ,Software ,Computer Science Applications - Published
- 2022
- Full Text
- View/download PDF
4. Intelligibility Enhancement Via Normal-to-Lombard Speech Conversion With Long Short-Term Memory Network and Bayesian Gaussian Mixture Model
- Author
-
Xiaochen Wang, Huyin Zhang, Gang Li, Ruimin Hu, and Shanfa Ke
- Subjects
Computer science ,Speech recognition ,Intelligibility (communication) ,Mixture model ,Lombard effect ,Expression (mathematics) ,Computer Science Applications ,Speech enhancement ,Noise ,Signal Processing ,Media Technology ,Active listening ,Electrical and Electronic Engineering ,Environmental noise - Abstract
Speech communications and interactions frequently occur in a variety of environments. Noise in the environment significantly degrades speech intelligibility when speaking and listening. Especially in the listening stage, even if the multimedia terminal outputs clean speech, it is still difficult for listeners to obtain information. Intelligibility enhancement (IENH) of speech is a technique for overcoming the environmental noise in the listening stage. It implements a perceptual enhancement of non-noisy speech. This study focuses on IENH via normal-to-Lombard speech conversion, inspired by a well known acoustic mechanism named the Lombard effect. Our method combines the long short-term memory (LSTM) network and Bayesian Gaussian mixture model (BGMM) to build a conversion architecture. Compared with baselines, it has three main advantages: 1) an LSTM network is used for spectral tilt mapping with fully considering short-term correlations and high-dimensional expression abilities; 2) the aperiodicity (AP) is mapped together with the fundamental frequency ( $F_0$ ) by a BGMM, which considers their relevance constraints and the importance of APs; 3) the gender-dependent mapping is used for $F_0$ and APs to consider distribution differences between genders. Experiments indicate that our method gets better performance in both objective and subjective tests.
- Published
- 2021
- Full Text
- View/download PDF
5. Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network
- Author
-
Zhongyuan Wang, Gang Li, Shanfa Ke, Tingzhao Wu, Ruimin Hu, and Xiaochen Wang
- Subjects
Ideal (set theory) ,Artificial neural network ,Computer Networks and Communications ,Computer science ,020207 software engineering ,TIMIT ,02 engineering and technology ,Residual ,Hardware and Architecture ,Aliasing ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Network performance ,Cluster analysis ,Algorithm ,Software - Abstract
The recently-proposed deep clustering-based algorithms represent a fundamental advance towards the single-channel multi-speaker speech sep- aration problem. These methods use an ideal binary mask to construct the objective function and K-means clustering method to estimate the ideal bina- ry mask. However, when sources belong to the same class or the number of sources is large, the assumption that one time-frequency unit of the mixture is dominated by only one source becomes weak, and the IBM-based separation causes spectral holes or aliasing. Instead, in our work, the quantized ideal ratio mask was proposed, the ideal ratio mask is quantized to have the output of the neural network with a limited number of possible values. Then the quan- tized ideal ratio mask is used to construct the objective function for the case of multi-source domination, to improve network performance. Furthermore, a network framework that combines a residual network, a recurring network, and a fully connected network was used for exploiting correlation information of frequency in our work. We evaluated our system on TIMIT dataset and show 1.6 dB SDR improvement over the previous state-of-the-art methods.
- Published
- 2020
- Full Text
- View/download PDF
6. Audio object coding based on optimal parameter frequency resolution
- Author
-
Shanfa Ke, Tingzhao Wu, Ruimin Hu, and Xiaochen Wang
- Subjects
Computer Networks and Communications ,Computer science ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,020207 software engineering ,02 engineering and technology ,Hardware and Architecture ,Aliasing ,Frequency resolution ,Distortion ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Computer vision ,Artificial intelligence ,Sound quality ,business ,Software ,Coding (social sciences) - Abstract
Object-based audio content is becoming the main form of audio content, because it is more interactive and flexible than traditional channel-based audio content. The Spatial Audio Object Coding (SAOC) method is proposed to encode multiple audio objects at low bitrate. However, SAOC extracts only a few parameters for each frame signal, which leads to low parameter frequency resolution. So the decoded signals have serious aliasing distortion which will destroy the sound quality. In this paper, we present a novel audio object coding method. We are the first to analyze how the signal distortion varies with parameter frequency resolution, and determine the optimal resolution to reduce aliasing distortion. In addition, we also achieve low coding bitrate by the dimensional reduction algorithm. Both the objective and subjective experiments confirm that the proposed method can provide higher sound quality of output signals than the state-of-the-art methods at equivalent bitrate.
- Published
- 2019
- Full Text
- View/download PDF
7. Normal-To-Lombard Speech Conversion by LSTM Network and BGMM for Intelligibility Enhancement of Telephone Speech
- Author
-
Xiaochen Wang, Ruimin Hu, Huyin Zhang, Gang Li, and Shanfa Ke
- Subjects
Background noise ,Speech enhancement ,Noise measurement ,Computer science ,Speech recognition ,Intelligibility (communication) ,Mixture model - Abstract
Noise in the environment significantly decreases the speech intelligibility of telephone conversations. Despite clean speech output from the device, the listener is still hard to get information. This study focuses on intelligibility enhancement (IENH) of telephone speech in near-end background noise based on normal-to-Lombard speech conversion. The proposed approach uses long short-term memory (LSTM) and Bayesian Gaussian mixture model (BGMM) to build the speech mapping model. Compared with previous studies, we fully consider the short-term correlations of speech and implement feature mappings with higher dimensional features and more types of features. Evaluations indicate that the proposed approach has achieved better results in both objective and subjective evaluation.
- Published
- 2020
- Full Text
- View/download PDF
8. High quality audio object coding framework based on non-negative matrix factorization
- Author
-
Tingzhao Wu, Xiaochen Wang, Jinshan Wang, Shanfa Ke, and Ruimin Hu
- Subjects
Computer Networks and Communications ,Computer science ,business.industry ,Speech coding ,020207 software engineering ,02 engineering and technology ,Coding tree unit ,Sub-band coding ,Matrix decomposition ,Non-negative matrix factorization ,0202 electrical engineering, electronic engineering, information engineering ,Computer vision ,Artificial intelligence ,Electrical and Electronic Engineering ,Sound quality ,business ,Decoding methods ,Coding (social sciences) - Abstract
Object-based audio coding is the main technique of audio scene coding. It can effectively reconstruct each object trajectory, besides provide sufficient flexibility for personalized audio scene reconstruction. So more and more attentions have been paid to the object-based audio coding. However, existing object-based techniques have poor sound quality because of low parameter frequency domain resolution. In order to achieve high quality audio object coding, we propose a new coding framework with introducing the non-negative matrix factorization (NMF) method. We extract object parameters with high resolution to improve sound quality, and apply NMF method to parameter coding to reduce the high bitrate caused by high resolution. And the experimental results have shown that the proposed framework can improve the coding quality by 25%, so it can provide a better solution to encode audio scene in a more flexible and higher quality way.
- Published
- 2017
- Full Text
- View/download PDF
9. Multi-speakers Speech Separation Based on Modified Attractor Points Estimation and GMM Clustering
- Author
-
Tingzhao Wu, Xiaochen Wang, Ruimin Hu, Zhongyuan Wang, Shanfa Ke, and Gang Li
- Subjects
Channel (digital image) ,Noise measurement ,business.industry ,Computer science ,Attractor ,Embedding ,Pattern recognition ,Point (geometry) ,Artificial intelligence ,Mixture model ,Cluster analysis ,business - Abstract
In this paper, a new attractor points estimation method for DANet algorithm used in single channel multi-speaker speech separation has been proposed. A prerequisite is that there must be separate segments of each source in the mixture. This condition is met in the actual situation because the source signal is not overlapping at any time. With this prerequisite, an isolated source segments extracted from the mixture is converted to the embedding space. With the embedding of isolated source segments, a more accurately attractor point for each source will be created, due to it does not contain components of other sources. In addition, a gaussian mixture model(GMM) clustering method instead of K-means clustering method were used at run time. The experiment demonstrated that the proposed method gets a better separation performance than state of the art method up to 1.04dB in SDR.
- Published
- 2019
- Full Text
- View/download PDF
10. The Analysis for Binaural Signal’s Characteristics of a Real Source and Corresponding Virtual Sound Image
- Author
-
Shanfa Ke, Jun Chen, Weiping Tu, Tingzhao Wu, Xiaochen Wang, and Jinshan Wang
- Subjects
geography ,Ideal (set theory) ,geography.geographical_feature_category ,Sound transmission class ,Computer science ,media_common.quotation_subject ,Acoustics ,Process (computing) ,020206 networking & telecommunications ,02 engineering and technology ,01 natural sciences ,Signal ,Perception ,0103 physical sciences ,0202 electrical engineering, electronic engineering, information engineering ,Loudspeaker ,010301 acoustics ,Binaural recording ,Sound (geography) ,media_common - Abstract
3D Audio System could rebuild more realistic and immersive sound effects. The existing 3D audio reconstruction methods mainly consider the physical characteristics of sound filed, less take head’s effect on sound transmission process into account. However, when human is located in the sound field, there will have an obvious deviation between perceptual sound image and reconstructed image. Some researchers considered head’s effects on the reconstruction of sound field, but they only use simple head model to reproduce sound field in some certain loudspeaker configurations. Therefore, if we want to reconstruct an ideal sound filed, we need to analyze head’s effects on the reconstruction of sound field in detail. Thus in this paper, we analyzed and compared binaural signal’s characteristics under different loudspeaker configurations and gains. The analysis results may be act as a primary reference for further research about sound field reconstruction etc.
- Published
- 2018
- Full Text
- View/download PDF
11. Analysis and Comparison of Inter-Channel Level Difference and Interaural Level Difference
- Author
-
Xiaochen Wang, Li Gao, Tingzhao Wu, Shanfa Ke, and Ruimin Hu
- Subjects
Computer science ,Speech recognition ,media_common.quotation_subject ,Attenuation ,Interaural time difference ,Horizontal plane ,01 natural sciences ,Imaging phantom ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Perception ,0103 physical sciences ,Loudspeaker ,0305 other medical science ,010301 acoustics ,Binaural recording ,media_common ,Coding (social sciences) - Abstract
The directional perception of human ear for the sound at horizontal plane mainly depends on binaural cues, Interaural Level Difference ILD, Interaural Time Difference ITD and Interaural Correlation IC. And ILD plays a leading role for human to locate the position of sound with frequency above 1.5i¾?KHz. In spatial audio applications, Inter-Channel Level Difference ICLD between loudspeaker signals are used to represent the location information of phantom sources generated by two loudspeakers. For headphone application, ILD and ICLD are approximate, so the perceptual characteristics of ILD can be used as a replacement for that of ICLD. But due to the attenuation influence of the transfer procedure from loudspeakers to humans ears, ICLD between loudspeakers signals are no longer the same with ILD between signals arrive at two ears. And these differences are always ignored in current spatial audio applications such as the perceptual coding of spatial parameters. So in this paper we focus on the analysis and comparison of ICLD and ILD from their formation and their values with different loudspeaker configurations. Experimental results showed that the difference of ILD and ICLD could be upi¾?to 55i¾?dB, and the research of this paper may be an important part or reference for further research about spatial audio applications such as coding, reconstruction, etc.
- Published
- 2016
- Full Text
- View/download PDF
12. Head Related Transfer Function Interpolation Based on Aligning Operation
- Author
-
Li Gao, Xiaochen Wang, Shanfa Ke, Tingzhao Wu, and Ruimin Hu
- Subjects
Computer science ,business.industry ,Phase (waves) ,01 natural sciences ,Head-related transfer function ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Time difference ,0103 physical sciences ,Computer vision ,Artificial intelligence ,0305 other medical science ,business ,010301 acoustics ,Binaural recording ,Sound image ,Interpolation - Abstract
Head related transfer function HRTF is the main technique of binaural synthesis, which is used to reconstruct spatial sound image, and the HRTF data only can be obtained by measurement. A high resolution HRTF database contains too many HRTFs, the workload of measurement is too huge to be finished. As a solution, in order to calculate new HRTF by measured HRTFs, many researchers concentrate on the interpolation of HRTF. But, before interpolating, HRTFs should be aligned because there is time delay between different HRTFs. Some researchers try to implement aligning operation based on phase, but the method is not appropriate since the periodicity of phase. Another idea to align HRTFs is by detecting method, however, the time difference is too tiny to detect exactly. None of the methods can provide a good and stable performance. In this paper, we propose a new method to align HRTFs based on correlation. And the experiments show that the proposed aligning method improves the accuracy index SDR 18.5i¾?dB for the most, furthermore, the proposed method could improve the accuracy for all positions.
- Published
- 2016
- Full Text
- View/download PDF
13. Physical Properties of Sound Field Based Estimation of Phantom Source in 3D
- Author
-
Li Gao, Tingzhao Wu, Yuhong Yang, Shanfa Ke, and Xiaochen Wang
- Subjects
Auditory event ,Computer science ,Acoustics ,Sound energy ,Amplitude panning ,Loudspeaker ,Particle velocity ,Sound pressure ,Signal ,Imaging phantom - Abstract
3D spatial sound effects can be achieved by amplitude panning with several loudspeakers, which can produce the auditory event of phantom source at arbitrary location with loudspeakers at arbitrary locations in 3D space. The estimation of the phantom source is to estimate the signal and location of a sound source which produce the same perception of auditory event with that of phantom source by loudspeakers. Several methods have been proposed to estimate the phantom sources, but these methods couldn’t ensure the conservation of sound energy at listening point in sound field, which including kinetic energy (particle velocity) and potential energy (sound pressure), so estimated errors were caused. A new method to estimate phantom source signal and the position is proposed, which is based on the physical properties (particle velocity, sound pressure) of the listening point in the sound field by loudspeakers. Moreover, the proposed method could be also appropriate for arbitrary asymmetric arranged loudspeakers. Experimental results showed that compared with current methods, estimated distortions of the location of phantom source and the superposed signal by loudspeakers with proposed method have been reduced obviously.
- Published
- 2015
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.