10,292 results on '"Speech coding"'
Search Results
52. Implementation of Audio Data Packet Encryption Synchronization Circuit
- Author
-
Ma, Hongbin, Wang, Yingli, Li, Gaoling, Kacprzyk, Janusz, Series editor, Pan, Jeng-Shyang, editor, Snasel, Vaclav, editor, Corchado, Emilio S., editor, Abraham, Ajith, editor, and Wang, Shyue-Liang, editor
- Published
- 2014
- Full Text
- View/download PDF
53. Summary and Conclusions
- Author
-
Rao, K. Sreenivasa, Vuppala, Anil Kumar, Neustein, Amy, Series editor, Rao, K. Sreenivasa, and Vuppala, Anil Kumar
- Published
- 2014
- Full Text
- View/download PDF
54. A Secure AMR Fixed Codebook Steganographic Scheme Based on Pulse Distribution Model.
- Author
-
Ren, Yanzhen, Yang, Hanyi, Wu, Hongxia, Tu, Weiping, and Wang, Lina
- Abstract
Adaptive multi-rate (AMR), a popular audio compression standard, is widely used in mobile communication and mobile Internet applications and has become a novel carrier for hiding information. To improve the statistical security, this paper presents a steganographic scheme in the AMR fixed codebook (FCB) domain based on the pulse distribution model (PDM-AFS), which is obtained from the distribution characteristics of the FCB value in the cover audio. The pulse positions in stego audio are controlled by message encoding and random masking to make the statistical distribution of the FCB parameters close to that of the cover audio. The experimental results show that the statistical security of the proposed scheme is better than that of the existing schemes. Furthermore, the hiding capacity is maintained compared with the existing schemes. The average hiding capacity can reach 2.06 kbps at an audio compression rate of 12.2 kbps, and the auditory concealment is good. To the best of our knowledge, this is the first secure AMR FCB steganographic scheme that improves the statistical security based on the distribution model of the cover audio. This scheme can be extended to other audio compression codecs under the principle of algebraic code excited linear prediction (ACELP), such as G.723.1 and G.729. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
55. An efficient two-digit adaptive delta modulation for Laplacian source coding.
- Author
-
Peric, Zoran, Denic, Bojan, and Despotovic, Vladimir
- Subjects
- *
ADAPTIVE modulation , *SOURCE code , *SOUND reverberation , *DATA quality , *MINIMAL design - Abstract
Delta Modulation (DM) is a simple waveform coding algorithm used mostly when timely data delivery is more important than the transmitted data quality. While the implementation of DM is fairly simple and inexpensive, it suffers from several limitations, such as slope overload and granular noise, which can be overcome using Adaptive Delta Modulation (ADM). This paper presents novel 2-digit ADM with six-level quantization using variable-length coding, for encoding the time-varying signals modelled by Laplacian distribution. Two variants of quantizer are employed, distortion-constrained quantizer which is optimally designed for minimal mean-squared error (MSE), and rate-constrained quantizer, which is suboptimal in the minimal MSE sense, but enables minimal loss in SQNR for the target bit rate. Experimental results using real speech signal are provided, indicating that the proposed configuration outperforms the baseline ADM algorithms, including Constant Factor Delta Modulation (CFDM), Continuously Variable Slope Delta Modulation (CVSDM), 2-digit and 2-bit ADM, and operates in a much wider dynamic range. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
56. Simple Speech Transform Coding Scheme Using Forward Adaptive Quantization for Discrete Input Signal.
- Author
-
Perić, Zoran, Tančić, Milan, Simić, Nikola, and Despotović, Vladimir
- Subjects
AUTOMATIC speech recognition ,IMAGE compression ,SIGNAL reconstruction ,SIGNAL processing ,SPEECH - Abstract
The speech coding scheme based on the simple transform coding and forward adaptive quantization for discrete input signal processing is proposed in this paper. The quasi-logarithmic quantizer is applied for discretization of continuous input signal, i.e. for preparing discrete input. The application of forward adaptation based on the input signal variance provides more efficient bandwidth usage, whereas utilization of transform coding provides sub-sequences with more predictable signal characteristics that ensure higher quality of signal reconstruction at the receiving end. In order to provide additional compression, transform coding precedes adaptive quantization. As an objective measure of system performance, signal-to-quantization-noise ratio is used. System performance is discussed for two typical cases. In the first case, it was considered that the information about continuous signal variance is available, whereas the second case considers system performance estimation when only the information about discretized signal variance is present, which means that there is a loss of input signal information. The main goal of such performance estimation comparison of the proposed speech signal coding model is to explore what is the objectivity of performance if the information about a continuous source is absent, which is a common phenomenon in digital systems. The advantages of the proposed coding scheme are demonstrated by comparing the performance of the reconstructed signal with other similar exiting speech signal coding systems. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
57. Effect of cochlear implant n-of-m strategy on signal-to-noise ratio below which noise hinders speech recognition.
- Author
-
Stam, Lucas, Goverts, S. Theo, and Smits, Cas
- Subjects
- *
COCHLEAR implants , *SIGNAL-to-noise ratio , *SPEECH perception , *INTELLIGIBILITY of speech , *SPEECH coding , *HIGHPASS electric filters - Abstract
Speech recognition was measured in 24 normal-hearing subjects for unprocessed speech and for speech processed by a cochlear implant Advanced Combination Encoder (ACE) coding strategy in quiet and at various signal-to noise ratios (SNRs). All signals were low- or high-pass filtered to avoid ceiling effects. Surprisingly, speech recognition performance plateaus at approximately 22 dB SNR for both speech types, implying that ACE processing has no effect on the upper limit of the effective SNR range. Speech recognition improved significantly above 15 dB SNR, suggesting that the upper limit used in the Speech Intelligibility Index should be reconsidered. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
58. F0-induced formant measurement errors result in biased variabilities.
- Author
-
Chen, Wei-Rong, Whalen, D. H., and Shadle, Christine H.
- Subjects
- *
FORMANTS (Speech) , *VOWEL reduction , *ACOUSTIC measurements , *LINEAR predictive coding , *SPEECH coding , *SPEECH perception - Abstract
Many developmental studies attribute reduction of acoustic variability to increasing motor control. However, linear prediction-based formant measurements are known to be biased toward the nearest harmonic of F0, especially at high F0s. Thus, the amount of reported formant variability generated by changes in F0 is unknown. Here, 470 000 vowels were synthesized, mimicking statistics reported in four developmental studies, to estimate the proportion of formant variability that can be attributed to F0 bias, as well as other formant measurement errors. Results showed that the F0-induced formant measurements errors are large and systematic, and cannot be eliminated by a large sample size. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
59. Front End to Back End Speech Scrambler.
- Author
-
Alsabbagh, Taha A.
- Subjects
SPEECH coding ,DATA encryption ,FREQUENCY curves ,INTELLIGIBILITY of speech - Abstract
In this paper, a modified approach for frequency permutation of speech signal has been designed implemented and verified based on MATLAB Simulink. A separate permutation of real and imaginary parts of the frequency component has been used. A Short Time Objective Intelligibility has been used for checking the performance and residual intelligibility of this method. The result shows a positive indication as compared with classical permutation. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
60. Optimal Fractional Linear Prediction With Restricted Memory.
- Author
-
Skovranek, Tomas, Despotovic, Vladimir, and Peric, Zoran
- Subjects
MEMORY ,LINEAR orderings ,DEFINITIONS ,FREQUENCY-domain analysis ,TRANSFER matrix ,FRACTIONAL calculus - Abstract
Linear prediction is extensively used in modeling, compression, coding, and generation of speech signal. Various formulations of linear prediction are available, both in time and frequency domain, which start from different assumptions but result in the same solution. In this letter, we propose a novel, generalized formulation of the optimal low-order linear prediction using the fractional (non-integer) derivatives. The proposed fractional derivative formulation allows for the definition of predictor with versatile behavior based on the order of fractional derivative. We derive the closed-form expressions of the optimal fractional linear predictor with restricted memory, and prove that the optimal first-order and the optimal second-order linear predictors are only its special cases. Furthermore, we empirically prove that the optimal order of fractional derivative can be approximated by the inverse of the predictor memory, and thus, it is a priori known. Therefore, the complexity is reduced by optimizing and transferring only one predictor coefficient, i.e., one parameter less in comparison to the second-order linear predictor, at the same level of performance. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
61. An AMR adaptive steganographic scheme based on the pitch delay of unvoiced speech.
- Author
-
Ren, Yanzhen, Liu, Dengkai, Yang, Jing, and Wang, Lina
- Subjects
CRYPTOGRAPHY ,COMMUNICATION & technology ,INFORMATION services ,SPEECH coding ,MOBILE communication systems - Abstract
In this paper, a novel AMR adaptive steganographic scheme based on pitch delay of unvoiced speech (PDU-AAS) was proposed. The existing AMR steganographic schemes based on pitch delay destroy the short-time relative stability of pitch delay of voiced speech segments and they are easier to be detected by the existing steganographic schemes. Especially, the pitch delay distribution of AMR voiced and unvoiced speech segments are analyzed in detail, and based on this characteristic that the pitch delay sequence of AMR unvoiced speech do not have short-term relative stability, we proposed an AMR adaptive steganographic scheme which selects the embedded position adaptively in the unvoiced speech segment, which is determined by the distribution characteristic of the adjacent pitch delay, and embeds the secret message by modifying the pitch delay without destroying the short-time stability of the pitch delay. The experiment results shows that the scheme has good concealment and hiding capability. Most important of all, the comparing experiment results show that the scheme has good security to resist the detection of the existing steganalysis algorithms. The principle of the scheme can be applied to the other steganographic schemes based on the pitch delay of the speech codec, such as G723.1, G.729. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
62. A scalable wideband speech codec using the wavelet packet transform based on the internet low bitrate codec.
- Author
-
Seto, Koji and Ogunfunmi, Tokunbo
- Subjects
- *
BROADBAND communication systems , *WAVELET transforms , *SPEECH codecs , *DISCRETE cosine transforms , *SPEECH codes theory , *INTERNET , *LINEAR network coding - Abstract
Highlights • A packet-loss robust scalable wideband speech codec is proposed. • Performance is improved using the wavelet transform instead of the MDCT. • The proposed codec outperforms the state-of-the-art codec in objective tests. • High performance of the proposed codec is confirmed in subjective tests. Abstract Most recent speech codecs employ code excited linear prediction (CELP) and transmit side information to improve speech quality under packet loss. Another approach to achieve high robustness to packet loss is to use the frame independent coding scheme based on the internet low bitrate codec (iLBC). The scalable wideband speech codec based on the iLBC was previously presented and outperformed G.729.1 at most bit rates according to the objective quality. This paper presents improvements to the previous work. Specifically, we employ the wavelet packet transform (WPT) instead of the modified discrete cosine transform (MDCT) to enhance the quality, and evaluate the proposed codec based on both the objective and subjective quality measures. The objective quality evaluation results show that clear improvement is achieved and that the proposed codec outperforms G.729.1 at the bit rate of 18 kbps or higher under clean channel conditions and has higher robustness to packet loss than G.729.1. The informal subjective test results also show similar trends. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
63. Novel Two-Bit Adaptive Delta Modulation Algorithms.
- Author
-
PERIC1, Zoran, DENIC, Bojan, and DESPOTOVIC, Vladimir
- Subjects
- *
ADAPTIVE modulation , *HILBERT-Huang transform , *SOUND reverberation , *LAPLACE distribution , *LAPLACE transformation , *ALGORITHMS , *SIGNAL-to-noise ratio - Abstract
This paper introduces two novel algorithms for the 2-bit adaptive delta modulation, namely 2-bit hybrid adaptive delta modulation and 2-bit optimal adaptive delta modulation. In 2-bit hybrid adaptive delta modulation, the adaptation is performed both at the frame level and the sample level, where the estimated variance is used to determine the initial quantization step size. In the latter algorithm, the estimated variance is used to scale the quantizer codebook optimally designed assuming Laplace distribution of the input signal. The algorithms are tested using speech signal and compared to constant factor delta modulation, continuously variable slope delta modulation and instantaneously adaptive 2-bit delta modulation, showing that the proposed algorithms offer higher performance and significantly wider dynamic range. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
64. An investigation on the degradation of different features extracted from the compressed American English speech using narrowband and wideband codecs.
- Author
-
Sankar, M. S. Arun and Sathidevi, P. S.
- Subjects
SPEECH perception ,SPEECH processing systems ,SIGNAL processing ,BROADBAND communication systems ,SPEECH disorders - Abstract
Speech coding facilitates speech compression without perceptual loss that results in the elimination or deterioration of both speech and speaker specific features used for a wide range of applications like automatic speaker and speech recognition, biometric authentication, prosody evaluations etc. The present work investigates the effect of speech coding in the quality of features which include Mel Frequency Cepstral Coefficients, Gammatone Frequency Cepstral Coefficients, Power-Normalized Cepstral Coefficients, Perceptual Linear Prediction Cepstral Coefficients, Rasta-Perceptual Linear Prediction Cepstral Coefficients, Residue Cepstrum Coefficients and Linear Predictive Coding-derived cepstral coefficients extracted from codec compressed speech. The codecs selected for this study are G.711, G.729, G.722.2, Enhanced Voice Services, Mixed Excitation Linear Prediction and also three codecs based on compressive sensing frame work. The analysis also includes the variation in the quality of extracted features with various bit-rates supported by Enhanced Voice Services, G.722.2 and compressive sensing codecs. The quality analysis of extracted epochs, fundamental frequency and formants estimated from codec compressed speech was also performed here. In the case of various features extracted from the output of selected codecs, the variation introduced by Mixed Excitation Linear Prediction codec is the least due to its unique method for the representation of excitation. In the case of compressive sensing based codecs, there is a drastic improvement in the quality of extracted features with the augmentation of bit rate due to the waveform type coding used in compressive sensing based codecs. For the most popular Code Excited Linear Prediction codec based on Analysis-by-Synthesis coding paradigm, the impact of Linear Predictive Coding order in feature extraction is investigated. There is an improvement in the quality of extracted features with the order of linear prediction and the optimum performance is obtained for Linear Predictive Coding order between 20 and 30, and this varies with gender and statistical characteristics of speech. Even though the basic motive of a codec is to compress single voice source, the performance of codecs in multi speaker environment is also studied, which is the most common environment in majority of the speech processing applications. Here, the multi speaker environment with two speakers is considered and there is an augmentation in the quality of individual speeches with increase in diversity of mixtures that are passed through codecs. The perceptual quality of individual speeches extracted from the codec compressed speech is almost same for both Mixed Excitation Linear Prediction and Enhanced Voice Services codecs but regarding the preservation of features, the Mixed Excitation Linear Prediction codec has shown a superior performance over Enhanced Voice Services codec. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
65. Low bit-rate speech coding based on multicomponent AFM signal model.
- Author
-
Bansal, Mohan and Sircar, Pradip
- Subjects
SPEECH processing systems ,SPEECH perception ,FOURIER analysis ,AUTOMATIC speech recognition ,PHONEME (Linguistics) - Abstract
In this paper, we propose a novel multicomponent amplitude and frequency modulated (AFM) signal model for parametric representation of speech phonemes. An efficient technique is developed for parameter estimation of the proposed model. The Fourier-Bessel series expansion is used to separate a multicomponent speech signal into a set of individual components. The discrete energy separation algorithm is used to extract the amplitude envelope (AE) and the instantaneous frequency (IF) of each component of the speech signal. Then, the parameter estimation of the proposed AFM signal model is carried out by analysing the AE and IF parts of the signal component. The developed model is found to be suitable for representation of an entire speech phoneme (voiced or unvoiced) irrespective of its time duration, and the model is shown to be applicable for low bit-rate speech coding. The symmetric Itakura-Saito and the root-mean-square log-spectral distance measures are used for comparison of the original and reconstructed speech signals. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
66. Multi-band excitation based vocoders and their real-time implementation
- Author
-
Ma, Wei
- Subjects
621.3822 ,Speech coding ,Digital signal processing - Abstract
Vocoders compress speech signals into a very efficient representation - vocal parameters, resulting in very low bit rate speech communications. However, the reproduced speech quality has remained low for a long time, since vocoders traditionally employ very simple speech production models. As speech coding research is focusing on bit rates below 4 kb/s after CELP based hybrid coders achieved a remarkable breakthrough at medium to low rates, 16-4.8 kb/s in the last decade, the vocoder techniques have regained their importance. The most attractive vocoder developed in recent years is the Multi-Band Excitation (MBE) vocoder, which uses multiple V/UV decisions in the frequency domain. The MBE vocoder can produce high speech quality at rates around 4 kb/s. However, problems associated with this newly emerged scheme include: high implementation complexity, rather higher rate than the normal 2.4 kb/s of vocoder, and low acoustic robustness. This thesis reports on a study of this new vocoder scheme. The first part of the thesis aims to examine 3 key aspects of vocoders: vocal tract filter, pitch determination and V/UV decisions. First, a comprehensive review is given for each of the above topics, covering formulation and classifications. Various improvement attempts are then discussed for each of these essential functions. The relationship between different vocal tract filter descriptions reveals the possibility of design of a new type of MBE based vocoder, MBE-LPC vocoder. A pitch synchronized sinusoidal synthesizer demonstrates a very efficient way of using sinusoidal-based vocal tract filters. A new pitch determination method is designed for the MBE vocoder in order to correct the bias of low pitch preference. In networking operations, vocoders face not only the speech signals but also network control signals, such as tones. Two essential functions: silence and DTMF detection have been designed together with the consideration of the V/UV decisions. In recent years, with speech coding algorithms employing more and more complicated techniques, their real-time implementation is becoming a critical problem in speech coding research. Research progress is frequently delayed by real-time implementation. Therefore, the second part of this thesis investigates the MBE based vocoders from a real-time implementation and system application point of view. A general study on realtime implementation is included which analyses the problems associated with modern DSP solutions. Fast implementation and efficient computation methods are proposed in considering vocoding and DSP peculiarities. Two complete MBE based vocoders are then examined along with the objective verification and performance enhancements in real-time operation. The efficiency and correctness of these real-time implementations are reported, in which the real-time implementation of 4.15 kb/s MBE vocoder was the first one to pass the objective type approval test required by the INMARSAT-M system. A proposed 2.7 kb/s MBE-LPC vocoder achieves slightly better quality than the above standard MBE vocoder.
- Published
- 1994
67. The Effect of Narrow-Band Transmission on Recognition of Paralinguistic Information From Human Vocalizations
- Author
-
Sascha Fruhholz, Erik Marchi, and Bjorn Schuller
- Subjects
Speech analysis ,speech coding ,emotion recognition ,computational paralinguistics ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Practically, no knowledge exists on the effects of speech coding and recognition for narrow-band transmission of speech signals within certain frequency ranges especially in relation to the recognition of paralinguistic cues in speech. We thus investigated the impact of narrow-band standard speech coders on the machine-based classification of affective vocalizations and clinical vocal recordings. In addition, we analyzed the effect of speech low-pass filtering by a set of different cut-off frequencies, either chosen as static values in the 0.5-5-kHz range or given dynamically by different upper limits from the first five speech formants (F1-F5). Speech coding and recognition were tested, first, according to short-term speaker states by using affective vocalizations as given by the Geneva Multimodal Emotion Portrayals. Second, in relation to long-term speaker traits, we tested vocal recording from clinical populations involving speech impairments as found in the Child Pathological Speech Database. We employ a large acoustic feature space derived from the Interspeech Computational Paralinguistics Challenge. Besides analysis of the sheer corruption outcome, we analyzed the potential of matched and multicondition training as opposed to miss-matched condition. In the results, first, multicondition and matched-condition training significantly increase performances as opposed to mismatched condition. Second, downgrades in classification accuracy occur, however, only at comparably severe levels of low-pass filtering. The downgrades especially appear for multi-categorical rather than for binary decisions. These can be dealt with reasonably by the alluded strategies.
- Published
- 2016
- Full Text
- View/download PDF
68. MASS: Microphone Array Speech Simulator in Room Acoustic Environment for Multi-Channel Speech Coding and Enhancement
- Author
-
Rui Cheng, Changchun Bao, and Zihao Cui
- Subjects
microphone array ,speech dataset ,room acoustics ,simulation ,speech coding ,speech enhancement ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Multi-channel speech coding and enhancement is an indispensable technology in speech communication. In order to verify the effectiveness of multi-channel speech coding and enhancement methods in the research and development, a microphone array speech simulator (MASS) used in room acoustic environment is proposed. The proposed MASS is the improvement and extension of the existing multi-channel speech simulator. It aims to simulate clean speech, noisy speech, clean speech with reverberation, noisy speech with reverberation, and noise signals by microphone array used for multi-channel coding and enhancement of speech signal in room acoustic environment. The experimental results of the multi-channel speech coding and enhancement prove that the MASS could well simulate the signals used in real room acoustic environment and can be applied to the research of the related fields.
- Published
- 2020
- Full Text
- View/download PDF
69. Steganographic Pulse-Based Recovery for Robust ACELP Transmission over Erasure Channels
- Author
-
López-Oller, Domingo, Gomez, Angel M., Córdoba, José Luis Pérez, Geiser, Bernd, Vary, Peter, Torre Toledano, Doroteo, editor, Ortega Giménez, Alfonso, editor, Teixeira, António, editor, González Rodríguez, Joaquín, editor, Hernández Gómez, Luis, editor, San Segundo Hernández, Rubén, editor, and Ramos Castro, Daniel, editor
- Published
- 2012
- Full Text
- View/download PDF
70. Early Development of Neural Speech Encoding Depends on Age but Not Native Language Status: Evidence From Lexical Tone
- Author
-
Peggy Hiu Ying Chan, Patrick C. M. Wong, Ching Man Lai, Ting Fan Leung, Akshay R. Maggu, Hugh Simon Lam, Nikolay Novitskiy, Kay H. Y. Wong, and Tak Yeung Leung
- Subjects
Speech recognition ,First language ,Speech coding ,Psychology ,Tone (literature) - Abstract
We investigated the development of early-latency and long-latency brain responses to native and non-native speech to shed light on the neurophysiological underpinnings of perceptual narrowing and early language development. Specifically, we postulated a two-level process to explain the decrease in sensitivity to non-native phonemes towards the end of infancy. Neurons at the earlier stages of the ascending auditory pathway mature rapidly during infancy facilitating the encoding of both native and non-native sounds. This growth enables neurons at the later stages of the auditory pathway to assign phonological status to speech according to the infant’s native language environment. To test this hypothesis, we collected early-latency and long-latency neural responses to native and non-native lexical tones from 85 Cantonese-learning children aged between 23 days and 24 months, 16 days. As expected, a broad range of presumably subcortical early-latency neural encoding measures grew rapidly and substantially during the first two years for both native and non-native tones. By contrast, long-latency cortical electrophysiological changes occurred on a much slower scale and showed sensitivity to nativeness at around six months. Our study provided a comprehensive understanding of early language development by revealing the complementary roles of earlier and later stages of speech processing in the developing brain.
- Published
- 2021
71. An Adaptive Bitrate Switching Algorithm for Speech Applications in Context of WebRTC
- Author
-
PoctaPeter, AlahmadiMohannad, and MelvinHugh
- Subjects
Web browser ,Multimedia ,Computer Networks and Communications ,Computer science ,Speech coding ,Switching algorithm ,Context (language use) ,Speech applications ,computer.software_genre ,WebRTC ,Set (abstract data type) ,Hardware and Architecture ,Data exchange ,computer - Abstract
Web Real-Time Communication (WebRTC) combines a set of standards and technologies to enable high-quality audio, video, and auxiliary data exchange in web browsers and mobile applications. It enables peer-to-peer multimedia sessions over IP networks without the need for additional plugins. The Opus codec, which is deployed as the default audio codec for speech and music streaming in WebRTC, supports a wide range of bitrates. This range of bitrates covers narrowband, wideband, and super-wideband up to fullband bandwidths. Users of IP-based telephony always demand high-quality audio. In addition to users’ expectation, their emotional state, content type, and many other psychological factors; network quality of service; and distortions introduced at the end terminals could determine their quality of experience. To measure the quality experienced by the end user for voice transmission service, the E-model standardized in the ITU-T Rec. G.107 (a narrowband version), ITU-T Rec. G.107.1 (a wideband version), and the most recent ITU-T Rec. G.107.2 extension for the super-wideband E-model can be used. In this work, we present a quality of experience model built on the E-model to measure the impact of coding and packet loss to assess the quality perceived by the end user in WebRTC speech applications. Based on the computed Mean Opinion Score, a real-time adaptive codec parameter switching mechanism is used to switch to the most optimum codec bitrate under the present network conditions. We present the evaluation results to show the effectiveness of the proposed approach when compared with the default codec configuration in WebRTC.
- Published
- 2021
72. Neural Speech Encoding in Infancy Predicts Future Language and Communication Difficulties
- Author
-
Patrick C. M. Wong, Gangyi Feng, Ting Fan Leung, Hugh Simon Lam, Akshay R. Maggu, Nikolay Novitskiy, Ching Man Lai, and Peggy Hiu Ying Chan
- Subjects
Linguistics and Language ,Gestures ,Computer science ,Communication ,Speech recognition ,Speech coding ,Infant ,Language Development ,Vocabulary ,Speech and Hearing ,Otorhinolaryngology ,Developmental and Educational Psychology ,Humans ,Speech ,Child ,Construct (philosophy) ,Language - Abstract
Purpose This study aimed to construct an objective and cost-effective prognostic tool to forecast the future language and communication abilities of individual infants. Method Speech-evoked electroencephalography (EEG) data were collected from 118 infants during the first year of life during the exposure to speech stimuli that differed principally in fundamental frequency. Language and communication outcomes, namely four subtests of the MacArthur–Bates Communicative Development Inventories (MCDI)–Chinese version, were collected between 3 and 16 months after initial EEG testing. In the two-way classification, children were classified into those with future MCDI scores below the 25th percentile for their age group and those above the same percentile, while the three-way classification classified them into < 25th, 25th–75th, and > 75th percentile groups. Machine learning (support vector machine classification) with cross validation was used for model construction. Statistical significance was assessed. Results Across the four MCDI measures of early gestures, later gestures, vocabulary comprehension, and vocabulary production, the areas under the receiver-operating characteristic curve of the predictive models were respectively .92 ± .031, .91 ± .028, .90 ± .035, and .89 ± .039 for the two-way classification, and .88 ± .041, .89 ± .033, .85 ± .047, and .85 ± .050 for the three-way classification ( p < .01 for all models). Conclusions Future language and communication variability can be predicted by an objective EEG method that indicates the function of the auditory neural pathway foundational to spoken language development, with precision sufficient for individual predictions. Longer-term research is needed to assess predictability of categorical diagnostic status. Supplemental Material https://doi.org/10.23641/asha.15138546
- Published
- 2021
73. Research on Low Delay 11.2kbps Speech Coding Algorithm
- Author
-
Zhao, Zhefeng, Zhang, Gang, Wang, Yiping, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Deng, Hepu, editor, Miao, Duoqian, editor, Lei, Jingsheng, editor, and Wang, Fu Lee, editor
- Published
- 2011
- Full Text
- View/download PDF
74. Speaker Recognition from Coded Speech Using Support Vector Machines
- Author
-
Janicki, Artur, Staroszczyk, Tomasz, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Nierstrasz, Oscar, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Sudan, Madhu, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Vardi, Moshe Y., Series editor, Weikum, Gerhard, Series editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Habernal, Ivan, editor, and Matoušek, Václav, editor
- Published
- 2011
- Full Text
- View/download PDF
75. Audio Signal Processing Using Time-Frequency Approaches: Coding, Classification, Fingerprinting, and Watermarking
- Author
-
Sridhar Krishnan, Behnaz Ghoraani, and K. Umapathy
- Subjects
Audio mining ,Audio signal ,Audio electronics ,Computer science ,business.industry ,Speech recognition ,lcsh:Electronics ,Speech coding ,lcsh:TK7800-8360 ,computer.software_genre ,Anti-aliasing ,lcsh:Telecommunication ,Sub-band coding ,Audio watermark ,lcsh:TK5101-6720 ,business ,Audio signal processing ,Digital watermarking ,computer ,Digital signal processing ,Digital audio - Abstract
Audio signals are information rich nonstationary signals that play an important role in our day-to-day communication, perception of environment, and entertainment. Due to its non-stationary nature, time- or frequency-only approaches are inadequate in analyzing these signals. A joint time-frequency (TF) approach would be a better choice to efficiently process these signals. In this digital era, compression, intelligent indexing for content-based retrieval, classification, and protection of digital audio content are few of the areas that encapsulate a majority of the audio signal processing applications. In this paper, we present a comprehensive array of TF methodologies that successfully address applications in all of the above mentioned areas. A TF-based audio coding scheme with novel psychoacoustics model, music classification, audio classification of environmental sounds, audio fingerprinting, and audio watermarking will be presented to demonstrate the advantages of using time-frequency approaches in analyzing and extracting information from audio signals.
- Published
- 2022
76. A Packet Loss Concealment Algorithm Robust to Burst Packet Loss Using Multiple Codebooks and Comfort Noise for CELP-Type Speech Coders
- Author
-
Park, Nam In, Kim, Hong Kook, Jung, Min A., Lee, Seong Ro, Choi, Seung Ho, Kim, Tai-hoon, editor, Vasilakos, Thanos, editor, Sakurai, Kouichi, editor, Xiao, Yang, editor, Zhao, Gansen, editor, and Ślęzak, Dominik, editor
- Published
- 2010
- Full Text
- View/download PDF
77. Design of Switched Quantizers and Speech Coding Based on Quasi-Logarithmic Compandor.
- Author
-
Vucic, Nikola J., Peric, Zoran H., and Petkovic, Goran M.
- Subjects
SPEECH coding ,PROBABILITY density function ,SIGNAL-to-noise ratio ,NUMERICAL functions ,MATHEMATICAL analysis - Abstract
This article investigates switched quantizers for speech signal depicted with Gaussian probability density function (PDF). Gaussian PDF is better for smaller frame lengths that are represented here. Companding technique results in constant Signal to Quantization Noise Ratio (SQNR). In this paper two approaches are present: quasi-logarithmic (QL) and piecewise uniform (PU) compandor. Simpler compandor directly affects the complexity of hardware realization and expense of given solution, but, on the other hand, also brings to weaker performances. Therefore, a smart choice has to be made. Usage of switched technique leads to better performances. This way, the quality of quantization is improved by dividing the dynamic range of variances into multiple subranges. For each subrange a separate quantizer is designed, with its support region's amplitude. The optimal amplitude is numerically determined, whereby a single criterion is obtaining the maximal SQNR. Bit rates of these quantizers don't depend on signal variance, as the fixed length codes are used. The performances of proposed quantizers are demonstrated on real speech signals from the reliable database. Comparison of obtained results with other recent solutions is done in order to show the efficiency of this model. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
78. Quality aspects of music used as a background noise in speech communication over mobile network.
- Author
-
Počta, Peter and Isabelle, Scott
- Subjects
- *
ACOUSTIC transients , *NOISE control , *SOUND measurement , *TRAFFIC noise , *ABSORPTION of sound - Abstract
This paper compares the effect of send-side music and environmental noise as background noise in a telephone communication. The study focuses on the quality experienced by the end user in the context of NB, WB and SWB mobile speech communication. The subjective test procedure defined in ITU-T Rec. P.835 is followed in this study. The results show that music as background noise in telephone conversation deteriorates the overall quality experienced by the end user. Moreover, the impact of music background noise on the quality is similar to that of the environmental noise. Furthermore it is shown that the music background noise seems to be slightly less intrusive than the environmental noise, especially when it comes to the lower SNR. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
79. Cognitive Speech Coding: Examining the Impact of Cognitive Speech Processing on Speech Compression.
- Author
-
Cernak, Milos, Asaei, Afsaneh, and Hyafil, Alexandre
- Abstract
Speech coding is a field in which compression paradigms have not changed in the last 30 years. Speech signals are most commonly encoded with compression methods that have roots in linear predictive theory dating back to the early 1940s. This article bridges this influential theory with recent cognitive studies applicable in speech communication engineering. It reviews the mechanisms of speech perception that have led to perceptual speech coding. The focus is on human speech communication and machine learning and the application of cognitive speech processing in speech compression that presents a paradigm shift from perceptual (auditory) speech processing toward cognitive (auditory plus cortical) speech processing. [ABSTRACT FROM PUBLISHER]
- Published
- 2018
- Full Text
- View/download PDF
80. Combined speech compression and encryption using chaotic compressive sensing with large key size.
- Author
-
Al‐Azawi, Maher K. Mahmood and Gaze, Ali M.
- Abstract
This study introduces a new method for speech signal encryption and compression in a single step. The combined compression/encryption procedures are accomplished using compressive sensing (CS). The contourlet transform is used to increase the sparsity of the signal required by CS. Due to its randomness properties and very high sensitivity to initial conditions, the chaotic system is used to generate the sensing matrix of CS. This largely increases the key size of encryption to 10135 when logistic map is used. A spectral segmental signal‐to‐noise ratio of −36.813 dB is obtained as a measure of encryption strength. The quality of reconstructed speech is given by means of signal‐to‐noise ratio (SNR), and perceptual evaluation speech quality (PESQ). For 60% compression ratio the proposed method gives 48.203 dB SNR and 4.437 PESQ for voiced speech segments. However, for continuous speech (voiced and unvoiced), it gives 41.097 dB SNR and 4.321 PESQ. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
81. A Speech Privacy Protection Method Based on Sound Masking and Speech Corpus.
- Author
-
Qi, Ding, Longmei, Nan, and jinfu, Xu
- Subjects
AUDITORY masking ,SPEECH processing systems ,SPEECH coding ,ALGORITHMS ,COMPRESSED speech ,DATA security - Abstract
In this paper a new speech privacy protection method based on sound masking and speech corpus is proposed. Frames in the speech corpus are selected according to secret keys and pitch of the destination speech frame, and the selected masking speech frames should have the same pitch period with the destination speech frame. Selected frames are used to mask the original speech frames and protect privacy of speech. Masked speech can be transmitted safely in communication system. By removing the effect of corresponding masking frames from the masked speech with the same corpus, original speech can be recovered. Experiment results show the proposed method gains good privacy protection effect and recovered speech quality. It is robust to speech compression by waveform coding algorithms, and helpful to systems using parametric coding algorithms for compression. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
82. A Novel AMR-WB Speech Steganography Based on Diameter-Neighbor Codebook Partition.
- Author
-
He, Junhui, Chen, Junxi, Xiao, Shichang, Tang, Shaohua, and Huang, Xiaoyu
- Subjects
CRYPTOGRAPHY ,SPEECH coding ,SPEECH codecs ,CRYPTOGRAPHY research ,CRYPTOGRAPHIC equipment - Abstract
Steganography is a means of covert communication without revealing the occurrence and the real purpose of communication. The adaptive multirate wideband (AMR-WB) is a widely adapted format in mobile handsets and is also the recommended speech codec for VoLTE. In this paper, a novel AMR-WB speech steganography is proposed based on diameter-neighbor codebook partition algorithm. Different embedding capacity may be achieved by adjusting the iterative parameters during codebook division. The experimental results prove that the presented AMR-WB steganography may provide higher and flexible embedding capacity without inducing perceptible distortion compared with the state-of-the-art methods. With 48 iterations of cluster merging, twice the embedding capacity of complementary-neighbor-vertices-based embedding method may be obtained with a decrease of only around 2% in speech quality and much the same undetectability. Moreover, both the quality of stego speech and the security regarding statistical steganalysis are better than the recent speech steganography based on neighbor-index-division codebook partition. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
83. Multilevel Delta Modulation with Switched First-Order Prediction for Wideband Speech Coding.
- Author
-
Peric, Zoran, Denic, Bojan, and Despotovic, Vladimir
- Subjects
DELTA modulation ,BROADBAND communication systems ,SPEECH coding ,SIGNAL processing ,COMPANDING - Abstract
In this paper a delta modulation speech coding scheme based on the ITU-T G.711 standard and the switched first-order predictor is presented. The forward adaptive scheme is used, where the adaptation to the signal variance is performed on frame-by-frame basis. The classification of the frames into weakly and highly correlated was done based on the correlation coefficient calculated for each frame, providing a basis for choosing the appropriate predictor coefficient. The obtained results indicate that the proposed model significantly outperforms the scalar companding system based on the G.711 standard. The obtained experimental results were verified using the theoretical model in the wide dynamic range of the input variance. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
84. Speaker Recognition for Hindi Speech Signal using MFCC-GMM Approach.
- Author
-
Maurya, Ankur, Kumar, Divya, and Agarwal, R.K.
- Subjects
SPEECH coding ,LINEAR predictive coding ,WIRELESS communications ,DATA transmission systems ,COMPUTER networks - Abstract
Speaker recognition for different languages is still a big challenge for researchers. The accuracy of identification rate (IR) is great issue, if the utterance of speech sample is less. This paper aims to implement speaker recognition for Hindi speech samples using Mel frequency cepestral coffiecient–vector quantization (MFCC-VQ) and Mel frequency cepestral cofficient-Gaussian mixture model (MFCC-GMM) for text dependent and text independent phrases. The accuracy of text independent recognition by MFCC-VQ and MFCC-GMM for Hindi speech sample is 77.64% and 86.27% respectively. However, the accuracy has increased significantly for text dependent recognition. The accuracy of Hindi speech samples are 85.49 % and 94.12 % using MFCC-VQ and MFCC-GMM approach. We have tested 15 speakers consisting 10 male and 5 female speakers. The total number of trails for each speaker is 17. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
85. Spoken character classification using abductive network.
- Author
-
Lawal, Isah
- Subjects
SPEECH perception ,GMDH algorithms ,SPEECH coding ,SPEECH synthesis ,ACOUSTIC signal processing ,LANGUAGE & languages - Abstract
In this paper, we address the problem of learning a classifier for the classification of spoken character. We present a solution based on Group Method of Data Handling (GMDH) learning paradigm for the development of a robust abductive network classifier. We improve the reliability of the classification process by introducing the concept of multiple abductive network classifier system. We evaluate the performance of the proposed classifier using three different speech datasets including spoken Arabic digit, spoken English letter, and spoken Pashto digit. The performance of the proposed classifier surpasses that reported in the literature for other classification techniques on the same speech datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
86. High payload multi-channel dual audio watermarking algorithm based on discrete wavelet transform and singular value decomposition.
- Author
-
Elshazly, A., Nasr, M., Fouad, M., and Abdel-Samie, F.
- Subjects
SPEECH coding ,SPEECH synthesis ,LANGUAGE & languages ,SINGULAR value decomposition ,WAVELET transforms ,SIGNAL quantization ,ACOUSTIC signal processing - Abstract
Digital watermarking technology is concerned with solving the problem of copyright protection, data authentication, content identification, distribution, and duplication of the digital media due to the great developments in computers and Internet technology. Recently, protection of digital audio signals has attracted the attention of researchers. This paper proposes a new audio watermarking scheme based on discrete wavelet transform (DWT), singular value decomposition (SVD), and quantization index modulation (QIM) with a synchronization code embedded with two encrypted watermark images or logos inserted into a stereo audio signal. In this algorithm, the original audio signal is split into blocks, and each block is decomposed with a two-level DWT, and then the approximate low-frequency sub-band coefficients are decomposed by SVD transform to obtain a diagonal matrix. The prepared watermarking and synchronization code bit stream is embedded into the diagonal matrix using QIM. After that, we perform inverse singular value decomposition (ISVD) and inverse discrete wavelet transform (IDWT) to obtain the watermarked audio signal. The watermark can be blindly extracted without knowledge of the original audio signal. Experimental results show that the transparency and imperceptibility of the proposed algorithm is satisfied, and that robustness is strong against popular audio signal processing attacks. High watermarking payload is achieved through the proposed scheme. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
87. Deep neural network training for whispered speech recognition using small databases and generative model sampling.
- Author
-
Ghaffarzadegan, Shabnam, Bořil, Hynek, and Hansen, John
- Subjects
SPEECH perception ,NEURAL circuitry ,HIDDEN Markov models ,SPEECH coding ,SPEECH synthesis ,ACOUSTIC signal processing ,LANGUAGE & languages - Abstract
State-of-the-art speech recognition solutions currently employ hidden Markov models (HMMs) to capture the time variability in a speech signal and deep neural networks (DNNs) to model the HMM state distributions. It has been shown that DNN-HMM hybrid systems outperform traditional HMM and Gaussian mixture model (GMM) hybrids in many applications. This improvement is mainly attributed to the ability of DNNs to model more complex data structures. However, having sufficient data samples is one key point in training a high accuracy DNN as a discriminative model. This barrier makes DNNs unsuitable for many applications with limited amounts of data. In this study, we introduce a method to produce an excessive amount of pseudo-samples that requires availability of only a small amount of transcribed data from the target domain. In this method, a universal background model (UBM) is trained to capture a parametric estimate of the data distributions. Next, random sampling is used to generate a large amount of pseudo-samples from the UBM. Frame-Shuffling is then applied to smooth the temporal cepstral trajectories in the generated pseudo-sample sequences to better resemble the temporal characteristics of a natural speech signal. Finally, the pseudo-sample sequences are combined with the original training data to train the DNN-HMM acoustic model of a speech recognizer. The proposed method is evaluated on small-sized sets of neutral and whisper data drawn from the UT-Vocal Effort II corpus. It is shown that phoneme error rates (PERs) of a DNN-HMM based speech recognizer are considerably reduced when incorporating the generated pseudo-samples in the training process, with + 79.0 and + 45.6% relative PER improvements for neutral-neutral training/test and whisper-whisper training/test scenarios, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
88. Clean speech/speech with background music classification using HNGD spectrum.
- Author
-
Khonglah, Banriskhem and Prasanna, S.
- Subjects
SPEECH coding ,ORAL communication ,LANGUAGE & languages ,DISCRETE Fourier transforms ,VOCAL tract ,ENVIRONMENTAL music - Abstract
This work explores the characteristics of speech in terms of the spectral characteristics of vocal tract system for deriving features effective for clean speech and speech with background music classification. A representation of the spectral characteristics of the vocal tract system in the form of Hilbert envelope of the numerator of group delay (HNGD) spectrum is explored for the task. This representation complements the existing methods of computing the spectral characteristics in terms of the temporal resolution. This spectrum has an additive and high resolution property which gives a better representation of the formants especially the higher ones. A feature is extracted from the HNGD spectrum which is known as the spectral contrast across the sub-bands and this feature essentially represents the relative spectral characteristics of the vocal tract system. The vocal tract system is also represented approximately in terms of the mel frequency cepstral coefficients (MFCCs) which represent the average spectral characteristics. The MFCCs and the sum of the spectral contrast on HNGD can be used as features to represent the average and relative spectral characteristics of the vocal tract system, respectively. These features complement each other and can be combined in a multidimensional framework to provide good discrimination between clean speech and speech with background music segments. The spectral contrast on HNGD spectrum is compared to the spectral contrast on discrete fourier transform (DFT) spectrum, which also represents the relative spectral characteristics of the vocal tract system. It is observed that better performances are achieved on the HNGD spectrum than the DFT spectrum. The features are classified using classifiers like Gaussian mixture models and support vector machines. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
89. A waveform concatenation technique for text-to-speech synthesis.
- Author
-
Panda, Soumya and Nayak, Ajit
- Subjects
SPEECH synthesis ,SPEECH coding ,ORAL communication ,LANGUAGE & languages ,ACOUSTIC signal processing ,SPEECH processing systems - Abstract
Designing text-to-speech systems capable of producing natural sounding speech segments in different Indian languages is a challenging and ongoing problem. Due to the large number of possible pronunciations in different Indian languages, a number of speech segments are needed to be stored in the speech database while a concatenative speech synthesis technique is used to achieve highly natural speech segments. However, the large speech database size makes it unusable for small hand held devices or human computer interactive systems with limited storage resources. In this paper, we proposed a fraction-based waveform concatenation technique to produce intelligible speech segments from a small footprint speech database. The results of all the experiments performed shows the effectiveness of the proposed technique in producing intelligible speech segments in different Indian languages even with very less storage and computation overhead compared to the existing syllable-based technique. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
90. Research on English machine translation system based on the internet.
- Author
-
Zhang, Yu
- Subjects
MACHINE translating ,COMPUTATIONAL linguistics ,SPEECH coding ,ORAL communication ,LANGUAGE & languages - Abstract
With the continuous development of the internet technology, its application in the machine translation field is becoming more and more popular. Combining the information technology with the machine translation technology, the internet-based machine translation system accesses the English and Chinese bilingual parallel information via the internet and completes translation with artificial assistance, which breaks the reliance on the machine. In order to verify the system, this paper studies the internet based English machine translation system and finds that it can complete rapid and effective translation with higher quality and efficiency compared with traditional machine translation methods. Therefore, the system is worth being promoted and has a good development prospect. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
91. Dravidian language classification from speech signal using spectral and prosodic features.
- Author
-
Koolagudi, Shashidhar, Bharadwaj, Akash, Srinivasa Murthy, Y., Reddy, Nishaanth, and Rao, Priya
- Subjects
PROSODIC analysis (Linguistics) ,DURATION (Phonetics) ,DRAVIDIAN languages ,SPEECH coding ,ORAL communication - Abstract
The interesting aspect of the Dravidian languages is a commonality through a shared script, similar vocabulary, and their common root language. In this work, an attempt has been made to classify the four complex Dravidian languages using cepstral coefficients and prosodic features. The speech of Dravidian languages has been recorded in various environments and considered as a database. It is demonstrated that while cepstral coefficients can indeed identify the language correctly with a fair degree of accuracy, prosodic features are added to the cepstral coefficients to improve language identification performance. Legendre polynomial fitting and the principle component analysis (PCA) are applied on feature vectors to reduce dimensionality which further resolves the issue of time complexity. In the experiments conducted, it is found that using both cepstral coefficients and prosodic features, a language identification rate of around 87% is obtained, which is about 18% above the baseline system using Mel-frequency cepstral coefficients (MFCCs). It is observed from the results that the temporal variations and prosody are the important factors needed to be considered for the tasks of language identification. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
92. A voice command detection system for aerospace applications.
- Author
-
Tabibian, Shima
- Subjects
ORAL communication ,SPEECH coding ,HIDDEN Markov models ,MARKOV processes ,COMPRESSED sensing - Abstract
Nowadays, according to ever-increasing volumes of audio content, audio processing is a vital need. In the aerospace field, voice commands could be used instead of data commands in order to speed up the command transmission, help crewmembers to complete their tasks by allowing hands-free control of supplemental equipment and as a redundant system for increasing the reliability of command transmission. In this paper, a voice command detection (VCD) framework is proposed for aerospace applications, which decodes the voice commands to comprehensible and executable commands, in an acceptable speed with a low false alarm rate. The framework is mainly based on a keyword spotting method, which extracts some pre-defined target keywords from the input voice commands. The mentioned keywords are input arguments to the proposed rule-based language model (LM). The rule-based LM decodes the voice commands based on the input keywords and their locations. Two keyword spotters are trained and used in the VCD system. The phone-based keyword spotter is trained on TIMIT database. Then, speaker adaptation methods are exploited to modify the parameters of the trained models using non-native speaker utterances. The word-based keyword spotter is trained on a database prepared and specialized for aerospace applications. The experimental results show that the word-based VCD system decodes the voice commands with true detection rate equal to 88% and false alarm rate equal to 12%, in average. Additionally, using speaker adaptation methods in the phone-based VCD system improves the true detection and false alarm rates about 21% and 21%, respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
93. Single-channel speech separation using combined EMD and speech-specific information.
- Author
-
Prasanna Kumar, M. and Kumaraswamy, R.
- Subjects
BLIND source separation ,SIGNAL separation ,COMPRESSED sensing ,ORAL communication ,LINEAR predictive coding ,SPEECH coding - Abstract
Multi-channel blind source separation (BSS) methods use more than one microphone. There is a need to develop speech separation algorithms under single microphone scenario. In this paper we propose a method for single channel speech separation (SCSS) by combining empirical mode decomposition (EMD) and speech specific information. Speech specific information is derived in the form of source-filter features. Source features are obtained using multi pitch information. Filter information is estimated using formant analysis. To track multi pitch information in the mixed signal we apply simple-inverse filtering tracking (SIFT) and histogram based pitch estimation to excitation source information. Formant estimation is done using linear predictive (LP) analysis. Pitch and formant estimation are done with and without EMD decomposition for better extraction of the individual speakers in the mixture. Combining EMD with speech specific information provides encouraging results for single-channel speech separation. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
94. Analysis-by-Synthesis Speech Coding
- Author
-
Chen, Juin-Hwey, Thyssen, Jes, Benesty, Jacob, editor, Sondhi, M. Mohan, editor, and Huang, Yiteng Arden, editor
- Published
- 2008
- Full Text
- View/download PDF
95. Low-Bit-Rate Speech Coding
- Author
-
McCree, Alan V., Benesty, Jacob, editor, Sondhi, M. Mohan, editor, and Huang, Yiteng Arden, editor
- Published
- 2008
- Full Text
- View/download PDF
96. Principles of Speech Coding
- Author
-
Kleijn, W. Bastiaan, Benesty, Jacob, editor, Sondhi, M. Mohan, editor, and Huang, Yiteng Arden, editor
- Published
- 2008
- Full Text
- View/download PDF
97. A Hybrid Method of User Identification with Use Independent Speech and Facial Asymmetry
- Author
-
Kubanek, Mariusz, Rydzek, Szymon, Carbonell, Jaime G., editor, Siekmann, J\'org, editor, Rutkowski, Leszek, editor, Tadeusiewicz, Ryszard, editor, Zadeh, Lotfi A., editor, and Zurada, Jacek M., editor
- Published
- 2008
- Full Text
- View/download PDF
98. Time–Frequency Analysis of Vietnamese Speech Inspired on Chirp Auditory Selectivity
- Author
-
Nguyen, Ha, Weruaga, Luis, Carbonell, Jaime G., editor, Siekmann, Jörg, editor, Ho, Tu-Bao, editor, and Zhou, Zhi-Hua, editor
- Published
- 2008
- Full Text
- View/download PDF
99. Paralinguistic Speech Processing: An Overview
- Author
-
Awder Mohammed Ahmed, Dindar Mikaeel Ahmed, Hajar Maseeh Yasin, Ibrahim Mahmood Ibrahim, Shakir Fattah Kak, Zryan Najat Rashid, Naaman Omar, Azar Abid Salih, Rozin Majeed Abdullah, and Siddeeq Y. Ameen
- Subjects
Computer science ,Speech recognition ,Speech coding ,General Medicine ,Speech processing ,Paralanguage - Abstract
Clients can adequately control PCs and create reports by speaking with the guide of innovation, discourse acknowledgment empowers records to be delivered all the more effectively in light of the fact that the program normally produces words as fast as they expressed, which is typically a lot faster than a human can compose. Discourse acknowledgment is an innovation that consequently finds the words and expressions that best match the contribution of human discourse. The most normal use of discourse acknowledgment is correspondence, where discourse acknowledgment can be utilized to create letters/messages and different reports. Point of discourse handling: - to comprehend discourse as a mechanism of correspondence, to reflect discourse for transmission and propagation; - to inspect discourse for robotized data discovery and extraction-to find some physiological highlights of the speaker. In discourse combination, there are two significant capacities. The first is the interpretation of voice to message. The second is for the content to be converted into human voice.
- Published
- 2021
100. Speaker Verification from Codec-Distorted Speech Through Combination of Affine Transform and Feature Switching
- Author
-
M. S. Athulya and P. S. Sathidevi
- Subjects
Feature (computer vision) ,Computer science ,Applied Mathematics ,Speech recognition ,Feature vector ,Signal Processing ,Speech coding ,Word error rate ,Feature selection ,TIMIT ,Mel-frequency cepstrum ,Affine transformation - Abstract
A high-performance speaker verification system from codec-distorted speech is developed and implemented in this paper. Apriori knowledge of the type of the speech codec is utilized in this. Code excited linear prediction-based codec which is one of the most commonly used codecs in mobile communications is assumed here. A novel method is developed by applying the concepts of feature switching and affine transform for the design and implementation of the proposed speaker verification system. In this system, best feature set for each speaker is identified during training phase from affine transformed speech features to make feature selection more robust. Mel frequency cepstral coefficients and modified power normalized cepstral coefficients are identified as features for feature switching. Feature switching is done using direct method in feature level itself and an indirect method in the i-vector framework. During testing phase, best feature set of the claimed speaker is extracted from the codec-distorted speech and affine transform is applied to reflect the feature space during training. Speaker verification is performed using this affine transformed feature set. Classifiers based on Gaussian mixture model-universal background model and i-vector are used for verification. The performance of the proposed system is tested using two databases, namely TIMIT and VoxCeleb1. For both databases with the above two classifiers, we could achieve very low equal error rate when compared with the other competitive methods available in the literature. Hence, the proposed system is a very good candidate for critical applications like forensic speaker verification.
- Published
- 2021
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.