69 results on '"Audio coding"'
Search Results
2. MPEG-1 psychoacoustic model emulation using multiscale convolutional neural networks.
- Author
-
Kemper, Guillermo, Sanchez, Alonso, and Serpa, Sergio
- Abstract
The Moving Picture Experts Group - 1 (MPEG-1) perceptual audio compression scheme is a successful family of audio codecs described in standard ISO/IEC 11172–3. Currently, there is no general framework to emulate nor MPEG-1 neither any other psychoacoustic model, which is a core piece of many perceptual codecs. This work presents a successful implementation of a convolutional neural network which emulates psychoacoustic model 1 from the MPEG-1 standard, termed "MCNN-PM" (Multiscale Convolutional Neural Network – Psychoacoustic Model). It is then implemented as part of the MPEG-1, Layer I codec. Using the objective difference grade (ODG) to evaluate audio quality, the MCNN-PM MPEG-1, Layer I codec outperforms the original MPEG-1, Layer I codec by up to 17% at 96 kbps, 14% at 128 kbps and performs almost equally at 192 kbps. This work shows that convolutional neural networks are a viable alternative to standard psychoacoustic models and can be used as part of perceptual audio codecs successfully. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
3. Deep convolutional neural networks for double compressed AMR audio detection
- Author
-
Aykut Büker and Cemal Hanilçi
- Subjects
audio coding ,audio recording ,audio signal processing ,data compression ,feature extraction ,signal classification ,Telecommunication ,TK5101-6720 - Abstract
Abstract Detection of double compressed (DC) adaptive multi‐rate (AMR) audio recordings is a challenging audio forensic problem and has received great attention in recent years. Here, the authors propose to use convolutional neural networks (CNN) for DC AMR audio detection. The CNN is used as (i) an end‐to‐end DC AMR audio detection system and (ii) a feature extractor. The end‐to‐end system receives the audio spectrogram as the input and returns the decision whether the input audio is single compressed (SC) or DC. As a feature extractor in turn, it is used to extract discriminative features and then these features are modelled using support vector machines (SVM) classifier. Our extensive analysis conducted on four different datasets shows the success of the proposed system and provides new findings related to the problem. Firstly, double compression has a considerable impact on the high frequency components of the signal. Secondly, the proposed system yields great performance independent of the recording device or environment. Thirdly, when previously altered files are used in the experiments, 97.41% detection rate is obtained with the CNN system. Finally, the cross‐dataset evaluation experiments show that the proposed system is very effective in case of a mismatch between training and test datasets.
- Published
- 2021
- Full Text
- View/download PDF
4. Highly Efficient Audio Coding With Blind Spectral Recovery Based on Machine Learning.
- Author
-
Kim, Jae-Won, Beack, Seung Kwon, Lim, Wootaek, and Park, Hochong
- Subjects
MACHINE learning ,VIDEO coding ,MUSIC conducting ,AUDIO equipment ,INFORMATION processing - Abstract
This letter proposes a new method for audio coding that utilizes blind spectral recovery to improve the coding efficiency without compromising performance. The proposed method transmits only a fraction of the spectral coefficients, thereby reducing the coding bit rate. Then, it recovers the remaining coefficients in the decoder using the transmitted coefficients as input. The proposed method is differentiated from conventional spectral recovery in that the coefficients to be recovered are interleaved with the transmitted coefficients to obtain the most data correlation. Further, it enhances the transmitted coefficients, which are degraded by quantization errors, to deliver better information to the recovery process. The spectral recovery is conducted recursively on a band basis such that information recovered in one band is used for the recovery in subsequent bands. An improved level correction for the recovered coefficients and a new sign coding are also developed. A subjective performance evaluation confirms that the proposed method at 40 kbps provides statistically equivalent sound quality to a state-of-the-art coding method at 48 kbps for speech and music categories. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
5. MPEG Standards for Compressed Representation of Immersive Audio.
- Author
-
Quackenbush, Schuyler R. and Herre, Jurgen
- Subjects
MPEG (Video coding standard) ,MULTI-degree of freedom ,SINGLE-degree-of-freedom systems ,HEADPHONES ,LOUDSPEAKERS ,AUGMENTED reality - Abstract
The term “immersive audio” is frequently used to describe an audio experience that provides the listener the sensation of being fully immersed or “present” in a sound scene. This can be achieved via different presentation modes, such as surround sound (several loudspeakers horizontally arranged around the listener), 3D audio (with loudspeakers at, above, and below listener ear level), and binaural audio to headphones. This article provides an overview of two recent standards that support the bitrate-efficient carriage of high-quality immersive sound. The first is MPEG-H 3D audio, which is a versatile standard that supports multiple immersive sound signal formats (channels, objects, and higher order ambisonics) and is now being adopted in broadcast and streaming applications. The second is MPEG-I immersive audio, an extension of 3D audio, currently under development, which is targeted for virtual and augmented reality applications. This will support rendering of fully user-interactive immersive sound for three degrees of user movement [three degrees of freedom (3DoF)], i.e., yaw, pitch, and roll head movement, and for six degrees of user movement [six degrees of freedom (6DoF)], i.e., 3DoF plus translational ${x}$ , ${y}$ , and ${z}$ user position movements. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
6. Speech Coding, Synthesis, and Compression
- Author
-
Farouk, Mohamed Hesham, Gan, Woon-Seng, Series editor, Kuo, C.-C. Jay, Series editor, Zheng, Thomas Fang, Series editor, Barni, Mauro, Series editor, and Farouk, Mohamed Hesham
- Published
- 2018
- Full Text
- View/download PDF
7. Digital audio signal watermarking using minimum‐energy scaling optimisation in the wavelet domain.
- Author
-
Hsu, Chih‐Yu, Tu, Shu‐Yi, Yang, Chao‐Tung, Chang, Ching‐Lung, and Chen, Shuo‐Tsung
- Abstract
This work's contributions include three innovative concepts, an improved model, two‐stage Lagrange principle, and minimum‐energy scaling optimisation, for quantisation audio watermarking in the wavelet domain. First, discrete wavelet transform (DWT) multi‐coefficients quantisation, composed of arbitrary scaling on the lowest DWT coefficients, and the group‐based signal‐to‐noise ratio (SNR) of these coefficients is connected in a model. Then, the two‐stage Lagrange principle and minimum‐energy approach play two essential roles to obtain the optimal scaling factors. With the proposed scheme, the best fidelity and robustness of embedded audio can be attained and the perceptual evaluation of audio quality (PEAQ) test with an illustration of the relationship between SNR and PEAQ is also performed as well. Simulation results show that each watermarked audio by the proposed method attains a high SNR, good PEAQ, and a low bit error rate (BER). The SNR of most watermarked audios in their method is above 35 or even above 40 and the corresponding subjective difference grade of PEAQ is close to 0. In terms of comparing BER, most of their BER is as low as 2% or less indicating better robustness against many attacks, such as re‐sampling, amplitude scaling, and mp3 compression. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
8. A new source‐filter model audio bandwidth extension using high frequency perception feature for IoT communications.
- Author
-
Jiang, Lin, Yu, Shaoqian, Wang, Xiaochen, Wang, Chao, and Wang, Tonghan
- Subjects
BANDWIDTHS ,SENSORY perception - Abstract
Summary: Audio coding is a generic technology to IoT applications system. Audio bandwidth extension is a standard technique within contemporary audio codecs to efficiently code audio signals at low bitrates. In existing methods, high frequency signal is generated by a duplication of the corresponding low frequency and some parameters of high frequency. However, the perception quality of coding will significantly degrade if the correlation between high and low frequency becomes weak. In this paper, we proposed a new source‐filter model audio bandwidth extension method. In our method, a perception feature of the high frequency signal is extracted to restore the perception quality of coding. In the decoder side, the crest factor and noise level are obtained under the constraint of the high frequency perception parameter. The performance shown in our experiment results are superior to the classic methods. Compared with the state of art method, the proposed method is also comparative because of the coding bitrate is significantly reduced and keeps a close perception quality of coding. This paper also provided a new solution for IoT communications that requires low bitrates and high quality of coding, especially like Internet of Vehicles. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
9. Advancement of 22.2 Multichannel Sound Broadcasting Based on MPEG-H 3D Audio.
- Author
-
Sugimoto, Takehiro, Aoki, Shuichi, Hasegawa, Tomomi, and Komori, Tomoyasu
- Subjects
- *
BROADCASTING industry , *SOUND systems , *AUDIO equipment , *COMPUTER graphics , *LOUDSPEAKERS - Abstract
This study proposes improvements to 22.2 multichannel (22.2 ch) sound broadcasting service. 22.2 ch sound is currently used in the 8K satellite broadcasting in Japan. In this study, the audio system is migrated from channel-based audio to object-based audio. The object-based audio equips 22.2 ch sound with alternative and adaptive functionalities: the alternative functionality is related to dialogue controls such as multilingual services, while the adaptive functionality enables 22.2 ch sound to be adapted to the audio format of the playback equipment. Moving Picture Experts Group (MPEG)-H 3D Audio (3DA), which is the latest audio coding standard, is used as the audio coding scheme. A real-time encoder and decoder based on 3DA was developed to verify the practicability of the proposed system. The encoded audio data is packetized and transmitted by MPEG-H MPEG Media Transport (MMT) to be multiplexed with video data. A transmission experiment with 8K video was carried out in which the proposed system was proved to operate as designed in this study. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
10. Secure echo‐hiding audio watermarking method based on improved PN sequence and robust principal component analysis.
- Author
-
Wang, Shengbei, Wang, Chao, Yuan, Weitao, Wang, Lin, and Wang, Jianming
- Abstract
Echo‐hiding has been widely studied for audio watermarking. This study proposes a more secure echo‐hiding method based on modified pseudo‐noise (PN) sequence and robust principal component analysis (RPCA). In the proposed method, the RPCA is used to decompose the original audio signal into low‐rank and sparse parts and then a pair of opposite modified PN sequences is employed to embed watermarks. The modified PN sequence improves the robustness of watermark detection by providing additional correlation peaks. Meanwhile, benefit from the RPCA and the opposite PN sequences, the security of the proposed method is improved since watermarks cannot be detected from the whole signal even if the PN sequence is known, which is an obvious improvement compared with the previous PN‐based echo‐hiding methods. In the watermark detection process, the authors make use of the low‐rank and sparse characteristics of the watermarked signal to detect watermarks from the low‐rank and sparse parts, respectively. Based on this basic framework, they also propose a multi‐bit embedding scheme, which obtains a doubled embedding capacity compared with the previous PN‐based echo‐hiding methods. The proposed method was evaluated with respect to inaudibility, security, and robustness. The experiment results verified the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
11. AMBIQUAL: Towards a Quality Metric for Headphone Rendered Compressed Ambisonic Spatial Audio.
- Author
-
Narbutt, Miroslaw, Skoglund, Jan, Allen, Andrew, Chinen, Michael, Barry, Dan, and Hines, Andrew
- Subjects
HEADPHONES ,BIT rate ,STREAMING audio ,VIRTUAL reality ,STREAMING media ,QUALITY of service ,IMAGE compression ,VIRTUAL reality software - Abstract
Featured Application: Streaming spatial audio for immersive audio and virtual reality applications will require compression algorithms that maintain the localization accuracy and irrationality attributes of sound sources as well as a high-fidelity quality of experience. Models to evaluate quality will be important for media content streaming application such as YouTube as well as VR gaming and other immersive multimedia experiences. Spatial audio is essential for creating a sense of immersion in virtual environments. Efficient encoding methods are required to deliver spatial audio over networks without compromising Quality of Service (QoS). Streaming service providers such as YouTube typically transcode content into various bit rates and need a perceptually relevant audio quality metric to monitor users' perceived quality and spatial localization accuracy. The aim of the paper is two-fold. First, it is to investigate the effect of Opus codec compression on the quality of spatial audio as perceived by listeners using subjective listening tests. Secondly, it is to introduce AMBIQUAL, a full reference objective metric for spatial audio quality, which derives both listening quality and localization accuracy metrics directly from the B-format Ambisonic audio. We compare AMBIQUAL quality predictions with subjective quality assessments across a variety of audio samples which have been compressed using the Opus 1.2 codec at various bit rates. Listening quality and localization accuracy of first and third-order Ambisonics were evaluated. Several fixed and dynamic audio sources (single and multiple) were used to evaluate localization accuracy. Results show good correlation regarding listening quality and localization accuracy between objective quality scores using AMBIQUAL and subjective scores obtained during listening tests. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
12. On the Consumption of Multimedia Content Using Mobile Devices: a Year to Year User Case Study.
- Author
-
FALKOWSKI-GILSKI, Przemysław
- Subjects
- *
MULTIMEDIA systems , *DIGITAL music players , *CONCERT halls , *CASE studies , *MUSIC videos - Abstract
In the early days, consumption of multimedia content related with audio signals was only possible in a stationary manner. The music player was located at home, with a necessary physical drive. An alternative way for an individual was to attend a live performance at a concert hall or host a private concert at home. To sum up, audio-visual effects were only reserved for a narrow group of recipients. Today, thanks to portable players, vision and sound is at last available for everyone. Finally, thanks to multimedia streaming platforms, every music piece or video, e.g. from one's favourite artist or band, can be viewed anytime and everywhere. The background or status of an individual is no longer an issue. Each person who is connected to the global network can have access to the same resources. This paper is focused on the consumption of multimedia content using mobile devices. It describes a year to year user case study carried out between 2015 and 2019, and describes the development of current trends related with the expectations of modern users. The goal of this study is to aid policymakers, as well as providers, when it comes to designing and evaluating systems and services. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
13. Detection of Frequency-Scale Modification Using Robust Audio Watermarking Based on Amplitude Modulation
- Author
-
Nishimura, Akira, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Shi, Yun-Qing, editor, Kim, Hyoung Joong, editor, Pérez-González, Fernando, editor, and Echizen, Isao, editor
- Published
- 2016
- Full Text
- View/download PDF
14. Reviews on Technology and Standard of Spatial Audio Coding
- Author
-
Ikhwana Elfitri and Amirul Luthfi
- Subjects
spatial audio ,audio coding ,multi-channel audio signals ,MPEG standard ,object-based audio ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Market demands on a more impressive entertainment media have motivated for delivery of three dimensional (3D) audio content to home consumers through Ultra High Definition TV (UHDTV), the next generation of TV broadcasting, where spatial audio coding plays fundamental role. This paper reviews fundamental concept on spatial audio coding which includes technology, standard, and application. Basic principle of object-based audio reproduction system will also be elaborated, compared to the traditional channel-based system, to provide good understanding on this popular interactive audio reproduction system which gives end users flexibility to render their own preferred audio composition.
- Published
- 2017
- Full Text
- View/download PDF
15. An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction
- Author
-
Cobos, Maximo, Ahrens, Jens, Kowalczyk, Konrad, and Politis, Archontis
- Published
- 2022
- Full Text
- View/download PDF
16. Time domain synchronisation estimation algorithm for FBMC vector signal analysis in 5G system
- Author
-
Yunzhi Ling, Yu Zhang, and Lantian Xu
- Subjects
synchronisation ,time-domain analysis ,mobile communication ,channel bank filters ,ofdm modulation ,acoustic signal detection ,audio coding ,signal quality changes ,low signal-to-noise ratio ,good synchronisation ,frequency domain ,power synchronisation ,fifth-generation mobile communication network system ,filter bank multicarrier test applications ,fbmc vector signal analysis ,time domain synchronisation estimation algorithm ,5g test applications ,fbmc signal vector analysis ,fbmc signal vector analysers ,Engineering (General). Civil engineering (General) ,TA1-2040 - Abstract
For the challenges of filter bank multi-carrier (FBMC) test applications in the fifth-generation mobile communication network (5G) system, presented is a time domain synchronisation estimation algorithm based on power synchronisation in frequency domain. Through MATLAB simulation, the proposed algorithm is verified to achieve good synchronisation at low signal-to-noise ratio (SNR) and large frequency offset without signal quality changes. As a result, it can be applied in FBMC signal vector analysers, promoting FBMC signal vector analysis functions used in 5G test applications.
- Published
- 2019
- Full Text
- View/download PDF
17. Stethoscope with digital frequency translation for improved audibility
- Author
-
Herbert M. Aumann and Nuri W. Emanetoglu
- Subjects
hilbert transforms ,audio coding ,analogue-digital conversion ,bioacoustics ,cardiology ,biomedical equipment ,medical signal processing ,microcontrollers ,modulation ,intestinal sounds ,analog frequency translator ,digital frequency translation ,acoustic stethoscope ,heart sounds ,chest sounds ,hilbert transformer ,single sideband suppressed carrier modulation ,hearing impaired physicians ,microcontroller ,time delay ,audio coder-decoder ,frequency 200.0 hz ,frequency 72.0 mhz ,Medical technology ,R855-855.5 - Abstract
The performance of an acoustic stethoscope is improved by translating, without loss of fidelity, heart sounds, chest sounds, and intestinal sounds below 50 Hz into a frequency range of 200 Hz, which is easily detectable by the human ear. Such a frequency translation will be of significant benefit to hearing impaired physicians and it will improve the stethoscope performance in a noisy environment. The technique is based on a single sideband suppressed carrier modulation. Stability and bias problems commonly associated with an analog frequency translator are avoided by an all-digital implementation. Real-time audio processing is made possible by approximating a Hilbert transformer with a time delay. The performance of the digital frequency translator was verified with a 16-bit 44.1 Ks/s audio coder/decoder and a 32-bit 72 MHz microcontroller.
- Published
- 2019
- Full Text
- View/download PDF
18. Time domain synchronisation estimation algorithm for FBMC vector signal analysis in 5G system.
- Author
-
Ling, Yunzhi, Zhang, Yu, and Xu, Lantian
- Subjects
SYNCHRONIZATION ,FILTER banks ,SIGNAL-to-noise ratio ,5G networks ,MOBILE communication systems - Abstract
For the challenges of filter bank multi-carrier (FBMC) test applications in the fifth-generation mobile communication network (5G) system, presented is a time domain synchronisation estimation algorithm based on power synchronisation in frequency domain. Through MATLAB simulation, the proposed algorithm is verified to achieve good synchronisation at low signal-to-noise ratio (SNR) and large frequency offset without signal quality changes. As a result, it can be applied in FBMC signal vector analysers, promoting FBMC signal vector analysis functions used in 5G test applications. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
19. An effective hybrid low delay packet loss concealment algorithm for MDCT-based audio codec.
- Author
-
Lin, Zhibin, Lu, Jing, and Qiu, Xiaojun
- Subjects
- *
AUDIO codec , *AUDITORY perception , *SECRECY , *MUSIC & language , *ALGORITHMS - Abstract
This paper proposes a hybrid packet loss concealment (PLC) algorithm for the MDCT-based audio codec with different PLC strategies on tone dominant source signals and noise like signals respectively. It is meaningful to find that the phase angle of the MDCT-MDST coefficients decreases linearly with the increase of the frame index but the amplitude keeps unchanged for the stationary source signal with dominant tonal components. Therefore an efficient frame interpolation method is designed to accurately estimate the phase angle and the magnitude of the MDCT-MDST coefficients of the lost frame. For the noise-like signals without overwhelming tonal components, a modified shaped-noise insertion is proposed to improve the audio perception. Both objective and subjective test results show that the proposed algorithm provides better performance than the existing ones for both music and voiced speech signals. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
20. BPL-PLC Voice Communication System for the Oil and Mining Industry
- Author
-
Grzegorz Debita, Przemysław Falkowski-Gilski, Marcin Habrych, Grzegorz Wiśniewski, Bogdan Miedziński, Przemysław Jedlikowski, Agnieszka Waniewska, Jan Wandzio, and Bartosz Polnik
- Subjects
audio coding ,digital systems ,electrical engineering ,ICT ,Industry 4.0 ,IoT ,Technology - Abstract
Application of a high-efficiency voice communication systems based on broadband over power line-power line communication (BPL-PLC) technology in medium voltage networks, including hazardous areas (like the oil and mining industry), as a redundant mean of wired communication (apart from traditional fiber optics and electrical wires) can be beneficial. Due to the possibility of utilizing existing electrical infrastructure, it can significantly reduce deployment costs. Additionally, it can be applied under difficult conditions, thanks to battery-powered devices. During an emergency situation (e.g., after coal dust explosion), the medium voltage cables are resistant to mechanical damage, providing a potentially life-saving communication link between the supervisor, rescue team, paramedics, and the trapped personnel. The assessment of such a system requires a comprehensive and accurate examination, including a number of factors. Therefore, various models were tested, considering: different transmission paths and types of coupling (inductive and capacitive), as well as various lengths of transmitted data packets. Next, a subjective quality evaluation study was carried out, considering speech signals from a number of languages (English, German, and Polish). Based on the obtained results, including both simulations and measurements, appropriate practical conclusions were formulated. Results confirmed the applicability of BPL-PLC technology as an efficient voice communication system for the oil and mining industry.
- Published
- 2020
- Full Text
- View/download PDF
21. AMBIQUAL: Towards a Quality Metric for Headphone Rendered Compressed Ambisonic Spatial Audio
- Author
-
Miroslaw Narbutt, Jan Skoglund, Andrew Allen, Michael Chinen, Dan Barry, and Andrew Hines
- Subjects
virtual reality ,spatial audio ,Ambisonics ,audio coding ,audio compression ,Opus codec ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Spatial audio is essential for creating a sense of immersion in virtual environments. Efficient encoding methods are required to deliver spatial audio over networks without compromising Quality of Service (QoS). Streaming service providers such as YouTube typically transcode content into various bit rates and need a perceptually relevant audio quality metric to monitor users’ perceived quality and spatial localization accuracy. The aim of the paper is two-fold. First, it is to investigate the effect of Opus codec compression on the quality of spatial audio as perceived by listeners using subjective listening tests. Secondly, it is to introduce AMBIQUAL, a full reference objective metric for spatial audio quality, which derives both listening quality and localization accuracy metrics directly from the B-format Ambisonic audio. We compare AMBIQUAL quality predictions with subjective quality assessments across a variety of audio samples which have been compressed using the Opus 1.2 codec at various bit rates. Listening quality and localization accuracy of first and third-order Ambisonics were evaluated. Several fixed and dynamic audio sources (single and multiple) were used to evaluate localization accuracy. Results show good correlation regarding listening quality and localization accuracy between objective quality scores using AMBIQUAL and subjective scores obtained during listening tests.
- Published
- 2020
- Full Text
- View/download PDF
22. Uniform Transition Tree-Structured Critically Sampled Filterbanks
- Author
-
Wixen, Ryan
- Subjects
Signal processing ,Filterbanks ,Audio coding ,Electrical engineering - Abstract
Critically sampled filterbanks are useful in applications like audio coding that involve processing the time-varying spectral characteristics of signals. Critically sampled filterbanks can be implemented with a tree-structure. The odd-numbered subbands of a critically sampled filterbank exhibit frequency inversion, causing subbands to be unordered in frequency past the first layer. We show how subbands can be swapped at each layer to maintain their ordering. Self-similar filterbanks, using the same filters at each layer of the tree, have nonuniform transition widths and long impulse responses. We explore using filters with wider transition widths at deeper layers, and we demonstrate that this technique can be used to implement an efficient uniform transition filterbank. We compare this uniform transition filterbank with a self-similar filterbank in an audio codec, showing that the uniform transition filterbank achieves a smaller maximum transition width at a lower impulse response length.
- Published
- 2023
- Full Text
- View/download PDF
23. Modelling Timbral Hardness.
- Author
-
Pearce, Andy, Brookes, Tim, and Mason, Russell
- Subjects
METADATA ,HARDNESS ,ACOUSTIC filters ,SOUNDS ,LOUDNESS ,PSYCHOACOUSTICS ,REGRESSION analysis - Abstract
Featured Application: The model of timbral hardness described in this study is expected to be used for the searching and filtering of sound effects. Hardness is the most commonly searched timbral attribute within freesound.org , a commonly used online sound effect repository. A perceptual model of hardness was developed to enable the automatic generation of metadata to facilitate hardness-based filtering or sorting of search results. A training dataset was collected of 202 stimuli with 32 sound source types, and perceived hardness was assessed by a panel of listeners. A multilinear regression model was developed on six features: maximum bandwidth, attack centroid, midband level, percussive-to-harmonic ratio, onset strength, and log attack time. This model predicted the hardness of the training data with R 2 = 0.76. It predicted hardness within a new dataset with R 2 = 0.57, and predicted the rank order of individual sources perfectly, after accounting for the subjective variance of the ratings. Its performance exceeded that of human listeners. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
24. 22.2 ch Audio Encoding/Decoding Hardware System Based on MPEG-4 AAC.
- Author
-
Sugimoto, Takehiro, Nakayama, Yasushige, and Komori, Tomoyasu
- Subjects
- *
MPEG (Video coding standard) , *MULTIMEDIA communications , *SOUND systems , *DECODERS (Electronics) , *BIT rate - Abstract
A 22.2 multichannel (22.2 ch) sound system has been adopted as an audio system for 8K Super Hi-Vision (8K). The 22.2 ch sound system is an advanced sound system composed of 24 channels three-dimensionally located in a space to envelop listeners in an immersive sound field. NHK has been working on standardizing and developing an 8K broadcasting system via a broadcasting satellite in time for test broadcasting in 2016. For an audio coding scheme, NHK developed a world-first 22.2 ch audio encoding/decoding hardware system (22.2 ch audio codec) capable of real time encoding/decoding. The fabricated 22.2 ch audio codec is based on MPEG-4 AAC and was assembled into the 8K codec together with the 8K video codec and the multiplexer. The audio quality of the fabricated 22.2 ch audio codec was assessed in an objective evaluation, and the evaluation results revealed the operational bit rates of the fabricated codec. An 8K satellite broadcasting experiment was carried out as a final verification test of the 8K broadcasting system, and 22.2 ch audio codec was found to be valid. [ABSTRACT FROM PUBLISHER]
- Published
- 2017
- Full Text
- View/download PDF
25. Delivering Scalable Audio Experiences using AC-4.
- Author
-
Riedmiller, Jeffrey, Kjorling, Kristofer, Roden, Jonas, Wolters, Martin, Biswas, Arijit, Boon, Prinyar, Carroll, Tim, Ekstrand, Per, Groschel, Alexander, Hedelin, Per, Hirvonen, Toni, Horich, Holger, Klejsa, Janusz, Koppens, Jeroen, Krauss, Kurt, Lehtonen, Heidi-Maria, Linzmeier, Karsten, Mehta, Sripal, Muesch, Hannes, and Mundt, Harald
- Subjects
- *
AUDIO codec , *DVB-H (Standard) , *INTERNET access control , *VIDEO recording , *IMAGE quality analysis - Abstract
AC-4 is a state-of-the-art audio codec standardized in ETSI (TS 103 190 and TS 103 190-2) and included in the DVB toolbox (TS 101 154 V2.2.1 and DVB BlueBook A157) and, at the time of writing, is a candidate standard for ATSC 3.0 as per A/342 part 2. AC-4 is an audio codec designed to address the current and future needs of video and audio entertainment services, including broadcast and Internet streaming. As such, it incorporates a number of features beyond the traditional audio coding algorithms, such as capabilities to support immersive and personalized audio, support for advanced loudness management, video-frame synchronous coding, dialog enhancement, etc. This paper will outline the thinking behind the design of the AC-4 codec, explain the different coding tools used, the systemic features included, and give an overview of performance and applications. It further outlines metadata aspects (immersive and personalized, essential for broadcast), metadata carriage, aspects of interchange of immersive programing, as well as immersive playback and rendering. [ABSTRACT FROM AUTHOR]
- Published
- 2017
- Full Text
- View/download PDF
26. Modelling Timbral Hardness
- Author
-
Andy Pearce, Tim Brookes, and Russell Mason
- Subjects
audio coding ,artificial intelligence ,sound recording ,sound quality ,psychoacoustics ,timbre ,modelling ,perception ,music information retrieval ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Hardness is the most commonly searched timbral attribute within freesound.org, a commonly used online sound effect repository. A perceptual model of hardness was developed to enable the automatic generation of metadata to facilitate hardness-based filtering or sorting of search results. A training dataset was collected of 202 stimuli with 32 sound source types, and perceived hardness was assessed by a panel of listeners. A multilinear regression model was developed on six features: maximum bandwidth, attack centroid, midband level, percussive-to-harmonic ratio, onset strength, and log attack time. This model predicted the hardness of the training data with R 2 = 0.76. It predicted hardness within a new dataset with R 2 = 0.57, and predicted the rank order of individual sources perfectly, after accounting for the subjective variance of the ratings. Its performance exceeded that of human listeners.
- Published
- 2019
- Full Text
- View/download PDF
27. MPEG-H 3D Audio: Immersive Audio Coding
- Author
-
Herre, Jürgen, Quackenbush, S.R., and Publica
- Subjects
Audio data reduction ,Audio coding ,Audio compression ,Immersive audio ,MPEG ,3D audio ,MPEG-H - Abstract
The term "Immersive Audio" is frequently used to describe an audio experience that provides to the listener the sensation of being fully immersed or "present" in a sound scene. This can be achieved via different presentation modes, such as surround sound (several loudspeakers horizontally arranged around the listener), 3D audio (with loudspeakers at, above and below listener ear level) and binaural audio to headphones. This article provides an overview of the recent MPEG standard, MPEG-H 3D Audio, which is a versatile standard that supports multiple immersive sound signal formats (channels, objects, higher order ambisonics), and is now being adopted in broadcast and streaming applications.
- Published
- 2022
28. Audio Coding Using Overlap and Kernel Adaptation.
- Author
-
Helmrich, Christian R. and Edler, Bernd
- Subjects
AUDIO codec ,KERNEL functions ,ENCODING ,QUANTIZATION (Physics) ,ELECTRIC switchgear - Abstract
Perceptual audio coding schemes typically apply the modified discrete cosine transform (MDCT) with different lengths and windows, and utilize signal-adaptive switching between these on a perframe basis for best subjective performance. In previous papers, the authors demonstrated that further quality gains can be achieved for some input signals using additional transform kernels such as the modified discrete sine transform (MDST) or greater inter-transform overlap by means of a modified extended lapped transform (MELT). This work discusses the algorithmic procedures and codec modifications necessary to combine all of the above features—transform length, window shape, transform kernel, and overlap ratio switching—into a flexible input-adaptive coding system. It is shown that, due to full time-domain aliasing cancelation, this system supports perfect signal reconstruction in the absence of quantization and, thanks to fast realizations of all transforms, increases the codec complexity only negligibly. The results of a 5.1 multichannel listening test are also reported. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
29. Gaussian channel transmission of images and audio files using cryptcoding
- Author
-
Aleksandra Popovska-Mitrovikj, Vesna Dimitrova, Vladimir Ilievski, Daniela Mechkaroska, Verica Bakeva, and Boro Jakimovski
- Subjects
packet error rate ,rcbq performance ,decoding ,error-correcting codes ,information security ,Computer science ,channel coding ,quasigroups ,audio files ,image files ,Data_CODINGANDINFORMATIONTHEORY ,02 engineering and technology ,cut-decoding algorithm ,decoding speed ,0203 mechanical engineering ,0202 electrical engineering, electronic engineering, information engineering ,random codes ,Electrical and Electronic Engineering ,image coding ,Computer Science::Information Theory ,cryptography ,gaussian channel transmission ,020206 networking & telecommunications ,020302 automobile design & engineering ,cryptcoding ,4-sets-cut-decoding algorithms ,decoded image ,Computer Science Applications ,error correction codes ,audio coding ,Gaussian channels ,bit-error rate ,Bit error rate ,coding algorithm ,gaussian channels ,Algorithm ,error statistics ,Decoding methods - Abstract
Random codes based on quasigroups (RCBQ) are cryptcodes, i.e. they are error-correcting codes, which provide information security. Cut-Decoding and 4-Sets-Cut-Decoding algorithms for these codes are defined elsewhere. Also, the performance of these codes for the transmission of text messages is investigated elsewhere. In this study, the authors investigate the RCBQ's performance with Cut-Decoding and 4-Sets-Cut-Decoding algorithms for transmission of images and audio files through a Gaussian channel. They compare experimental results for both coding/decoding algorithms and for different values of signal-to-noise ratio. In all experiments, the differences between the transmitted and decoded image or audio file are considered. Experimentally obtained values for bit-error rate and packet error rate and the decoding speed of both algorithms are compared. Also, two filters for enhancing the quality of the images decoded using RCBQ are proposed.
- Published
- 2019
- Full Text
- View/download PDF
30. Audio security through compressive sampling and cellular automata.
- Author
-
George, Sudhish, Augustine, Nishanth, and Pattathil, Deepthi
- Subjects
COMPRESSED sensing ,SOUND recording & reproducing ,DATA encryption ,CELLULAR automata ,DATA compression ,RANDOM matrices - Abstract
In this paper, a new approach for scrambling the compressive sensed (CS) audio data using two dimensional cellular automata is presented. In order to improve the security, linear feedback shift register (LFSR) based secure measurement matrix for compressive sensing is used. The basic idea is to select the different states of LFSR as the entries of a random matrix and orthonormalize these values to generate a Gaussian random measurement matrix. It is proposed to generate the initial state matrix of cellular automata using an LFSR based random bitstream generator. In order to improve the security and key space of the proposed cryptosystem, piecewise linear chaotic map (PWLCM) based initial seeds generation for LFSRs is used. In the proposed approach, the initial value, parameter value and the number of iterations of PWLCM are kept as secret to provide security. The proposed audio encryption method for CS audio data is validated with different compressive sensing reconstruction approaches. Experimental and analytical verification shows that the proposed encryption system gives good reconstruction performance, robustness to noise, high level of scrambling and good security against several forms of attack. Moreover, since the measurement matrix used for CS operation and the initial state matrix used for 2D cellular automata are generated using the secret key, the storage/transmission requirement of the same can be avoided. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
31. An enhanced direct chord transformation for music retrieval in the AAC transform domain with window switching.
- Author
-
Chang, Tai-Ming, Hsieh, Chia-Bin, and Chang, Pao-Chi
- Subjects
AUDIO codec ,DIGITAL image processing ,DIGITIZATION ,INFORMATION processing ,INFORMATION retrieval - Abstract
With the explosive growth in the number of music albums produced, retrieving music information has become a critical aspect of managing music data. Extracting frequency parameters directly from the compressed files to represent music greatly benefits processing speed when working on a large database. In this study, we focused on advanced audio coding (AAC) files and analyzed the disparity in frequency expression between discrete Fourier transform and discrete cosine transform, considered the frequency resolution to select the appropriate frequency range, and developed a direct chroma feature-transformation method in the AAC transform domain. An added challenge to using AAC files directly is long/short window switching, ignoring which may result in inaccurate frequency mapping and inefficient information retrieval. For a short window in particular, we propose a peak-competition method to enhance the pitch information that does not include ambiguous frequency components when combining eight subframes. Moreover, for chroma feature segmentation, we propose a simple dynamic-segmentation method to replace the complex computation of beat tracking. Our experimental results show that the proposed method increased the accuracy rate by approximately 7 % in Top-1 search results over transform-domain methods described previously and performed nearly as effectively as state-of-the-art waveform-domain approaches did. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
32. Subjective and Objective Assessment of Perceived Audio Quality of Current Digital Audio Broadcasting Systems and Web-Casting Applications.
- Author
-
Pocta, Peter and Beerends, John G.
- Subjects
- *
AUDIO codec , *DIGITAL audio , *BROADCAST channels , *IMAGE quality analysis , *STREAMING media - Abstract
This paper investigates the impact of different audio codecs typically deployed in current digital audio broadcasting (DAB) systems and web-casting applications, which represent a main source of quality impairment in these systems and applications, on the quality perceived by the end user. Both subjective and objective assessments are used. Two different audio quality prediction models, namely Perceptual Evaluation of Audio Quality (PEAQ) and Perceptual Objective Listening Quality Assessment (POLQA) Music, are evaluated by comparing the predictions with subjectively obtained grades. The results show that the degradations introduced by the typical lossy audio codecs deployed in current DAB systems and web-casting applications operating at the lowest bit rate typically used in these distribution systems and applications seriously impact the subjective audio quality perceived by the end user. Furthermore, it is shown that a retrained POLQA Music provides the best overall correlations between predicted objective measurements and subjective scores allowing to predict the final perceived quality with good accuracy when scores are averaged over a small set of musical fragments ( \mathbf R = 0.95 ). [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
33. MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio.
- Author
-
Herre, Jurgen, Hilpert, Johannes, Kuntz, Achim, and Plogsties, Jan
- Abstract
The science and art of Spatial Audio is concerned with the capture, production, transmission, and reproduction of an immersive sound experience. Recently, a new generation of spatial audio technology has been introduced that employs elevated and lowered loudspeakers and thus surpasses previous ‘surround sound’ technology without such speakers in terms of listener immersion and potential for spatial realism. In this context, the ISO/MPEG standardization group has started the MPEG-H 3D Audio development effort to facilitate high-quality bitrate-efficient production, transmission and reproduction of such immersive audio material. The underlying format is designed to provide universal means for carriage of channel-based, object-based and Higher Order Ambisonics based input. High quality reproduction is provided for many output formats from 22.2 and beyond down to 5.1, stereo and binaural reproduction—independently of the original encoding format, thus overcoming the incompatibility between various 3D formats. This paper provides an overview of the MPEG-H 3D Audio project and technology and an assessment of the system capabilities and performance. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
34. Low Delay Robust Audio Coding by Noise Shaping, Fractional Sampling, and Source Prediction
- Author
-
Jan Ostergaard, Bilgin, Ali, Marcellin, Michael W., Serra-Sagrista, Joan, and Storer, James A.
- Subjects
PEAQ ,Audio signal ,Computer science ,Network packet ,low delay ,020206 networking & telecommunications ,02 engineering and technology ,fractional sampling ,Noise shaping ,030507 speech-language pathology & audiology ,03 medical and health sciences ,audio coding ,Packet loss ,0202 electrical engineering, electronic engineering, information engineering ,source predictions ,Oversampling ,Multiple descriptions ,0305 other medical science ,Algorithm ,noise shaping ,Decoding methods ,Data compression - Abstract
It was recently shown that the combination of source prediction, two-times oversampling, and noise shaping, can be used to obtain a robust (multiple-description) audio coding frame- work for networks with packet loss probabilities less than 10%. Specifically, it was shown that audio signals could be encoded into two descriptions (packets), which were separately sent over a communication channel. Each description yields a desired performance by itself, and when they are combined, the performance is improved. This paper extends the previ- ous work to an arbitrary number of descriptions (packets) by using fractional oversampling and a new decoding principle. We demonstrate that, due to source aliasing, existing MSE optimized reconstruction rules from noisy sampled data, performs poorly from a perceptual point of view. A simple reconstruction rule is proposed, that improves the PEAQ objective difference grades (ODG) by more than 2 points. The proposed audio coder enables low- delay high-quality audio streaming on networks with late packet arrivals or packet losses. With a coding delay of 2.5 ms, and a total bitrate of 300 kbps, it is demonstrated that mean PEAQ ODGs around -0.65 can be obtained for 48 kHz (mono) music (pop & rock), and packet loss probabilities of 20%.
- Published
- 2021
- Full Text
- View/download PDF
35. sc3nb: a Python-SuperCollider Interface for Auditory Data Science
- Author
-
Thomas Hermann and Dennis Reinsch
- Subjects
interactive programming ,Interactive programming ,Interface (Java) ,Computer science ,NumPy ,Audification ,SuperCollider ,Python (programming language) ,Data science ,audio coding ,Sonification ,auditory data science ,sonification ,computer ,computer.programming_language ,Coding (social sciences) ,Python - Abstract
This paper introduces sc3nb, a Python package for audio coding and interactive control of the SuperCollider programming environment. sc3nb supports Jupyter notebooks, enables flexible means for sound and music computing such as sound synthesis and analysis and is particularly tailored for sonification. We present the main concepts and interfaces and illustrate how to use sc3nb at hand of selected code examples for basic sonification approaches, such as audification and parameter-mapping sonification. Finally, we introduce TimedQueues which enable coordinated audiovisual displays, e.g. to synchronize matplotlib data and sc3nb-based sound rendition. sc3nb enables interactive sound applications right in the center of the pandas/numpy/scipy data science ecosystem. The open source package is hosted at GitHub and available via the Python Package Index PyPI.
- Published
- 2021
36. ITU-T SG16 Sapporo Meeting Report.
- Author
-
Kiyoshi Tanaka, Shohei Matsuo, Hitoshi Ohmuro, and Seiichi Sakaya
- Subjects
- *
STREAMING video & television , *MULTIMEDIA systems , *VIDEO coding , *TELECOMMUNICATION conferences , *CONFERENCES & conventions - Abstract
The article discusses the highlights of the International Telecommunication Union-Telecommunication Standardization Sector Study Group 16 (ITU-T SG16) meeting held from June 30 to July 11, 2014 at the Sapporo Convention Center in Hokkaido, Japan. Topics tackled at the event include Internet protocol television (IPTV), accessibility of multimedia services and systems, video coding, and robust transmission technology.
- Published
- 2015
- Full Text
- View/download PDF
37. Finite‐state entropy‐constrained vector quantiser for audio modified discrete cosine transform coefficients uniform quantisation.
- Author
-
Jiang, Sumxin, Yin, Rendong, and Liu, Peilin
- Abstract
In this paper, an entropy‐constrained vector quantiser (ECVQ) scheme with finite memory, called finite‐state ECVQ (FS‐ECVQ), is presented. This scheme consists of a finite‐state vector quantiser (FSVQ) and multiple component ECVQs. By utilising the FSVQ, the inter‐frame dependencies within source sequence can be effectively exploited and no side information needs to be transmitted. By employing the ECVQs, the total memory requirements of FS‐ECVQ can be efficiently decreased while the coding performance is improved. An FS‐ECVQ, designed for the modified discrete cosine transform coefficients coding, was implemented and evaluated based on the unified speech and audio coding (USAC) scheme. Results showed that the FS‐ECVQ achieved reduction of the total memory requirements by 92.3%, compared with the encoder in USAC working draft 6 (WD6), and over 10%, compared with the encoder in USAC final version (FINAL), while maintaining coding performance similar to FINAL, which was about 4% better than that of WD6. [ABSTRACT FROM AUTHOR]
- Published
- 2015
- Full Text
- View/download PDF
38. Audio coding via EMD
- Author
-
Thierry Chonavel, Kais Khaldi, Mounia Turki Hadj-Alouane, Ali Komaty, Abdel-Ouahab Boudraa, Institut de Recherche de l'Ecole Navale (IRENAV), Université de Bordeaux (UB)-Institut Polytechnique de Bordeaux-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE)-Arts et Métiers Sciences et Technologies, HESAM Université (HESAM)-HESAM Université (HESAM), Département Signal et Communications (IMT Atlantique - SC), IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Lab-STICC_IMTA_CID_TOMS, Laboratoire des sciences et techniques de l'information, de la communication et de la connaissance (Lab-STICC), École Nationale d'Ingénieurs de Brest (ENIB)-Université de Bretagne Sud (UBS)-Université de Brest (UBO)-École Nationale Supérieure de Techniques Avancées Bretagne (ENSTA Bretagne)-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS)-Université Bretagne Loire (UBL)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-École Nationale d'Ingénieurs de Brest (ENIB)-Université de Bretagne Sud (UBS)-Université de Brest (UBO)-École Nationale Supérieure de Techniques Avancées Bretagne (ENSTA Bretagne)-Institut Mines-Télécom [Paris] (IMT)-Centre National de la Recherche Scientifique (CNRS)-Université Bretagne Loire (UBL)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), and Institut Mines-Télécom [Paris] (IMT)
- Subjects
Computer science ,Audio coding ,02 engineering and technology ,Sub-band coding ,Hilbert–Huang transform ,Wavelet ,[INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Codec ,Psychoacoustics ,Electrical and Electronic Engineering ,Empirical mode decomposition ,Audio signal ,Applied Mathematics ,020206 networking & telecommunications ,Stationarity index ,Maxima and minima ,Computational Theory and Mathematics ,Signal Processing ,020201 artificial intelligence & image processing ,Empirical mode compression ,Computer Vision and Pattern Recognition ,Statistics, Probability and Uncertainty ,Algorithm ,[SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing ,Psychoacoustic model ,Coding (social sciences) - Abstract
International audience; In this paper an audio coding scheme based on the empirical mode decomposition in association with a psychoacoustic model is presented. The principle of the method consists in breaking down adaptively the audio signal into intrinsic oscillatory components, called Intrinsic Mode Functions (IMFs), that are fully described by their local extrema. These extrema are encoded. The coding is carried out frame by frame and no assumption is made upon the signal to be coded. The number of allocated bits varies from mode to mode and obeys to the coding error inaudibility constraint. Due to the symmetry of an IMF, only the extrema (maxima or minima) of one of its interpolating envelopes are perceptually coded. In addition, to deal with rapidly changing audio signals, a stationarity index is used and when a transient is detected, the frame is split into two overlapping sub-frames. At the decoder side, the IMFs are recovered using the associated coded maxima, and the original signal is reconstructed by IMFs summation. Performance of the proposed coding is analyzed and compared to that of MP3 and AAC codecs, and the wavelet-based coding approach. Based on the analyzed mono audio signals, the obtained results show that the proposed coding scheme outperforms the MP3 and the wavelet-based coding methods and performs slightly better than the AAC codec, showing thus the potential of the EMD for data-driven audio coding.
- Published
- 2020
- Full Text
- View/download PDF
39. Fast Randomization for Distributed Low-Bitrate Coding of Speech and Audio
- Author
-
Johannes Fischer, Tom Bäckström, Publica, Dept Signal Process and Acoust, Friedrich-Alexander University Erlangen-Nürnberg, Aalto-yliopisto, and Aalto University
- Subjects
Acoustics and Ultrasonics ,Computer science ,speech coding ,Speech recognition ,Speech coding ,superfast algorithm ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Data_CODINGANDINFORMATIONTHEORY ,02 engineering and technology ,randomization ,distributed coding ,orthonormal matrix ,Audio codec ,0202 electrical engineering, electronic engineering, information engineering ,Computer Science (miscellaneous) ,Speech ,Electrical and Electronic Engineering ,Quantization (signal) ,Voice activity detection ,ta213 ,Complexity theory ,Quantization (signal processing) ,020206 networking & telecommunications ,Linear predictive coding ,Speech processing ,Sub-band coding ,Codecs ,Computational Mathematics ,audio coding ,Adaptive Multi-Rate audio codec ,020201 artificial intelligence & image processing - Abstract
Efficient coding of speech and audio in a distributed system requires that quantization errors across nodes are uncorrelated. Yet, with conventional methods at low bitrates, quantization levels become increasingly sparse, which does not correspond to the distribution of the input signal and, importantly, also reduces coding efficiency in a distributed system. We have recently proposed a distributed speech and audio codec design, which applies quantization in a randomized domain such that quantization errors are randomly rotated in the output domain. Similar to dithering, this ensures that quantization errors across nodes are uncorrelated and coding efficiency is retained. In this paper, we improve this approach by proposing faster randomization methods, with a computational complexity of $\mathcal O(N\log N)$ . The presented experiments demonstrate that the proposed randomizations yield uncorrelated signals, that perceptual quality is competitive, and that the complexity of the proposed methods is feasible for practical applications.
- Published
- 2018
- Full Text
- View/download PDF
40. Effective utilisation of JND for spatial parameters quantisation in 3D multichannel audio.
- Author
-
Gao, Li, Hu, Ruimin, Wang, Xiaochen, Li, Gang, and Yang, Yuhong
- Abstract
The features of just noticeable difference (JND) in human spatial perception are often used to remove the perceptual redundancy of spatial parameters in 3D multichannel audio. However in previous spatial parameters quantisation schemes, JND data are not effectively utilised resulting in either perceptual distortions of spatial images, or wastes of coding bitrates. It is proposed to effectively utilise azimuthal JND to design the quantisation codebook of spatial parameters azimuths, which divides the full circle of 360° to adjacent quantisation intervals, and set the endpoints of quantisation intervals according to JND of quantisation value. With presented scheme, quantisation codebook size has decreased 13 and 57%, respectively, compared with MPEG Surround and SLQP, as well as ignorable perceptual distortion induced. [ABSTRACT FROM AUTHOR]
- Published
- 2016
- Full Text
- View/download PDF
41. First-order ambisonic coding with quaternion-based interpolation of PCA rotation matrices
- Author
-
Mahé, Pierre, Ragot, Stephane, Marchand, Sylvain, Orange Labs [Lannion], France Télécom, Laboratoire Informatique, Image et Interaction - EA 2118 (L3I), and Université de La Rochelle (ULR)
- Subjects
PCA ,quaternion ,audio coding ,0202 electrical engineering, electronic engineering, information engineering ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,020207 software engineering ,020201 artificial intelligence & image processing ,ambisonic ,[INFO]Computer Science [cs] ,02 engineering and technology ,[PHYS.MECA.ACOU]Physics [physics]/Mechanics [physics]/Acoustics [physics.class-ph] - Abstract
International audience; Conversational applications such as telephony are mostly restricted to mono. With the emergence of VR/XR applications and new products with spatial audio, there is a need to extend traditional voice and audio codecs to enable immersive communication.The present work is motivated by recent activities in 3GPP standardization around the development of a new codec called immersive voice and audio services (IVAS). The IVAS codec will address a wide variety of use cases, e.g. immersive telephony, spatial audio conferencing, live content sharing. There are two main design goals for IVAS. One goal is the versatility of the codec in terms of input (scene-based, channel-based, object-based audio…) and output (mono, stereo, binaural, various multichannel loudspeaker setups). The second goal is to re-use as much as possible and extend the enhanced voice services (EVS) mono codec.In this work, we focus on the first-order ambisonic (FOA) format which is a good candidate for the internal representation in an immersive audio codec at low bit rates, due to the flexibility of the underlying sound field decomposition. We propose a new coding method, which can extend existing core codecs such as EVS. The proposed method consists in adaptively pre-processing ambisonic components prior to multi-mono coding by a core codec.The first part of this work investigates the basic multi-mono coding approach for FOA, which is for instance used in the Opus codec (in the so-called channel mapping family 2). In this approach ambisonic components are coded separately with different instances of the (mono) core codec. We present results of a subjective test (MUSHRA), which shows that this direct approach is not satisfactory for low-bitrate coding. The signal structure is degraded which produces many spatial artifacts (e.g. wrong panning, ghost sources...). In the second part of this work, we propose a new method to exploit the correlation of ambisonic components. The pre-processing (prior to multi-mono coding) operates in time-domain to allow maximum compatibility with many codecs, especially low bit-rate codecs such as EVS and Opus, and to minimize extra delay.The proposed method applies Principal Components Analysis (PCA) on a 20 ms frame basis. For each frame, eigenvectors are computed and the eigenvector matrix is defined as a 4D rotation matrix. For complex sound scenes (with many audio sources, sudden changes…) rotation parameters may change dramatically between consecutive frames and audio sources may go from one principal component to another, which may cause discontinuities or other artifacts. Solutions such as the interpolation of eigenvectors (after inter-frame realignment) are not optimal. In the proposed method, we ensure smooth transitions between inter-frame PCA rotations thanks to two complementary methods. The first one is a matching algorithm for eigenvectors between the current and the previous frame, which avoids signal inversion and permutation across frames. The second one is an interpolation of the 4D rotation matrices in quaternion domain. We use the Cayley factorization of 4D rotation matrices into a double quaternion for the current and previous frame and apply quaternion spherical linear interpolation (QSLERP) interpolation on a subframe basis. The interpolated rotation matrices are then applied to the ambisonic components and the decorrelated components are coded with a multi-mono coding approach.We present results of a subjective evaluation (MUSHRA) for the proposed method showing that the proposed method brings significant improvements over naive multi-mono method, especially in terms of spatial quality.
- Published
- 2019
- Full Text
- View/download PDF
42. AMBIQUAL - a full reference objective quality metric for ambisonic spatial audio
- Author
-
Andrew Hines, Andrew Allen, Miroslaw Narbutt, Michael Chinen, Jan Skoglund, Google LLC, and AMBIQUAL
- Subjects
Computer science ,Speech recognition ,02 engineering and technology ,MUSHRA ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Raw audio format ,0202 electrical engineering, electronic engineering, information engineering ,Codec ,Quality of experience ,Sound quality ,Digital Communications and Networking ,ambisonics ,opus codec ,Ambisonics ,020206 networking & telecommunications ,audio compression ,audio coding ,Acoustics, Dynamics, and Controls ,Signal Processing ,Metric (mathematics) ,virtual reality ,Dynamic range compression ,0305 other medical science ,spatial audio - Abstract
Streaming spatial audio over networks requires efficient encoding techniques that compress the raw audio content without compromising quality of experience. Streaming service providers such as YouTube need a perceptually relevant objective audio quality metric to monitor users’ perceived quality and spatial localization accuracy. In this paper we introduce a full reference objective spatial audio quality metric, AMBIQUAL, which assesses both Listening Quality and Localization Accuracy. In our solution both metrics are derived directly from the B-format Ambisonic audio. The metric extends and adapts the algorithm used in ViSQOLAudio, a full reference objective metric designed for assessing speech and audio quality. In particular, Listening Quality is derived from the omnidirectional channel and Localization Accuracy is derived from a weighted sum of similarity from B-format directional channels. This paper evaluates whether the proposed AMBIQUAL objective spatial audio quality metric can predict two factors: Listening Quality and Localization Accuracy by comparing its predictions with results from MUSHRA subjective listening tests. In particular, we evaluated the Listening Quality and Localization Accuracy of First and Third-Order Ambisonic audio compressed with the OPUS 1.2 codec at various bitrates (i.e. 32, 128 and 256, 512kbps respectively). The sample set for the tests comprised both recorded and synthetic audio clips with a wide range of time-frequency characteristics. To evaluate Localization Accuracy of compressed audio a number of fixed and dynamic (moving vertically and horizontally) source positions were selected for the test samples. Results showed a strong correlation (PCC=0.919; Spearman=0.882 regarding Listening Quality and PCC=0.854; Spearman=0.842 regarding Localization Accuracy) between objective quality scores derived from the B-format Ambisonic audio using AMBIQUAL and subjective scores obtained during listening MUSHRA tests. AMBIQUAL displays very promising quality assessment predictions for spatial audio. Future work will optimise the algorithm to generalise and validate it for any Higher Order Ambisonic formats.
- Published
- 2018
- Full Text
- View/download PDF
43. GMM-Based Iterative Entropy Coding for Spectral Envelopes of Speech and Audio
- Author
-
Guillaume Fuchs, Srikanth Korse, Tom Bäckström, Fraunhofer Institute for Integrated Circuits, Dept Signal Process and Acoust, Aalto-yliopisto, and Aalto University
- Subjects
ta113 ,Speech Coding ,Audio Coding ,Computer science ,Envelope Modelling ,Speech coding ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Vector quantization ,Data_CODINGANDINFORMATIONTHEORY ,Mixture model ,Spectral envelope ,Bit rate ,Entropy Coding ,Entropy (information theory) ,Probability distribution ,Codec ,Gaussian mixture models ,Entropy encoding ,Algorithm - Abstract
Spectral envelope modelling is a central part of speech and audio codecs and is traditionally based on either vector quantization or scalar quantization followed by entropy coding. To bridge the coding performance of vector quantization with the low complexity of the scalar case, we propose an iterative approach for entropy coding the spectral envelope parameters. For each parameter, a univariate probability distribution is derived from a Gaussian mixture model of the joint distribution and the previously quantized parameters used as a-priori information. Parameters are then iteratively and individually scalar quantized and entropy coded. Unlike vector quantization, the complexity of proposed method does not increase exponentially with dimension and bitrate. Moreover, the coding resolution and dimension can be adaptively modified without retraining the model. Experimental results show that these important advantages do not impair coding efficiency compared to a state-of-art vector quantization scheme.
- Published
- 2018
- Full Text
- View/download PDF
44. BPL-PLC Voice Communication System for the Oil and Mining Industry.
- Author
-
Debita, Grzegorz, Falkowski-Gilski, Przemysław, Habrych, Marcin, Wiśniewski, Grzegorz, Miedziński, Bogdan, Jedlikowski, Przemysław, Waniewska, Agnieszka, Wandzio, Jan, and Polnik, Bartosz
- Subjects
TELECOMMUNICATION systems ,DATA packeting ,MINERAL industries ,INTERNET telephony ,PETROLEUM industry ,BROADBAND communication systems ,CARRIER transmission on electric lines ,CONVEYOR belts - Abstract
Application of a high-efficiency voice communication systems based on broadband over power line-power line communication (BPL-PLC) technology in medium voltage networks, including hazardous areas (like the oil and mining industry), as a redundant mean of wired communication (apart from traditional fiber optics and electrical wires) can be beneficial. Due to the possibility of utilizing existing electrical infrastructure, it can significantly reduce deployment costs. Additionally, it can be applied under difficult conditions, thanks to battery-powered devices. During an emergency situation (e.g., after coal dust explosion), the medium voltage cables are resistant to mechanical damage, providing a potentially life-saving communication link between the supervisor, rescue team, paramedics, and the trapped personnel. The assessment of such a system requires a comprehensive and accurate examination, including a number of factors. Therefore, various models were tested, considering: different transmission paths and types of coupling (inductive and capacitive), as well as various lengths of transmitted data packets. Next, a subjective quality evaluation study was carried out, considering speech signals from a number of languages (English, German, and Polish). Based on the obtained results, including both simulations and measurements, appropriate practical conclusions were formulated. Results confirmed the applicability of BPL-PLC technology as an efficient voice communication system for the oil and mining industry. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
45. Audio Signal Broadcasting
- Author
-
Grego, Vigor and Jambrošić, Kristian
- Subjects
loudness standards ,television broadcasting ,TEHNIČKE ZNANOSTI. Elektrotehnika ,digitalna televizija ,standardi glasnoće ,tehnološki lanac ,digital television ,televizijska radiodifuzija ,audio signal ,techological chain ,audio coding ,TECHNICAL SCIENCES. Electrical Engineering ,MPEG-2 ,Laudato TV ,audio kodiranje - Abstract
U ovom radu opisan je tehnološki lanac audio signala u televizijskoj radiodifuziji, s naglaskom na digitalnu televiziju. Izložene su osnovne značajke opreme za snimanje, pohranjivanje i emitiranje, vezane uz audio signal. Opisano je kodiranje audio signala za digitalnu televiziju, a detaljnije je opisan format MPEG-2. Dan je pregled standarda glasnoće radiodifuzije kao i način mjerenja pojedinih parametara vezanih uz glasnoću u televizijskom emitiranju. Opisani su osnovni zahtjevi akustike prostora za televizijski studio te je dan primjer male televizijske kuće – Laudato TV. In this paper, technological chain of audio signal in television broadcasting with empasis on digital television broadcasting is shown. Basic characteristics of equipment for recording, storing and broadcasting connected to the audio signal are displayed. Audio coding for digital television is described and MPEG-2 format is discussed in more detail. An overview of loudness standards in broadcasting and ways of measuring particular loudness parameters is given. Basic acoustic conditions on television studio spaces are described and an example of a small television house – Laudato TV – is given.
- Published
- 2017
46. Bitrate classification of twice-encoded audio using objective quality features
- Author
-
Damien Kelly, Andrew Hines, Colm Sloan, Anil Kokaram, Naomi Harte, Google, Inc. and Science Foundation Ireland (SFI), and CONNECT research centre
- Subjects
Multimedia ,Electrical and Electronics ,Computer science ,020206 networking & telecommunications ,02 engineering and technology ,Lossy compression ,computer.software_genre ,Variable bitrate ,Metadata ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Upload ,audio coding ,signal classification ,support vector machines ,ViSQOLAudio ,bitrate classification ,full reference objective audio quality metric ,low bitrate codecs ,metadata ,multiclass support vector machine classifier ,music streaming service ,twice-encoded audio ,Bit rate ,Codecs ,Digital audio players ,Encoding ,Mobile communication ,Support vector machines ,Encoding (memory) ,Signal Processing ,Data_FILES ,0202 electrical engineering, electronic engineering, information engineering ,Bandwidth (computing) ,Codec ,Computer Engineering ,Sound quality ,0305 other medical science ,computer - Abstract
When a user uploads audio files to a music stream- ing service, these files are subsequently re-encoded to lower bitrates to target different devices, e.g. low bitrate for mobile. To save time and bandwidth uploading files, some users encode their original files using a lossy codec. The metadata for these files cannot always be trusted as users might have encoded their files more than once. Determining the lowest bitrate of the files allows the streaming service to skip the process of encoding the files to bitrates higher than that of the uploaded files, saving on processing and storage space. This paper presents a model that uses quality predictions from ViSQOLAudio, a full reference objective audio quality metric, as features in combination with a multi-class support vector machine classifier. An experiment on twice-encoded files found that low bitrate codecs could be classified using audio quality features. The experiment also provides insights into the implications of multiple transcodes from a quality perspective.
- Published
- 2016
- Full Text
- View/download PDF
47. Audio coding via EMD.
- Author
-
Boudraa, Abdel-Ouahab, Khaldi, Kais, Chonavel, Thierry, Turki Hadj-Alouane, Mounia, and Komaty, Ali
- Subjects
- *
HILBERT-Huang transform , *DIGITAL music players , *MAXIMA & minima , *ACOUSTIC transducers - Abstract
In this paper an audio coding scheme based on the empirical mode decomposition in association with a psychoacoustic model is presented. The principle of the method consists in breaking down adaptively the audio signal into intrinsic oscillatory components, called Intrinsic Mode Functions (IMFs), that are fully described by their local extrema. These extrema are encoded. The coding is carried out frame by frame and no assumption is made upon the signal to be coded. The number of allocated bits varies from mode to mode and obeys to the coding error inaudibility constraint. Due to the symmetry of an IMF, only the extrema (maxima or minima) of one of its interpolating envelopes are perceptually coded. In addition, to deal with rapidly changing audio signals, a stationarity index is used and when a transient is detected, the frame is split into two overlapping sub-frames. At the decoder side, the IMFs are recovered using the associated coded maxima, and the original signal is reconstructed by IMFs summation. Performance of the proposed coding is analyzed and compared to that of MP3 and AAC codecs, and the wavelet-based coding approach. Based on the analyzed mono audio signals, the obtained results show that the proposed coding scheme outperforms the MP3 and the wavelet-based coding methods and performs slightly better than the AAC codec, showing thus the potential of the EMD for data-driven audio coding. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
48. A quasi-orthogonal, invertible, and perceptually relevant time-frequency transform for audio coding
- Author
-
Olivier Derrien, Thibaud Necciarf, Peter Balazs, Université de Toulon (UTLN), Laboratoire de Mécanique et d'Acoustique [Marseille] (LMA ), Aix Marseille Université (AMU)-École Centrale de Marseille (ECM)-Centre National de la Recherche Scientifique (CNRS), Sons, Acoustics Research Institute (ARI), Austrian Academy of Sciences (OeAW)-Austrian Academy of Sciences (OeAW), ANR-13-IS03-0004,POTION,Optimisation perceptive des représentations audio temps-fréquence et codage(2013), Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-École Centrale de Marseille (ECM), Derrien, Olivier, and Blanc – Accords bilatéraux 2013 - Optimisation perceptive des représentations audio temps-fréquence et codage - - POTION2013 - ANR-13-IS03-0004 - Blanc – Accords bilatéraux 2013 - VALID
- Subjects
Non-stationary time-frequency transforms ,Audio coding ,Speech recognition ,Speech coding ,MDCT ,020206 networking & telecommunications ,02 engineering and technology ,Time frequency transform ,Sub-band coding ,law.invention ,Time–frequency analysis ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Invertible matrix ,law ,0202 electrical engineering, electronic engineering, information engineering ,0305 other medical science ,ERB filters ,Algorithm ,[SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing ,[SPI.SIGNAL] Engineering Sciences [physics]/Signal and Image processing ,Mathematics ,Coding (social sciences) - Abstract
International audience; We describe ERB-MDCT, an invertible real-valued time-frequency transform based on MDCT, which is widely used in audio coding (e.g. MP3 and AAC). ERB-MDCT was designed similarly to ERBLet, a recent invertible transform with a resolution evolving across frequency to match the perceptual ERB frequency scale, while the frequency scale in most invertible transforms (e.g. MDCT) is uniform. ERB-MDCT has mostly the same frequency scale as ERBLet, but the main improvement is that atoms are quasi-orthogonal, i.e. its redundancy is close to 1. Furthermore, the energy is more sparse in the time-frequency plane. Thus, it is more suitable for audio coding than ERBLet.
- Published
- 2015
- Full Text
- View/download PDF
49. Amélioration de codecs audio standardisés avec maintien de l'interopérabilité
- Author
-
Lefebvre, Roch, Lapierre, Jimmy, Lefebvre, Roch, and Lapierre, Jimmy
- Abstract
Résumé : L’audio numérique s’est déployé de façon phénoménale au cours des dernières décennies, notamment grâce à l’établissement de standards internationaux. En revanche, l’imposition de normes introduit forcément une certaine rigidité qui peut constituer un frein à l’amélioration des technologies déjà déployées et pousser vers une multiplication de nouveaux standards. Cette thèse établit que les codecs existants peuvent être davantage valorisés en améliorant leur qualité ou leur débit, même à l’intérieur du cadre rigide posé par les standards établis. Trois volets sont étudiés, soit le rehaussement à l’encodeur, au décodeur et au niveau du train binaire. Dans tous les cas, la compatibilité est préservée avec les éléments existants. Ainsi, il est démontré que le signal audio peut être amélioré au décodeur sans transmettre de nouvelles informations, qu’un encodeur peut produire un signal amélioré sans ajout au décodeur et qu’un train binaire peut être mieux optimisé pour une nouvelle application. En particulier, cette thèse démontre que même un standard déployé depuis plusieurs décennies comme le G.711 a le potentiel d’être significativement amélioré à postériori, servant même de cœur à un nouveau standard de codage par couches qui devait préserver cette compatibilité. Ensuite, les travaux menés mettent en lumière que la qualité subjective et même objective d’un décodeur AAC (Advanced Audio Coding) peut être améliorée sans l’ajout d’information supplémentaire de la part de l’encodeur. Ces résultats ouvrent la voie à davantage de recherches sur les traitements qui exploitent une connaissance des limites des modèles de codage employés. Enfin, cette thèse établit que le train binaire à débit fixe de l’AMR WB+ (Extended Adaptive Multi-Rate Wideband) peut être compressé davantage pour le cas des applications à débit variable. Cela démontre qu’il est profitable d’adapter un codec au contexte dans lequel il est employé., Digital audio applications have grown exponentially during the last decades, in good part because of the establishment of international standards. However, imposing such norms necessarily introduces hurdles that can impede the improvement of technologies that have already been deployed, potentially leading to a proliferation of new standards. This thesis shows that existent coders can be better exploited by improving their quality or their bitrate, even within the rigid constraints posed by established standards. Three aspects are studied, being the enhancement of the encoder, the decoder and the bit stream. In every case, the compatibility with the other elements of the existent coder is maintained. Thus, it is shown that the audio signal can be improved at the decoder without transmitting new information, that an encoder can produce an improved signal without modifying its decoder, and that a bit stream can be optimized for a new application. In particular, this thesis shows that even a standard like G.711, which has been deployed for decades, has the potential to be significantly improved after the fact. This contribution has even served as the core for a new standard embedded coder that had to maintain that compatibility. It is also shown that the subjective and objective audio quality of the AAC (Advanced Audio Coding) decoder can be improved, without adding any extra information from the encoder, by better exploiting the knowledge of the coder model’s limitations. Finally, it is shown that the fixed rate bit stream of the AMR-WB+ (Extended Adaptive Multi-Rate Wideband) can be compressed more efficiently when considering a variable bit rate scenario, showing the need to adapt a coder to its use case.
- Published
- 2016
50. Reviews on Technology and Standard of Spatial Audio Coding
- Author
-
Amirul Luthfi and Ikhwana Elfitri
- Subjects
object-based audio ,Audio signal ,Multimedia ,Computer science ,End user ,computer.software_genre ,Interactive audio ,TK1-9971 ,Entertainment ,audio coding ,multi-channel audio signals ,MPEG standard ,Electrical engineering. Electronics. Nuclear engineering ,computer ,spatial audio ,Coding (social sciences) - Abstract
Market demands on a more impressive entertainment media have motivated for delivery of three dimensional (3D) audio content to home consumers through Ultra High Definition TV (UHDTV), the next generation of TV broadcasting, where spatial audio coding plays fundamental role. This paper reviews fundamental concept on spatial audio coding which includes technology, standard, and application. Basic principle of object-based audio reproduction system will also be elaborated, compared to the traditional channel-based system, to provide good understanding on this popular interactive audio reproduction system which gives end users flexibility to render their own preferred audio composition. Keywords : spatial audio, audio coding, multi-channel audio signals, MPEG standard, object-based audio
- Published
- 2017
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.