Author: "Chen, Jingdong" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chen, Jingdong"' showing total 2,088 results

Start Over Author "Chen, Jingdong"

2,088 results on '"Chen, Jingdong"'

1. Try-On-Adapter: A Simple and Flexible Try-On Paradigm

Author: Guo, Hanzhong, Zhang, Jianfeng, Zou, Cheng, Li, Jun, Wang, Meng, Wen, Ruxue, Tang, Pingzhong, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image-based virtual try-on, widely used in online shopping, aims to generate images of a naturally dressed person conditioned on certain garments, providing significant research and commercial potential. A key challenge of try-on is to generate realistic images of the model wearing the garments while preserving the details of the garments. Previous methods focus on masking certain parts of the original model's standing image, and then inpainting on masked areas to generate realistic images of the model wearing corresponding reference garments, which treat the try-on task as an inpainting task. However, such implements require the user to provide a complete, high-quality standing image, which is user-unfriendly in practical applications. In this paper, we propose Try-On-Adapter (TOA), an outpainting paradigm that differs from the existing inpainting paradigm. Our TOA can preserve the given face and garment, naturally imagine the rest parts of the image, and provide flexible control ability with various conditions, e.g., garment properties and human pose. In the experiments, TOA shows excellent performance on the virtual try-on task even given relatively low-quality face and garment images in qualitative comparisons. Additionally, TOA achieves the state-of-the-art performance of FID scores 5.56 and 7.23 for paired and unpaired on the VITON-HD dataset in quantitative comparisons., Comment: Image virtual try-on, 7 pages, 3 figures
Published: 2024

2. HomoMatcher: Dense Feature Matching Results with Semi-Dense Efficiency by Homography Estimation

Author: Wang, Xiaolong, Yu, Lei, Zhang, Yingying, Lao, Jiangwei, Ru, Lixiang, Zhong, Liheng, Chen, Jingdong, Zhang, Yu, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Feature matching between image pairs is a fundamental problem in computer vision that drives many applications, such as SLAM. Recently, semi-dense matching approaches have achieved substantial performance enhancements and established a widely-accepted coarse-to-fine paradigm. However, the majority of existing methods focus on improving coarse feature representation rather than the fine-matching module. Prior fine-matching techniques, which rely on point-to-patch matching probability expectation or direct regression, often lack precision and do not guarantee the continuity of feature points across sequential images. To address this limitation, this paper concentrates on enhancing the fine-matching module in the semi-dense matching framework. We employ a lightweight and efficient homography estimation network to generate the perspective mapping between patches obtained from coarse matching. This patch-to-patch approach achieves the overall alignment of two patches, resulting in a higher sub-pixel accuracy by incorporating additional constraints. By leveraging the homography estimation between patches, we can achieve a dense matching result with low computational cost. Extensive experiments demonstrate that our method achieves higher accuracy compared to previous semi-dense matchers. Meanwhile, our dense matching results exhibit similar end-point-error accuracy compared to previous dense matchers while maintaining semi-dense efficiency., Comment: 10 pages, 5 figures, conference under review
Published: 2024

3. LumiSculpt: A Consistency Lighting Control Network for Video Generation

Author: Zhang, Yuxin, Zheng, Dandan, Gong, Biao, Chen, Jingdong, Yang, Ming, Dong, Weiming, and Xu, Changsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Lighting plays a pivotal role in ensuring the naturalness of video generation, significantly influencing the aesthetic quality of the generated content. However, due to the deep coupling between lighting and the temporal features of videos, it remains challenging to disentangle and model independent and coherent lighting attributes, limiting the ability to control lighting in video generation. In this paper, inspired by the established controllable T2I models, we propose LumiSculpt, which, for the first time, enables precise and consistent lighting control in T2V generation models.LumiSculpt equips the video generation with strong interactive capabilities, allowing the input of custom lighting reference image sequences. Furthermore, the core learnable plug-and-play module of LumiSculpt facilitates remarkable control over lighting intensity, position, and trajectory in latent video diffusion models based on the advanced DiT backbone.Additionally, to effectively train LumiSculpt and address the issue of insufficient lighting data, we construct LumiHuman, a new lightweight and flexible dataset for portrait lighting of images and videos. Experimental results demonstrate that LumiSculpt achieves precise and high-quality lighting control in video generation.
Published: 2024

4. Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Author: Tan, Shuai, Gong, Biao, Wang, Xiang, Zhang, Shiwei, Zheng, Dandan, Zheng, Ruobing, Zheng, Kecheng, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X, a universal animation framework based on LDM for various character types (collectively named X), including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of LDM by simulating possible inputs in advance that may arise during inference. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A^2Bench) to evaluate the performance of Animate-X on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X compared to state-of-the-art methods., Comment: 25 pages, 15 figures, conference
Published: 2024

5. StyleTokenizer: Defining Image Style by a Single Instance for Controlling Diffusion Models

Author: Li, Wen, Fang, Muyuan, Zou, Cheng, Gong, Biao, Zheng, Ruobing, Wang, Meng, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the burst of innovative methods for controlling the diffusion process, effectively controlling image styles in text-to-image generation remains a challenging task. Many adapter-based methods impose image representation conditions on the denoising process to accomplish image control. However these conditions are not aligned with the word embedding space, leading to interference between image and text control conditions and the potential loss of semantic information from the text prompt. Addressing this issue involves two key challenges. Firstly, how to inject the style representation without compromising the effectiveness of text representation in control. Secondly, how to obtain the accurate style representation from a single reference image. To tackle these challenges, we introduce StyleTokenizer, a zero-shot style control image generation method that aligns style representation with text representation using a style tokenizer. This alignment effectively minimizes the impact on the effectiveness of text prompts. Furthermore, we collect a well-labeled style dataset named Style30k to train a style feature extractor capable of accurately representing style while excluding other content information. Experimental results demonstrate that our method fully grasps the style characteristics of the reference image, generating appealing images that are consistent with both the target image style and text prompt. The code and dataset are available at https://github.com/alipay/style-tokenizer., Comment: Accepted by ECCV2024
Published: 2024

6. POA: Pre-training Once for Models of All Sizes

Author: Zhang, Yingying, Guo, Xin, Lao, Jiangwei, Yu, Lei, Ru, Lixiang, Wang, Jian, Ye, Guo, He, Huimei, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition, 68T07
Abstract: Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes for downstream tasks. Remarkably, the elastic student facilitates the simultaneous pre-training of multiple models with different sizes, which also acts as an additional ensemble of models of various sizes to enhance representation learning. Extensive experiments, including k-nearest neighbors, linear probing evaluation and assessments on multiple downstream tasks demonstrate the effectiveness and advantages of our POA. It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones, producing around a hundred models with different sizes through a single pre-training session. The code is available at: https://github.com/Qichuzyy/POA., Comment: Accepted by ECCV2024
Published: 2024

7. Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Author: Huang, Ziyuan, Ji, Kaixiang, Gong, Biao, Qing, Zhiwu, Zhang, Qinglong, Zheng, Kecheng, Wang, Jian, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.
Published: 2024

8. ViTime: A Visual Intelligence-Based Foundation Model for Time Series Forecasting

Author: Yang, Luoxiao, Wang, Yun, Fan, Xinqi, Cohen, Israel, Chen, Jingdong, Zhao, Yue, and Zhang, Zijun
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: The success of large pretrained models in natural language processing (NLP) and computer vision (CV) has opened new avenues for constructing foundation models for time series forecasting (TSF). Traditional TSF foundation models rely heavily on numerical data fitting. In contrast, the human brain is inherently skilled at processing visual information, prefer predicting future trends by observing visualized sequences. From a biomimetic perspective, utilizing models to directly process numerical sequences might not be the most effective route to achieving Artificial General Intelligence (AGI). This paper proposes ViTime, a novel Visual Intelligence-based foundation model for TSF. ViTime overcomes the limitations of numerical time series data fitting by utilizing visual data processing paradigms and employs a innovative data synthesis method during training, called Real Time Series (RealTS). Experiments on a diverse set of previously unseen forecasting datasets demonstrate that ViTime achieves state-of-the-art zero-shot performance, even surpassing the best individually trained supervised models in some situations. These findings suggest that visual intelligence can significantly enhance time series analysis and forecasting, paving the way for more advanced and versatile models in the field. The code for our framework is accessible at https://github.com/IkeYang/ViTime.
Published: 2024

9. SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

Author: Luo, Junwei, Pang, Zhen, Zhang, Yongjun, Wang, Tingzhu, Wang, Linlin, Dang, Bo, Lao, Jiangwei, Wang, Jian, Chen, Jingdong, Tan, Yihua, and Li, Yansheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Remote Sensing Large Multi-Modal Models (RSLMMs) are developing rapidly and showcase significant capabilities in remote sensing imagery (RSI) comprehension. However, due to the limitations of existing datasets, RSLMMs have shortcomings in understanding the rich semantic relations among objects in complex remote sensing scenes. To unlock RSLMMs' complex comprehension ability, we propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples. FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty, ranging from relation reasoning to image-level scene graph generation. Based on FIT-RS, we build the FIT-RSFG benchmark. Furthermore, we establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC. Based on combined instruction data, we propose SkySenseGPT, which achieves outstanding performance on both public datasets and FIT-RSFG, surpassing existing RSLMMs. We hope the FIT-RS dataset can enhance the relation comprehension capability of RSLMMs and provide a large-scale fine-grained data source for the remote sensing community. The dataset will be available at https://github.com/Luo-Z13/SkySenseGPT, Comment: 30 pages, 5 figures, 19 tables, dataset and code see https://github.com/Luo-Z13/SkySenseGPT
Published: 2024

10. Low algorithmic delay implementation of convolutional beamformer for online joint source separation and dereverberation

Author: Mo, Kaien, Wang, Xianrui, Yang, Yichen, Makino, Shoji, and Chen, Jingdong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Blind-audio-source-separation (BASS) techniques, particularly those with low latency, play an important role in a wide range of real-time systems, e.g., hearing aids, in-car hand-free voice communication, real-time human-machine interaction, etc. Most existing BASS algorithms are deduced to run on batch mode, and therefore large latency is unavoidable. Recently, some online algorithms were developed, which achieve separation on a frame-by-frame basis in the short-time-Fourier-transform (STFT) domain and the latency is significantly reduced as compared to those batch methods. However, the latency with these algorithms may still be too long for many real-time systems to bear. To further reduce latency while achieving good separation performance, we propose in this work to integrate a weighted prediction error (WPE) module into a non-causal sample-truncating-based independent vector analysis (NST-IVA). The resulting algorithm can maintain the algorithmic delay as NST-IVA if the delay with WPE is appropriately controlled while achieving significantly better performance, which is validated by simulations., Comment: 4 pages, 4 figures. Accepted by EUSIPCO 2024
Published: 2024

11. Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation

Author: Zhang, Yingying, Shi, Chuangji, Guo, Xin, Lao, Jiangwei, Wang, Jian, Wang, Jiaotuan, and Chen, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: The design of the query is crucial for the performance of DETR and its variants. Each query consists of two components: a content part and a positional one. Traditionally, the content query is initialized with a zero or learnable embedding, lacking essential content information and resulting in sub-optimal performance. In this paper, we introduce a novel plug-and-play module, Self-Adaptive Content Query (SACQ), to address this limitation. The SACQ module utilizes features from the transformer encoder to generate content queries via self-attention pooling. This allows candidate queries to adapt to the input image, resulting in a more comprehensive content prior and better focus on target objects. However, this improved concentration poses a challenge for the training process that utilizes the Hungarian matching, which selects only a single candidate and suppresses other similar ones. To overcome this, we propose a query aggregation strategy to cooperate with SACQ. It merges similar predicted candidates from different queries, easing the optimization. Our extensive experiments on the COCO dataset demonstrate the effectiveness of our proposed approaches across six different DETR's variants with multiple configurations, achieving an average improvement of over 1.0 AP., Comment: 11 pages, 7 figures
Published: 2024

12. Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

Author: Zhang, Zicheng, Zheng, Ruobing, Liu, Ziwen, Han, Congying, Li, Tianqi, Wang, Meng, Guo, Tiande, Chen, Jingdong, Li, Bonan, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent works in implicit representations, such as Neural Radiance Fields (NeRF), have advanced the generation of realistic and animatable head avatars from video sequences. These implicit methods are still confronted by visual artifacts and jitters, since the lack of explicit geometric constraints poses a fundamental challenge in accurately modeling complex facial deformations. In this paper, we introduce Dynamic Tetrahedra (DynTet), a novel hybrid representation that encodes explicit dynamic meshes by neural networks to ensure geometric consistency across various motions and viewpoints. DynTet is parameterized by the coordinate-based networks which learn signed distance, deformation, and material texture, anchoring the training data into a predefined tetrahedra grid. Leveraging Marching Tetrahedra, DynTet efficiently decodes textured meshes with a consistent topology, enabling fast rendering through a differentiable rasterizer and supervision via a pixel loss. To enhance training efficiency, we incorporate classical 3D Morphable Models to facilitate geometry learning and define a canonical space for simplifying texture learning. These advantages are readily achievable owing to the effective geometric representation employed in DynTet. Compared with prior works, DynTet demonstrates significant improvements in fidelity, lip synchronization, and real-time performance according to various metrics. Beyond producing stable and visually appealing synthesis videos, our method also outputs the dynamic meshes which is promising to enable many emerging applications., Comment: CVPR 2024
Published: 2024

13. M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Author: Guo, Qingpei, Xu, Furong, Zhang, Hanxiao, Ren, Wang, Ma, Ziping, Ju, Lin, Wang, Jian, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.
Published: 2024

14. Independent low-rank matrix analysis based on the Sinkhorn divergence source model for blind source separation

Author: Wang, Jianyu, Guan, Shanzheng, Chen, Jingdong, and Benesty, Jacob
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The so-called independent low-rank matrix analysis (ILRMA) has demonstrated a great potential for dealing with the problem of determined blind source separation (BSS) for audio and speech signals. This method assumes that the spectra from different frequency bands are independent and the spectral coefficients in any frequency band are Gaussian distributed. The Itakura-Saito divergence is then employed to estimate the source model related parameters. In reality, however, the spectral coefficients from different frequency bands may be dependent, which is not considered in the existing ILRMA algorithm. This paper presents an improved version of ILRMA, which considers the dependency between the spectral coefficients from different frequency bands. The Sinkhorn divergence is then exploited to optimize the source model parameters. As a result of using the cross-band information, the BSS performance is improved. But the number of parameters to be estimated also increases significantly, and so is the computational complexity. To reduce the algorithm complexity, we apply the Kronecker product to decompose the modeling matrix into the product of a number of matrices of much smaller dimensionality. An efficient algorithm is then developed to implement the Sinkhorn divergence based BSS algorithm and the complexity is reduced by an order of magnitude.
Published: 2024

15. SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Author: Guo, Xin, Lao, Jiangwei, Dang, Bo, Zhang, Yingying, Yu, Lei, Ru, Lixiang, Zhong, Liheng, Huang, Ziyuan, Wu, Kang, Hu, Dingxiang, He, Huimei, Wang, Jian, Chen, Jingdong, Yang, Ming, Zhang, Yongjun, and Li, Yansheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications., Comment: Accepted by CVPR2024
Published: 2023

16. A computationally efficient semi-blind source separation based approach for nonlinear echo cancellation based on an element-wise iterative source steering

Author: Lu, Kunxing, Wang, Xianrui, Ueda, Tetsuya, Makino, Shoji, and Chen, Jingdong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: While the semi-blind source separation-based acoustic echo cancellation (SBSS-AEC) has received much research attention due to its promising performance during double-talk compared to the traditional adaptive algorithms, it suffers from system latency and nonlinear distortions. To circumvent these drawbacks, the recently developed ideas on convolutive transfer function (CTF) approximation and nonlinear expansion have been used in the iterative projection (IP)-based semi-blind source separation (SBSS) algorithm. However, because of the introduction of CTF approximation and nonlinear expansion, this algorithm becomes computationally very expensive, which makes it difficult to implement in embedded systems. Thus, we attempt in this paper to improve this IP-based algorithm, thereby developing an element-wise iterative source steering (EISS) algorithm. In comparison with the IP-based SBSS algorithm, the proposed algorithm is computationally much more efficient, especially when the nonlinear expansion order is high and the length of the CTF filter is long. Meanwhile, its AEC performance is as good as that of IP-based SBSS.
Published: 2023

17. Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup

Author: Wang, Maolin, Zhao, Yao, Liu, Jiajia, Chen, Jingdong, Zhuang, Chenyi, Gu, Jinjie, Guo, Ruocheng, and Zhao, Xiangyu
Subjects: Computer Science - Artificial Intelligence
Abstract: The deployment of Large Multimodal Models (LMMs) within AntGroup has significantly advanced multimodal tasks in payment, security, and advertising, notably enhancing advertisement audition tasks in Alipay. However, the deployment of such sizable models introduces challenges, particularly in increased latency and carbon emissions, which are antithetical to the ideals of Green AI. This paper introduces a novel multi-stage compression strategy for our proprietary LLM, AntGMM. Our methodology pivots on three main aspects: employing small training sample sizes, addressing multi-level redundancy through multi-stage pruning, and introducing an advanced distillation loss design. In our research, we constructed a dataset, the Multimodal Advertisement Audition Dataset (MAAD), from real-world scenarios within Alipay, and conducted experiments to validate the reliability of our proposed strategy. Furthermore, the effectiveness of our strategy is evident in its operational success in Alipay's real-world multimodal advertisement audition for three months from September 2023. Notably, our approach achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease. Moreover, our compressed model is estimated to reduce electricity consumption by approximately 75 million kWh annually compared to the direct deployment of AntGMM, demonstrating our commitment to green AI initiatives. We will publicly release our code and the MAAD dataset after some reviews\footnote{https://github.com/MorinW/AntGMM$\_$Pruning}.
Published: 2023

18. POA: Pre-training Once for Models of All Sizes

Author: Zhang, Yingying, Guo, Xin, Lao, Jiangwei, Yu, Lei, Ru, Lixiang, Wang, Jian, Ye, Guo, He, Huimei, Chen, Jingdong, Yang, Ming, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

19. Training Object Detectors from Scratch: An Empirical Study in the Era of Vision Transformer

Author: Hong, Weixiang, Ren, Wang, Lao, Jiangwei, Xie, Lele, Zhong, Liheng, Wang, Jian, Chen, Jingdong, Liu, Honghai, and Chu, Wei
Published: 2024
Full Text: View/download PDF

20. LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints

Author: Xu, Weidi, Wang, Jingwei, Xie, Lele, He, Jianshan, Zhou, Hongting, Wang, Taifeng, Wan, Xiaopei, Chen, Jingdong, Qu, Chao, and Chu, Wei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Symbolic Computation
Abstract: Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, whose layers perform mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency. By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations effectively mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations. Empirical results in three kinds of tasks over graphs, images, and text show that LogicMP outperforms advanced competitors in both performance and efficiency., Comment: 28 pages, 14 figures, 12 tables
Published: 2023

21. The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Author: Wu, Shilong, Wang, Chenxi, Chen, Hang, Dai, Yusheng, Zhang, Chenyue, Wang, Ruoyu, Lan, Hongbo, Du, Jun, Lee, Chin-Hui, Chen, Jingdong, Watanabe, Shinji, Siniscalchi, Sabato Marco, Scharenborg, Odette, Wang, Zhong-Qiu, Pan, Jia, and Gao, Jianqing
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward., Comment: 5 pages, 4 figures
Published: 2023

22. Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match vs. Mismatch Classification

Author: Yang, Yiqian, Zhao, Zhengqiao, Wang, Qian, Yang, Yan, and Chen, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computational Engineering, Finance, and Science
Abstract: Existing approaches to modeling associations between visual stimuli and brain responses are facing difficulties in handling between-subject variance and model generalization. Inspired by the recent progress in modeling speech-brain response, we propose in this work a "match-vs-mismatch" deep learning model to classify whether a video clip induces excitatory responses in recorded EEG signals and learn associations between the visual content and corresponding neural recordings. Using an exclusive experimental dataset, we demonstrate that the proposed model is able to achieve the highest accuracy on unseen subjects as compared to other baseline models. Furthermore, we analyze the inter-subject noise using a subject-level silhouette score in the embedding space and show that the developed model is able to mitigate inter-subject noise and significantly reduce the silhouette score. Moreover, we examine the Grad-CAM activation score and show that the brain regions associated with language processing contribute most to the model predictions, followed by regions associated with visual processing. These results have the potential to facilitate the development of neural recording-based video reconstruction and its related applications.
Published: 2023

23. An Anchor-Point Based Image-Model for Room Impulse Response Simulation with Directional Source Radiation and Sensor Directivity Patterns

Author: Pan, Chao, Zhang, Lei, Lu, Yilong, Jin, Jilu, Qiu, Lin, Chen, Jingdong, and Benesty, Jacob
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The image model method has been widely used to simulate room impulse responses and the endeavor to adapt this method to different applications has also piqued great interest over the last few decades. This paper attempts to extend the image model method and develops an anchor-point-image-model (APIM) approach as a solution for simulating impulse responses by including both the source radiation and sensor directivity patterns. To determine the orientations of all the virtual sources, anchor points are introduced to real sources, which subsequently lead to the determination of the orientations of the virtual sources. An algorithm is developed to generate room impulse responses with APIM by taking into account the directional pattern functions, factional time delays, as well as the computational complexity. The developed model and algorithms can be used in various acoustic problems to simulate room acoustics and improve and evaluate processing algorithms., Comment: 19 pages, 8 figures
Published: 2023

24. Distortionless Beamforming

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

25. Adaptive Noise Cancellation

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

26. Binaural Beamforming

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

27. Low-Rank Beamforming

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

28. Principal Component Analysis in Noise Reduction and Beamforming

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

29. Fundamentals of Microphone Array Processing

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

30. Large Array Beamforming

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

31. Introduction

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

32. Limitations of Single Microphone Processing

Author: Benesty, Jacob, Huang, Gongping, Chen, Jingdong, Pan, Ningning, Benesty, Jacob, Series Editor, Kellermann, Walter, Series Editor, Huang, Gongping, Chen, Jingdong, and Pan, Ningning
Published: 2024
Full Text: View/download PDF

33. Identification and expression analysis of the Xyloglucan transglycosylase/hydrolase (XTH) gene family under abiotic stress in oilseed (Brassica napus L.)

Author: Chen, Jingdong, Wan, Heping, Zhao, Huixia, Dai, Xigang, Wu, Wanjin, Liu, Jin, Xu, Jinsong, Yang, Rui, Xu, Benbo, Zeng, Changli, and Zhang, Xuekun
Published: 2024
Full Text: View/download PDF

34. The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Author: Wang, Zhe, Wu, Shilong, Chen, Hang, He, Mao-Kui, Du, Jun, Lee, Chin-Hui, Chen, Jingdong, Watanabe, Shinji, Siniscalchi, Sabato, Scharenborg, Odette, Liu, Diyuan, Yin, Baocai, Pan, Jia, Gao, Jianqing, and Liu, Cong
Subjects: Computer Science - Multimedia
Abstract: The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers., Comment: 5 pages, 4 figures, to be published in ICASSP2023
Published: 2023

35. Dynamic control of the directional scattering of single Mie particle by laser induced metal insulator transitions

Author: Zhu Yanlin, Li Shulei, Zhang Yang, Meng Jinjing, Tan Xu, Chen Jingdong, Panmai Mingcheng, and Xiang Jin
Subjects: vanadium dioxide, mie resonance, insulator-metal transition, all-optical modulator, Physics, QC1-999
Abstract: Interference between the electric and magnetic dipole-induced in Mie nanostructures has been widely demonstrated to tailor the scattering field, which was commonly used in optical nano-antennas, filters, and routers. The dynamic control of scattering fields based on dielectric nanostructures is interesting for fundamental research and important for practical applications. Here, it is shown theoretically that the amplitude of the electric and magnetic dipoles induced in a vanadium dioxide nanosphere can be manipulated by using laser-induced metal-insulator transitions, and it is experimentally demonstrated that the directional scattering can be controlled by simply varying the irradiances of the excitation laser. As a straightforward application, we demonstrate a high-performance optical modulator in the visible band with high modulation depth, fast modulation speed, and high reproducibility arising from a backscattering setup with the quasi-first Kerker condition. Our method indicates the potential applications in developing nanoscale optical antennas and optical modulation devices.
Published: 2024
Full Text: View/download PDF

36. The Influence of Brand Culture on Consumer Purchasing Behavior Intention

Author: Tan, Tiantian, Chen, JingDong, Chen, Mo, Xhafa, Fatos, Series Editor, Xu, Jiuping, editor, Binti Ismail, Noor Azina, editor, Dabo-Niang, Sophie, editor, Ali Hassan, Mohamed Hag, editor, and Hajiyev, Asaf, editor
Published: 2024
Full Text: View/download PDF

37. Crossing the Digital Divide in Older People: Analysis of Influencing Factors on the Willingness of the Elderly to Use Mobile Payment

Author: Hou, JiaLe, Chen, Jingdong, Xhafa, Fatos, Series Editor, Xu, Jiuping, editor, Binti Ismail, Noor Azina, editor, Dabo-Niang, Sophie, editor, Ali Hassan, Mohamed Hag, editor, and Hajiyev, Asaf, editor
Published: 2024
Full Text: View/download PDF

38. Robust Manifold Nonnegative Tucker Factorization for Tensor Data Representation

Author: Wang, Jianyu, Tang, Linruize, Chen, Jie, and Chen, Jingdong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Nonnegative Tucker Factorization (NTF) minimizes the euclidean distance or Kullback-Leibler divergence between the original data and its low-rank approximation which often suffers from grossly corruptions or outliers and the neglect of manifold structures of data. In particular, NTF suffers from rotational ambiguity, whose solutions with and without rotation transformations are equally in the sense of yielding the maximum likelihood. In this paper, we propose three Robust Manifold NTF algorithms to handle outliers by incorporating structural knowledge about the outliers. They first applies a half-quadratic optimization algorithm to transform the problem into a general weighted NTF where the weights are influenced by the outliers. Then, we introduce the correntropy induced metric, Huber function and Cauchy function for weights respectively, to handle the outliers. Finally, we introduce a manifold regularization to overcome the rotational ambiguity of NTF. We have compared the proposed method with a number of representative references covering major branches of NTF on a variety of real-world image databases. Experimental results illustrate the effectiveness of the proposed method under two evaluation metrics (accuracy and nmi).
Published: 2022

39. Experimental study for effects of tube spacing and tube material on falling film flow transition mode between tubes

Author: Chen, Jingdong, Liu, Xia, Yan, Longfei, and Yang, Guobin
Published: 2024
Full Text: View/download PDF

40. Microphone Arrays

Author: Benesty, Jacob, primary, Huang, Gongping, additional, Chen, Jingdong, additional, and Pan, Ningning, additional
Published: 2024
Full Text: View/download PDF

41. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization

Author: Luo, Canjie, Jin, Lianwen, and Chen, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently self-supervised representation learning has drawn considerable attention from the scene text recognition community. Different from previous studies using contrastive learning, we tackle the issue from an alternative perspective, i.e., by formulating the representation learning scheme in a generative manner. Typically, the neighboring image patches among one text line tend to have similar styles, including the strokes, textures, colors, etc. Motivated by this common sense, we augment one image patch and use its neighboring patch as guidance to recover itself. Specifically, we propose a Similarity-Aware Normalization (SimAN) module to identify the different patterns and align the corresponding styles from the guiding patch. In this way, the network gains representation capability for distinguishing complex patterns such as messy strokes and cluttered backgrounds. Experiments show that the proposed SimAN significantly improves the representation quality and achieves promising performance. Moreover, we surprisingly find that our self-supervised generative network has impressive potential for data synthesis, text image editing, and font interpolation, which suggests that the proposed SimAN has a wide range of practical applications., Comment: Accepted to appear in CVPR 2022
Published: 2022

42. Hierarchical Memory Learning for Fine-Grained Scene Graph Generation

Author: Deng, Youming, Li, Yansheng, Zhang, Yongjun, Xiang, Xiang, Wang, Jian, Chen, Jingdong, and Ma, Jiayi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: As far as Scene Graph Generation (SGG), coarse and fine predicates mix in the dataset due to the crowd-sourced labeling, and the long-tail problem is also pronounced. Given this tricky situation, many existing SGG methods treat the predicates equally and learn the model under the supervision of mixed-granularity predicates in one stage, leading to relatively coarse predictions. In order to alleviate the negative impact of the suboptimum mixed-granularity annotation and long-tail effect problems, this paper proposes a novel Hierarchical Memory Learning (HML) framework to learn the model from simple to complex, which is similar to the human beings' hierarchical memory learning process. After the autonomous partition of coarse and fine predicates, the model is first trained on the coarse predicates and then learns the fine predicates. In order to realize this hierarchical learning pattern, this paper, for the first time, formulates the HML framework using the new Concept Reconstruction (CR) and Model Reconstruction (MR) constraints. It is worth noticing that the HML framework can be taken as one general optimization strategy to improve various SGG models, and significant improvement can be achieved on the SGG benchmark (i.e., Visual Genome)., Comment: ECCV 2022
Published: 2022
Full Text: View/download PDF

43. Training Protocol Matters: Towards Accurate Scene Text Recognition via Training Protocol Searching

Author: Chu, Xiaojie, Wang, Yongtao, Shen, Chunhua, Chen, Jingdong, and Chu, Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The development of scene text recognition (STR) in the era of deep learning has been mainly focused on novel architectures of STR models. However, training protocol (i.e., settings of the hyper-parameters involved in the training of STR models), which plays an equally important role in successfully training a good STR model, is under-explored for scene text recognition. In this work, we attempt to improve the accuracy of existing STR models by searching for optimal training protocol. Specifically, we develop a training protocol search algorithm, based on a newly designed search space and an efficient search algorithm using evolutionary optimization and proxy tasks. Experimental results show that our searched training protocol can improve the recognition accuracy of mainstream STR models by 2.7%~3.9%. In particular, with the searched training protocol, TRBA-Net achieves 2.1% higher accuracy than the state-of-the-art STR model (i.e., EFIFSTR), while the inference speed is 2.3x and 3.7x faster on CPU and GPU respectively. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method and the generalization ability of the training protocol found by our search method. Code is available at https://github.com/VDIGPKU/STR_TPSearch.
Published: 2022

44. CBNet: A Composite Backbone Network Architecture for Object Detection

Author: Liang, Tingting, Chu, Xiaojie, Liu, Yudong, Wang, Yongtao, Tang, Zhi, Chu, Wei, Chen, Jingdong, and Ling, Haibin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. Without additional pre-training of the composite backbone, CBNetV2 can be adapted to various backbones (CNN-based vs. Transformer-based) and head designs of most mainstream detectors (one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNetV2 introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which is significantly better than the state-of-the-art result (57.7% box AP and 50.2% mask AP) achieved by Swin-L, while the training schedule is reduced by 6$\times$. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at https://github.com/VDIGPKU/CBNetV2., Comment: IEEE Transactions on Image Processing (TIP) camera ready
Published: 2021
Full Text: View/download PDF

45. MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction

Author: Tang, Guozhi, Xie, Lele, Jin, Lianwen, Wang, Jiapeng, Chen, Jingdong, Xu, Zhen, Wang, Qianying, Wu, Yaqiang, and Li, Hui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Visual Information Extraction (VIE) task aims to extract key information from multifarious document images (e.g., invoices and purchase receipts). Most previous methods treat the VIE task simply as a sequence labeling problem or classification problem, which requires models to carefully identify each kind of semantics by introducing multimodal features, such as font, color, layout. But simply introducing multimodal features couldn't work well when faced with numeric semantic categories or some ambiguous texts. To address this issue, in this paper we propose a novel key-value matching model based on a graph neural network for VIE (MatchVIE). Through key-value matching based on relevancy evaluation, the proposed MatchVIE can bypass the recognitions to various semantics, and simply focuses on the strong relevancy between entities. Besides, we introduce a simple but effective operation, Num2Vec, to tackle the instability of encoded values, which helps model converge more smoothly. Comprehensive experiments demonstrate that the proposed MatchVIE can significantly outperform previous methods. Notably, to the best of our knowledge, MatchVIE may be the first attempt to tackle the VIE task by modeling the relevancy between keys and values and it is a good complement to the existing methods., Comment: accepted by IJCAI 2021
Published: 2021

46. On intrusive speech quality measures and a global SNR based metric

Author: Pan, Chao, Chen, Jingdong, and Benesty, Jacob
Published: 2024
Full Text: View/download PDF

47. Genome-wide identification and expression analysis of the chlorophyll a/b binding protein gene family in oilseed (Brassica napus L.) under salt stress conditions

Author: Xue, Tianyuan, Wan, Heping, Chen, Jingdong, He, Sixiao, Lujin, Chunzi, Xia, Mang, Wang, Shanshan, Dai, Xigang, and Zeng, Changli
Published: 2024
Full Text: View/download PDF

48. CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

Author: Huang, Hao, Wang, Yongtao, Chen, Zhaoyu, Zhang, Yuze, Li, Yuheng, Tang, Zhi, Chu, Wei, Chen, Jingdong, Lin, Weisi, and Ma, Kai-Kuang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Malicious applications of deepfakes (i.e., technologies generating target facial attributes or entire faces from facial images) have posed a huge threat to individuals' reputation and security. To mitigate these threats, recent studies have proposed adversarial watermarks to combat deepfake models, leading them to generate distorted outputs. Despite achieving impressive results, these adversarial watermarks have low image-level and model-level transferability, meaning that they can protect only one facial image from one specific deepfake model. To address these issues, we propose a novel solution that can generate a Cross-Model Universal Adversarial Watermark (CMUA-Watermark), protecting a large number of facial images from multiple deepfake models. Specifically, we begin by proposing a cross-model universal attack pipeline that attacks multiple deepfake models iteratively. Then, we design a two-level perturbation fusion strategy to alleviate the conflict between the adversarial watermarks generated by different facial images and models. Moreover, we address the key problem in cross-model optimization with a heuristic approach to automatically find the suitable attack step sizes for different models, further weakening the model-level conflict. Finally, we introduce a more reasonable and comprehensive evaluation method to fully test the proposed method and compare it with existing ones. Extensive experimental results demonstrate that the proposed CMUA-Watermark can effectively distort the fake facial images generated by multiple deepfake models while achieving a better performance than existing methods., Comment: 9 pages, 7 figures, Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI22
Published: 2021

49. AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Author: Fu, Yihui, Cheng, Luyao, Lv, Shubo, Jv, Yukai, Kong, Yuxiang, Chen, Zhuo, Hu, Yanxin, Xie, Lei, Wu, Jian, Bu, Hui, Xu, Xin, Du, Jun, and Chen, Jingdong
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field., Comment: Accepted by Interspeech 2021
Published: 2021

50. Large Array Beamforming

Author: Benesty, Jacob, primary, Huang, Gongping, additional, Chen, Jingdong, additional, and Pan, Ningning, additional
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,088 results on '"Chen, Jingdong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources