Author: "Zhang, Baochang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Baochang"' showing total 2,277 results

Start Over Author "Zhang, Baochang"

2,277 results on '"Zhang, Baochang"'

1. P4Q: Learning to Prompt for Quantization in Visual-language Models

Author: Sun, Huixin, Wang, Runqi, Li, Yanjing, Cao, Xianbin, Jiang, Xiaolong, Hu, Yao, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Large-scale pre-trained Vision-Language Models (VLMs) have gained prominence in various visual and multimodal tasks, yet the deployment of VLMs on downstream application platforms remains challenging due to their prohibitive requirements of training samples and computing resources. Fine-tuning and quantization of VLMs can substantially reduce the sample and computation costs, which are in urgent need. There are two prevailing paradigms in quantization, Quantization-Aware Training (QAT) can effectively quantize large-scale VLMs but incur a huge training cost, while low-bit Post-Training Quantization (PTQ) suffers from a notable performance drop. We propose a method that balances fine-tuning and quantization named ``Prompt for Quantization'' (P4Q), in which we design a lightweight architecture to leverage contrastive loss supervision to enhance the recognition performance of a PTQ model. Our method can effectively reduce the gap between image features and text features caused by low-bit quantization, based on learnable prompts to reorganize textual representations and a low-bit adapter to realign the distributions of image and text features. We also introduce a distillation loss based on cosine similarity predictions to distill the quantized model using a full-precision teacher. Extensive experimental results demonstrate that our P4Q method outperforms prior arts, even achieving comparable results to its full-precision counterparts. For instance, our 8-bit P4Q can theoretically compress the CLIP-ViT/B-32 by 4 $\times$ while achieving 66.94\% Top-1 accuracy, outperforming the learnable prompt fine-tuned full-precision model by 2.24\% with negligible additional parameters on the ImageNet dataset.
Published: 2024

2. Bilateral Sharpness-Aware Minimization for Flatter Minima

Author: Deng, Jiaxin, Pang, Junbiao, Zhang, Baochang, and Huang, Qingming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Sharpness-Aware Minimization (SAM) enhances generalization by reducing a Max-Sharpness (MaxS). Despite the practical success, we empirically found that the MAxS behind SAM's generalization enhancements face the "Flatness Indicator Problem" (FIP), where SAM only considers the flatness in the direction of gradient ascent, resulting in a next minimization region that is not sufficiently flat. A better Flatness Indicator (FI) would bring a better generalization of neural networks. Because SAM is a greedy search method in nature. In this paper, we propose to utilize the difference between the training loss and the minimum loss over the neighborhood surrounding the current weight, which we denote as Min-Sharpness (MinS). By merging MaxS and MinS, we created a better FI that indicates a flatter direction during the optimization. Specially, we combine this FI with SAM into the proposed Bilateral SAM (BSAM) which finds a more flatter minimum than that of SAM. The theoretical analysis proves that BSAM converges to local minima. Extensive experiments demonstrate that BSAM offers superior generalization performance and robustness compared to vanilla SAM across various tasks, i.e., classification, transfer learning, human pose estimation, and network quantization. Code is publicly available at: https://github.com/ajiaaa/BSAM.
Published: 2024

3. DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays

Author: Liu, Xuhui, Qiao, Zhi, Liu, Runkun, Li, Hong, Zhang, Juan, Zhen, Xiantong, Qian, Zhen, and Zhang, Baochang
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Computed tomography (CT) is widely utilized in clinical settings because it delivers detailed 3D images of the human body. However, performing CT scans is not always feasible due to radiation exposure and limitations in certain surgical environments. As an alternative, reconstructing CT images from ultra-sparse X-rays offers a valuable solution and has gained significant interest in scientific research and medical applications. However, it presents great challenges as it is inherently an ill-posed problem, often compromised by artifacts resulting from overlapping structures in X-ray images. In this paper, we propose DiffuX2CT, which models CT reconstruction from orthogonal biplanar X-rays as a conditional diffusion process. DiffuX2CT is established with a 3D global coherence denoising model with a new, implicit conditioning mechanism. We realize the conditioning mechanism by a newly designed tri-plane decoupling generator and an implicit neural decoder. By doing so, DiffuX2CT achieves structure-controllable reconstruction, which enables 3D structural information to be recovered from 2D X-rays, therefore producing faithful textures in CT images. As an extra contribution, we collect a real-world lumbar CT dataset, called LumbarV, as a new benchmark to verify the clinical significance and performance of CT reconstruction from X-rays. Extensive experiments on this dataset and three more publicly available datasets demonstrate the effectiveness of our proposal.
Published: 2024

4. Asymptotic Unbiased Sample Sampling to Speed Up Sharpness-Aware Minimization

Author: Deng, Jiaxin, Pang, Junbiao, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Sharpness-Aware Minimization (SAM) has emerged as a promising approach for effectively reducing the generalization error. However, SAM incurs twice the computational cost compared to base optimizer (e.g., SGD). We propose Asymptotic Unbiased Sampling with respect to iterations to accelerate SAM (AUSAM), which maintains the model's generalization capacity while significantly enhancing computational efficiency. Concretely, we probabilistically sample a subset of data points beneficial for SAM optimization based on a theoretically guaranteed criterion, i.e., the Gradient Norm of each Sample (GNS). We further approximate the GNS by the difference in loss values before and after perturbation in SAM. As a plug-and-play, architecture-agnostic method, our approach consistently accelerates SAM across a range of tasks and networks, i.e., classification, human pose estimation and network quantization. On CIFAR10/100 and Tiny-ImageNet, AUSAM achieves results comparable to SAM while providing a speedup of over 70%. Compared to recent dynamic data pruning methods, AUSAM is better suited for SAM and excels in maintaining performance. Additionally, AUSAM accelerates optimization in human pose estimation and model quantization without sacrificing performance, demonstrating its broad practicality.
Published: 2024

5. DecomCAM: Advancing Beyond Saliency Maps through Decomposition and Integration

Author: Yang, Yuguang, Guo, Runtang, Wu, Sheng, Wang, Yimi, Yang, Linlin, Fan, Bo, Zhong, Jilong, Zhang, Juan, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Interpreting complex deep networks, notably pre-trained vision-language models (VLMs), is a formidable challenge. Current Class Activation Map (CAM) methods highlight regions revealing the model's decision-making basis but lack clear saliency maps and detailed interpretability. To bridge this gap, we propose DecomCAM, a novel decomposition-and-integration method that distills shared patterns from channel activation maps. Utilizing singular value decomposition, DecomCAM decomposes class-discriminative activation maps into orthogonal sub-saliency maps (OSSMs), which are then integrated together based on their contribution to the target concept. Extensive experiments on six benchmarks reveal that DecomCAM not only excels in locating accuracy but also achieves an optimizing balance between interpretability and computational efficiency. Further analysis unveils that OSSMs correlate with discernible object components, facilitating a granular understanding of the model's reasoning. This positions DecomCAM as a potential tool for fine-grained interpretation of advanced deep learning models. The code is avaible at https://github.com/CapricornGuang/DecomCAM., Comment: Accepted by Neurocomputing journal
Published: 2024

6. An AI-Enabled Framework Within Reach for Enhancing Healthcare Sustainability and Fairness

Author: Huang, Bin, Zhao, Changchen, Liu, Zimeng, Hong, Shenda, Zhang, Baochang, Lu, Hao, Liu, Zhijun, Wang, Wenjin, and Liu, Hui
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Good health and well-being is among key issues in the United Nations 2030 Sustainable Development Goals. The rising prevalence of large-scale infectious diseases and the accelerated aging of the global population are driving the transformation of healthcare technologies. In this context, establishing large-scale public health datasets, developing medical models, and creating decision-making systems with a human-centric approach are of strategic significance. Recently, by leveraging the extraordinary number of accessible cameras, groundbreaking advancements have emerged in AI methods for physiological signal monitoring and disease diagnosis using camera sensors. These approaches, requiring no specialized medical equipment, offer convenient manners of collecting large-scale medical data in response to public health events. Therefore, we outline a prospective framework and heuristic vision for a camera-based public health (CBPH) framework utilizing visual physiological monitoring technology. The CBPH can be considered as a convenient and universal framework for public health, advancing the United Nations Sustainable Development Goals, particularly in promoting the universality, sustainability, and equity of healthcare in low- and middle-income countries or regions. Furthermore, CBPH provides a comprehensive solution for building a large-scale and human-centric medical database, and a multi-task large medical model for public health and medical scientific discoveries. It has a significant potential to revolutionize personal monitoring technologies, digital medicine, telemedicine, and primary health care in public health. Therefore, it can be deemed that the outcomes of this paper will contribute to the establishment of a sustainable and fair framework for public health, which serves as a crucial bridge for advancing scientific discoveries in the realm of AI for medicine (AI4Medicine)., Comment: 16 pages, 5 figures
Published: 2024

7. Fusion-Mamba for Cross-modality Object Detection

Author: Dong, Wenhao, Zhu, Haodong, Lin, Shaohui, Luo, Xiaoyan, Shen, Yunhang, Liu, Xuhui, Zhang, Juan, Guo, Guodong, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
Published: 2024

8. Real-time guidewire tracking and segmentation in intraoperative x-ray

Author: Zhang, Baochang, Bui, Mai, Wang, Cheng, Bourier, Felix, Schunkert, Heribert, and Navab, Nassir
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: During endovascular interventions, physicians have to perform accurate and immediate operations based on the available real-time information, such as the shape and position of guidewires observed on the fluoroscopic images, haptic information and the patients' physiological signals. For this purpose, real-time and accurate guidewire segmentation and tracking can enhance the visualization of guidewires and provide visual feedback for physicians during the intervention as well as for robot-assisted interventions. Nevertheless, this task often comes with the challenge of elongated deformable structures that present themselves with low contrast in the noisy fluoroscopic image sequences. To address these issues, a two-stage deep learning framework for real-time guidewire segmentation and tracking is proposed. In the first stage, a Yolov5s detector is trained, using the original X-ray images as well as synthetic ones, which is employed to output the bounding boxes of possible target guidewires. More importantly, a refinement module based on spatiotemporal constraints is incorporated to robustly localize the guidewire and remove false detections. In the second stage, a novel and efficient network is proposed to segment the guidewire in each detected bounding box. The network contains two major modules, namely a hessian-based enhancement embedding module and a dual self-attention module. Quantitative and qualitative evaluations on clinical intra-operative images demonstrate that the proposed approach significantly outperforms our baselines as well as the current state of the art and, in comparison, shows higher robustness to low quality images.
Published: 2024
Full Text: View/download PDF

9. A General and Efficient Training for Transformer via Token Expansion

Author: Huang, Wenxuan, Shen, Yunhang, Xie, Jiao, Zhang, Baochang, He, Gaoqi, Li, Ke, Sun, Xing, and Lin, Shaohui
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion ., Comment: Accepted to CVPR 2024. Code is available at https://github.com/Osilly/TokenExpansion
Published: 2024

10. A Channel-ensemble Approach: Unbiased and Low-variance Pseudo-labels is Critical for Semi-supervised Classification

Author: Wu, Jiaqi, Pang, Junbiao, Zhang, Baochang, and Huang, Qingming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Semi-supervised learning (SSL) is a practical challenge in computer vision. Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of The Art (SOTA) performances in SSL. These approaches employ a threshold-to-pseudo-label (T2L) process to generate PLs by truncating the confidence scores of unlabeled data predicted by the self-training method. However, self-trained models typically yield biased and high-variance predictions, especially in the scenarios when a little labeled data are supplied. To address this issue, we propose a lightweight channel-based ensemble method to effectively consolidate multiple inferior PLs into the theoretically guaranteed unbiased and low-variance one. Importantly, our approach can be readily extended to any SSL framework, such as FixMatch or FreeMatch. Experimental results demonstrate that our method significantly outperforms state-of-the-art techniques on CIFAR10/100 in terms of effectiveness and efficiency.
Published: 2024

11. $\mathrm{F^2Depth}$: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis

Author: Guo, Xiaotong, Zhao, Huijie, Shao, Shuwei, Li, Xudong, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised monocular depth estimation methods have been increasingly given much attention due to the benefit of not requiring large, labelled datasets. Such self-supervised methods require high-quality salient features and consequently suffer from severe performance drop for indoor scenes, where low-textured regions dominant in the scenes are almost indiscriminative. To address the issue, we propose a self-supervised indoor monocular depth estimation framework called $\mathrm{F^2Depth}$. A self-supervised optical flow estimation network is introduced to supervise depth learning. To improve optical flow estimation performance in low-textured areas, only some patches of points with more discriminative features are adopted for finetuning based on our well-designed patch-based photometric loss. The finetuned optical flow estimation network generates high-accuracy optical flow as a supervisory signal for depth estimation. Correspondingly, an optical flow consistency loss is designed. Multi-scale feature maps produced by finetuned optical flow estimation network perform warping to compute feature map synthesis loss as another supervisory signal for depth learning. Experimental results on the NYU Depth V2 dataset demonstrate the effectiveness of the framework and our proposed losses. To evaluate the generalization ability of our $\mathrm{F^2Depth}$, we collect a Campus Indoor depth dataset composed of approximately 1500 points selected from 99 images in 18 scenes. Zero-shot generalization experiments on 7-Scenes dataset and Campus Indoor achieve $\delta_1$ accuracy of 75.8% and 76.0% respectively. The accuracy results show that our model can generalize well to monocular images captured in unknown indoor scenes.
Published: 2024

12. RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model

Author: Wang, Mingze, Su, Lili, Yan, Cilin, Xu, Sheng, Yuan, Pengcheng, Jiang, Xiaolong, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The intelligent interpretation of buildings plays a significant role in urban planning and management, macroeconomic analysis, population dynamics, etc. Remote sensing image building interpretation primarily encompasses building extraction and change detection. However, current methodologies often treat these two tasks as separate entities, thereby failing to leverage shared knowledge. Moreover, the complexity and diversity of remote sensing image scenes pose additional challenges, as most algorithms are designed to model individual small datasets, thus lacking cross-scene generalization. In this paper, we propose a comprehensive remote sensing image building understanding model, termed RSBuilding, developed from the perspective of the foundation model. RSBuilding is designed to enhance cross-scene generalization and task universality. Specifically, we extract image features based on the prior knowledge of the foundation model and devise a multi-level feature sampler to augment scale information. To unify task representation and integrate image spatiotemporal clues, we introduce a cross-attention decoder with task prompts. Addressing the current shortage of datasets that incorporate annotations for both tasks, we have developed a federated training strategy to facilitate smooth model convergence even when supervision for some tasks is missing, thereby bolstering the complementarity of different tasks. Our model was trained on a dataset comprising up to 245,000 images and validated on multiple building extraction and change detection datasets. The experimental results substantiate that RSBuilding can concurrently handle two structurally distinct tasks and exhibits robust zero-shot generalization capabilities.
Published: 2024

13. Learning Accurate Low-bit Quantization towards Efficient Computational Imaging

Author: Xu, Sheng, Li, Yanjing, Liu, Chuanjian, and Zhang, Baochang
Published: 2024
Full Text: View/download PDF

14. Effective Gradient Sample Size via Variation Estimation for Accelerating Sharpness aware Minimization

Author: Deng, Jiaxin, Pang, Junbiao, Zhang, Baochang, and Wang, Tian
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Sharpness-aware Minimization (SAM) has been proposed recently to improve model generalization ability. However, SAM calculates the gradient twice in each optimization step, thereby doubling the computation costs compared to stochastic gradient descent (SGD). In this paper, we propose a simple yet efficient sampling method to significantly accelerate SAM. Concretely, we discover that the gradient of SAM is a combination of the gradient of SGD and the Projection of the Second-order gradient matrix onto the First-order gradient (PSF). PSF exhibits a gradually increasing frequency of change during the training process. To leverage this observation, we propose an adaptive sampling method based on the variation of PSF, and we reuse the sampled PSF for non-sampling iterations. Extensive empirical results illustrate that the proposed method achieved state-of-the-art accuracies comparable to SAM on diverse network architectures.
Published: 2024

15. Push Quantization-Aware Training Toward Full Precision Performances via Consistency Regularization

Author: Pang, Junbiao, Cai, Tianyang, Zhang, Baochang, Wu, Jiaqi, and Tao, Ye
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing Quantization-Aware Training (QAT) methods intensively depend on the complete labeled dataset or knowledge distillation to guarantee the performances toward Full Precision (FP) accuracies. However, empirical results show that QAT still has inferior results compared to its FP counterpart. One question is how to push QAT toward or even surpass FP performances. In this paper, we address this issue from a new perspective by injecting the vicinal data distribution information to improve the generalization performances of QAT effectively. We present a simple, novel, yet powerful method introducing an Consistency Regularization (CR) for QAT. Concretely, CR assumes that augmented samples should be consistent in the latent feature space. Our method generalizes well to different network architectures and various QAT methods. Extensive experiments demonstrate that our approach significantly outperforms the current state-of-the-art QAT methods and even FP counterparts., Comment: 11 pages, 5 figures
Published: 2024

16. ZONE: Zero-Shot Instruction-Guided Local Editing

Author: Li, Shanglin, Zeng, Bohan, Feng, Yutang, Gao, Sicheng, Liu, Xuhui, Liu, Jiaming, Lin, Li, Tang, Xu, Hu, Yao, Liu, Jianzhuang, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in vision-language models like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However, most existing text-to-image editing methods encounter two obstacles: First, the text prompt needs to be carefully crafted to achieve good results, which is not intuitive or user-friendly. Second, they are insensitive to local edits and can irreversibly affect non-edited regions, leaving obvious editing traces. To tackle these problems, we propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction (e.g., "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness, outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE., Comment: Accepted at CVPR 2024
Published: 2023

17. Federated Learning via Input-Output Collaborative Distillation

Author: Gong, Xuan, Li, Shanglin, Bao, Yuxiang, Yao, Barry, Huang, Yawen, Wu, Ziyan, Zhang, Baochang, Zheng, Yefeng, and Doermann, David
Subjects: Computer Science - Machine Learning
Abstract: Federated learning (FL) is a machine learning paradigm in which distributed local nodes collaboratively train a central model without sharing individually held private data. Existing FL methods either iteratively share local model parameters or deploy co-distillation. However, the former is highly susceptible to private data leakage, and the latter design relies on the prerequisites of task-relevant real data. Instead, we propose a data-free FL framework based on local-to-central collaborative distillation with direct input and output space exploitation. Our design eliminates any requirement of recursive local parameter exchange or auxiliary task-relevant data to transfer knowledge, thereby giving direct privacy control to local users. In particular, to cope with the inherent data heterogeneity across locals, our technique learns to distill input on which each local model produces consensual yet unique results to represent each expertise. Our proposed FL framework achieves notable privacy-utility trade-offs with extensive experiments on image classification and segmentation tasks under various real-world heterogeneous federated learning settings on both natural and medical images., Comment: Accepted at AAAI 2024
Published: 2023

18. Tuning-Free Inversion-Enhanced Control for Consistent Image Editing

Author: Duan, Xiaoyue, Cui, Shuhao, Kang, Guoliang, Zhang, Baochang, Fei, Zhengcong, Fan, Mingyuan, and Huang, Junshi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Consistent editing of real images is a challenging task, as it requires performing non-rigid edits (e.g., changing postures) to the main objects in the input image without changing their identity or attributes. To guarantee consistent attributes, some existing methods fine-tune the entire model or the textual embedding for structural consistency, but they are time-consuming and fail to perform non-rigid edits. Other works are tuning-free, but their performances are weakened by the quality of Denoising Diffusion Implicit Model (DDIM) reconstruction, which often fails in real-world scenarios. In this paper, we present a novel approach called Tuning-free Inversion-enhanced Control (TIC), which directly correlates features from the inversion process with those from the sampling process to mitigate the inconsistency in DDIM reconstruction. Specifically, our method effectively obtains inversion features from the key and value features in the self-attention layers, and enhances the sampling process by these inversion features, thus achieving accurate reconstruction and content-consistent editing. To extend the applicability of our method to general editing scenarios, we also propose a mask-guided attention concatenation strategy that combines contents from both the inversion and the naive DDIM editing processes. Experiments show that the proposed method outperforms previous works in reconstruction and consistent editing, and produces impressive results in various settings.
Published: 2023

19. DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays

Author: Liu, Xuhui, Qiao, Zhi, Liu, Runkun, Li, Hong, Zhang, Juan, Zhen, Xiantong, Qian, Zhen, Zhang, Baochang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

20. LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

Author: Bao, Yuxiang, Qiu, Di, Kang, Guoliang, Zhang, Baochang, Jin, Bo, Wang, Kaiye, and Yan, Pengfei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models. Previous methods typically adopt cross-frame attention, \emph{i.e.,} sharing the \textit{key} and \textit{value} tokens across attentions of different frames, to encourage the temporal consistency. However, in those works, temporal inconsistency issue may not be thoroughly solved, rendering the fidelity of generated videos limited.%The current state of the art cross-frame attention method aims at maintaining fine-grained visual details across frames, but it is still challenged by the temporal coherence problem. In this paper, we find the bottleneck lies in the unconstrained query tokens and propose a new zero-shot video-to-video translation framework, named \textit{LatentWarp}. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space to constrain the query tokens. Specifically, based on the optical flow obtained from the original video, we warp the generated latent features of last frame to align with the current frame during the denoising process. As a result, the corresponding regions across the adjacent frames can share closely-related query tokens and attention outputs, which can further improve latent-level consistency to enhance visual temporal coherence of generated videos. Extensive experiment results demonstrate the superiority of \textit{LatentWarp} in achieving video-to-video translation with temporal coherence.
Published: 2023

21. IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts

Author: Zeng, Bohan, Li, Shanglin, Feng, Yutang, Yang, Ling, Li, Hong, Gao, Sicheng, Liu, Jiaming, He, Conghui, Zhang, Wentao, Liu, Jianzhuang, Zhang, Baochang, and Yan, Shuicheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in 3D generation have been remarkable, with methods such as DreamFusion leveraging large-scale text-to-image diffusion-based models to guide 3D object generation. These methods enable the synthesis of detailed and photorealistic textured objects. However, the appearance of 3D objects produced by such text-to-3D models is often unpredictable, and it is hard for single-image-to-3D methods to deal with images lacking a clear subject, complicating the generation of appearance-controllable 3D objects from complex images. To address these challenges, we present IPDreamer, a novel method that captures intricate appearance features from complex $\textbf{I}$mage $\textbf{P}$rompts and aligns the synthesized 3D object with these extracted features, enabling high-fidelity, appearance-controllable 3D object generation. Our experiments demonstrate that IPDreamer consistently generates high-quality 3D objects that align with both the textual and complex image prompts, highlighting its promising capability in appearance-controlled, complex 3D object generation. Our code is available at https://github.com/zengbohan0217/IPDreamer., Comment: 20 pages, 12 figures
Published: 2023

22. Heterogeneous Generative Knowledge Distillation with Masked Image Modeling

Author: Wang, Ziming, Han, Shumin, Wang, Xiaodi, Hao, Jing, Cao, Xianbin, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Small CNN-based models usually require transferring knowledge from a large model before they are deployed in computationally resource-limited edge devices. Masked image modeling (MIM) methods achieve great success in various visual tasks but remain largely unexplored in knowledge distillation for heterogeneous deep models. The reason is mainly due to the significant discrepancy between the Transformer-based large model and the CNN-based small network. In this paper, we develop the first Heterogeneous Generative Knowledge Distillation (H-GKD) based on MIM, which can efficiently transfer knowledge from large Transformer models to small CNN-based models in a generative self-supervised fashion. Our method builds a bridge between Transformer-based models and CNNs by training a UNet-style student with sparse convolution, which can effectively mimic the visual representation inferred by a teacher over masked modeling. Our method is a simple yet effective learning paradigm to learn the visual representation and distribution of data from heterogeneous teacher models, which can be pre-trained using advanced generative methods. Extensive experiments show that it adapts well to various models and sizes, consistently achieving state-of-the-art performance in image classification, object detection, and semantic segmentation tasks. For example, in the Imagenet 1K dataset, H-GKD improves the accuracy of Resnet50 (sparse) from 76.98% to 80.01%.
Published: 2023

23. Machine learning and visual perception.

Author: Zhang, Baochang
Subjects: Computer Signals, Machine learning, Visual basic
Abstract: Summary: Machine Learning and Visual Perception provides an up-to-date overview on the topic, including the PAC model, decision tree, Bayesian learning, support vector machines, AdaBoost, compressive sensing and so on.Both classic and novel algorithms are introduced in classifier design, face recognition, deep learning, time series recognition, image classification, and object detection.
Published: 2019

24. Representation Disparity-aware Distillation for 3D Object Detection

Author: Li, Yanjing, Xu, Sheng, Lin, Mingbao, Yin, Jihao, Zhang, Baochang, and Cao, Xianbin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we focus on developing knowledge distillation (KD) for compact 3D detectors. We observe that off-the-shelf KD methods manifest their efficacy only when the teacher model and student counterpart share similar intermediate feature representations. This might explain why they are less effective in building extreme-compact 3D detectors where significant representation disparity arises due primarily to the intrinsic sparsity and irregularity in 3D point clouds. This paper presents a novel representation disparity-aware distillation (RDD) method to address the representation disparity issue and reduce performance gap between compact students and over-parameterized teachers. This is accomplished by building our RDD from an innovative perspective of information bottleneck (IB), which can effectively minimize the disparity of proposal region pairs from student and teacher in features and logits. Extensive experiments are performed to demonstrate the superiority of our RDD over existing KD methods. For example, our RDD increases mAP of CP-Voxel-S to 57.1% on nuScenes dataset, which even surpasses teacher performance while taking up only 42% FLOPs., Comment: Accepted by ICCV2023. arXiv admin note: text overlap with arXiv:2205.15156 by other authors
Published: 2023

25. Q-YOLO: Efficient Inference for Real-time Object Detection

Author: Wang, Mingze, Sun, Huixin, Shi, Jun, Liu, Xuhui, Zhang, Baochang, and Cao, Xianbin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Real-time object detection plays a vital role in various computer vision applications. However, deploying real-time object detectors on resource-constrained platforms poses challenges due to high computational and memory requirements. This paper describes a low-bit quantization method to build a highly efficient one-stage detector, dubbed as Q-YOLO, which can effectively address the performance degradation problem caused by activation distribution imbalance in traditional quantized YOLO models. Q-YOLO introduces a fully end-to-end Post-Training Quantization (PTQ) pipeline with a well-designed Unilateral Histogram-based (UH) activation quantization scheme, which determines the maximum truncation values through histogram analysis by minimizing the Mean Squared Error (MSE) quantization errors. Extensive experiments on the COCO dataset demonstrate the effectiveness of Q-YOLO, outperforming other PTQ methods while achieving a more favorable balance between accuracy and computational cost. This research contributes to advancing the efficient deployment of object detection models on resource-limited edge devices, enabling real-time detection with reduced computational and memory overhead.
Published: 2023

26. Filter Pruning for Efficient CNNs via Knowledge-driven Differential Filter Sampler

Author: Lin, Shaohui, Huang, Wenxuan, Xie, Jiao, Zhang, Baochang, Shen, Yunhang, Yu, Zhou, Han, Jungong, and Doermann, David
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Filter pruning simultaneously accelerates the computation and reduces the memory overhead of CNNs, which can be effectively applied to edge devices and cloud services. In this paper, we propose a novel Knowledge-driven Differential Filter Sampler~(KDFS) with Masked Filter Modeling~(MFM) framework for filter pruning, which globally prunes the redundant filters based on the prior knowledge of a pre-trained model in a differential and non-alternative optimization. Specifically, we design a differential sampler with learnable sampling parameters to build a binary mask vector for each layer, determining whether the corresponding filters are redundant. To learn the mask, we introduce masked filter modeling to construct PCA-like knowledge by aligning the intermediate features from the pre-trained teacher model and the outputs of the student decoder taking sampling features as the input. The mask and sampler are directly optimized by the Gumbel-Softmax Straight-Through Gradient Estimator in an end-to-end manner in combination with global pruning constraint, MFM reconstruction error, and dark knowledge. Extensive experiments demonstrate the proposed KDFS's effectiveness in compressing the base models on various datasets. For instance, the pruned ResNet-50 on ImageNet achieves $55.36\%$ computation reduction, and $42.86\%$ parameter reduction, while only dropping $0.35\%$ Top-1 accuracy, significantly outperforming the state-of-the-art methods. The code is available at \url{https://github.com/Osilly/KDFS}.
Published: 2023

27. DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-bit CNNs

Author: Li, Yanjing, Xu, Sheng, Cao, Xianbin, Zhuo, Li'an, Zhang, Baochang, Wang, Tian, and Guo, Guodong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Neural architecture search (NAS) proves to be among the effective approaches for many tasks by generating an application-adaptive neural architecture, which is still challenged by high computational cost and memory consumption. At the same time, 1-bit convolutional neural networks (CNNs) with binary weights and activations show their potential for resource-limited embedded devices. One natural approach is to use 1-bit CNNs to reduce the computation and memory cost of NAS by taking advantage of the strengths of each in a unified framework, while searching the 1-bit CNNs is more challenging due to the more complicated processes involved. In this paper, we introduce Discrepant Child-Parent Neural Architecture Search (DCP-NAS) to efficiently search 1-bit CNNs, based on a new framework of searching the 1-bit model (Child) under the supervision of a real-valued model (Parent). Particularly, we first utilize a Parent model to calculate a tangent direction, based on which the tangent propagation method is introduced to search the optimized 1-bit Child. We further observe a coupling relationship between the weights and architecture parameters existing in such differentiable frameworks. To address the issue, we propose a decoupled optimization method to search an optimized architecture. Extensive experiments demonstrate that our DCP-NAS achieves much better results than prior arts on both CIFAR-10 and ImageNet datasets. In particular, the backbones achieved by our DCP-NAS achieve strong generalization performance on person re-identification and object detection., Comment: Accepted by International Journal of Computer Vision
Published: 2023

28. Self-Enhancement Improves Text-Image Retrieval in Foundation Visual-Language Models

Author: Yang, Yuguang, Wang, Yiming, Geng, Shupeng, Wang, Runqi, Wang, Yimi, Wu, Sheng, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The emergence of cross-modal foundation models has introduced numerous approaches grounded in text-image retrieval. However, on some domain-specific retrieval tasks, these models fail to focus on the key attributes required. To address this issue, we propose a self-enhancement framework, A^{3}R, based on the CLIP-ViT/G-14, one of the largest cross-modal models. First, we perform an Attribute Augmentation strategy to enrich the textual description for fine-grained representation before model learning. Then, we propose an Adaption Re-ranking method to unify the representation space of textual query and candidate images and re-rank candidate images relying on the adapted query after model learning. The proposed framework is validated to achieve a salient improvement over the baseline and other teams' solutions in the cross-modal image retrieval track of the 1st foundation model challenge without introducing any additional samples. The code is available at \url{https://github.com/CapricornGuang/A3R}., Comment: Accepted by CVPR 2023 Workshop
Published: 2023

29. Applications

Author: Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, Doermann, David, Bandyopadhyay, Sanghamitra, Founding Editor, Maulik, Ujjwal, Founding Editor, Siarry, Patrick, Series Editor, Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, and Doermann, David
Published: 2024
Full Text: View/download PDF

30. Quantization of Neural Networks

Author: Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, Doermann, David, Bandyopadhyay, Sanghamitra, Founding Editor, Maulik, Ujjwal, Founding Editor, Siarry, Patrick, Series Editor, Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, and Doermann, David
Published: 2024
Full Text: View/download PDF

31. Binary Neural Architecture Search

Author: Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, Doermann, David, Bandyopadhyay, Sanghamitra, Founding Editor, Maulik, Ujjwal, Founding Editor, Siarry, Patrick, Series Editor, Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, and Doermann, David
Published: 2024
Full Text: View/download PDF

32. Introduction

Author: Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, Doermann, David, Bandyopadhyay, Sanghamitra, Founding Editor, Maulik, Ujjwal, Founding Editor, Siarry, Patrick, Series Editor, Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, and Doermann, David
Published: 2024
Full Text: View/download PDF

33. Network Pruning

Author: Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, Doermann, David, Bandyopadhyay, Sanghamitra, Founding Editor, Maulik, Ujjwal, Founding Editor, Siarry, Patrick, Series Editor, Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, and Doermann, David
Published: 2024
Full Text: View/download PDF

34. Binary Neural Networks

Author: Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, Doermann, David, Bandyopadhyay, Sanghamitra, Founding Editor, Maulik, Ujjwal, Founding Editor, Siarry, Patrick, Series Editor, Zhang, Baochang, Wang, Tiancheng, Xu, Sheng, and Doermann, David
Published: 2024
Full Text: View/download PDF

35. Flexible TAM requirement of TnpB enables efficient single-nucleotide editing with expanded targeting scope

Author: Feng, Xu, Xu, Ruyi, Liao, Jianglan, Zhao, Jingyu, Zhang, Baochang, Xu, Xiaoxiao, Zhao, Pengpeng, Wang, Xiaoning, Yao, Jianyun, Wang, Pengxia, Wang, Xiaoxue, Han, Wenyuan, and She, Qunxin
Published: 2024
Full Text: View/download PDF

36. Decom--CAM: Tell Me What You See, In Details! Feature-Level Interpretation via Decomposition Class Activation Map

Author: Yang, Yuguang, Guo, Runtang, Wu, Sheng, Wang, Yimi, Zhang, Juan, Gong, Xuan, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Interpretation of deep learning remains a very challenging problem. Although the Class Activation Map (CAM) is widely used to interpret deep model predictions by highlighting object location, it fails to provide insight into the salient features used by the model to make decisions. Furthermore, existing evaluation protocols often overlook the correlation between interpretability performance and the model's decision quality, which presents a more fundamental issue. This paper proposes a new two-stage interpretability method called the Decomposition Class Activation Map (Decom-CAM), which offers a feature-level interpretation of the model's prediction. Decom-CAM decomposes intermediate activation maps into orthogonal features using singular value decomposition and generates saliency maps by integrating them. The orthogonality of features enables CAM to capture local features and can be used to pinpoint semantic components such as eyes, noses, and faces in the input image, making it more beneficial for deep model interpretation. To ensure a comprehensive comparison, we introduce a new evaluation protocol by dividing the dataset into subsets based on classification accuracy results and evaluating the interpretability performance on each subset separately. Our experiments demonstrate that the proposed Decom-CAM outperforms current state-of-the-art methods significantly by generating more precise saliency maps across all levels of classification accuracy. Combined with our feature-level interpretability approach, this paper could pave the way for a new direction for understanding the decision-making process of deep neural networks., Comment: This version has not included sufficient evidence for its claims
Published: 2023

37. Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Author: Li, Yanjing, Xu, Sheng, Lin, Mingbao, Cao, Xianbin, Liu, Chuanjian, Sun, Xiao, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision transformers (ViTs) quantization offers a promising prospect to facilitate deploying large pre-trained networks on resource-limited devices. Fully-binarized ViTs (Bi-ViT) that pushes the quantization of ViTs to its limit remain largely unexplored and a very challenging task yet, due to their unacceptable performance. Through extensive empirical analyses, we identify the severe drop in ViT binarization is caused by attention distortion in self-attention, which technically stems from the gradient vanishing and ranking disorder. To address these issues, we first introduce a learnable scaling factor to reactivate the vanished gradients and illustrate its effectiveness through theoretical and experimental analyses. We then propose a ranking-aware distillation method to rectify the disordered ranking in a teacher-student framework. Bi-ViT achieves significant improvements over popular DeiT and Swin backbones in terms of Top-1 accuracy and FLOPs. For example, with DeiT-Tiny and Swin-Tiny, our method significantly outperforms baselines by 22.1% and 21.4% respectively, while 61.5x and 56.1x theoretical acceleration in terms of FLOPs compared with real-valued counterparts on ImageNet.
Published: 2023

38. AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

Author: Wang, Runqi, Duan, Xiaoyue, Kang, Guoliang, Liu, Jianzhuang, Lin, Shaohui, Xu, Songcen, Lv, Jinhu, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text. Text consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute word bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts. The implementation code can be available at https://github.com/bhrqw/AttriCLIP.
Published: 2023

39. Few-Shot Learning with Visual Distribution Calibration and Cross-Modal Distribution Alignment

Author: Wang, Runqi, Zheng, Hao, Duan, Xiaoyue, Liu, Jianzhuang, Lu, Yuning, Wang, Tian, Xu, Songcen, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pre-trained vision-language models have inspired much research on few-shot learning. However, with only a few training images, there exist two crucial problems: (1) the visual feature distributions are easily distracted by class-irrelevant information in images, and (2) the alignment between the visual and language feature distributions is difficult. To deal with the distraction problem, we propose a Selective Attack module, which consists of trainable adapters that generate spatial attention maps of images to guide the attacks on class-irrelevant image areas. By messing up these areas, the critical features are captured and the visual distributions of image features are calibrated. To better align the visual and language feature distributions that describe the same object class, we propose a cross-modal distribution alignment module, in which we introduce a vision-language prototype for each class to align the distributions, and adopt the Earth Mover's Distance (EMD) to optimize the prototypes. For efficient computation, the upper bound of EMD is derived. In addition, we propose an augmentation strategy to increase the diversity of the images and the text prompts, which can reduce overfitting to the few-shot training images. Extensive experiments on 11 datasets demonstrate that our method consistently outperforms prior arts in few-shot learning. The implementation code will be available at https://github.com/bhrqw/SADA.
Published: 2023

40. Controllable Mind Visual Diffusion Model

Author: Zeng, Bohan, Li, Shanglin, Liu, Xuhui, Gao, Sicheng, Jiang, Xiaolong, Tang, Xu, Hu, Yao, Liu, Jianzhuang, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Brain signal visualization has emerged as an active research area, serving as a critical interface between the human visual system and computer vision models. Although diffusion models have shown promise in analyzing functional magnetic resonance imaging (fMRI) data, including reconstructing high-quality images consistent with original visual stimuli, their accuracy in extracting semantic and silhouette information from brain signals remains limited. In this regard, we propose a novel approach, referred to as Controllable Mind Visual Diffusion Model (CMVDM). CMVDM extracts semantic and silhouette information from fMRI data using attribute alignment and assistant networks. Additionally, a residual block is incorporated to capture information beyond semantic and silhouette features. We then leverage a control model to fully exploit the extracted information for image synthesis, resulting in generated images that closely resemble the visual stimuli in terms of semantics and silhouette. Through extensive experimentation, we demonstrate that CMVDM outperforms existing state-of-the-art methods both qualitatively and quantitatively., Comment: 16 pages, 11 figures
Published: 2023

41. MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation

Author: Guo, Jie, Wang, Qimeng, Gao, Yan, Jiang, Xiaolong, Tang, Xu, Hu, Yao, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: CLIP (Contrastive Language-Image Pretraining) is well-developed for open-vocabulary zero-shot image-level recognition, while its applications in pixel-level tasks are less investigated, where most efforts directly adopt CLIP features without deliberative adaptations. In this work, we first demonstrate the necessity of image-pixel CLIP feature adaption, then provide Multi-View Prompt learning (MVP-SEG) as an effective solution to achieve image-pixel adaptation and to solve open-vocabulary semantic segmentation. Concretely, MVP-SEG deliberately learns multiple prompts trained by our Orthogonal Constraint Loss (OCLoss), by which each prompt is supervised to exploit CLIP feature on different object parts, and collaborative segmentation masks generated by all prompts promote better segmentation. Moreover, MVP-SEG introduces Global Prompt Refining (GPR) to further eliminate class-wise segmentation noise. Experiments show that the multi-view prompts learned from seen categories have strong generalization to unseen categories, and MVP-SEG+ which combines the knowledge transfer stage significantly outperforms previous methods on several benchmarks. Moreover, qualitative results justify that MVP-SEG does lead to better focus on different local parts.
Published: 2023

42. Face Animation with an Attribute-Guided Diffusion Model

Author: Zeng, Bohan, Liu, Xuhui, Gao, Sicheng, Liu, Boyu, Li, Hong, Liu, Jianzhuang, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Face animation has achieved much progress in computer vision. However, prevailing GAN-based methods suffer from unnatural distortions and artifacts due to sophisticated motion deformation. In this paper, we propose a Face Animation framework with an attribute-guided Diffusion Model (FADM), which is the first work to exploit the superior modeling capacity of diffusion models for photo-realistic talking-head generation. To mitigate the uncontrollable synthesis effect of the diffusion model, we design an Attribute-Guided Conditioning Network (AGCN) to adaptively combine the coarse animation features and 3D face reconstruction results, which can incorporate appearance and motion conditions into the diffusion process. These specific designs help FADM rectify unnatural artifacts and distortions, and also enrich high-fidelity facial details through iterative diffusion refinements with accurate animation attributes. FADM can flexibly and effectively improve existing animation videos. Extensive experiments on widely used talking-head benchmarks validate the effectiveness of FADM over prior arts., Comment: 8 pages, 6 figures
Published: 2023

43. Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

Author: Xu, Sheng, Li, Yanjing, Lin, Mingbao, Gao, Peng, Guo, Guodong, Lu, Jinhu, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The recent detection transformer (DETR) has advanced object detection, but its application on resource-constrained devices requires massive computation and memory resources. Quantization stands out as a solution by representing the network in low-bit parameters and operations. However, there is a significant performance drop when performing low-bit quantized DETR (Q-DETR) with existing quantization methods. We find that the bottlenecks of Q-DETR come from the query information distortion through our empirical analyses. This paper addresses this problem based on a distribution rectification distillation (DRD). We formulate our DRD as a bi-level optimization problem, which can be derived by generalizing the information bottleneck (IB) principle to the learning of Q-DETR. At the inner level, we conduct a distribution alignment for the queries to maximize the self-information entropy. At the upper level, we introduce a new foreground-aware query matching scheme to effectively transfer the teacher information to distillation-desired features to minimize the conditional information entropy. Extensive experimental results show that our method performs much better than prior arts. For example, the 4-bit Q-DETR can theoretically accelerate DETR with ResNet-50 backbone by 6.6x and achieve 39.4% AP, with only 2.6% performance gaps than its real-valued counterpart on the COCO dataset.
Published: 2023

44. Implicit Diffusion Models for Continuous Super-Resolution

Author: Gao, Sicheng, Liu, Xuhui, Zeng, Bohan, Xu, Sheng, Li, Yanjing, Luo, Xiaoyan, Liu, Jianzhuang, Zhen, Xiantong, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image super-resolution (SR) has attracted increasing attention due to its wide applications. However, current SR methods generally suffer from over-smoothing and artifacts, and most work only with fixed magnifications. This paper introduces an Implicit Diffusion Model (IDM) for high-fidelity continuous image super-resolution. IDM integrates an implicit neural representation and a denoising diffusion model in a unified end-to-end framework, where the implicit neural representation is adopted in the decoding process to learn continuous-resolution representation. Furthermore, we design a scale-controllable conditioning mechanism that consists of a low-resolution (LR) conditioning network and a scaling factor. The scaling factor regulates the resolution and accordingly modulates the proportion of the LR information and generated features in the final output, which enables the model to accommodate the continuous-resolution requirement. Extensive experiments validate the effectiveness of our IDM and demonstrate its superior performance over prior arts., Comment: 8 pages, 9 figures, published to CVPR2023
Published: 2023

45. Confidence-driven Bounding Box Localization for Small Object Detection

Author: Sun, Huixin, Zhang, Baochang, Li, Yanjing, and Cao, Xianbin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite advancements in generic object detection, there remains a performance gap in detecting small objects compared to normal-scale objects. We for the first time observe that existing bounding box regression methods tend to produce distorted gradients for small objects and result in less accurate localization. To address this issue, we present a novel Confidence-driven Bounding Box Localization (C-BBL) method to rectify the gradients. C-BBL quantizes continuous labels into grids and formulates two-hot ground truth labels. In prediction, the bounding box head generates a confidence distribution over the grids. Unlike the bounding box regression paradigms in conventional detectors, we introduce a classification-based localization objective through cross entropy between ground truth and predicted confidence distribution, generating confidence-driven gradients. Additionally, C-BBL describes a uncertainty loss based on distribution entropy in labels and predictions to further reduce the uncertainty in small object localization. The method is evaluated on multiple detectors using three object detection benchmarks and consistently improves baseline detectors, achieving state-of-the-art performance. We also demonstrate the generalizability of C-BBL to different label systems and effectiveness for high resolution detection, which validates its prospect as a general solution.
Published: 2023

46. Resilient Binary Neural Network

Author: Xu, Sheng, Li, Yanjing, Ma, Teli, Lin, Mingbao, Dong, Hao, Zhang, Baochang, Gao, Peng, and Lv, Jinhu
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Binary neural networks (BNNs) have received ever-increasing popularity for their great capability of reducing storage burden as well as quickening inference time. However, there is a severe performance drop compared with real-valued networks, due to its intrinsic frequent weight oscillation during training. In this paper, we introduce a Resilient Binary Neural Network (ReBNN) to mitigate the frequent oscillation for better BNNs' training. We identify that the weight oscillation mainly stems from the non-parametric scaling factor. To address this issue, we propose to parameterize the scaling factor and introduce a weighted reconstruction loss to build an adaptive training objective. For the first time, we show that the weight oscillation is controlled by the balanced parameter attached to the reconstruction loss, which provides a theoretical foundation to parameterize it in back propagation. Based on this, we learn our ReBNN by calculating the balanced parameter based on its maximum magnitude, which can effectively mitigate the weight oscillation with a resilient training process. Extensive experiments are conducted upon various network models, such as ResNet and Faster-RCNN for computer vision, as well as BERT for natural language processing. The results demonstrate the overwhelming performance of our ReBNN over prior arts. For example, our ReBNN achieves 66.9% Top-1 accuracy with ResNet-18 backbone on the ImageNet dataset, surpassing existing state-of-the-arts by a significant margin. Our code is open-sourced at https://github.com/SteveTsui/ReBNN., Comment: AAAI 2023 Oral
Published: 2023

47. Feature Calibration Network for Occluded Pedestrian Detection

Author: Zhang, Tianliang, Ye, Qixiang, Zhang, Baochang, Liu, Jianzhuang, Zhang, Xiaopeng, and Tian, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pedestrian detection in the wild remains a challenging problem especially for scenes containing serious occlusion. In this paper, we propose a novel feature learning method in the deep learning framework, referred to as Feature Calibration Network (FC-Net), to adaptively detect pedestrians under various occlusions. FC-Net is based on the observation that the visible parts of pedestrians are selective and decisive for detection, and is implemented as a self-paced feature learning framework with a self-activation (SA) module and a feature calibration (FC) module. In a new self-activated manner, FC-Net learns features which highlight the visible parts and suppress the occluded parts of pedestrians. The SA module estimates pedestrian activation maps by reusing classifier weights, without any additional parameter involved, therefore resulting in an extremely parsimony model to reinforce the semantics of features, while the FC module calibrates the convolutional features for adaptive pedestrian representation in both pixel-wise and region-based ways. Experiments on CityPersons and Caltech datasets demonstrate that FC-Net improves detection performance on occluded pedestrians up to 10% while maintaining excellent performance on non-occluded instances., Comment: Accepted by IEEE Transactions on Intelligent Transportation Systems (TITS)
Published: 2022
Full Text: View/download PDF

48. CircleNet: Reciprocating Feature Adaptation for Robust Pedestrian Detection

Author: Zhang, Tianliang, Han, Zhenjun, Xu, Huijuan, Zhang, Baochang, and Ye, Qixiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pedestrian detection in the wild remains a challenging problem especially when the scene contains significant occlusion and/or low resolution of the pedestrians to be detected. Existing methods are unable to adapt to these difficult cases while maintaining acceptable performance. In this paper we propose a novel feature learning model, referred to as CircleNet, to achieve feature adaptation by mimicking the process humans looking at low resolution and occluded objects: focusing on it again, at a finer scale, if the object can not be identified clearly for the first time. CircleNet is implemented as a set of feature pyramids and uses weight sharing path augmentation for better feature fusion. It targets at reciprocating feature adaptation and iterative object detection using multiple top-down and bottom-up pathways. To take full advantage of the feature adaptation capability in CircleNet, we design an instance decomposition training strategy to focus on detecting pedestrian instances of various resolutions and different occlusion levels in each cycle. Specifically, CircleNet implements feature ensemble with the idea of hard negative boosting in an end-to-end manner. Experiments on two pedestrian detection datasets, Caltech and CityPersons, show that CircleNet improves the performance of occluded and low-resolution pedestrians with significant margins while maintaining good performance on normal instances., Comment: Accepted by Transactions on Intelligent Transportation Systems (TITS)
Published: 2022
Full Text: View/download PDF

49. XA-Sim2Real: Adaptive Representation Learning for Vessel Segmentation in X-Ray Angiography

Author: Zhang, Baochang, Zhang, Zichen, Liu, Shuting, Faghihroohi, Shahrooz, Schunkert, Heribert, Navab, Nassir, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
Published: 2024
Full Text: View/download PDF

50. Multi-modal Data Fusion with Missing Data Handling for Mild Cognitive Impairment Progression Prediction

Author: Liu, Shuting, Zhang, Baochang, Zimmer, Veronika A., Rueckert, Daniel, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

2,277 results on '"Zhang, Baochang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources