Author: "Wang, Gaoang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang, Gaoang"' showing total 207 results

Start Over Author "Wang, Gaoang"

207 results on '"Wang, Gaoang"'

1. Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Author: Li, Bozheng, Liu, Mushui, Wang, Gaoang, and Yu, Yunlong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a novel Temporal Sequence-Aware Model (TSAM) for few-shot action recognition (FSAR), which incorporates a sequential perceiver adapter into the pre-training framework, to integrate both the spatial information and the sequential temporal dynamics into the feature embeddings. Different from the existing fine-tuning approaches that capture temporal information by exploring the relationships among all the frames, our perceiver-based adapter recurrently captures the sequential dynamics alongside the timeline, which could perceive the order change. To obtain the discriminative representations for each class, we extend a textual corpus for each class derived from the large language models (LLMs) and enrich the visual prototypes by integrating the contextual semantic information. Besides, We introduce an unbalanced optimal transport strategy for feature matching that mitigates the impact of class-unrelated features, thereby facilitating more effective decision-making. Experimental results on five FSAR datasets demonstrate that our method set a new benchmark, beating the second-best competitors with large margins., Comment: 9 pages, 6 figures
Published: 2024

2. STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft

Author: Zhao, Zhonghan, Chai, Wenhao, Wang, Xuan, Ma, Ke, Chen, Kewei, Guo, Dongxu, Ye, Tian, Zhang, Yanting, Wang, Hongwei, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challenging tasks such as navigation and even creative tasks, with an efficiency far exceeding previous state-of-the-art methods by a factor of $2.5\times$ to $7.3\times$. We begin our exploration with a vanilla large language model, augmenting it with a vision encoder and an action codebase trained on our collected high-quality dataset STEVE-21K. Subsequently, we enhanced it with a Critic and memory to transform it into a complex system. Finally, we constructed a hierarchical multi-agent system. Our recent work explored how to prune the agent system through knowledge distillation. In the future, we will explore more potential applications of STEVE agents in the real world., Comment: CVPR 2024 Embodied AI Workshop
Published: 2024

3. BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

Author: Wang, Wenjie, Lu, Yehao, Zheng, Guangcong, Zhan, Shuigen, Ye, Xiaoqing, Tan, Zichang, Wang, Jingdong, Wang, Gaoang, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping, ignoring the position approximation error in the voxel pooling process. Inspired by this insight, we propose a novel voxel pooling strategy to reduce such error, dubbed BEVSpread. Specifically, instead of bringing the image features contained in a frustum point to a single BEV grid, BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance, a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration, BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that, as a plug-in, BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12, 5.26, 3.01) AP in vehicle, pedestrian and cyclist.
Published: 2024

4. CityCraft: A Real Crafter for 3D City Generation

Author: Deng, Jie, Chai, Wenhao, Huang, Junsheng, Zhao, Zhonghan, Huang, Qixuan, Gao, Mingyan, Guo, Jianshu, Hao, Shengyu, Hu, Wenhao, Hwang, Jenq-Neng, Li, Xi, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: City scene generation has gained significant attention in autonomous driving, smart city development, and traffic simulation. It helps enhance infrastructure planning and monitoring solutions. Existing methods have employed a two-stage process involving city layout generation, typically using Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), or Transformers, followed by neural rendering. These techniques often exhibit limited diversity and noticeable artifacts in the rendered city scenes. The rendered scenes lack variety, resembling the training images, resulting in monotonous styles. Additionally, these methods lack planning capabilities, leading to less realistic generated scenes. In this paper, we introduce CityCraft, an innovative framework designed to enhance both the diversity and quality of urban scene generation. Our approach integrates three key stages: initially, a diffusion transformer (DiT) model is deployed to generate diverse and controllable 2D city layouts. Subsequently, a Large Language Model(LLM) is utilized to strategically make land-use plans within these layouts based on user prompts and language guidelines. Based on the generated layout and city plan, we utilize the asset retrieval module and Blender for precise asset placement and scene construction. Furthermore, we contribute two new datasets to the field: 1)CityCraft-OSM dataset including 2D semantic layouts of urban areas, corresponding satellite images, and detailed annotations. 2) CityCraft-Buildings dataset, featuring thousands of diverse, high-quality 3D building assets. CityCraft achieves state-of-the-art performance in generating realistic 3D cities., Comment: 20 pages, 9 figures
Published: 2024

5. S4Fusion: Saliency-aware Selective State Space Model for Infrared Visible Image Fusion

Author: Ma, Haolong, Li, Hui, Cheng, Chunyang, Wang, Gaoang, Song, Xiaoning, and Wu, Xiaojun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: As one of the tasks in Image Fusion, Infrared and Visible Image Fusion aims to integrate complementary information captured by sensors of different modalities into a single image. The Selective State Space Model (SSSM), known for its ability to capture long-range dependencies, has demonstrated its potential in the field of computer vision. However, in image fusion, current methods underestimate the potential of SSSM in capturing the global spatial information of both modalities. This limitation prevents the simultaneous consideration of the global spatial information from both modalities during interaction, leading to a lack of comprehensive perception of salient targets. Consequently, the fusion results tend to bias towards one modality instead of adaptively preserving salient targets. To address this issue, we propose the Saliency-aware Selective State Space Fusion Model (S4Fusion). In our S4Fusion, the designed Cross-Modal Spatial Awareness Module (CMSA) can simultaneously focus on global spatial information from both modalities while facilitating their interaction, thereby comprehensively capturing complementary information. Additionally, S4Fusion leverages a pre-trained network to perceive uncertainty in the fused images. By minimizing this uncertainty, S4Fusion adaptively highlights salient targets from both images. Extensive experiments demonstrate that our approach produces high-quality images and enhances performance in downstream tasks.
Published: 2024

6. FlexiFilm: Long Video Generation with Flexible Conditions

Author: Ouyang, Yichen, Yuan, jianhao, Zhao, Hao, Wang, Gaoang, and zhao, Bo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/, Comment: 9 pages, 9 figures
Published: 2024

7. MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

Author: Song, Enxin, Chai, Wenhao, Ye, Tian, Hwang, Jenq-Neng, Li, Xi, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing methods either employ complex spatial-temporal modules or rely heavily on additional perception models to extract temporal features for video understanding, and they only perform well on short videos. For long videos, the computational complexity and memory costs associated with long-term temporal connections are significantly increased, posing additional challenges.Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose MovieChat to overcome these challenges. We lift pre-trained multi-modal large language models for understanding long videos without incorporating additional trainable temporal modules, employing a zero-shot approach. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method. The code along with the dataset can be accessed via the following https://github.com/rese1f/MovieChat.
Published: 2024

8. Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model

Author: Zhao, Zhonghan, Ma, Ke, Chai, Wenhao, Wang, Xuan, Chen, Kewei, Guo, Dongxu, Zhang, Yanting, Wang, Hongwei, and Wang, Gaoang
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance., Comment: arXiv admin note: text overlap with arXiv:2403.08282
Published: 2024

9. VersaT2I: Improving Text-to-Image Models with Versatile Reward

Author: Guo, Jianshu, Chai, Wenhao, Deng, Jie, Huang, Hsiang-Wei, Ye, Tian, Xu, Yichen, Zhang, Jiawei, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc. Then, for every quality aspect, we select high-quality images in this aspect generated by the model as the training set to finetune the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a gating function to combine multiple quality aspects, which can avoid conflicts between different quality aspects. Our method is easy to extend and does not require any manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms the baseline methods across various quality criteria.
Published: 2024

10. Hierarchical Auto-Organizing System for Open-Ended Multi-Agent Navigation

Author: Zhao, Zhonghan, Chen, Kewei, Guo, Dongxu, Chai, Wenhao, Ye, Tian, Zhang, Yanting, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the dynamic and unpredictable open-world setting, navigating complex environments in Minecraft poses significant challenges for multi-agent systems. Agents must interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, crucial for effective multi-agent navigation. Furthermore, processing and integrating multi-modal information (such as visual, textual, and auditory data) is essential for agents to comprehend their goals and navigate the environment successfully and fully. To address this issue, we design the HAS framework to auto-organize groups of LLM-based agents to complete navigation tasks. In our approach, we devise a hierarchical auto-organizing navigation system, which is characterized by 1) a hierarchical system for multi-agent organization, ensuring centralized planning and decentralized execution; 2) an auto-organizing and intra-communication mechanism, enabling dynamic group adjustment under subtasks; 3) a multi-modal information platform, facilitating multi-modal perception to perform the three navigation tasks with one system. To assess organizational behavior, we design a series of navigation tasks in the Minecraft environment, which includes searching and exploring. We aim to develop embodied organizations that push the boundaries of embodied AI, moving it towards a more human-like organizational structure., Comment: ICLR 2024 Workshop on LLM Agents
Published: 2024

11. MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant

Author: Zhan, Chenlu, Lin, Yu, Wang, Gaoang, Wang, Hongwei, and Wu, Jian
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Medical generative models, acknowledged for their high-quality sample generation ability, have accelerated the fast growth of medical applications. However, recent works concentrate on separate medical generation models for distinct medical tasks and are restricted to inadequate medical multi-modal knowledge, constraining medical comprehensive diagnosis. In this paper, we propose MedM2G, a Medical Multi-Modal Generative framework, with the key innovation to align, extract, and generate medical multi-modal within a unified model. Extending beyond single or two medical modalities, we efficiently align medical multi-modal through the central alignment approach in the unified space. Significantly, our framework extracts valuable clinical knowledge by preserving the medical visual invariant of each imaging modal, thereby enhancing specific medical information for multi-modal generation. By conditioning the adaptive cross-guided parameters into the multi-flow diffusion framework, our model promotes flexible interactions among medical multi-modal for generation. MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image, image-to-text, and unified generation of medical modalities (CT, MRI, X-ray). It performs 5 medical generation tasks across 10 datasets, consistently outperforming various state-of-the-art works., Comment: Accepted by CVPR2024
Published: 2024

12. UniDCP: Unifying Multiple Medical Vision-language Tasks via Dynamic Cross-modal Learnable Prompts

Author: Zhan, Chenlu, Zhang, Yufei, Lin, Yu, Wang, Gaoang, and Wang, Hongwei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Medical vision-language pre-training (Med-VLP) models have recently accelerated the fast-growing medical diagnostics application. However, most Med-VLP models learn task-specific representations independently from scratch, thereby leading to great inflexibility when they work across multiple fine-tuning tasks. In this work, we propose UniDCP, a Unified medical vision-language model with Dynamic Cross-modal learnable Prompts, which can be plastically applied to multiple medical vision-language tasks. Specifically, we explicitly construct a unified framework to harmonize diverse inputs from multiple pretraining tasks by leveraging cross-modal prompts for unification, which accordingly can accommodate heterogeneous medical fine-tuning tasks. Furthermore, we conceive a dynamic cross-modal prompt optimizing strategy that optimizes the prompts within the shareable space for implicitly processing the shareable clinic knowledge. UniDCP is the first Med-VLP model capable of performing all 8 medical uni-modal and cross-modal tasks over 14 corresponding datasets, consistently yielding superior results over diverse state-of-the-art methods.
Published: 2023

13. User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Author: Wang, Xuan, Wang, Guanhong, Chai, Wenhao, Zhou, Jiayu, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.
Published: 2023

14. CityGen: Infinite and Controllable 3D City Layout Generation

Author: Deng, Jie, Chai, Wenhao, Guo, Jianshu, Huang, Qixuan, Hu, Wenhao, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: City layout generation has recently gained significant attention. The goal of this task is to automatically generate the layout of a city scene, including elements such as roads, buildings, vegetation, as well as other urban infrastructures. Previous methods using VAEs or GANs for 3D city layout generation offer limited diversity and constrained interactivity, only allowing users to selectively regenerate parts of the layout, which greatly limits customization. In this paper, we propose CityGen, a novel end-to-end framework for infinite, diverse and controllable 3D city layout generation.First, we propose an outpainting pipeline to extend the local layout to an infinite city layout. Then, we utilize a multi-scale diffusion model to generate diverse and controllable local semantic layout patches. The extensive experiments show that CityGen achieves state-of-the-art (SOTA) performance under FID and KID in generating an infinite and controllable 3D city layout. CityGen demonstrates promising applicability in fields like smart cities, urban planning, and digital simulation., Comment: 12 pages, 9 figures
Published: 2023

15. See and Think: Embodied Agent in Virtual Environment

Author: Zhao, Zhonghan, Chai, Wenhao, Wang, Xuan, Boyi, Li, Hao, Shengyu, Cao, Shidong, Ye, Tian, and Wang, Gaoang
Subjects: Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have achieved impressive pro-gress on several open-world tasks. Recently, using LLMs to build embodied agents has been a hotspot. This paper proposes STEVE, a comprehensive and visionary embodied agent in the Minecraft virtual environment. STEVE comprises three key components: vision perception, language instruction, and code action. Vision perception involves interpreting visual information in the environment, which is then integrated into the LLMs component with agent state and task instruction. Language instruction is responsible for iterative reasoning and decomposing complex tasks into manageable guidelines. Code action generates executable skill actions based on retrieval in skill database, enabling the agent to interact effectively within the Minecraft environment. We also collect STEVE-21K dataset, which includes 600+ vision-environment pairs, 20K knowledge question-answering pairs, and 200+ skill-code pairs. We conduct continuous block search, knowledge question and answering, and tech tree mastery to evaluate the performance. Extensive experiments show that STEVE achieves at most 1.5x faster unlocking key tech trees and 2.5x quicker in block search tasks., Comment: ECCV 2024. First three authors contribute equally to this work. Project Website https://rese1f.github.io/STEVE/
Published: 2023

16. Vision meets mmWave Radar: 3D Object Perception Benchmark for Autonomous Driving

Author: Wang, Yizhou, Cheng, Jen-Hao, Huang, Jui-Te, Kuan, Sheng-Yao, Fu, Qiqian, Ni, Chiming, Hao, Shengyu, Wang, Gaoang, Xing, Guanbin, Liu, Hui, and Hwang, Jenq-Neng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Signal Processing
Abstract: Sensor fusion is crucial for an accurate and robust perception system on autonomous vehicles. Most existing datasets and perception solutions focus on fusing cameras and LiDAR. However, the collaboration between camera and radar is significantly under-exploited. The incorporation of rich semantic information from the camera, and reliable 3D information from the radar can potentially achieve an efficient, cheap, and portable solution for 3D object perception tasks. It can also be robust to different lighting or all-weather driving scenarios due to the capability of mmWave radars. In this paper, we introduce the CRUW3D dataset, including 66K synchronized and well-calibrated camera, radar, and LiDAR frames in various driving scenarios. Unlike other large-scale autonomous driving datasets, our radar data is in the format of radio frequency (RF) tensors that contain not only 3D location information but also spatio-temporal semantic information. This kind of radar format can enable machine learning models to generate more reliable object perception results after interacting and fusing the information or features between the camera and radar.
Published: 2023

17. Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning

Author: Zhang, Zhenyu, Wang, Benlu, Liang, Weijie, Li, Yizhi, Guo, Xuechen, Wang, Guanhong, Li, Shiyan, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: With the development of multimodality and large language models, the deep learning-based technique for medical image captioning holds the potential to offer valuable diagnostic recommendations. However, current generic text and image pre-trained models do not yield satisfactory results when it comes to describing intricate details within medical images. In this paper, we present a novel medical image captioning method guided by the segment anything model (SAM) to enable enhanced encoding with both general and detailed feature extraction. In addition, our approach employs a distinctive pre-training strategy with mixed semantic learning to simultaneously capture both the overall information and finer details within medical images. We demonstrate the effectiveness of this approach, as it outperforms the pre-trained BLIP2 model on various evaluation metrics for generating descriptions of medical images.
Published: 2023

18. DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes

Author: Hao, Shengyu, Liu, Peiyuan, Zhan, Yibing, Jin, Kaixun, Liu, Zuozhu, Song, Mingli, Hwang, Jenq-Neng, and Wang, Gaoang
Published: 2024
Full Text: View/download PDF

19. Devil in the Number: Towards Robust Multi-modality Data Filter

Author: Xu, Yichen, Xu, Zihan, Chai, Wenhao, Zhao, Zhonghan, Song, Enxin, and Wang, Gaoang
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs. For instance, LAION papers employs the CLIP score filter to select data with CLIP scores surpassing a certain threshold. On the other hand, T-MARS achieves high-quality data filtering by detecting and masking text within images and then filtering by CLIP score. Through analyzing the dataset, we observe a significant proportion of redundant information, such as numbers, present in the textual content. Our experiments on a subset of the data unveil the profound impact of these redundant elements on the CLIP scores. A logical approach would involve reevaluating the CLIP scores after eliminating these influences. Experimentally, our text-based CLIP filter outperforms the top-ranked method on the ``small scale" of DataComp (a data filtering benchmark) on ImageNet distribution shifts, achieving a 3.6% performance improvement. The results also demonstrate that our proposed text-masked filter outperforms the original CLIP score filter when selecting the top 40% of the data. The impact of numbers on CLIP and their handling provide valuable insights for improving the effectiveness of CLIP training, including language rewrite techniques., Comment: ICCV 2023 Workshop: TNGCV-DataComp
Published: 2023

20. FrameRS: A Video Frame Compression Model Composed by Self supervised Video Frame Reconstructor and Key Frame Selector

Author: Fu, Qiqian, Wang, Guanhong, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we present frame reconstruction model: FrameRS. It consists self-supervised video frame reconstructor and key frame selector. The frame reconstructor, FrameMAE, is developed by adapting the principles of the Masked Autoencoder for Images (MAE) for video context. The key frame selector, Frame Selector, is built on CNN architecture. By taking the high-level semantic information from the encoder of FrameMAE as its input, it can predicted the key frames with low computation costs. Integrated with our bespoke Frame Selector, FrameMAE can effectively compress a video clip by retaining approximately 30% of its pivotal frames. Performance-wise, our model showcases computational efficiency and competitive accuracy, marking a notable improvement over traditional Key Frame Extract algorithms. The implementation is available on Github
Published: 2023

21. Chasing Consistency in Text-to-3D Generation from a Single Image

Author: Ouyang, Yichen, Chai, Wenhao, Ye, Jiayi, Tao, Dapeng, Zhan, Yibing, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-3D generation from a single-view image is a popular but challenging task in 3D vision. Although numerous methods have been proposed, existing works still suffer from the inconsistency issues, including 1) semantic inconsistency, 2) geometric inconsistency, and 3) saturation inconsistency, resulting in distorted, overfitted, and over-saturated generations. In light of the above issues, we present Consist3D, a three-stage framework Chasing for semantic-, geometric-, and saturation-Consistent Text-to-3D generation from a single image, in which the first two stages aim to learn parameterized consistency tokens, and the last stage is for optimization. Specifically, the semantic encoding stage learns a token independent of views and estimations, promoting semantic consistency and robustness. Meanwhile, the geometric encoding stage learns another token with comprehensive geometry and reconstruction constraints under novel-view estimations, reducing overfitting and encouraging geometric consistency. Finally, the optimization stage benefits from the semantic and geometric tokens, allowing a low classifier-free guidance scale and therefore preventing oversaturation. Experimental results demonstrate that Consist3D produces more consistent, faithful, and photo-realistic 3D assets compared to previous state-of-the-art methods. Furthermore, Consist3D also allows background and object editing through text prompts., Comment: 9 pages, 11 figures
Published: 2023

22. Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection

Author: Yang, Longrong, Zhou, Xianpan, Li, Xuewei, Qiao, Liang, Li, Zheyang, Yang, Ziwei, Wang, Gaoang, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Knowledge distillation (KD) has shown potential for learning compact models in dense object detection. However, the commonly used softmax-based distillation ignores the absolute classification scores for individual categories. Thus, the optimum of the distillation loss does not necessarily lead to the optimal student classification scores for dense object detectors. This cross-task protocol inconsistency is critical, especially for dense object detectors, since the foreground categories are extremely imbalanced. To address the issue of protocol differences between distillation and classification, we propose a novel distillation method with cross-task consistent protocols, tailored for the dense object detection. For classification distillation, we address the cross-task protocol inconsistency problem by formulating the classification logit maps in both teacher and student models as multiple binary-classification maps and applying a binary-classification distillation loss to each map. For localization distillation, we design an IoU-based Localization Distillation Loss that is free from specific network structures and can be compared with existing localization distillation losses. Our proposed method is simple but effective, and experimental results demonstrate its superiority over existing methods. Code is available at https://github.com/TinyTigerPan/BCKD., Comment: Accepted by ICCV 2023
Published: 2023

23. UniAP: Towards Universal Animal Perception in Vision via Few-shot Learning

Author: Sun, Meiqi, Zhao, Zhonghan, Chai, Wenhao, Luo, Hanjun, Cao, Shidong, Zhang, Yanting, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Animal visual perception is an important technique for automatically monitoring animal health, understanding animal behaviors, and assisting animal-related research. However, it is challenging to design a deep learning-based perception model that can freely adapt to different animals across various perception tasks, due to the varying poses of a large diversity of animals, lacking data on rare species, and the semantic inconsistency of different tasks. We introduce UniAP, a novel Universal Animal Perception model that leverages few-shot learning to enable cross-species perception among various visual tasks. Our proposed model takes support images and labels as prompt guidance for a query image. Images and labels are processed through a Transformer-based encoder and a lightweight label encoder, respectively. Then a matching module is designed for aggregating information between prompt guidance and the query image, followed by a multi-head label decoder to generate outputs for various tasks. By capitalizing on the shared visual characteristics among different animals and tasks, UniAP enables the transfer of knowledge from well-studied species to those with limited labeled data or even unseen species. We demonstrate the effectiveness of UniAP through comprehensive experiments in pose estimation, segmentation, and classification tasks on diverse animal species, showcasing its ability to generalize and adapt to new classes with minimal labeled examples.
Published: 2023

24. PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

Author: Liu, Hanbing, He, Jun-Yan, Cheng, Zhi-Qi, Xiang, Wangmeng, Yang, Qize, Chai, Wenhao, Wang, Gaoang, Bao, Xu, Luo, Bin, Geng, Yifeng, and Xie, Xuansong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Computer Science - Robotics
Abstract: Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to simulate 3D pose distribution in the target domain. By incorporating a multi-hypothesis network, PoSynDA generates diverse pose hypotheses and aligns them with the target domain. To do this, it first utilizes target-specific source augmentation to obtain the target domain distribution data from the source domain by decoupling the scale and position parameters. The process is then further refined through the teacher-student paradigm and low-rank adaptation. With extensive comparison of benchmarks such as Human3.6M and MPI-INF-3DHP, PoSynDA demonstrates competitive performance, even comparable to the target-trained MixSTE model\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation in unseen domains. The code is available at https://github.com/hbing-l/PoSynDA., Comment: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the code is at https://github.com/hbing-l/PoSynDA
Published: 2023

25. StableVideo: Text-driven Consistency-aware Diffusion Video Editing

Author: Chai, Wenhao, Guo, Xun, Wang, Gaoang, and Lu, Yan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time. This prevents diffusion models from being applied to natural video editing in practical scenarios. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects. Specifically, we develop a novel inter-frame propagation mechanism for diffusion video editing, which leverages the concept of layered representations to propagate the appearance information from one frame to the next. We then build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing. Extensive experiments demonstrate the strong editing capability of our approach. Compared with state-of-the-art video editing methods, our approach shows superior qualitative and quantitative results. Our code is available at \href{https://github.com/rese1f/StableVideo}{this https URL}., Comment: ICCV 2023
Published: 2023

26. MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Author: Song, Enxin, Chai, Wenhao, Wang, Guanhong, Zhang, Yucheng, Zhou, Haoyang, Wu, Feiyang, Chi, Haozhe, Guo, Xun, Ye, Tian, Zhang, Yanting, Lu, Yan, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet, existing systems can only handle videos with very few frames. For long videos, the computation complexity, memory cost, and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model, with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism, we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method., Comment: CVPR 2024. First three authors contribute equally to this work. Project Website https://rese1f.github.io/MovieChat/
Published: 2023

27. A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

Author: Zhao, Zhonghan, Chai, Wenhao, Hao, Shengyu, Hu, Wenhao, Wang, Guanhong, Cao, Shidong, Song, Mingli, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep learning has the potential to revolutionize sports performance, with applications ranging from perception and comprehension to decision. This paper presents a comprehensive survey of deep learning in sports performance, focusing on three main aspects: algorithms, datasets and virtual environments, and challenges. Firstly, we discuss the hierarchical structure of deep learning algorithms in sports performance which includes perception, comprehension and decision while comparing their strengths and weaknesses. Secondly, we list widely used existing datasets in sports and highlight their characteristics and limitations. Finally, we summarize current challenges and point out future trends of deep learning in sports. Our survey provides valuable reference material for researchers interested in deep learning in sports applications.
Published: 2023

28. MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Author: Zhang, Zhenyu, Chai, Wenhao, Jiang, Zhongyu, Ye, Tian, Song, Mingli, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Estimating 3D human poses only from a 2D human pose sequence is thoroughly explored in recent years. Yet, prior to this, no such work has attempted to unify 2D and 3D pose representations in the shared feature space. In this paper, we propose \mpm, a unified 2D-3D human pose representation framework via masked pose modeling. We treat 2D and 3D poses as two different modalities like vision and language and build a single-stream transformer-based architecture. We apply two pretext tasks, which are masked 2D pose modeling, and masked 3D pose modeling to pre-train our network and use full-supervision to perform further fine-tuning. A high masking ratio of $71.8~\%$ in total with a spatio-temporal mask sampling strategy leads to better relation modeling both in spatial and temporal domains. \mpm~can handle multiple tasks including 3D human pose estimation, 3D pose estimation from occluded 2D pose, and 3D pose completion in a \textbf{single} framework. We conduct extensive experiments and ablation studies on several widely used human pose datasets and achieve state-of-the-art performance on MPI-INF-3DHP., Comment: Accepted by PRCV2024
Published: 2023

29. Language Adaptive Weight Generation for Multi-task Visual Grounding

Author: Su, Wei, Miao, Peihan, Dou, Huanzhang, Wang, Gaoang, Qiao, Liang, Li, Zheyang, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance., Comment: Accepted by CVPR2023
Published: 2023

30. SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

Author: Li, Xuewei, Wu, Tao, Qi, Zhongang, Wang, Gaoang, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation (PASS) gives complete scene perception based on an ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic image input focus on solving image distortions but lack consideration of the 3D properties of original $360^{\circ}$ data. Therefore, their performance will drop a lot when inputting panoramic images with the 3D disturbance. To be more robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical geometry knowledge. Specifically, a spherical geometry-aware framework is proposed for PASS. It includes three modules, i.e., spherical geometry-aware image projection, spherical deformable patch embedding, and a panorama-aware loss, which takes input images with 3D disturbance into account, adds a spherical geometry-aware constraint on the existing deformable patch embedding, and indicates the pixel density of original $360^{\circ}$ data, respectively. Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude. Our code and supplementary material are available at https://github.com/TencentARC/SGAT4PASS., Comment: Accepted by IJCAI 2023
Published: 2023

31. User-Aware Prefix-Tuning Is a Good Learner for Personalized Image Captioning

Author: Wang, Xuan, Wang, Guanhong, Chai, Wenhao, Zhou, Jiayu, Wang, Gaoang, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Qingshan, editor, Wang, Hanzi, editor, Ma, Zhanyu, editor, Zheng, Weishi, editor, Zha, Hongbin, editor, Chen, Xilin, editor, Wang, Liang, editor, and Ji, Rongrong, editor
Published: 2024
Full Text: View/download PDF

32. Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation

Author: Chai, Wenhao, Jiang, Zhongyu, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: When applying a pre-trained 2D-to-3D human pose lifting model to a target unseen dataset, large performance degradation is commonly encountered due to domain shift issues. We observe that the degradation is caused by two factors: 1) the large distribution gap over global positions of poses between the source and target datasets due to variant camera parameters and settings, and 2) the deficient diversity of local structures of poses in training. To this end, we combine \textbf{global adaptation} and \textbf{local generalization} in \textit{PoseDA}, a simple yet effective framework of unsupervised domain adaptation for 3D human pose estimation. Specifically, global adaptation aims to align global positions of poses from the source domain to the target domain with a proposed global position alignment (GPA) module. And local generalization is designed to enhance the diversity of 2D-3D pose mapping with a local pose augmentation (LPA) module. These modules bring significant performance improvement without introducing additional learnable parameters. In addition, we propose local pose augmentation (LPA) to enhance the diversity of 3D poses following an adversarial training scheme consisting of 1) a augmentation generator that generates the parameters of pre-defined pose transformations and 2) an anchor discriminator to ensure the reality and quality of the augmented data. Our approach can be applicable to almost all 2D-3D lifting models. \textit{PoseDA} achieves 61.3 mm of MPJPE on MPI-INF-3DHP under a cross-dataset evaluation setup, improving upon the previous state-of-the-art method by 10.2\%., Comment: ICCV 2023
Published: 2023

33. DDMM-Synth: A Denoising Diffusion Model for Cross-modal Medical Image Synthesis with Sparse-view Measurement Embedding

Author: Li, Xiaoyue, Shang, Kai, Wang, Gaoang, and Butala, Mark D.
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Physics - Medical Physics
Abstract: Reducing the radiation dose in computed tomography (CT) is important to mitigate radiation-induced risks. One option is to employ a well-trained model to compensate for incomplete information and map sparse-view measurements to the CT reconstruction. However, reconstruction from sparsely sampled measurements is insufficient to uniquely characterize an object in CT, and a learned prior model may be inadequate for unencountered cases. Medical modal translation from magnetic resonance imaging (MRI) to CT is an alternative but may introduce incorrect information into the synthesized CT images in addition to the fact that there exists no explicit transformation describing their relationship. To address these issues, we propose a novel framework called the denoising diffusion model for medical image synthesis (DDMM-Synth) to close the performance gaps described above. This framework combines an MRI-guided diffusion model with a new CT measurement embedding reverse sampling scheme. Specifically, the null-space content of the one-step denoising result is refined by the MRI-guided data distribution prior, and its range-space component derived from an explicit operator matrix and the sparse-view CT measurements is directly integrated into the inference stage. DDMM-Synth can adjust the projection number of CT a posteriori for a particular clinical application and its modified version can even improve the results significantly for noisy cases. Our results show that DDMM-Synth outperforms other state-of-the-art supervised-learning-based baselines under fair experimental conditions., Comment: llncs.cls v2.20,12 pages with 6 figures
Published: 2023

34. Blind Inpainting with Object-aware Discrimination for Artificial Marker Removal

Author: Guo, Xuechen, Hu, Wenhao, Ni, Chiming, Chai, Wenhao, Li, Shiyan, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Medical images often contain artificial markers added by doctors, which can negatively affect the accuracy of AI-based diagnosis. To address this issue and recover the missing visual contents, inpainting techniques are highly needed. However, existing inpainting methods require manual mask input, limiting their application scenarios. In this paper, we introduce a novel blind inpainting method that automatically completes visual contents without specifying masks for target areas in an image. Our proposed model includes a mask-free reconstruction network and an object-aware discriminator. The reconstruction network consists of two branches that predict the corrupted regions with artificial markers and simultaneously recover the missing visual contents. The object-aware discriminator relies on the powerful recognition capabilities of the dense object detector to ensure that the markers of reconstructed images cannot be detected in any local regions. As a result, the reconstructed image can be close to the clean one as much as possible. Our proposed method is evaluated on different medical image datasets, covering multiple imaging modalities such as ultrasound (US), magnetic resonance imaging (MRI), and electron microscopy (EM), demonstrating that our method is effective and robust against various unknown missing region patterns.
Published: 2023

35. Deep Learning Methods for Small Molecule Drug Discovery: A Survey

Author: Hu, Wenhao, Liu, Yingying, Chen, Xuanyu, Chai, Wenhao, Chen, Hangyue, Wang, Hongwei, and Wang, Gaoang
Subjects: Computer Science - Machine Learning, Quantitative Biology - Biomolecules
Abstract: With the development of computer-assisted techniques, research communities including biochemistry and deep learning have been devoted into the drug discovery field for over a decade. Various applications of deep learning have drawn great attention in drug discovery, such as molecule generation, molecular property prediction, retrosynthesis prediction, and reaction prediction. While most existing surveys only focus on one of the applications, limiting the view of researchers in the community. In this paper, we present a comprehensive review on the aforementioned four aspects, and discuss the relationships among different applications. The latest literature and classical benchmarks are presented for better understanding the development of variety of approaches. We commence by summarizing the molecule representation format in these works, followed by an introduction of recent proposed approaches for each of the four tasks. Furthermore, we review a variety of commonly used datasets and evaluation metrics and compare the performance of deep learning-based models. Finally, we conclude by identifying remaining challenges and discussing the future trend for deep learning methods in drug discovery.
Published: 2023

36. DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes

Author: Hao, Shenghao, Liu, Peiyuan, Zhan, Yibing, Jin, Kaixun, Liu, Zuozhu, Song, Mingli, Hwang, Jenq-Neng, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Cross-view multi-object tracking aims to link objects between frames and camera views with substantial overlaps. Although cross-view multi-object tracking has received increased attention in recent years, existing datasets still have several issues, including 1) missing real-world scenarios, 2) lacking diverse scenes, 3) owning a limited number of tracks, 4) comprising only static cameras, and 5) lacking standard benchmarks, which hinder the investigation and comparison of cross-view tracking methods. To solve the aforementioned issues, we introduce DIVOTrack: a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians in realistic and non-experimental environments. Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available. Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named CrossMOT, which learns object detection, single-view association, and cross-view matching with an all-in-one embedding model. Finally, we present a summary of current methodologies and a set of standard benchmarks with our DIVOTrack to provide a fair comparison and conduct a comprehensive analysis of current approaches and our proposed CrossMOT. The dataset and code are available at https://github.com/shengyuhao/DIVOTrack., Comment: Accepted to IJCV 2023
Published: 2023

37. DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models

Author: Cao, Shidong, Chai, Wenhao, Hao, Shengyu, Zhang, Yanting, Chen, Hangyue, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image-based fashion design with AI techniques has attracted increasing attention in recent years. We focus on a new fashion design task, where we aim to transfer a reference appearance image onto a clothing image while preserving the structure of the clothing image. It is a challenging task since there are no reference images available for the newly designed output fashion images. Although diffusion-based image translation or neural style transfer (NST) has enabled flexible style transfer, it is often difficult to maintain the original structure of the image realistically during the reverse diffusion, especially when the referenced appearance image greatly differs from the common clothing appearance. To tackle this issue, we present a novel diffusion model-based unsupervised structure-aware transfer method to semantically generate new clothes from a given clothing image and a reference appearance image. In specific, we decouple the foreground clothing with automatically generated semantic masks by conditioned labels. And the mask is further used as guidance in the denoising process to preserve the structure information. Moreover, we use the pre-trained vision Transformer (ViT) for both appearance and structure guidance. Our experimental results show that the proposed method outperforms state-of-the-art baseline models, generating more realistic images in the fashion design task. Code and demo can be found at https://github.com/Rem105-210/DiffFashion.
Published: 2023

38. STSC-SNN: Spatio-Temporal Synaptic Connection with Temporal Convolution and Attention for Spiking Neural Networks

Author: Yu, Chengting, Gu, Zheming, Li, Da, Wang, Gaoang, Wang, Aili, and Li, Erping
Subjects: Computer Science - Neural and Evolutionary Computing, Quantitative Biology - Neurons and Cognition, Statistics - Machine Learning
Abstract: Spiking Neural Networks (SNNs), as one of the algorithmic models in neuromorphic computing, have gained a great deal of research attention owing to temporal information processing capability, low power consumption, and high biological plausibility. The potential to efficiently extract spatio-temporal features makes it suitable for processing the event streams. However, existing synaptic structures in SNNs are almost full-connections or spatial 2D convolution, neither of which can extract temporal dependencies adequately. In this work, we take inspiration from biological synapses and propose a spatio-temporal synaptic connection SNN (STSC-SNN) model, to enhance the spatio-temporal receptive fields of synaptic connections, thereby establishing temporal dependencies across layers. Concretely, we incorporate temporal convolution and attention mechanisms to implement synaptic filtering and gating functions. We show that endowing synaptic models with temporal dependencies can improve the performance of SNNs on classification tasks. In addition, we investigate the impact of performance vias varied spatial-temporal receptive fields and reevaluate the temporal modules in SNNs. Our approach is tested on neuromorphic datasets, including DVS128 Gesture (gesture recognition), N-MNIST, CIFAR10-DVS (image classification), and SHD (speech digit recognition). The results show that the proposed model outperforms the state-of-the-art accuracy on nearly all datasets.
Published: 2022
Full Text: View/download PDF

39. Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality

Author: Chi, Haozhe, Yang, Minghua, Zhu, Junhao, Wang, Guanhong, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal sentiment analysis (MSA) is an important way of observing mental activities with the help of data captured from multiple modalities. However, due to the recording or transmission error, some modalities may include incomplete data. Most existing works that address missing modalities usually assume a particular modality is completely missing and seldom consider a mixture of missing across multiple modalities. In this paper, we propose a simple yet effective meta-sampling approach for multimodal sentiment analysis with missing modalities, namely Missing Modality-based Meta Sampling (M3S). To be specific, M3S formulates a missing modality sampling strategy into the modal agnostic meta-learning (MAML) framework. M3S can be treated as an efficient add-on training component on existing models and significantly improve their performances on multimodal data with a mixture of missing modalities. We conduct experiments on IEMOCAP, SIMS and CMU-MOSI datasets, and superior performance is achieved compared with recent state-of-the-art methods.
Published: 2022

40. Hierarchical Semi-Supervised Contrastive Learning for Contamination-Resistant Anomaly Detection

Author: Wang, Gaoang, Zhan, Yibing, Wang, Xinchao, Song, Mingli, and Nahrstedt, Klara
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Anomaly detection aims at identifying deviant samples from the normal data distribution. Contrastive learning has provided a successful way to sample representation that enables effective discrimination on anomalies. However, when contaminated with unlabeled abnormal samples in training set under semi-supervised settings, current contrastive-based methods generally 1) ignore the comprehensive relation between training data, leading to suboptimal performance, and 2) require fine-tuning, resulting in low efficiency. To address the above two issues, in this paper, we propose a novel hierarchical semi-supervised contrastive learning (HSCL) framework, for contamination-resistant anomaly detection. Specifically, HSCL hierarchically regulates three complementary relations: sample-to-sample, sample-to-prototype, and normal-to-abnormal relations, enlarging the discrimination between normal and abnormal samples with a comprehensive exploration of the contaminated data. Besides, HSCL is an end-to-end learning approach that can efficiently learn discriminative representations without fine-tuning. HSCL achieves state-of-the-art performance in multiple scenarios, such as one-class classification and cross-dataset detection. Extensive ablation studies further verify the effectiveness of each considered relation. The code is available at https://github.com/GaoangW/HSCL.
Published: 2022

41. Handwritten Chinese signature detection with simple Copy–Paste augmentation on power plants technical documents

Author: Zhang, Ying, Zhang, Jian, Yan, Kaihong, Wang, Hongwei, and Wang, Gaoang
Published: 2023
Full Text: View/download PDF

42. Recent Advances in Embedding Methods for Multi-Object Tracking: A Survey

Author: Wang, Gaoang, Song, Mingli, and Hwang, Jenq-Neng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-object tracking (MOT) aims to associate target objects across video frames in order to obtain entire moving trajectories. With the advancement of deep neural networks and the increasing demand for intelligent video analysis, MOT has gained significantly increased interest in the computer vision community. Embedding methods play an essential role in object location estimation and temporal identity association in MOT. Unlike other computer vision tasks, such as image classification, object detection, re-identification, and segmentation, embedding methods in MOT have large variations, and they have never been systematically analyzed and summarized. In this survey, we first conduct a comprehensive overview with in-depth analysis for embedding methods in MOT from seven different perspectives, including patch-level embedding, single-frame embedding, cross-frame joint embedding, correlation embedding, sequential embedding, tracklet embedding, and cross-track relational embedding. We further summarize the existing widely used MOT datasets and analyze the advantages of existing state-of-the-art methods according to their embedding strategies. Finally, some critical yet under-investigated areas and future research directions are discussed.
Published: 2022

43. Preserve Pre-trained Knowledge: Transfer Learning With Self-Distillation For Action Recognition

Author: Zhou, Yang, He, Zhanhao, Lu, Keyu, Wang, Guanhong, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Video-based action recognition is one of the most popular topics in computer vision. With recent advances of selfsupervised video representation learning approaches, action recognition usually follows a two-stage training framework, i.e., self-supervised pre-training on large-scale unlabeled sets and transfer learning on a downstream labeled set. However, catastrophic forgetting of the pre-trained knowledge becomes the main issue in the downstream transfer learning of action recognition, resulting in a sub-optimal solution. In this paper, to alleviate the above issue, we propose a novel transfer learning approach that combines self-distillation in fine-tuning to preserve knowledge from the pre-trained model learned from the large-scale dataset. Specifically, we fix the encoder from the last epoch as the teacher model to guide the training of the encoder from the current epoch in the transfer learning. With such a simple yet effective learning strategy, we outperform state-of-the-art methods on widely used UCF101 and HMDB51 datasets in action recognition task.
Published: 2022

44. Human-Centered Prior-Guided and Task-Dependent Multi-Task Representation Learning for Action Recognition Pre-Training

Author: Wang, Guanhong, Lu, Keyu, Zhou, Yang, He, Zhanhao, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recently, much progress has been made for self-supervised action recognition. Most existing approaches emphasize the contrastive relations among videos, including appearance and motion consistency. However, two main issues remain for existing pre-training methods: 1) the learned representation is neutral and not informative for a specific task; 2) multi-task learning-based pre-training sometimes leads to sub-optimal solutions due to inconsistent domains of different tasks. To address the above issues, we propose a novel action recognition pre-training framework, which exploits human-centered prior knowledge that generates more informative representation, and avoids the conflict between multiple tasks by using task-dependent representations. Specifically, we distill knowledge from a human parsing model to enrich the semantic capability of representation. In addition, we combine knowledge distillation with contrastive learning to constitute a task-dependent multi-task framework. We achieve state-of-the-art performance on two popular benchmarks for action recognition task, i.e., UCF101 and HMDB51, verifying the effectiveness of our method., Comment: This paper has been accepted by ICME 2022
Published: 2022

45. Self-paced Multi-grained Cross-modal Interaction Modeling for Referring Expression Comprehension

Author: Miao, Peihan, Su, Wei, Wang, Gaoang, Li, Xuewei, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: As an important and challenging problem in vision-language tasks, referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning. In addition, due to the diversity of visual scenes and the variation of linguistic expressions, some hard examples have much more abundant multi-grained information than others. How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task. To address aforementioned challenges, in this paper, we propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability through innovations in network structure and learning mechanism. Concretely, we design a transformer-based multi-grained cross-modal attention, which effectively utilizes the inherent multi-grained information in visual and linguistic encoders. Furthermore, considering the large variance of samples, we propose a self-paced sample informativeness learning to adaptively enhance the network learning for samples containing abundant multi-grained information. The proposed framework significantly outperforms state-of-the-art methods on widely used datasets, such as RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame datasets, demonstrating the effectiveness of our method., Comment: Accepted by TIP
Published: 2022

46. MAP-SNN: Mapping Spike Activities with Multiplicity, Adaptability, and Plasticity into Bio-Plausible Spiking Neural Networks

Author: Yu, Chengting, Du, Yangkai, Chen, Mufeng, Wang, Aili, Wang, Gaoang, and Li, Erping
Subjects: Computer Science - Neural and Evolutionary Computing, Quantitative Biology - Neurons and Cognition
Abstract: Spiking Neural Network (SNN) is considered more biologically realistic and power-efficient as it imitates the fundamental mechanism of the human brain. Recently, backpropagation (BP) based SNN learning algorithms that utilize deep learning frameworks have achieved good performance. However, bio-interpretability is partially neglected in those BP-based algorithms. Toward bio-plausible BP-based SNNs, we consider three properties in modeling spike activities: Multiplicity, Adaptability, and Plasticity (MAP). In terms of multiplicity, we propose a Multiple-Spike Pattern (MSP) with multiple spike transmission to strengthen model robustness in discrete time-iteration. To realize adaptability, we adopt Spike Frequency Adaption (SFA) under MSP to decrease spike activities for improved efficiency. For plasticity, we propose a trainable convolutional synapse that models spike response current to enhance the diversity of spiking neurons for temporal feature extraction. The proposed SNN model achieves competitive performances on neuromorphic datasets: N-MNIST and SHD. Furthermore, experimental results demonstrate that the proposed three aspects are significant to iterative robustness, spike efficiency, and temporal feature extraction capability of spike activities. In summary, this work proposes a feasible scheme for bio-inspired spike activities with MAP, offering a new neuromorphic perspective to embed biological characteristics into spiking neural networks.
Published: 2022
Full Text: View/download PDF

47. User-Aware Prefix-Tuning Is a Good Learner for Personalized Image Captioning

Author: Wang, Xuan, primary, Wang, Guanhong, additional, Chai, Wenhao, additional, Zhou, Jiayu, additional, and Wang, Gaoang, additional
Published: 2023
Full Text: View/download PDF

48. Disjoint Contrastive Regression Learning for Multi-Sourced Annotations

Author: Ruan, Xiaoqian and Wang, Gaoang
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Large-scale datasets are important for the development of deep learning models. Such datasets usually require a heavy workload of annotations, which are extremely time-consuming and expensive. To accelerate the annotation procedure, multiple annotators may be employed to label different subsets of the data. However, the inconsistency and bias among different annotators are harmful to the model training, especially for qualitative and subjective tasks.To address this challenge, in this paper, we propose a novel contrastive regression framework to address the disjoint annotations problem, where each sample is labeled by only one annotator and multiple annotators work on disjoint subsets of the data. To take account of both the intra-annotator consistency and inter-annotator inconsistency, two strategies are employed.Firstly, a contrastive-based loss is applied to learn the relative ranking among different samples of the same annotator, with the assumption that the ranking of samples from the same annotator is unanimous. Secondly, we apply the gradient reversal layer to learn robust representations that are invariant to different annotators. Experiments on the facial expression prediction task, as well as the image quality assessment task, verify the effectiveness of our proposed framework.
Published: 2021

49. ActiveMatch: End-to-end Semi-supervised Active Representation Learning

Author: Yuan, Xinkai, Li, Zilinghan, and Wang, Gaoang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Semi-supervised learning (SSL) is an efficient framework that can train models with both labeled and unlabeled data, but may generate ambiguous and non-distinguishable representations when lacking adequate labeled samples. With human-in-the-loop, active learning can iteratively select informative unlabeled samples for labeling and training to improve the performance in the SSL framework. However, most existing active learning approaches rely on pre-trained features, which is not suitable for end-to-end learning. To deal with the drawbacks of SSL, in this paper, we propose a novel end-to-end representation learning method, namely ActiveMatch, which combines SSL with contrastive learning and active learning to fully leverage the limited labels. Starting from a small amount of labeled data with unsupervised contrastive learning as a warm-up, ActiveMatch then combines SSL and supervised contrastive learning, and actively selects the most representative samples for labeling during the training, resulting in better representations towards the classification. Compared with MixMatch and FixMatch with the same amount of labeled data, we show that ActiveMatch achieves the state-of-the-art performance, with 89.24% accuracy on CIFAR-10 with 100 collected labels, and 92.20% accuracy with 200 collected labels., Comment: 5 pages, 2 figures, accepted by ICIP 2022
Published: 2021

50. Track without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking

Author: Wang, Gaoang, Gu, Renshu, Liu, Zuozhu, Hu, Weijie, Song, Mingli, and Hwang, Jenq-Neng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vehicle tracking is an essential task in the multi-object tracking (MOT) field. A distinct characteristic in vehicle tracking is that the trajectories of vehicles are fairly smooth in both the world coordinate and the image coordinate. Hence, models that capture motion consistencies are of high necessity. However, tracking with the standalone motion-based trackers is quite challenging because targets could get lost easily due to limited information, detection error and occlusion. Leveraging appearance information to assist object re-identification could resolve this challenge to some extent. However, doing so requires extra computation while appearance information is sensitive to occlusion as well. In this paper, we try to explore the significance of motion patterns for vehicle tracking without appearance information. We propose a novel approach that tackles the association issue for long-term tracking with the exclusive fully-exploited motion information. We address the tracklet embedding issue with the proposed reconstruct-to-embed strategy based on deep graph convolutional neural networks (GCN). Comprehensive experiments on the KITTI-car tracking dataset and UA-Detrac dataset show that the proposed method, though without appearance information, could achieve competitive performance with the state-of-the-art (SOTA) trackers. The source code will be available at https://github.com/GaoangW/LGMTracker.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

207 results on '"Wang, Gaoang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources