Author: "Chang, Xiaojun" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chang, Xiaojun"' showing total 849 results

Start Over Author "Chang, Xiaojun"

849 results on '"Chang, Xiaojun"'

1. Medical Report Generation Is A Multi-label Classification Problem

Author: Fan, Yijian, Yang, Zhenbang, Liu, Rui, Li, Mingjie, and Chang, Xiaojun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Medical report generation is a critical task in healthcare that involves the automatic creation of detailed and accurate descriptions from medical images. Traditionally, this task has been approached as a sequence generation problem, relying on vision-and-language techniques to generate coherent and contextually relevant reports. However, in this paper, we propose a novel perspective: rethinking medical report generation as a multi-label classification problem. By framing the task this way, we leverage the radiology nodes from the commonly used knowledge graph, which can be better captured through classification techniques. To verify our argument, we introduce a novel report generation framework based on BLIP integrated with classified key nodes, which allows for effective report generation with accurate classification of multiple key aspects within the medical images. This approach not only simplifies the report generation process but also significantly enhances performance metrics. Our extensive experiments demonstrate that leveraging key nodes can achieve state-of-the-art (SOTA) performance, surpassing existing approaches across two benchmark datasets. The results underscore the potential of re-envisioning traditional tasks with innovative methodologies, paving the way for more efficient and accurate medical report generation., Comment: Accepted to 2024 IEEE International Conference on Medical Artificial Intelligence
Published: 2024

2. Normalized solutions of $L^2$-supercritical Kirchhoff equations in bounded domains

Author: Wang, Qun and Chang, Xiaojun
Subjects: Mathematics - Analysis of PDEs, 35J60, 35B09, 35B40, 47J30
Abstract: In this paper, we investigate the existence of normalized solutions for the following nonlinear Kirchhoff type problem \begin{equation*} \begin{cases} -(a+b\int_{\Omega}\vert\nabla u\vert^2dx)\Delta u+\lambda u=\vert u\vert^{p-2}u & \text{ in }\Omega,\\ u=0 & \text{ on }\partial\Omega \end{cases} \end{equation*} subject to the constraint $\int_{\Omega}\vert u\vert^2dx=c$. Here, $a$ and $b$ are positive constants, $\Omega$ is a smooth bounded domain in $\mathbb{R}^N$ with $1\leq N\leq3$, $c>0$ is a prescribed value, and $\lambda\in \mathbb{R}$ is a Lagrange multiplier. In the $L^2$-supercritical regime $2+\frac{8}{N}
Published: 2024

3. RealCustom++: Representing Images as Real-Word for Real-Time Customization

Author: Mao, Zhendong, Huang, Mengqi, Ding, Fei, Liu, Mingcong, He, Qian, Chang, Xiaojun, and Zhang, Yongdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image customization, which takes given texts and images depicting given subjects as inputs, aims to synthesize new images that align with both text semantics and subject appearance. This task provides precise control over details that text alone cannot capture and is fundamental for various real-world applications, garnering significant interest from academia and industry. Existing works follow the pseudo-word paradigm, which involves representing given subjects as pseudo-words and combining them with given texts to collectively guide the generation. However, the inherent conflict and entanglement between the pseudo-words and texts result in a dual-optimum paradox, where subject similarity and text controllability cannot be optimal simultaneously. We propose a novel real-words paradigm termed RealCustom++ that instead represents subjects as non-conflict real words, thereby disentangling subject similarity from text controllability and allowing both to be optimized simultaneously. Specifically, RealCustom++ introduces a novel "train-inference" decoupled framework: (1) During training, RealCustom++ learns the alignment between vision conditions and all real words in the text, ensuring high subject-similarity generation in open domains. This is achieved by the cross-layer cross-scale projector to robustly and finely extract subject features, and a curriculum training recipe that adapts the generated subject to diverse poses and sizes. (2) During inference, leveraging the learned general alignment, an adaptive mask guidance is proposed to only customize the generation of the specific target real word, keeping other subject-irrelevant regions uncontaminated to ensure high text-controllability in real-time., Comment: 23 pages
Published: 2024

4. Disentangled Noisy Correspondence Learning

Author: Dang, Zhuohang, Luo, Minnan, Wang, Jihong, Jia, Chengyou, Han, Haochen, Wan, Herun, Dai, Guang, Chang, Xiaojun, and Wang, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Cross-modal retrieval is crucial in understanding latent correspondences across modalities. However, existing methods implicitly assume well-matched training data, which is impractical as real-world data inevitably involves imperfect alignments, i.e., noisy correspondences. Although some works explore similarity-based strategies to address such noise, they suffer from sub-optimal similarity predictions influenced by modality-exclusive information (MEI), e.g., background noise in images and abstract definitions in texts. This issue arises as MEI is not shared across modalities, thus aligning it in training can markedly mislead similarity predictions. Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy. Inspired by the robustness of information bottlenecks against noise, we introduce DisNCL, a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning, to adaptively balance the extraction of MII and MEI with certifiable optimal cross-modal disentanglement efficacy. DisNCL then enhances similarity predictions in modality-invariant subspace, thereby greatly boosting similarity-based alleviation strategy for noisy correspondences. Furthermore, DisNCL introduces soft matching targets to model noisy many-to-many relationships inherent in multi-modal input for noise-robust and accurate cross-modal alignment. Extensive experiments confirm DisNCL's efficacy by 2% average recall improvement. Mutual information estimation and visualization results show that DisNCL learns meaningful MII/MEI subspaces, validating our theoretical analyses.
Published: 2024

5. Contrastive Learning with Counterfactual Explanations for Radiology Report Generation

Author: Li, Mingjie, Lin, Haokun, Qiu, Liang, Liang, Xiaodan, Chen, Ling, Elsaddik, Abdulmotaleb, and Chang, Xiaojun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the common content of anatomy, radiology images with their corresponding reports exhibit high similarity. Such inherent data bias can predispose automatic report generation models to learn entangled and spurious representations resulting in misdiagnostic reports. To tackle these, we propose a novel \textbf{Co}unter\textbf{F}actual \textbf{E}xplanations-based framework (CoFE) for radiology report generation. Counterfactual explanations serve as a potent tool for understanding how decisions made by algorithms can be changed by asking ``what if'' scenarios. By leveraging this concept, CoFE can learn non-spurious visual representations by contrasting the representations between factual and counterfactual images. Specifically, we derive counterfactual images by swapping a patch between positive and negative samples until a predicted diagnosis shift occurs. Here, positive and negative samples are the most semantically similar but have different diagnosis labels. Additionally, CoFE employs a learnable prompt to efficiently fine-tune the pre-trained large language model, encapsulating both factual and counterfactual content to provide a more generalizable prompt representation. Extensive experiments on two benchmarks demonstrate that leveraging the counterfactual explanations enables CoFE to generate semantically coherent and factually complete reports and outperform in terms of language generation and clinical efficacy metrics., Comment: ECCV 2024
Published: 2024

6. Label-anticipated Event Disentanglement for Audio-Visual Video Parsing

Author: Zhou, Jinxing, Guo, Dan, Mao, Yuxin, Zhong, Yiran, Chang, Xiaojun, and Wang, Meng
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Audio-Visual Video Parsing (AVVP) task aims to detect and temporally locate events within audio and visual modalities. Multiple events can overlap in the timeline, making identification challenging. While traditional methods usually focus on improving the early audio-visual encoders to embed more effective features, the decoding phase -- crucial for final event classification, often receives less attention. We aim to advance the decoding phase and improve its interpretability. Specifically, we introduce a new decoding paradigm, \underline{l}abel s\underline{e}m\underline{a}ntic-based \underline{p}rojection (LEAP), that employs labels texts of event categories, each bearing distinct and explicit semantics, for parsing potentially overlapping events.LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. This process, enriched by modeling cross-modal (audio/visual-label) interactions, gradually disentangles event semantics within video segments to refine relevant label embeddings, guaranteeing a more discriminative and interpretable decoding process. To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function. This function leverages the Intersection over Union of audio and visual events (EIoU) as a novel metric to calibrate audio-visual similarities at the feature level, accommodating the varied event densities across modalities. Extensive experiments demonstrate the superiority of our method, achieving new state-of-the-art performance for AVVP and also enhancing the relevant audio-visual event localization task., Comment: Accepted by ECCV2024
Published: 2024

7. Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection

Author: Yi, Junfei, Mao, Jianxu, Liu, Tengfei, Li, Mingjie, Gu, Hanyu, Zhang, Hui, Chang, Xiaojun, and Wang, Yaonan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Knowledge distillation (KD) is a widely adopted and effective method for compressing models in object detection tasks. Particularly, feature-based distillation methods have shown remarkable performance. Existing approaches often ignore the uncertainty in the teacher model's knowledge, which stems from data noise and imperfect training. This limits the student model's ability to learn latent knowledge, as it may overly rely on the teacher's imperfect guidance. In this paper, we propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection, termed "Uncertainty Estimation-Discriminative Knowledge Extraction-Knowledge Transfer (UET)", which can seamlessly integrate with existing distillation methods. By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model, facilitating deeper exploration of latent knowledge. Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources. Extensive experiments validate the effectiveness of our proposed approach across various distillation strategies, detectors, and backbone architectures. Specifically, following our proposed paradigm, the existing FGD method achieves state-of-the-art (SoTA) performance, with ResNet50-based GFL achieving 44.1% mAP on the COCO dataset, surpassing the baselines by 3.9%.
Published: 2024

8. Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification

Author: Huang, Gexin, Wu, Chenfei, Li, Mingjie, Chang, Xiaojun, Chen, Ling, Sun, Ying, Zhao, Shen, Liang, Xiaodan, and Lin, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Predicting genetic mutations from whole slide images is indispensable for cancer diagnosis. However, existing work training multiple binary classification models faces two challenges: (a) Training multiple binary classifiers is inefficient and would inevitably lead to a class imbalance problem. (b) The biological relationships among genes are overlooked, which limits the prediction performance. To tackle these challenges, we innovatively design a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules: (a) A gene graph whose node features are the genes' linguistic descriptions and the cancer phenotype, with edges modeled by genes' pathway associations and mutation consistencies. (b) A knowledge association module that fuses linguistic and biomedical knowledge into gene priors by transformer-based graph representation learning, capturing the intrinsic relationships between different genes' mutations. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules: (a) A modality fusion module that firstly fuses the gene priors with critical regions in WSIs and obtains gene-wise mutation logits. (b) A comparative multi-label loss that emphasizes the inherent comparisons among mutation status to enhance the discrimination capabilities. Sufficient experiments on The Cancer Genome Atlas benchmark demonstrate that BPGT outperforms the state-of-the-art., Comment: 16 pages, 8 figures, and 3 tables
Published: 2024

9. MLP Can Be A Good Transformer Learner

Author: Lin, Sihao, Lyu, Pumeng, Liu, Dongrui, Tang, Tao, Liang, Xiaodan, Song, Andy, and Chang, Xiaojun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers, guided by entropy considerations. We identify that regarding the attention layer in bottom blocks, their subsequent MLP layers, i.e. two feed-forward layers, can elicit the same entropy quantity. Meanwhile, the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore, we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping, yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B, improving throughput and memory bound without performance compromise. Code is available at https://github.com/sihaoevery/lambda_vit., Comment: efficient transformer
Published: 2024

10. LongVLM: Efficient Long Video Understanding via Large Language Models

Author: Weng, Yuetian, Han, Mingfei, He, Haoyu, Chang, Xiaojun, and Zhuang, Bohan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a simple yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples show that our model produces more precise responses for long video understanding. Code is available at https://github.com/ziplab/LongVLM., Comment: Accepted by ECCV 2024
Published: 2024

11. Self-Supervised Multi-Frame Neural Scene Flow

Author: Liu, Dongrui, Liu, Daqi, Li, Xueqian, Lin, Sihao, xie, Hongwei, Wang, Bing, Chang, Xiaojun, and Chu, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Neural Scene Flow Prior (NSFP) and Fast Neural Scene Flow (FNSF) have shown remarkable adaptability in the context of large out-of-distribution autonomous driving. Despite their success, the underlying reasons for their astonishing generalization capabilities remain unclear. Our research addresses this gap by examining the generalization capabilities of NSFP through the lens of uniform stability, revealing that its performance is inversely proportional to the number of input point clouds. This finding sheds light on NSFP's effectiveness in handling large-scale point cloud scene flow estimation tasks. Motivated by such theoretical insights, we further explore the improvement of scene flow estimation by leveraging historical point clouds across multiple frames, which inherently increases the number of point clouds. Consequently, we propose a simple and effective method for multi-frame point cloud scene flow estimation, along with a theoretical evaluation of its generalization abilities. Our analysis confirms that the proposed method maintains a limited generalization error, suggesting that adding multiple frames to the scene flow optimization process does not detract from its generalizability. Extensive experimental results on large-scale autonomous driving Waymo Open and Argoverse lidar datasets demonstrate that the proposed method achieves state-of-the-art performance.
Published: 2024

12. Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding

Author: Hu, Jingjing, Guo, Dan, Li, Kun, Si, Zhan, Yang, Xun, Chang, Xiaojun, and Wang, Meng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Inspired by the activity-silent and persistent activity mechanisms in human visual perception biology, we design a Unified Static and Dynamic Network (UniSDNet), to learn the semantic association between the video and text/audio queries in a cross-modal environment for efficient video grounding. For static modeling, we devise a novel residual structure (ResMLP) to boost the global comprehensive interaction between the video segments and queries, achieving more effective semantic enhancement/supplement. For dynamic modeling, we effectively exploit three characteristics of the persistent activity mechanism in our network design for a better video context comprehension. Specifically, we construct a diffusely connected video clip graph on the basis of 2D sparse temporal masking to reflect the "short-term effect" relationship. We innovatively consider the temporal distance and relevance as the joint "auxiliary evidence clues" and design a multi-kernel Temporal Gaussian Filter to expand the context clue into high-dimensional space, simulating the "complex visual perception", and then conduct element level filtering convolution operations on neighbour clip nodes in message passing stage for finally generating and ranking the candidate proposals. Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks. Our UniSDNet achieves SOTA performance on three widely used datasets for NLVG, as well as three datasets for SLVG, e.g., reporting new records at 38.88% R@1,IoU@0.7 on ActivityNet Captions and 40.26% R@1,IoU@0.5 on TACoS. To facilitate this field, we collect two new datasets (Charades-STA Speech and TACoS Speech) for SLVG task. Meanwhile, the inference speed of our UniSDNet is 1.56$\times$ faster than the strong multi-query benchmark. Code is available at: https://github.com/xian-sh/UniSDNet.
Published: 2024

13. NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning

Author: Lin, Bingqian, Nie, Yunshuang, Wei, Ziming, Chen, Jiaqi, Ma, Shikui, Han, Jianhua, Xu, Hang, Chang, Xiaojun, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Robotics
Abstract: Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions. Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability. However, their predominant use in an offline manner usually suffers from substantial domain gap between the VLN task and the LLM training corpus. This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision, leading to a significant mitigation of the domain gap in a cost-effective manner. Specifically, at each timestep, the LLM is prompted to forecast the navigational chain-of-thought by: 1) acting as a world model to imagine the next observation according to the instruction, 2) selecting the candidate observation that best aligns with the imagination, and 3) determining the action based on the reasoning from the prior steps. Through constructing formalized labels for training, the LLM can learn to generate desired and reasonable chain-of-thought outputs for improving the action decision. Experimental results across various training settings and popular VLN benchmarks (e.g., Room-to-Room (R2R), Room-across-Room (RxR), Room-for-Room (R4R)) show the significant superiority of NavCoT over the direct action prediction variants. Through simple parameter-efficient finetuning, our NavCoT outperforms a recent GPT4-based approach with ~7% relative improvement on the R2R dataset. We believe that NavCoT will help unlock more task-adaptive and scalable LLM-based embodied agents, which are helpful for developing real-world robotics applications. Code is available at https://github.com/expectorlin/NavCoT.
Published: 2024

14. SWAP-NAS: Sample-Wise Activation Patterns for Ultra-fast NAS

Author: Peng, Yameng, Song, Andy, Fayek, Haytham M., Ciesielski, Vic, and Chang, Xiaojun
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Neural and Evolutionary Computing
Abstract: Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid resource-intensive neural network training, especially in Neural Architecture Search (NAS). Recent studies show that existing training-free metrics have several limitations, such as limited correlation and poor generalisation across different search spaces and tasks. Hence, we propose Sample-Wise Activation Patterns and its derivative, SWAP-Score, a novel high-performance training-free metric. It measures the expressivity of networks over a batch of input samples. The SWAP-Score is strongly correlated with ground-truth performance across various search spaces and tasks, outperforming 15 existing training-free metrics on NAS-Bench-101/201/301 and TransNAS-Bench-101. The SWAP-Score can be further enhanced by regularisation, which leads to even higher correlations in cell-based search space and enables model size control during the search. For example, Spearman's rank correlation coefficient between regularised SWAP-Score and CIFAR-100 validation accuracies on NAS-Bench-201 networks is 0.90, significantly higher than 0.80 from the second-best metric, NWOT. When integrated with an evolutionary algorithm for NAS, our SWAP-NAS achieves competitive performance on CIFAR-10 and ImageNet in approximately 6 minutes and 9 minutes of GPU time respectively., Comment: ICLR2024 Spotlight
Published: 2024

15. DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions

Author: Wang, Guangrun, Li, Changlin, Yuan, Liuchun, Peng, Jiefeng, Xian, Xiaoyu, Liang, Xiaodan, Chang, Xiaojun, and Lin, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Neural Architecture Search (NAS), aiming at automatically designing neural architectures by machines, has been considered a key step toward automatic machine learning. One notable NAS branch is the weight-sharing NAS, which significantly improves search efficiency and allows NAS algorithms to run on ordinary computers. Despite receiving high expectations, this category of methods suffers from low search effectiveness. By employing a generalization boundedness tool, we demonstrate that the devil behind this drawback is the untrustworthy architecture rating with the oversized search space of the possible architectures. Addressing this problem, we modularize a large search space into blocks with small search spaces and develop a family of models with the distilling neural architecture (DNA) techniques. These proposed models, namely a DNA family, are capable of resolving multiple dilemmas of the weight-sharing NAS, such as scalability, efficiency, and multi-modal compatibility. Our proposed DNA models can rate all architecture candidates, as opposed to previous works that can only access a subsearch space using heuristic algorithms. Moreover, under a certain computational complexity constraint, our method can seek architectures with different depths and widths. Extensive experimental evaluations show that our models achieve state-of-the-art top-1 accuracy of 78.9% and 83.6% on ImageNet for a mobile convolutional network and a small vision transformer, respectively. Additionally, we provide in-depth empirical analysis and insights into neural architecture ratings. Codes available: \url{https://github.com/changlin31/DNA}., Comment: T-PAMI
Published: 2024

16. MatchNAS: Optimizing Edge AI in Sparse-Label Data Contexts via Automating Deep Neural Network Porting for Mobile Deployment

Author: Huang, Hongtao, Chang, Xiaojun, Hu, Wen, and Yao, Lina
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Recent years have seen the explosion of edge intelligence with powerful Deep Neural Networks (DNNs). One popular scheme is training DNNs on powerful cloud servers and subsequently porting them to mobile devices after being lightweight. Conventional approaches manually specialized DNNs for various edge platforms and retrain them with real-world data. However, as the number of platforms increases, these approaches become labour-intensive and computationally prohibitive. Additionally, real-world data tends to be sparse-label, further increasing the difficulty of lightweight models. In this paper, we propose MatchNAS, a novel scheme for porting DNNs to mobile devices. Specifically, we simultaneously optimise a large network family using both labelled and unlabelled data and then automatically search for tailored networks for different hardware platforms. MatchNAS acts as an intermediary that bridges the gap between cloud-based DNNs and edge-based DNNs.
Published: 2024
Full Text: View/download PDF

17. Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Author: Dang, Zhuohang, Luo, Minnan, Jia, Chengyou, Dai, Guang, Chang, Xiaojun, and Wang, Jingdong
Subjects: Computer Science - Machine Learning
Abstract: Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. Recently, to alleviate expensive data collection, co-occurring pairs from the Internet are automatically harvested for training. However, it inevitably includes mismatched pairs, \ie, noisy correspondences, undermining supervision reliability and degrading performance. Current methods leverage deep neural networks' memorization effect to address noisy correspondences, which overconfidently focus on \emph{similarity-guided training with hard negatives} and suffer from self-reinforcing errors. In light of above, we introduce a novel noisy correspondence learning framework, namely \textbf{S}elf-\textbf{R}einforcing \textbf{E}rrors \textbf{M}itigation (SREM). Specifically, by viewing sample matching as classification tasks within the batch, we generate classification logits for the given sample. Instead of a single similarity score, we refine sample filtration through energy uncertainty and estimate model's sensitivity of selected clean samples using swapped classification entropy, in view of the overall prediction distribution. Additionally, we propose cross-modal biased complementary learning to leverage negative matches overlooked in hard-negative training, further improving model optimization stability and curbing self-reinforcing errors. Extensive experiments on challenging benchmarks affirm the efficacy and efficiency of SREM.
Published: 2023

18. Video Recognition in Portrait Mode

Author: Han, Mingfei, Yang, Linjie, Jin, Xiaojie, Feng, Jiashi, Chang, Xiaojun, and Wang, Heng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The creation of new datasets often presents new challenges for video recognition and can inspire novel ideas while addressing these challenges. While existing datasets mainly comprise landscape mode videos, our paper seeks to introduce portrait mode videos to the research community and highlight the unique challenges associated with this video format. With the growing popularity of smartphones and social media applications, recognizing portrait mode videos is becoming increasingly important. To this end, we have developed the first dataset dedicated to portrait mode video recognition, namely PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a data-driven manner, comprising 400 fine-grained categories, and rigorous quality assurance was implemented to ensure the accuracy of human annotations. In addition to the new dataset, we conducted a comprehensive analysis of the impact of video format (portrait mode versus landscape mode) on recognition accuracy and spatial bias due to the different formats. Furthermore, we designed extensive experiments to explore key aspects of portrait mode video recognition, including the choice of data augmentation, evaluation procedure, the importance of temporal information, and the role of audio modality. Building on the insights from our experimental results and the introduction of PortraitMode-400, our paper aims to inspire further research efforts in this emerging research area., Comment: See mingfei.info/PMV for data and code information
Published: 2023

19. Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Author: Han, Mingfei, Yang, Linjie, Chang, Xiaojun, and Wang, Heng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video and narration captioning, multi-shot video summarization, and video retrieval with shot descriptions. Preliminary experiments show some challenges to generate a long and comprehensive video summary. Nevertheless, the generated imperfect summaries can already significantly boost the performance of existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries., Comment: See https://mingfei.info/shot2story for updates and more information
Published: 2023

20. Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition

Author: Jia, Chengyou, Luo, Minnan, Chang, Xiaojun, Dang, Zhuohang, Han, Mingfei, Wang, Mengmeng, Dai, Guang, Dang, Sizhe, and Wang, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Exploring open-vocabulary video action recognition is a promising venture, which aims to recognize previously unseen actions within any arbitrary set of categories. Existing methods typically adapt pretrained image-text models to the video domain, capitalizing on their inherent strengths in generalization. A common thread among such methods is the augmentation of visual embeddings with temporal information to improve the recognition of seen actions. Yet, they compromise with standard less-informative action descriptions, thus faltering when confronted with novel actions. Drawing inspiration from human cognitive processes, we argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts. Specifically, we harness the knowledge in LLMs to produce a set of descriptive sentences that contain distinctive features for identifying given actions. Building upon this foundation, we further introduce a multi-modal action knowledge alignment mechanism to align concepts in video and textual knowledge encapsulated within the prompts. Extensive experiments on various video benchmarks, including zero-shot, few-shot, and base-to-novel generalization settings, demonstrate that our method not only sets new SOTA performance but also possesses excellent interpretability.
Published: 2023

21. Disentangled Representation Learning with Transmitted Information Bottleneck

Author: Dang, Zhuohang, Luo, Minnan, Jia, Chengyou, Dai, Guang, Wang, Jihong, Chang, Xiaojun, and Wang, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models. Although significant advances have been made by regularizing the information in representations with information theory, two major challenges remain: 1) the representation compression inevitably leads to performance drop; 2) the disentanglement constraints on representations are in complicated optimization. To these issues, we introduce Bayesian networks with transmitted information to formulate the interaction among input and representations during disentanglement. Building upon this framework, we propose \textbf{DisTIB} (\textbf{T}ransmitted \textbf{I}nformation \textbf{B}ottleneck for \textbf{Dis}entangled representation learning), a novel objective that navigates the balance between information compression and preservation. We employ variational inference to derive a tractable estimation for DisTIB. This estimation can be simply optimized via standard gradient descent with a reparameterization trick. Moreover, we theoretically prove that DisTIB can achieve optimal disentanglement, underscoring its superior efficacy. To solidify our claims, we conduct extensive experiments on various downstream tasks to demonstrate the appealing efficacy of DisTIB and validate our theoretical analyses.
Published: 2023

22. Mask Propagation for Efficient Video Semantic Segmentation

Author: Weng, Yuetian, Han, Mingfei, He, Haoyu, Li, Mingjie, Yao, Lina, Chang, Xiaojun, and Zhuang, Bohan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Video Semantic Segmentation (VSS) involves assigning a semantic label to each pixel in a video sequence. Prior work in this field has demonstrated promising results by extending image semantic segmentation models to exploit temporal relationships across video frames; however, these approaches often incur significant computational costs. In this paper, we propose an efficient mask propagation framework for VSS, called MPVSS. Our approach first employs a strong query-based image segmentor on sparse key frames to generate accurate binary masks and class predictions. We then design a flow estimation module utilizing the learned queries to generate a set of segment-aware flow maps, each associated with a mask prediction from the key frame. Finally, the mask-flow pairs are warped to serve as the mask predictions for the non-key frames. By reusing predictions from key frames, we circumvent the need to process a large volume of video frames individually with resource-intensive segmentors, alleviating temporal redundancy and significantly reducing computational costs. Extensive experiments on VSPW and Cityscapes demonstrate that our mask propagation framework achieves SOTA accuracy and efficiency trade-offs. For instance, our best model with Swin-L backbone outperforms the SOTA MRCFA using MiT-B5 by 4.0% mIoU, requiring only 26% FLOPs on the VSPW dataset. Moreover, our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former baseline with only up to 2% mIoU degradation on the Cityscapes validation set. Code is available at https://github.com/ziplab/MPVSS., Comment: NeurIPS 2023
Published: 2023

23. No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

Author: Xu, Xuwei, Li, Changlin, Chen, Yudong, Chang, Xiaojun, Liu, Jiajun, and Wang, Sen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks, yet their high computational complexity prevents their deployment in computing resource-constrained environments. Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs by dynamically dropping image tokens. However, some undesirable pruning at early stages may result in permanent loss of image information in subsequent layers, consequently hindering model performance. To address this problem, we propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency. Specifically, in each layer, IdleViT selects a subset of the image tokens to participate in computations while keeping the rest of the tokens idle and directly passing them to this layer's output. By allowing the idle tokens to be re-selected in the following layers, IdleViT mitigates the negative impact of improper pruning in the early stages. Furthermore, inspired by the normalized graph cut, we devise a token cut loss on the attention map as regularization to improve IdleViT's token selection ability. Our method is simple yet effective and can be extended to pyramid ViTs since no token is completely dropped. Extensive experimental results on various ViT architectures have shown that IdleViT can diminish the complexity of pretrained ViTs by up to 33\% with no more than 0.2\% accuracy decrease on ImageNet, after finetuning for only 30 epochs. Notably, when the keep ratio is 0.5, IdleViT outperforms the state-of-the-art EViT on DeiT-S by 0.5\% higher accuracy and even faster inference speed. The source code is available in the supplementary material., Comment: Accepted to AJCAI2023
Published: 2023

24. Visual Out-of-Distribution Detection in Open-Set Noisy Environments

Author: He, Rundong, Han, Zhongyi, Nie, Xiushan, Yin, Yilong, and Chang, Xiaojun
Published: 2024
Full Text: View/download PDF

25. Self-supervised discriminative model prediction for visual tracking

Author: Yuan, Di, Geng, Gu, Shu, Xiu, Liu, Qiao, Chang, Xiaojun, He, Zhenyu, and Shi, Guangming
Published: 2024
Full Text: View/download PDF

26. PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

Author: Jia, Chengyou, Luo, Minnan, Dang, Zhuohang, Dai, Guang, Chang, Xiaojun, and Wang, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, pedestrian detection and Re-IDentification (ReID). Despite significant progress, current methods face two primary challenges: 1) the pedestrian candidates learned within detectors are suboptimal for the ReID task. 2) the potential for collaboration between two sub-tasks is overlooked. To address these issues, we present a novel Person Search framework based on the Diffusion model, PSDiff. PSDiff formulates the person search as a dual denoising process from noisy boxes and ReID embeddings to ground truths. Distinct from the conventional Detection-to-ReID approach, our denoising paradigm discards prior pedestrian candidates generated by detectors, thereby avoiding the local optimum problem of the ReID task. Following the new paradigm, we further design a new Collaborative Denoising Layer (CDL) to optimize detection and ReID sub-tasks in an iterative and collaborative way, which makes two sub-tasks mutually beneficial. Extensive experiments on the standard benchmarks show that PSDiff achieves state-of-the-art performance with fewer parameters and elastic computing overhead.
Published: 2023

27. Normalized solutions for Sobolev critical Schr\'odinger-Bopp-Podolsky systems

Author: Li, Yuxin, Chang, Xiaojun, and Feng, Zhaosheng
Subjects: Mathematics - Analysis of PDEs, 35K92, 35B44, 35B40, 35R02
Abstract: We study the Sobolev critical Schr\"odinger-Bopp-Podolsky system \begin{gather*} -\Delta u+\phi u=\lambda u+\mu|u|^{p-2}u+|u|^4u\quad \text{in }\mathbb{R}^3, -\Delta\phi+\Delta^2\phi=4\pi u^2\quad \text{in } \mathbb{R}^3, \end{gather*} under the mass constraint \[ \int_{\mathbb{R}^3}u^2\,dx=c \] for some prescribed $c>0$, where $20$ is a parameter, and $\lambda\in\mathbb{R}$ is a Lagrange multiplier. By developing a constraint minimizing approach, we show that the above system admits a local minimizer. Furthermore, we establish the existence of normalized ground state solutions., Comment: 19 pages
Published: 2023
Full Text: View/download PDF

28. ProAgent: Building Proactive Cooperative Agents with Large Language Models

Author: Zhang, Ceyao, Yang, Kaijie, Hu, Siyi, Wang, Zihao, Li, Guanghe, Sun, Yihang, Zhang, Cheng, Zhang, Zhaowei, Liu, Anji, Zhu, Song-Chun, Chang, Xiaojun, Zhang, Junge, Yin, Feng, Liang, Yitao, and Yang, Yaodong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: Building agents with adaptive behavior in cooperative tasks stands as a paramount goal in the realm of multi-agent systems. Current approaches to developing cooperative agents rely primarily on learning-based methods, whose policy generalization depends heavily on the diversity of teammates they interact with during the training phase. Such reliance, however, constrains the agents' capacity for strategic adaptation when cooperating with unfamiliar teammates, which becomes a significant challenge in zero-shot coordination scenarios. To address this challenge, we propose ProAgent, a novel framework that harnesses large language models (LLMs) to create proactive agents capable of dynamically adapting their behavior to enhance cooperation with teammates. ProAgent can analyze the present state, and infer the intentions of teammates from observations. It then updates its beliefs in alignment with the teammates' subsequent actual behaviors. Moreover, ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various of coordination scenarios. Experimental evaluations conducted within the Overcooked-AI environment unveil the remarkable performance superiority of ProAgent, outperforming five methods based on self-play and population-based training when cooperating with AI agents. Furthermore, in partnered with human proxy models, its performance exhibits an average improvement exceeding 10% compared to the current state-of-the-art method. For more information about our project, please visit~\url{https://pku-proagent.github.io}., Comment: v3 is the AAAI'24 camera ready version, which polished abstract and introduction based on the reviewers' comments, and enriched related works. 7 pages of main content, 2 pages of references, 2 figures and 1 table
Published: 2023

29. SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

Author: Jia, Chengyou, Luo, Minnan, Dang, Zhuohang, Dai, Guang, Chang, Xiaojun, Wang, Mengmeng, and Wang, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls. In contrast, Layout-to-Image (L2I) generation, aiming to generate realistic and complex scene images from user-specified layouts, has risen to prominence. However, existing methods transform layout information into tokens or RGB images for conditional control in the generative process, leading to insufficient spatial and semantic controllability of individual instances. To address these limitations, we propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. Owing to rich spatial and semantic information encapsulated in well-designed feature maps, SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. Additionally, we propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms. The former aims to model the relationships among multiple objects within scenes while the latter is designed to heighten the model's sensitivity to the spatial information embedded in the guidance. Extensive experiments demonstrate that SSMG achieves highly promising results, setting a new state-of-the-art across a range of metrics encompassing fidelity, diversity, and controllability., Comment: Accepted to AAAI 2024
Published: 2023

30. FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration

Author: Huang, Zhijian, Lin, Sihao, Liu, Guiyu, Luo, Mukun, Ye, Chaoqiang, Xu, Hang, Chang, Xiaojun, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario, considering robust prediction and computation budget. However, naively extending the existing framework to the domain of multi-modality multi-task learning remains ineffective and even poisonous due to the notorious modality bias and task conflict. Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima. To mitigate the issue, we propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization. Specifically, the gradients, produced by the task heads and used to update the shared backbone, will be calibrated at the backbone's last layer to alleviate the task conflict. Before the calibrated gradients are further propagated to the modality branches of the backbone, their magnitudes will be calibrated again to the same level, ensuring the downstream tasks pay balanced attention to different modalities. Experiments on large-scale benchmark nuScenes demonstrate the effectiveness of the proposed method, eg, an absolute 14.4% mIoU improvement on map segmentation and 1.4% mAP improvement on 3D detection, advancing the application of 3D autonomous driving in the domain of multi-modality fusion and multi-task learning. We also discuss the links between modalities and tasks.
Published: 2023

31. Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

Author: Liu, Yao, Cui, Gangfeng, Luo, Jiahui, Chang, Xiaojun, and Yao, Lina
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: As a fundamental aspect of human life, two-person interactions contain meaningful information about people's activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings.
Published: 2023

32. Convergence of least energy sign-changing solutions for logarithmic Schr\'{o}dinger equations on locally finite graphs

Author: Chang, Xiaojun, Rădulescu, Vicenţiu D., Wang, Ru, and Yan, Duokui
Subjects: Mathematics - Analysis of PDEs, Mathematics - Functional Analysis, 35A15, 35R02, 35Q55, 39A12
Abstract: In this paper, we study the following logarithmic Schr\"{o}dinger equation \[ -\Delta u+\lambda a(x)u=u\log u^2\ \ \ \ \mbox{ in }V \] on a connected locally finite graph $G=(V,E)$, where $\Delta$ denotes the graph Laplacian, $\lambda > 0$ is a constant, and $a(x) \geq 0$ represents the potential. Using variational techniques in combination with the Nehari manifold method based on directional derivative, we can prove that, there exists a constant $\lambda_0>0$ such that for all $\lambda\geq\lambda_0$, the above problem admits a least energy sign-changing solution $u_{\lambda}$. Moreover, as $\lambda\to+\infty$, we prove that the solution $u_{\lambda}$ converges to a least energy sign-changing solution of the following Dirichlet problem \[\begin{cases} -\Delta u=u\log u^2~~~&\mbox{ in }\Omega,\\ u(x)=0~~~&\mbox{ on }\partial\Omega, \end{cases}\] where $\Omega=\{x\in V: a(x)=0\}$ is the potential well., Comment: Submitted to CNSNS
Published: 2023
Full Text: View/download PDF

33. Maximum Entropy Heterogeneous-Agent Reinforcement Learning

Author: Liu, Jiarong, Zhong, Yifan, Hu, Siyi, Fu, Haobo, Fu, Qiang, Chang, Xiaojun, and Yang, Yaodong
Subjects: Computer Science - Multiagent Systems, Computer Science - Machine Learning
Abstract: Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning \emph{stochastic} policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration., Comment: ICLR 2024 spotlight
Published: 2023

34. Toward the Automated Construction of Probabilistic Knowledge Graphs for the Maritime Domain

Author: Shiri, Fatemeh, Wang, Teresa, Pan, Shirui, Chang, Xiaojun, Li, Yuan-Fang, Haffari, Reza, Nguyen, Van, and Yu, Shuang
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: International maritime crime is becoming increasingly sophisticated, often associated with wider criminal networks. Detecting maritime threats by means of fusing data purely related to physical movement (i.e., those generated by physical sensors, or hard data) is not sufficient. This has led to research and development efforts aimed at combining hard data with other types of data (especially human-generated or soft data). Existing work often assumes that input soft data is available in a structured format, or is focused on extracting certain relevant entities or concepts to accompany or annotate hard data. Much less attention has been given to extracting the rich knowledge about the situations of interest implicitly embedded in the large amount of soft data existing in unstructured formats (such as intelligence reports and news articles). In order to exploit the potentially useful and rich information from such sources, it is necessary to extract not only the relevant entities and concepts but also their semantic relations, together with the uncertainty associated with the extracted knowledge (i.e., in the form of probabilistic knowledge graphs). This will increase the accuracy of and confidence in, the extracted knowledge and facilitate subsequent reasoning and learning. To this end, we propose Maritime DeepDive, an initial prototype for the automated construction of probabilistic knowledge graphs from natural language data for the maritime domain. In this paper, we report on the current implementation of Maritime DeepDive, together with preliminary results on extracting probabilistic events from maritime piracy incidents. This pipeline was evaluated on a manually crafted gold standard, yielding promising results.
Published: 2023
Full Text: View/download PDF

35. Existence and instability of standing waves for the biharmonic nonlinear Schroedinger equation with combined nonlinearities

Author: Chang, Xiaojun, Hajaiej, Hichem, Ma, Zhouji, and Song, Linjie
Subjects: Mathematics - Analysis of PDEs, 35J60
Abstract: We prove the existence of normalized ground state solutions for the biharmonic Schr\"odinger equation with combined nonlinearities and show that all ground states correspond to the local minima of the associated energy functional restricted to the appropriate set. Moreover, we prove that the standing waves are strongly unstable by blowup. In particular, our results cover the critical case. Our method is novel and innovative as previous ideas cannot be used in many cases under this study., Comment: The authors most welcome any comments
Published: 2023

36. Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining

Author: Lin, Bingqian, Chen, Zicong, Li, Mingjie, Lin, Haokun, Xu, Hang, Zhu, Yi, Liu, Jianzhuang, Cai, Wenjia, Yang, Lei, Zhao, Shen, Wu, Chenfei, Chen, Ling, Chang, Xiaojun, Yang, Yi, Xing, Lei, and Liang, Xiaodan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Medical artificial general intelligence (MAGI) enables one foundation model to solve different medical tasks, which is very practical in the medical domain. It can significantly reduce the requirement of large amounts of task-specific data by sufficiently sharing medical knowledge among different tasks. However, due to the challenges of designing strongly generalizable models with limited and complex medical data, most existing approaches tend to develop task-specific models. To take a step towards MAGI, we propose a new paradigm called Medical-knOwledge-enhanced mulTimOdal pretRaining (MOTOR). In MOTOR, we combine two kinds of basic medical knowledge, i.e., general and specific knowledge, in a complementary manner to boost the general pretraining process. As a result, the foundation model with comprehensive basic knowledge can learn compact representations from pretraining radiographic data for better cross-modal alignment. MOTOR unifies the understanding and generation, which are two kinds of core intelligence of an AI system, into a single medical foundation model, to flexibly handle more diverse medical tasks. To enable a comprehensive evaluation and facilitate further research, we construct a medical multimodal benchmark including a wide range of downstream tasks, such as chest x-ray report generation and medical visual question answering. Extensive experiments on our benchmark show that MOTOR obtains promising results through simple task-oriented adaptation. The visualization shows that the injected knowledge successfully highlights key information in the medical data, demonstrating the excellent interpretability of MOTOR. Our MOTOR successfully mimics the human practice of fulfilling a "medical student" to accelerate the process of becoming a "specialist". We believe that our work makes a significant stride in realizing MAGI., Comment: Project page: https://github.com/chenzcv7/MOTOR
Published: 2023

37. A Benchmark for Cycling Close Pass Near Miss Event Detection from Video Streams

Author: Li, Mingjie, Rathnayake, Tharindu, Beck, Ben, Meng, Lingheng, Chen, Zijue, Cosgun, Akansel, Chang, Xiaojun, and Kulić, Dana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Cycling is a healthy and sustainable mode of transport. However, interactions with motor vehicles remain a key barrier to increased cycling participation. The ability to detect potentially dangerous interactions from on-bike sensing could provide important information to riders and policy makers. Thus, automated detection of conflict between cyclists and drivers has attracted researchers from both computer vision and road safety communities. In this paper, we introduce a novel benchmark, called Cyc-CP, towards cycling close pass near miss event detection from video streams. We first divide this task into scene-level and instance-level problems. Scene-level detection asks an algorithm to predict whether there is a close pass near miss event in the input video clip. Instance-level detection aims to detect which vehicle in the scene gives rise to a close pass near miss. We propose two benchmark models based on deep learning techniques for these two problems. For training and testing those models, we construct a synthetic dataset and also collect a real-world dataset. Our models can achieve 88.13% and 84.60% accuracy on the real-world dataset, respectively. We envision this benchmark as a test-bed to accelerate cycling close pass near miss detection and facilitate interaction between the fields of road safety, intelligent transportation systems and artificial intelligence. Both the benchmark datasets and detection models will be available at https://github.com/SustainableMobility/cyc-cp to facilitate experimental reproducibility and encourage more in-depth research in the field., Comment: 15 pages, 19 figurers and 2 tables
Published: 2023

38. No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

Author: Xu, Xuwei, Li, Changlin, Chen, Yudong, Chang, Xiaojun, Liu, Jiajun, Wang, Sen, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Liu, Tongliang, editor, Webb, Geoff, editor, Yue, Lin, editor, and Wang, Dadong, editor
Published: 2024
Full Text: View/download PDF

39. Origin and evolution of the triploid cultivated banana genome

Author: Li, Xiuxiu, Yu, Sheng, Cheng, Zhihao, Chang, Xiaojun, Yun, Yingzi, Jiang, Mengwei, Chen, Xuequn, Wen, Xiaohui, Li, Hua, Zhu, Wenjun, Xu, Shiyao, Xu, Yanbing, Wang, Xianjun, Zhang, Chen, Wu, Qiong, Hu, Jin, Lin, Zhenguo, Aury, Jean-Marc, Van de Peer, Yves, Wang, Zonghua, Zhou, Xiaofan, Wang, Jihua, Lü, Peitao, and Zhang, Liangsheng
Published: 2024
Full Text: View/download PDF

40. Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation

Author: Li, Mingjie, Lin, Bingqian, Chen, Zicong, Lin, Haokun, Liang, Xiaodan, and Chang, Xiaojun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Automatic radiology reporting has great clinical potential to relieve radiologists from heavy workloads and improve diagnosis interpretation. Recently, researchers have enhanced data-driven neural networks with medical knowledge graphs to eliminate the severe visual and textual bias in this task. The structures of such graphs are exploited by using the clinical dependencies formed by the disease topic tags via general knowledge and usually do not update during the training process. Consequently, the fixed graphs can not guarantee the most appropriate scope of knowledge and limit the effectiveness. To address the limitation, we propose a knowledge graph with Dynamic structure and nodes to facilitate medical report generation with Contrastive Learning, named DCL. In detail, the fundamental structure of our graph is pre-constructed from general knowledge. Then we explore specific knowledge extracted from the retrieved reports to add additional nodes or redefine their relations in a bottom-up manner. Each image feature is integrated with its very own updated graph before being fed into the decoder module for report generation. Finally, this paper introduces Image-Report Contrastive and Image-Report Matching losses to better represent visual features and textual information. Evaluated on IU-Xray and MIMIC-CXR datasets, our DCL outperforms previous state-of-the-art models on these two benchmarks., Comment: Accepted by CVPR 2023. Project page: https://github.com/mlii0117/DCL
Published: 2023

41. Guided Image-to-Image Translation by Discriminator-Generator Communication

Author: Cao, Yuanjiang, Yao, Lina, Pan, Le, Sheng, Quan Z., and Chang, Xiaojun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: The goal of Image-to-image (I2I) translation is to transfer an image from a source domain to a target domain, which has recently drawn increasing attention. One major branch of this research is to formulate I2I translation based on Generative Adversarial Network (GAN). As a zero-sum game, GAN can be reformulated as a Partially-observed Markov Decision Process (POMDP) for generators, where generators cannot access full state information of their environments. This formulation illustrates the information insufficiency in the GAN training. To mitigate this problem, we propose to add a communication channel between discriminators and generators. We explore multiple architecture designs to integrate the communication mechanism into the I2I translation framework. To validate the performance of the proposed approach, we have conducted extensive experiments on various benchmark datasets. The experimental results confirm the superiority of our proposed method.
Published: 2023

42. ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency

Author: Ren, Pengzhen, Li, Changlin, Xu, Hang, Zhu, Yi, Wang, Guangrun, Liu, Jianzhuang, Chang, Xiaojun, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Recently, great success has been made in learning visual representations from text supervision, facilitating the emergence of text-supervised semantic segmentation. However, existing works focus on pixel grouping and cross-modal semantic alignment, while ignoring the correspondence among multiple augmented views of the same image. To overcome such limitation, we propose multi-\textbf{View} \textbf{Co}nsistent learning (ViewCo) for text-supervised semantic segmentation. Specifically, we first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image. Additionally, we propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision by contrasting the segment features of Siamese visual encoders. The text-to-views consistency benefits the dense assignment of the visual features by encouraging different crops to align with the same text, while the cross-view segmentation consistency modeling provides additional self-supervision, overcoming the limitation of ambiguous text supervision for segmentation masks. Trained with large-scale image-text data, our model can directly segment objects of arbitrary categories in a zero-shot manner. Extensive experiments show that ViewCo outperforms state-of-the-art methods on average by up to 2.9\%, 1.6\%, and 2.4\% mIoU on PASCAL VOC2012, PASCAL Context, and COCO, respectively.
Published: 2023

43. Normalized solutions of $L^2$-supercritical NLS equations on noncompact metric graphs with localized nonlinearities

Author: Borthwick, Jack, Chang, Xiaojun, Jeanjean, Louis, and Soave, Nicola
Subjects: Mathematics - Analysis of PDEs
Abstract: In this paper we are concerned with the existence of normalized solutions for nonlinear Schr\"odinger equations on noncompact metric graphs with localized nonlinearities. In a $L^2$-supercritical regime, we obtain the existence of solutions for any prescribed mass. This result is obtained through an approach which could prove successful to treat more general equations on noncompact graphs., Comment: arXiv admin note: text overlap with arXiv:2204.01043
Published: 2022
Full Text: View/download PDF

44. 3D-TOGO: Towards Text-Guided Cross-Category 3D Object Generation

Author: Jiang, Zutao, Lu, Guansong, Liang, Xiaodan, Zhu, Jihua, Zhang, Wei, Chang, Xiaojun, and Xu, Hang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Text-guided 3D object generation aims to generate 3D objects described by user-defined captions, which paves a flexible way to visualize what we imagined. Although some works have been devoted to solving this challenging task, these works either utilize some explicit 3D representations (e.g., mesh), which lack texture and require post-processing for rendering photo-realistic views; or require individual time-consuming optimization for every single case. Here, we make the first attempt to achieve generic text-guided cross-category 3D object generation via a new 3D-TOGO model, which integrates a text-to-views generation module and a views-to-3D generation module. The text-to-views generation module is designed to generate different views of the target 3D object given an input caption. prior-guidance, caption-guidance and view contrastive learning are proposed for achieving better view-consistency and caption similarity. Meanwhile, a pixelNeRF model is adopted for the views-to-3D generation module to obtain the implicit 3D neural representation from the previously-generated views. Our 3D-TOGO model generates 3D objects in the form of the neural radiance field with good texture and requires no time-cost optimization for every single caption. Besides, 3D-TOGO can control the category, color and shape of generated 3D objects with the input caption. Extensive experiments on the largest 3D object dataset (i.e., ABO) are conducted to verify that 3D-TOGO can better generate high-quality 3D objects according to the input captions across 98 different categories, in terms of PSNR, SSIM, LPIPS and CLIP-score, compared with text-NeRF and Dreamfields.
Published: 2022

45. Ground states for logarithmic Schr\'{o}dinger equations on locally finite graphs

Author: Chang, Xiaojun, Wang, Ru, and Yan, Duokui
Subjects: Mathematics - Analysis of PDEs, 35A15, 35R02, 35Q55, 39A12
Abstract: In this paper, we study the following logarithmic Schr\"{o}dinger equation \[ -\Delta u+a(x)u=u\log u^2\ \ \ \ \mbox{in }V, \] where $\Delta$ is the graph Laplacian, $G=(V,E)$ is a connected locally finite graph, the potential $a: V\to \mathbb{R}$ is bounded from below and may change sign. We first establish two Sobolev compact embedding theorems in the case when different assumptions are imposed on $a(x)$. It leads to two kinds of associated energy functionals, one of which is not well-defined under the logarithmic nonlinearity, while the other is $C^1$. The existence of ground state solutions are then obtained by using the Nehari manifold method and the mountain pass theorem respectively., Comment: 25 pages
Published: 2022

46. Simple Primitives with Feasibility- and Contextuality-Dependence for Open-World Compositional Zero-shot Learning

Author: Liu, Zhe, Li, Yun, Yao, Lina, Chang, Xiaojun, Fang, Wei, Wu, Xiaojun, and Yang, Yi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The task of Compositional Zero-Shot Learning (CZSL) is to recognize images of novel state-object compositions that are absent during the training stage. Previous methods of learning compositional embedding have shown effectiveness in closed-world CZSL. However, in Open-World CZSL (OW-CZSL), their performance tends to degrade significantly due to the large cardinality of possible compositions. Some recent works separately predict simple primitives (i.e., states and objects) to reduce cardinality. However, they consider simple primitives as independent probability distributions, ignoring the heavy dependence between states, objects, and compositions. In this paper, we model the dependence of compositions via feasibility and contextuality. Feasibility-dependence refers to the unequal feasibility relations between simple primitives, e.g., \textit{hairy} is more feasible with \textit{cat} than with \textit{building} in the real world. Contextuality-dependence represents the contextual variance in images, e.g., \textit{cat} shows diverse appearances under the state of \textit{dry} and \textit{wet}. We design Semantic Attention (SA) and generative Knowledge Disentanglement (KD) to learn the dependence of feasibility and contextuality, respectively. SA captures semantics in compositions to alleviate impossible predictions, driven by the visual similarity between simple primitives. KD disentangles images into unbiased feature representations, easing contextual bias in predictions. Moreover, we complement the current compositional probability model with feasibility and contextuality in a compatible format. Finally, we conduct comprehensive experiments to analyze and validate the superior or competitive performance of our model, Semantic Attention and knowledge Disentanglement guided Simple Primitives (SAD-SP), on three widely-used benchmark OW-CZSL datasets.
Published: 2022

47. Bounded Palais-Smale sequences with Morse type information for some constrained functionals

Author: Borthwick, Jack, Chang, Xiaojun, Jeanjean, Louis, and Soave, Nicola
Subjects: Mathematics - Analysis of PDEs, 35J60, 47J30
Abstract: In this paper, we study, for functionals having a mountain pass geometry on a constraint, the existence of bounded Palais-Smale sequences carrying Morse index type information., Comment: This version is the final one, corresponding to the paper now published in Transactions of the American Mathematical Society
Published: 2022

48. Learning Self-Regularized Adversarial Views for Self-Supervised Vision Transformers

Author: Tang, Tao, Li, Changlin, Wang, Guangrun, Yu, Kaicheng, Chang, Xiaojun, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Automatic data augmentation (AutoAugment) strategies are indispensable in supervised data-efficient training protocols of vision transformers, and have led to state-of-the-art results in supervised learning. Despite the success, its development and application on self-supervised vision transformers have been hindered by several barriers, including the high search cost, the lack of supervision, and the unsuitable search space. In this work, we propose AutoView, a self-regularized adversarial AutoAugment method, to learn views for self-supervised vision transformers, by addressing the above barriers. First, we reduce the search cost of AutoView to nearly zero by learning views and network parameters simultaneously in a single forward-backward step, minimizing and maximizing the mutual information among different augmented views, respectively. Then, to avoid information collapse caused by the lack of label supervision, we propose a self-regularized loss term to guarantee the information propagation. Additionally, we present a curated augmentation policy search space for self-supervised learning, by modifying the generally used search space designed for supervised learning. On ImageNet, our AutoView achieves remarkable improvement over RandAug baseline (+10.2% k-NN accuracy), and consistently outperforms sota manually tuned view policy by a clear margin (up to +1.3% k-NN accuracy). Extensive experiments show that AutoView pretraining also benefits downstream tasks (+1.2% mAcc on ADE20K Semantic Segmentation and +2.8% mAP on revisited Oxford Image Retrieval benchmark) and improves model robustness (+2.3% Top-1 Acc on ImageNet-A and +1.0% AUPR on ImageNet-O). Code and models will be available at https://github.com/Trent-tangtao/AutoView.
Published: 2022

49. PAR: Political Actor Representation Learning with Social Context and Expert Knowledge

Author: Feng, Shangbin, Tan, Zhaoxuan, Chen, Zilong, Wang, Ningnan, Yu, Peisheng, Zheng, Qinghua, Chang, Xiaojun, and Luo, Minnan
Subjects: Computer Science - Computation and Language
Abstract: Modeling the ideological perspectives of political actors is an essential task in computational political science with applications in many downstream tasks. Existing approaches are generally limited to textual data and voting records, while they neglect the rich social context and valuable expert knowledge for holistic ideological analysis. In this paper, we propose \textbf{PAR}, a \textbf{P}olitical \textbf{A}ctor \textbf{R}epresentation learning framework that jointly leverages social context and expert knowledge. Specifically, we retrieve and extract factual statements about legislators to leverage social context information. We then construct a heterogeneous information network to incorporate social context and use relational graph neural networks to learn legislator representations. Finally, we train PAR with three objectives to align representation learning with expert knowledge, model ideological stance consistency, and simulate the echo chamber phenomenon. Extensive experiments demonstrate that PAR is better at augmenting political text understanding and successfully advances the state-of-the-art in political perspective detection and roll call vote prediction. Further analysis proves that PAR learns representations that reflect the political reality and provide new insights into political behavior., Comment: EMNLP 2022
Published: 2022

50. ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities

Author: Zhuo, Terry Yue, Liao, Yaqing, Lei, Yuecheng, Qu, Lizhen, de Melo, Gerard, Chang, Xiaojun, Ren, Yazhou, and Xu, Zenglin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities., Comment: Accepted at EACL2023 (Findings)
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

849 results on '"Chang, Xiaojun"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources