Author: "He, Xuehai" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"He, Xuehai"' showing total 43 results

Start Over Author "He, Xuehai"

43 results on '"He, Xuehai"'

1. Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Author: Wang, Yiping, He, Xuehai, Wang, Kuan, Ma, Luyao, Yang, Jianwei, Wang, Shuohang, Du, Simon Shaolei, and Shen, Yelong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Graphics
Abstract: The current state-of-the-art video generative models can produce commercial-grade videos with highly realistic details. However, they still struggle to coherently present multiple sequential events in the stories specified by the prompts, which is foreseeable an essential capability for future long video generation scenarios. For example, top T2V generative models still fail to generate a video of the short simple story 'how to put an elephant into a refrigerator.' While existing detail-oriented benchmarks primarily focus on fine-grained metrics like aesthetic quality and spatial-temporal consistency, they fall short of evaluating models' abilities to handle event-level story presentation. To address this gap, we introduce StoryEval, a story-oriented benchmark specifically designed to assess text-to-video (T2V) models' story-completion capabilities. StoryEval features 423 prompts spanning 7 classes, each representing short stories composed of 2-4 consecutive events. We employ advanced vision-language models, such as GPT-4V and LLaVA-OV-Chat-72B, to verify the completion of each event in the generated videos, applying a unanimous voting method to enhance reliability. Our methods ensure high alignment with human evaluations, and the evaluation of 11 models reveals its challenge, with none exceeding an average story-completion rate of 50%. StoryEval provides a new benchmark for advancing T2V models and highlights the challenges and opportunities in developing next-generation solutions for coherent story-driven video generation., Comment: benchmark paper, project page: https://ypwang61.github.io/project/StoryEval
Published: 2024

2. Mojito: Motion Trajectory and Intensity Control for Video Generation

Author: He, Xuehai, Wang, Shuohang, Yang, Jianwei, Wu, Xiaoxia, Wang, Yiping, Wang, Kuan, Zhan, Zheng, Ruwase, Olatunji, Shen, Yelong, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training video diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. To tackle these challenges, this paper introduces Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation. Specifically, Mojito features a Directional Motion Control (DMC) module that leverages cross-attention to efficiently direct the generated object's motion without training, alongside a Motion Intensity Modulator (MIM) that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.
Published: 2024

3. EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing

Author: Zheng, Kaizhi, Chen, Xiaotong, He, Xuehai, Gu, Jing, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, Jianfeng, Wang, Lijuan, and Wang, Xin Eric
Subjects: Computer Science - Graphics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction
Abstract: Given the steep learning curve of professional 3D software and the time-consuming process of managing large 3D assets, language-guided 3D scene editing has significant potential in fields such as virtual reality, augmented reality, and gaming. However, recent approaches to language-guided 3D scene editing either require manual interventions or focus only on appearance modifications without supporting comprehensive scene layout changes. In response, we propose Edit-Room, a unified framework capable of executing a variety of layout edits through natural language commands, without requiring manual intervention. Specifically, EditRoom leverages Large Language Models (LLMs) for command planning and generates target scenes using a diffusion-based method, enabling six types of edits: rotate, translate, scale, replace, add, and remove. To address the lack of data for language-guided 3D scene editing, we have developed an automatic pipeline to augment existing 3D scene synthesis datasets and introduced EditRoom-DB, a large-scale dataset with 83k editing pairs, for training and evaluation. Our experiments demonstrate that our approach consistently outperforms other baselines across all metrics, indicating higher accuracy and coherence in language-guided scene layout editing.
Published: 2024

4. MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Author: He, Xuehai, Feng, Weixi, Zheng, Kaizhi, Lu, Yujie, Zhu, Wanrong, Li, Jiachen, Fan, Yue, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, William Yang, Wang, Lijuan, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos.
Published: 2024

5. Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA

Author: Yan, Qianqi, He, Xuehai, Yue, Xiang, and Wang, Xin Eric
Subjects: Computer Science - Artificial Intelligence
Abstract: Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.
Published: 2024

6. FlexEControl: Flexible and Efficient Multimodal Control for Text-to-Image Generation

Author: He, Xuehai, Zheng, Jian, Fang, Jacob Zhiyuan, Piramuthu, Robinson, Bansal, Mohit, Ordonez, Vicente, Sigurdsson, Gunnar A, Peng, Nanyun, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Controllable text-to-image (T2I) diffusion models generate images conditioned on both text prompts and semantic inputs of other modalities like edge maps. Nevertheless, current controllable T2I methods commonly face challenges related to efficiency and faithfulness, especially when conditioning on multiple inputs from either the same or diverse modalities. In this paper, we propose a novel Flexible and Efficient method, FlexEControl, for controllable T2I generation. At the core of FlexEControl is a unique weight decomposition strategy, which allows for streamlined integration of various input types. This approach not only enhances the faithfulness of the generated image to the control, but also significantly reduces the computational overhead typically associated with multimodal conditioning. Our approach achieves a reduction of 41% in trainable parameters and 30% in memory usage compared with Uni-ControlNet. Moreover, it doubles data efficiency and can flexibly generate images under the guidance of multiple input conditions of various modalities.
Published: 2024

7. Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning

Author: Li, Jiachen, Gao, Qiaozi, Johnston, Michael, Gao, Xiaofeng, He, Xuehai, Shakiah, Suhaila, Shi, Hangjie, Ghanadan, Reza, and Wang, William Yang
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: Prompt-based learning has been demonstrated as a compelling paradigm contributing to large language models' tremendous success (LLMs). Inspired by their success in language tasks, existing research has leveraged LLMs in embodied instruction following and task planning. In this work, we tackle the problem of training a robot to understand multimodal prompts, interleaving vision signals with text descriptions. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. In this work, we introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts from multi-task expert trajectories. Our methods consist of a two-stage training pipeline that performs inverse dynamics pretraining and multi-task finetuning. To facilitate multimodal understanding, we design our multimodal prompt encoder by augmenting a pretrained LM with a residual connection to the visual input and model the dependencies among action dimensions. Empirically, we evaluate the efficacy of our method on the VIMA-BENCH and establish a new state-of-the-art (10% improvement in success rate). Moreover, we demonstrate that our model exhibits remarkable in-context learning ability. Project page: \url{https://midas-icml.github.io/}., Comment: Accepted by ICML 2024. Project page: https://midas-icml.github.io
Published: 2023

8. MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Author: Zheng, Kaizhi, He, Xuehai, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks., Comment: 23 pages, 10 figures
Published: 2023

9. LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Author: Feng, Weixi, Zhu, Wanrong, Fu, Tsu-jui, Jampani, Varun, Akula, Arjun, He, Xuehai, Basu, Sugato, Wang, Xin Eric, and Wang, William Yang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance the visual planning skills of LLMs. LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, LayoutGPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains., Comment: NeurIPS 2023
Published: 2023

10. Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Author: He, Xuehai, Feng, Weixi, Fu, Tsu-Jui, Jampani, Varun, Akula, Arjun, Narayana, Pradyumna, Basu, Sugato, Wang, William Yang, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via efficient attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.
Published: 2023

11. Multimodal Graph Transformer for Multimodal Question Answering

Author: He, Xuehai and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
Published: 2023

12. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Author: Feng, Weixi, He, Xuehai, Fu, Tsu-Jui, Jampani, Varun, Akula, Arjun, Narayana, Pradyumna, Basu, Sugato, Wang, Xin Eric, and Wang, William Yang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process., Comment: ICLR 2023 Camera Ready version
Published: 2022

13. ComCLIP: Training-Free Compositional Image and Text Matching

Author: Jiang, Kenan, He, Xuehai, Xu, Ruize, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-lanaguage pretrained models like CLIP to compositional image and text matching -- a more challenging image and text matching task requiring the model understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel \textbf{\textit{training-free}} compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action sub-images and composes CLIP's vision encoder and text encoder to perform evolving matching over compositional text embedding and sub-image embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: SVO, ComVG, Winoground, and VL-checklist, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the \textbf{\textit{zero-shot}} inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at https://github.com/eric-ai-lab/ComCLIP.
Published: 2022

14. CPL: Counterfactual Prompt Learning for Vision and Language Models

Author: He, Xuehai, Yang, Diji, Feng, Weixi, Fu, Tsu-Jui, Akula, Arjun, Jampani, Varun, Narayana, Pradyumna, Basu, Sugato, Wang, William Yang, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Prompt tuning is a new few-shot transfer learning technique that only tunes the learnable prompt for pre-trained vision and language models such as CLIP. However, existing prompt tuning methods tend to learn spurious or entangled representations, which leads to poor generalization to unseen concepts. Towards non-spurious and efficient prompt learning from limited examples, this paper presents a novel \underline{\textbf{C}}ounterfactual \underline{\textbf{P}}rompt \underline{\textbf{L}}earning (CPL) method for vision and language models, which simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Particularly, CPL constructs counterfactual by identifying minimal non-spurious feature change between semantically-similar positive and negative samples that causes concept change, and learns more generalizable prompt representation from both factual and counterfactual examples via contrastive learning. Extensive experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks than previous prompt tuning methods on CLIP. On image classification, we achieve 3.55\% average relative improvement on unseen classes across seven datasets; on image-text retrieval and visual question answering, we gain up to 4.09\% and 25.08\% relative improvements across three few-shot scenarios on unseen test sets respectively.
Published: 2022

15. JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

Author: Zheng, Kaizhi, Zhou, Kaiwen, Gu, Jing, Fan, Yue, Wang, Jialu, Di, Zonglin, He, Xuehai, and Wang, Xin Eric
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1\% to 15.8\%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge., Comment: 20 pages
Published: 2022

16. Improve the performance of CT-based pneumonia classification via source data reweighting

Author: Xie, Pengtao, Zhao, Xingchen, and He, Xuehai
Subjects: Information and Computing Sciences, Human-Centred Computing, Infectious Diseases, Bioengineering, Networking and Information Technology R&D (NITRD), Pneumonia, Biomedical Imaging, Machine Learning and Artificial Intelligence, Pneumonia & Influenza, Lung, Humans, Tomography, X-Ray Computed, Deep Learning, Computers
Abstract: Pneumonia is a life-threatening disease. Computer tomography (CT) imaging is broadly used for diagnosing pneumonia. To assist radiologists in accurately and efficiently detecting pneumonia from CT scans, many deep learning methods have been developed. These methods require large amounts of annotated CT scans, which are difficult to obtain due to privacy concerns and high annotation costs. To address this problem, we develop a three-level optimization based method which leverages CT data from a source domain to mitigate the lack of labeled CT scans in a target domain. Our method automatically identifies and downweights low-quality source CT data examples which are noisy or have large domain discrepancy with target data, by minimizing the validation loss of a target model trained on reweighted source data. On a target dataset with 2218 CT scans and a source dataset with 349 CT images, our method achieves an F1 score of 91.8% in detecting pneumonia and an F1 score of 92.4% in detecting other types of pneumonia, which are significantly better than those achieved by state-of-the-art baseline methods.
Published: 2023

17. Parameter-efficient Model Adaptation for Vision Transformers

Author: He, Xuehai, Li, Chunyuan, Zhang, Pengchuan, Yang, Jianwei, and Wang, Xin Eric
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model parameters or leverage linear probes. In this paper, we aim to study parameter-efficient model adaptation strategies for vision transformers on the image classification task. We formulate efficient model adaptation as a subspace training problem and perform a comprehensive benchmarking over different efficient adaptation methods. We conduct an empirical study on each efficient model adaptation method focusing on its performance alongside parameter cost. Furthermore, we propose a parameter-efficient model adaptation framework, which first selects submodules by measuring local intrinsic dimensions and then projects them into subspace for further decomposition via a novel Kronecker Adaptation (KAdaptation) method. We analyze and compare our method with a diverse set of baseline model adaptation methods (including state-of-the-art methods for pretrained language models). Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across 20 image classification datasets under the few-shot setting and 7 image classification datasets under the full-shot setting.
Published: 2022

18. Learning by Ignoring, with Application to Domain Adaptation

Author: Zhao, Xingchen, He, Xuehai, and Xie, Pengtao
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Learning by ignoring, which identifies less important things and excludes them from the learning process, is broadly practiced in human learning and has shown ubiquitous effectiveness. There has been psychological studies showing that learning to ignore certain things is a powerful tool for helping people focus. In this paper, we explore whether this useful human learning methodology can be borrowed to improve machine learning. We propose a novel machine learning framework referred to as learning by ignoring (LBI). Our framework automatically identifies pretraining data examples that have large domain shift from the target distribution by learning an ignoring variable for each example and excludes them from the pretraining process. We formulate LBI as a three-level optimization framework where three learning stages are involved: pretraining by minimizing the losses weighed by ignoring variables; finetuning; updating the ignoring variables by minimizing the validation loss. A gradient-based algorithm is developed to efficiently solve the three-level optimization problem in LBI. Experiments on various datasets demonstrate the effectiveness of our framework.
Published: 2020

19. Pathological Visual Question Answering

Author: He, Xuehai, Cai, Zhuo, Wei, Wenlan, Zhang, Yichen, Mou, Luntian, Xing, Eric, and Xie, Pengtao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Is it possible to develop an "AI Pathologist" to pass the board-certified examination of the American Board of Pathology (ABP)? To build such a system, three challenges need to be addressed. First, we need to create a visual question answering (VQA) dataset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Due to privacy concerns, pathology images are usually not publicly available. Besides, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. The second challenge is: since it is difficult to hire highly experienced pathologists to create pathology visual questions and answers, the resulting pathology VQA dataset may contain errors. Training pathology VQA models using these noisy or even erroneous data will lead to problematic models that cannot generalize well on unseen images. The third challenge is: the medical concepts and knowledge covered in pathology question-answer (QA) pairs are very diverse while the number of QA pairs available for modeling training is limited. How to learn effective representations of diverse medical concepts based on limited data is technically demanding. In this paper, we aim to address these three challenges. To our best knowledge, our work represents the first one addressing the pathology VQA problem. To deal with the issue that a publicly available pathology VQA dataset is lacking, we create PathVQA dataset. To address the second challenge, we propose a learning-by-ignoring approach. To address the third challenge, we propose to use cross-modal self-supervised learning. We perform experiments on our created PathVQA dataset and the results demonstrate the effectiveness of our proposed learning-by-ignoring method and cross-modal self-supervised learning methods., Comment: arXiv admin note: text overlap with arXiv:2003.10286
Published: 2020

20. Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms

Author: Yang, Xingyi, He, Xuehai, Liang, Yuxiao, Yang, Yue, Zhang, Shanghang, and Xie, Pengtao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Pretraining has become a standard technique in computer vision and natural language processing, which usually helps to improve performance substantially. Previously, the most dominant pretraining method is transfer learning (TL), which uses labeled data to learn a good representation network. Recently, a new pretraining approach -- self-supervised learning (SSL) -- has demonstrated promising results on a wide range of applications. SSL does not require annotated labels. It is purely conducted on input data by solving auxiliary tasks defined on the input data examples. The current reported results show that in certain applications, SSL outperforms TL and the other way around in other applications. There has not been a clear understanding on what properties of data and tasks render one approach outperforms the other. Without an informed guideline, ML researchers have to try both methods to find out which one is better empirically. It is usually time-consuming to do so. In this work, we aim to address this problem. We perform a comprehensive comparative study between SSL and TL regarding which one works better under different properties of data and tasks, including domain difference between source and target tasks, the amount of pretraining data, class imbalance in source data, and usage of target data for additional pretraining, etc. The insights distilled from our comparative studies can help ML researchers decide which method to use based on the properties of their applications.
Published: 2020

21. On the Generation of Medical Dialogues for COVID-19

Author: Yang, Wenmian, Zeng, Guangtao, Tan, Bowen, Ju, Zeqian, Chakravorty, Subrato, He, Xuehai, Chen, Shu, Yang, Xingyi, Wu, Qingyang, Yu, Zhou, Xing, Eric, and Xie, Pengtao
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Under the pandemic of COVID-19, people experiencing COVID19-related symptoms or exposed to risk factors have a pressing need to consult doctors. Due to hospital closure, a lot of consulting services have been moved online. Because of the shortage of medical professionals, many people cannot receive online consultations timely. To address this problem, we aim to develop a medical dialogue system that can provide COVID19-related consultations. We collected two dialogue datasets -- CovidDialog -- (in English and Chinese respectively) containing conversations between doctors and patients about COVID-19. On these two datasets, we train several dialogue generation models based on Transformer, GPT, and BERT-GPT. Since the two COVID-19 dialogue datasets are small in size, which bear high risk of overfitting, we leverage transfer learning to mitigate data deficiency. Specifically, we take the pretrained models of Transformer, GPT, and BERT-GPT on dialog datasets and other large-scale texts, then finetune them on our CovidDialog tasks. We perform both automatic and human evaluation of responses generated by these models. The results show that the generated responses are promising in being doctor-like, relevant to the conversation history, and clinically informative. The data and code are available at https://github.com/UCSD-AI4H/COVID-Dialogue.
Published: 2020

22. MedDialog: Two Large-scale Medical Dialogue Datasets

Author: He, Xuehai, Chen, Shu, Ju, Zeqian, Dong, Xiangyu, Fang, Hongchao, Wang, Sicheng, Yang, Yue, Zeng, Jiaqi, Zhang, Ruisi, Zhang, Ruoyu, Zhou, Meng, Zhu, Penghui, and Xie, Pengtao
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: Medical dialogue systems are promising in assisting in telemedicine to increase access to healthcare services, improve the quality of patient care, and reduce medical costs. To facilitate the research and development of medical dialogue systems, we build two large-scale medical dialogue datasets: MedDialog-EN and MedDialog-CN. MedDialog-EN is an English dataset containing 0.3 million conversations between patients and doctors and 0.5 million utterances. MedDialog-CN is an Chinese dataset containing 1.1 million conversations and 4 million utterances. To our best knowledge, MedDialog-(EN,CN) are the largest medical dialogue datasets to date. The dataset is available at https://github.com/UCSD-AI4H/Medical-Dialogue-System
Published: 2020

23. COVID-CT-Dataset: A CT Scan Dataset about COVID-19

Author: Yang, Xingyi, He, Xuehai, Zhao, Jinyu, Zhang, Yichen, Zhang, Shanghang, and Xie, Pengtao
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing, Statistics - Machine Learning
Abstract: During the outbreak time of COVID-19, computed tomography (CT) is a useful manner for diagnosing COVID-19 patients. Due to privacy issues, publicly available COVID-19 CT datasets are highly difficult to obtain, which hinders the research and development of AI-powered diagnosis methods of COVID-19 based on CTs. To address this issue, we build an open-sourced dataset -- COVID-CT, which contains 349 COVID-19 CT images from 216 patients and 463 non-COVID-19 CTs. The utility of this dataset is confirmed by a senior radiologist who has been diagnosing and treating COVID-19 patients since the outbreak of this pandemic. We also perform experimental studies which further demonstrate that this dataset is useful for developing AI-based diagnosis models of COVID-19. Using this dataset, we develop diagnosis methods based on multi-task learning and self-supervised learning, that achieve an F1 of 0.90, an AUC of 0.98, and an accuracy of 0.89. According to the senior radiologist, models with such performance are good enough for clinical usage. The data and code are available at https://github.com/UCSD-AI4H/COVID-CT
Published: 2020

24. PathVQA: 30000+ Questions for Medical Visual Question Answering

Author: He, Xuehai, Zhang, Yichen, Mou, Luntian, Xing, Eric, and Xie, Pengtao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Is it possible to develop an "AI Pathologist" to pass the board-certified examination of the American Board of Pathology? To achieve this goal, the first step is to create a visual question answering (VQA) dataset where the AI agent is presented with a pathology image together with a question and is asked to give the correct answer. Our work makes the first attempt to build such a dataset. Different from creating general-domain VQA datasets where the images are widely accessible and there are many crowdsourcing workers available and capable of generating question-answer pairs, developing a medical VQA dataset is much more challenging. First, due to privacy concerns, pathology images are usually not publicly available. Second, only well-trained pathologists can understand pathology images, but they barely have time to help create datasets for AI research. To address these challenges, we resort to pathology textbooks and online digital libraries. We develop a semi-automated pipeline to extract pathology images and captions from textbooks and generate question-answer pairs from captions using natural language processing. We collect 32,799 open-ended questions from 4,998 pathology images where each question is manually checked to ensure correctness. To our best knowledge, this is the first dataset for pathology VQA. Our dataset will be released publicly to promote research in medical VQA.
Published: 2020

25. Spatio-temporal Evolution of Interactive Coercing Relationship between Tourism Resource Development and Landscape Ecological Security: A Case Study of the Guizhou Section of Chishui River Basin.

Author: HE Xuehai, LONG Maoxing, and HUANG Dongmei
Subjects: ENVIRONMENTAL security, SPATIOTEMPORAL processes, WATERSHEDS, LANDSCAPE protection, TOURISM, SUSTAINABLE tourism
Abstract: There is an obvious interactive relationship between tourism resource development and landscape ecological security, clarifying the specific mechanism of the two is of great significance to achieve a solid ecological security barrier and high-quality tourism development By taking the Guizhou section of Chishui River Basin as an example, this paper firstly constructed two grid-scale models to analyze the intensity of tourism resource development and landscape ecological security, and then used a dual index model to explore the spatio-temporal evolution rules of interactive coercing relationship between the two from 2010 to 2020. The results are shown as follows: (1) The development intensity of tourism resources in the study area shows a continuous increase. As areas with high-level development of tourism resources expand rapidly, the development pattern changes from "punctiform development" to "zonal development", radually forming a spatial pattern with core scenic area as the growth pole and the northeast-southwest axis and northwest-southeast axis as the two axes. (2) The overall level of landscape ecological security in the study area is high and stable, and areas with medium and high-level continue to increase, forming two landscape ecological security axes of northeast-southwest axis and northwest-southeast axis but with a fluctuating development trend. (3) The overall collaborative development of tourism resource development and landscape ecological security in the study area is improving, forming a positive collaborative areas with the northwest and southwest areas as the focuses and the dual-axis of landscape ecology as the link. However, there are differences in the specific regional collaboration paths, among which the southwest and the concentrated contiguous areas with weak ecological backgrounds face greater pressure on ecological landscape protection before realizing positive synergy, while central areas on the dual-axis, where tourism resource development is at the primary and intermediate levels, are more likely to be affected by the rate of tourism resource development. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. Parameter-Efficient Model Adaptation for Vision Transformers

Author: He, Xuehai, primary, Li, Chunyuan, additional, Zhang, Pengchuan, additional, Yang, Jianwei, additional, and Wang, Xin Eric, additional
Published: 2023
Full Text: View/download PDF

27. Discriminative Diffusion Models as Few-shot Vision and Language Learners

Author: He, Xuehai, Feng, Weixi, Fu, Tsu-Jui, Jampani, Varun, Akula, Arjun, Narayana, Pradyumna, Basu, Sugato, Wang, William Yang, and Wang, Xin Eric
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via attention-based prompt learning to perform image-text matching. By comparing DSD with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.
Published: 2023

28. Multimodal Graph Transformer for Multimodal Question Answering

Author: He, Xuehai, primary and Wang, Xin, additional
Published: 2023
Full Text: View/download PDF

29. Self-supervised learning for macromolecular structure classification based on cryo-electron tomograms

Author: Gupta, Tarun, primary, He, Xuehai, additional, Uddin, Mostofa Rafid, additional, Zeng, Xiangrui, additional, Zhou, Andrew, additional, Zhang, Jing, additional, Freyberg, Zachary, additional, and Xu, Min, additional
Published: 2022
Full Text: View/download PDF

30. Molecular Assembly of Biomimetic Systems

Author: Junbai Li, Qiang He, Xuehai Yan
Published: 2010

31. CPL: Counterfactual Prompt Learning for Vision and Language Models

Author: He, Xuehai, primary, Yang, Diji, additional, Feng, Weixi, additional, Fu, Tsu-Jui, additional, Akula, Arjun, additional, Jampani, Varun, additional, Narayana, Pradyumna, additional, Basu, Sugato, additional, Wang, William Yang, additional, and Wang, Xin, additional
Published: 2022
Full Text: View/download PDF

32. Learning by Ignoring, with Application to Domain Adaptation

Author: Zhao, Xingchen, primary, He, Xuehai, primary, and Xie, Pengtao, primary
Published: 2021
Full Text: View/download PDF

33. On the Generation of Medical Dialogs for COVID-19

Author: Zhou, Meng, primary, Li, Zechen, additional, Tan, Bowen, additional, Zeng, Guangtao, additional, Yang, Wenmian, additional, He, Xuehai, additional, Ju, Zeqian, additional, Chakravorty, Subrato, additional, Chen, Shu, additional, Yang, Xingyi, additional, Zhang, Yichen, additional, Wu, Qingyang, additional, Yu, Zhou, additional, Xu, Kun, additional, Xing, Eric, additional, and Xie, Pengtao, additional
Published: 2021
Full Text: View/download PDF

34. Towards Visual Question Answering on Pathology Images

Author: He, Xuehai, primary, Cai, Zhuo, additional, Wei, Wenlan, additional, Zhang, Yichen, additional, Mou, Luntian, additional, Xing, Eric, additional, and Xie, Pengtao, additional
Published: 2021
Full Text: View/download PDF

35. Pathological Visual Question Answering

Author: He, Xuehai, primary, Cai, Zhuo, primary, Wei, Wenlan, primary, Zhang, Yichen, primary, Mou, Luntian, primary, Xing, Eric, primary, and Xie, Pengtao, primary
Published: 2020
Full Text: View/download PDF

36. Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms

Author: Yang, Xingyi, primary, He, Xuehai, primary, Liang, Yuxiao, primary, Yang, Yue, primary, Zhang, Shanghang, primary, and Xie, Pengtao, primary
Published: 2020
Full Text: View/download PDF

37. On the Generation of Medical Dialogues for COVID-19

Author: Yang, Wenmian, primary, Zeng, Guangtao, additional, Tan, Bowen, additional, Ju, Zeqian, additional, Chakravorty, Subrato, additional, He, Xuehai, additional, Chen, Shu, additional, Yang, Xingyi, additional, Wu, Qingyang, additional, Yu, Zhou, additional, Xing, Eric, additional, and Xie, Pengtao, additional
Published: 2020
Full Text: View/download PDF

38. On the Generation of Medical Dialogues for COVID19

Author: Yang, Wenmian, primary, Zeng, Guangtao, primary, Tan, Bowen, primary, Ju, Zeqian, primary, Chakravorty, Subrato, primary, He, Xuehai, primary, Chen, Shu, primary, Yang, Xingyi, primary, Wu, Qingyang, primary, Yu, Zhou, primary, Xing, Eric, primary, and Xie, Pengtao, primary
Published: 2020
Full Text: View/download PDF

39. Sample-Efficient Deep Learning for COVID-19 Diagnosis Based on CT Scans

Author: He, Xuehai, primary, Yang, Xingyi, additional, Zhang, Shanghang, additional, Zhao, Jinyu, additional, Zhang, Yichen, additional, Xing, Eric, additional, and Xie, Pengtao, additional
Published: 2020
Full Text: View/download PDF

40. Learned Turbo-type Affine Rank Minimization

Author: He, Xuehai, primary, Yuan, Xiaojun, additional, and Xue, Zhipeng, additional
Published: 2019
Full Text: View/download PDF

41. Learned Turbo Message Passing for Affine Rank Minimization and Compressed Robust Principal Component Analysis

Author: He, Xuehai, primary, Xue, Zhipeng, additional, and Yuan, Xiaojun, additional
Published: 2019
Full Text: View/download PDF

42. Comprehensive model for managing water resources in the Baotou City, Inner Mongolia

Author: Wei Jiahua, Shao Jingli, He Xuehai, and Shen Xioke
Subjects: Water resources, Geography, Water resource management, Inner mongolia
Abstract: Based on the analysis of water resources and water supply condi1ions of the Baotou City, a multi-objective and comprehensive model for managing water resources is established through the selection of decisive variables, analysis of the balance of water resources demand and supply, and the set-up of the managing districts. The optimal distribution of groundwater is strongly emphasised and a response matrix is used to express the restraint of water level in the model. The model is solved using linear objective programming. The analytical results indicate that the optimised scheme has excellent social, economic, and environmental benefits, and can be used as a reference for the development and planning of the Baotou City.
Published: 2000
Full Text: View/download PDF

43. Characteristics of two-directional univariate wavelet packets

Author: Chen, Qingjiang, primary and He, Xuehai, additional
Published: 2010
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

43 results on '"He, Xuehai"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources