Author: "Han, Xu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Han, Xu"' showing total 13,602 results

Start Over Author "Han, Xu"

13,602 results on '"Han, Xu"'

1. Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Author: Luo, Yuqi, Song, Chenyang, Han, Xu, Chen, Yingfa, Xiao, Chaojun, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning, I.2.7
Abstract: Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-$p\%$ sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., $1-\mathrm{sparsity\ ratio}$) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable., Comment: 23 pages, 13 figures, 6 tables
Published: 2024

2. Real-Time Text Detection with Similar Mask in Traffic, Industrial, and Natural Scenes

Author: Han, Xu, Gao, Junyu, Yang, Chuang, Yuan, Yuan, and Wang, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Texts on the intelligent transportation scene include mass information. Fully harnessing this information is one of the critical drivers for advancing intelligent transportation. Unlike the general scene, detecting text in transportation has extra demand, such as a fast inference speed, except for high accuracy. Most existing real-time text detection methods are based on the shrink mask, which loses some geometry semantic information and needs complex post-processing. In addition, the previous method usually focuses on correct output, which ignores feature correction and lacks guidance during the intermediate process. To this end, we propose an efficient multi-scene text detector that contains an effective text representation similar mask (SM) and a feature correction module (FCM). Unlike previous methods, the former aims to preserve the geometric information of the instances as much as possible. Its post-progressing saves 50$\%$ of the time, accurately and efficiently reconstructing text contours. The latter encourages false positive features to move away from the positive feature center, optimizing the predictions from the feature level. Some ablation studies demonstrate the efficiency of the SM and the effectiveness of the FCM. Moreover, the deficiency of existing traffic datasets (such as the low-quality annotation or closed source data unavailability) motivated us to collect and annotate a traffic text dataset, which introduces motion blur. In addition, to validate the scene robustness of the SM-Net, we conduct experiments on traffic, industrial, and natural scene datasets. Extensive experiments verify it achieves (SOTA) performance on several benchmarks. The code and dataset are available at: \url{https://github.com/fengmulin/SMNet}.
Published: 2024

3. Progressive Compositionality In Text-to-Image Generative Models

Author: Han, Xu, Jin, Linghao, Liu, Xiaofeng, and Liang, Paul Pu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges by optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate high-quality complex contrastive images that diffusion models can directly discriminate based on visual representations? In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios and harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, ConPair, consisting of 15k pairs of high-quality contrastive images. These pairs feature minimal visual discrepancies and cover a wide range of attribute categories, especially complex and natural scenarios. To learn effectively from these error cases, i.e., hard negative images, we propose EvoGen, a new multi-stage curriculum for contrastive learning of diffusion models. Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks.
Published: 2024

4. The Latent Road to Atoms: Backmapping Coarse-grained Protein Structures with Latent Diffusion

Author: Han, Xu, Sun, Yuancheng, Chen, Kai, Liu, Kang, and Ye, Qiwei
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Coarse-grained(CG) molecular dynamics simulations offer computational efficiency for exploring protein conformational ensembles and thermodynamic properties. Though coarse representations enable large-scale simulations across extended temporal and spatial ranges, the sacrifice of atomic-level details limits their utility in tasks such as ligand docking and protein-protein interaction prediction. Backmapping, the process of reconstructing all-atom structures from coarse-grained representations, is crucial for recovering these fine details. While recent machine learning methods have made strides in protein structure generation, challenges persist in reconstructing diverse atomistic conformations that maintain geometric accuracy and chemical validity. In this paper, we present Latent Diffusion Backmapping (LDB), a novel approach leveraging denoising diffusion within latent space to address these challenges. By combining discrete latent encoding with diffusion, LDB bypasses the need for equivariant and internal coordinate manipulation, significantly simplifying the training and sampling processes as well as facilitating better and wider exploration in configuration space. We evaluate LDB's state-of-the-art performance on three distinct protein datasets, demonstrating its ability to efficiently reconstruct structures with high structural accuracy and chemical validity. Moreover, LDB shows exceptional versatility in capturing diverse protein ensembles, highlighting its capability to explore intricate conformational spaces. Our results position LDB as a powerful and scalable approach for backmapping, effectively bridging the gap between CG simulations and atomic-level analyses in computational biology., Comment: Paper under review
Published: 2024

5. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Author: Yu, Shi, Tang, Chaoyue, Xu, Bokai, Cui, Junbo, Ran, Junhao, Yan, Yukun, Liu, Zhenghao, Wang, Shuo, Han, Xu, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25--39\% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag .
Published: 2024

6. LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models

Author: Zhou, Zihan, Li, Chong, Chen, Xinyi, Wang, Shuo, Chao, Yu, Li, Zhili, Wang, Haoyu, An, Rongqiao, Shi, Qi, Tan, Zhixing, Han, Xu, Shi, Xiaodong, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Enlarging the context window of large language models (LLMs) has become a crucial research area, particularly for applications involving extremely long texts. In this work, we propose a novel training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$\times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output. The main challenge for divide-and-conquer long text processing frameworks lies in the risk of losing essential long-range information when splitting the document, which can lead the model to produce incomplete or incorrect answers based on the segmented texts. Disrupted long-range information can be classified into two categories: inter-chunk dependency and inter-chunk conflict. We design a structured information protocol to better cope with inter-chunk dependency and an in-context confidence calibration mechanism to resolve inter-chunk conflicts. Experimental results demonstrate that LLM$\times$MapReduce can outperform representative open-source and commercial long-context LLMs, and is applicable to several different models., Comment: Work in Progress. Code: https://github.com/thunlp/LLMxMapReduce
Published: 2024

7. Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation

Author: Wang, Ruobing, Zha, Daren, Yu, Shi, Zhao, Qingfei, Chen, Yuxuan, Wang, Yixuan, Wang, Shuo, Yan, Yukun, Liu, Zhenghao, Han, Xu, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Retrieval-Augmented Generation (RAG) mitigates issues of the factual errors and hallucinated outputs generated by Large Language Models (LLMs) in open-domain question-answering tasks (OpenQA) via introducing external knowledge. For complex QA, however, existing RAG methods use LLMs to actively predict retrieval timing and directly use the retrieved information for generation, regardless of whether the retrieval timing accurately reflects the actual information needs, or sufficiently considers prior retrieved knowledge, which may result in insufficient information gathering and interaction, yielding low-quality answers. To address these, we propose a generic RAG approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA tasks, which includes the iterative information collector, adaptive memory reviewer, and task-oriented generator, while following a new Retriever-and-Memory paradigm. Specifically, Adaptive-Note introduces an overarching view of knowledge growth, iteratively gathering new information in the form of notes and updating them into the existing optimal knowledge structure, enhancing high-quality knowledge interactions. In addition, we employ an adaptive, note-based stop-exploration strategy to decide "what to retrieve and when to stop" to encourage sufficient knowledge exploration. We conduct extensive experiments on five complex QA datasets, and the results demonstrate the superiority and effectiveness of our method and its components. The code and data are at https://github.com/thunlp/Adaptive-Note., Comment: 15 pages, 2 figures
Published: 2024

8. Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling

Author: Chen, Yingfa, Zhang, Xinrong, Hu, Shengding, Han, Xu, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: One essential advantage of recurrent neural networks (RNNs) over transformer-based language models is their linear computational complexity concerning the sequence length, which makes them much faster in handling long sequences during inference. However, most publicly available RNNs (e.g., Mamba and RWKV) are trained on sequences with less than 10K tokens, and their effectiveness in longer contexts remains largely unsatisfying so far. In this paper, we study the cause of the inability to process long context for RNNs and suggest critical mitigations. We examine two practical concerns when applying state-of-the-art RNNs to long contexts: (1) the inability to extrapolate to inputs longer than the training length and (2) the upper bound of memory capacity. Addressing the first concern, we first investigate *state collapse* (SC), a phenomenon that causes severe performance degradation on sequence lengths not encountered during training. With controlled experiments, we attribute this to overfitting due to the recurrent state being overparameterized for the training length. For the second concern, we train a series of Mamba-2 models on long documents to empirically estimate the recurrent state capacity in language modeling and passkey retrieval. Then, three SC mitigation methods are proposed to improve Mamba-2's length generalizability, allowing the model to process more than 1M tokens without SC. We also find that the recurrent state capacity in passkey retrieval scales exponentially to the state size, and we empirically train a Mamba-2 370M with near-perfect passkey retrieval accuracy on 256K context length. This suggests a promising future for RNN-based long-context modeling., Comment: 21 pages, 18 figures
Published: 2024

9. Exploring the Benefit of Activation Sparsity in Pre-training

Author: Zhang, Zhengyan, Xiao, Chaojun, Qin, Qiujieli, Lin, Yankai, Zeng, Zhiyuan, Han, Xu, Liu, Zhiyuan, Xie, Ruobing, Sun, Maosong, and Zhou, Jie
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to $2\times$ faster inference speed. Codes are available at https://github.com/thunlp/moefication., Comment: ICML 2024
Published: 2024

10. Autoregressive Moving-average Attention Mechanism for Time Series Forecasting

Author: Lu, Jiecheng, Han, Xu, Sun, Yan, and Yang, Shihao
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: We propose an Autoregressive (AR) Moving-average (MA) attention structure that can adapt to various linear attention mechanisms, enhancing their ability to capture long-range and local temporal patterns in time series. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that incorporating the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
Published: 2024

11. Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

Author: Huang, Yuxiang, Yuan, Binhang, Han, Xu, Xiao, Chaojun, and Liu, Zhiyuan
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static-sized caches, such as eviction, suffer from inefficient policies. These limitations restrict deployment on consumer-grade devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, a framework for long-context LLM inference that introduces retaining heads to evaluate the causal importance of KV cache units, allowing for more accurate eviction within a fixed cache size. Locret is fine-tuned on top of the frozen backbone LLM using a minimal amount of data from standard long-context SFT datasets. During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage. We conduct an extensive empirical study to evaluate Locret, where the experimental results show that Locret outperforms the recent competitive approaches, including InfLLM, Quantization, SirLLM, and MInference, in terms of memory efficiency and the quality of generated contents -- Locret achieves over a 20x and 8x KV cache compression ratio compared to the full KV cache for Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined with other methods, such as quantization and token merging. To our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU, enabling 128K long-context inference without compromising generation quality, and requiring little additional system optimizations., Comment: Preprints
Published: 2024

12. Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

Author: Han, Xu, Gao, Junyu, Yang, Chuang, Yuan, Yuan, and Wang, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.
Published: 2024

13. Spotlight Text Detector: Spotlight on Candidate Regions Like a Camera

Author: Han, Xu, Gao, Junyu, Yang, Chuang, Yuan, Yuan, and Wang, Qi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The irregular contour representation is one of the tough challenges in scene text detection. Although segmentation-based methods have achieved significant progress with the help of flexible pixel prediction, the overlap of geographically close texts hinders detecting them separately. To alleviate this problem, some shrink-based methods predict text kernels and expand them to restructure texts. However, the text kernel is an artificial object with incomplete semantic features that are prone to incorrect or missing detection. In addition, different from the general objects, the geometry features (aspect ratio, scale, and shape) of scene texts vary significantly, which makes it difficult to detect them accurately. To consider the above problems, we propose an effective spotlight text detector (STD), which consists of a spotlight calibration module (SCM) and a multivariate information extraction module (MIEM). The former concentrates efforts on the candidate kernel, like a camera focus on the target. It obtains candidate features through a mapping filter and calibrates them precisely to eliminate some false positive samples. The latter designs different shape schemes to explore multiple geometric features for scene texts. It helps extract various spatial relationships to improve the model's ability to recognize kernel regions. Ablation studies prove the effectiveness of the designed SCM and MIEM. Extensive experiments verify that our STD is superior to existing state-of-the-art methods on various datasets, including ICDAR2015, CTW1500, MSRA-TD500, and Total-Text.
Published: 2024

14. Enabling Real-Time Conversations with Minimal Training Costs

Author: Xu, Wang, Wang, Shuo, Zhao, Weilin, Han, Xu, Yan, Yukun, Zhang, Yudi, Tao, Zhe, Liu, Zhiyuan, and Che, Wanxiang
Subjects: Computer Science - Computation and Language
Abstract: Large language models (LLMs) have demonstrated the ability to improve human efficiency through conversational interactions. Conventional LLM-powered dialogue systems, operating on a turn-based paradigm, preclude real-time interaction during response generation. To address this limitation, researchers have proposed duplex models. These models can dynamically adapt to user input, facilitating real-time interactive feedback. However, these methods typically require substantial computational resources to acquire the ability. To reduce overhead, this paper presents a new duplex decoding approach that enhances LLMs with duplex ability, requiring minimal additional training. Specifically, our method employs parallel decoding of queries and responses in conversations, effectively implementing a channel-division-multiplexing decoding strategy. Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs., Comment: 7pages, 6 figures, 1 table
Published: 2024

15. From Words to Wheels: Automated Style-Customized Policy Generation for Autonomous Driving

Author: Han, Xu, Chen, Xianda, Cai, Zhenghan, Cai, Pinlong, Zhu, Meixin, and Chu, Xiaowen
Subjects: Computer Science - Robotics
Abstract: Autonomous driving technology has witnessed rapid advancements, with foundation models improving interactivity and user experiences. However, current autonomous vehicles (AVs) face significant limitations in delivering command-based driving styles. Most existing methods either rely on predefined driving styles that require expert input or use data-driven techniques like Inverse Reinforcement Learning to extract styles from driving data. These approaches, though effective in some cases, face challenges: difficulty obtaining specific driving data for style matching (e.g., in Robotaxis), inability to align driving style metrics with user preferences, and limitations to pre-existing styles, restricting customization and generalization to new commands. This paper introduces Words2Wheels, a framework that automatically generates customized driving policies based on natural language user commands. Words2Wheels employs a Style-Customized Reward Function to generate a Style-Customized Driving Policy without relying on prior driving data. By leveraging large language models and a Driving Style Database, the framework efficiently retrieves, adapts, and generalizes driving styles. A Statistical Evaluation module ensures alignment with user preferences. Experimental results demonstrate that Words2Wheels outperforms existing methods in accuracy, generalization, and adaptability, offering a novel solution for customized AV driving behavior. Code and demo available at https://yokhon.github.io/Words2Wheels/., Comment: 6 pages, 7 figures
Published: 2024

16. From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents

Author: Yu, Jifan, Zhang, Zheyuan, Zhang-li, Daniel, Tu, Shangqing, Hao, Zhanxin, Li, Rui Miao, Li, Haoxuan, Wang, Yuanchun, Li, Hanming, Gong, Linlu, Cao, Jie, Lin, Jiayin, Zhou, Jinchang, Qin, Fei, Wang, Haohua, Jiang, Jianxiao, Deng, Lijun, Zhan, Yisi, Xiao, Chaojun, Dai, Xusheng, Yan, Xuan, Lin, Nianyi, Zhang, Nan, Ni, Ruixin, Dang, Yang, Hou, Lei, Zhang, Yu, Han, Xu, Li, Manli, Li, Juanzi, Liu, Zhiyuan, Liu, Huiqin, and Sun, Maosong
Subjects: Computer Science - Computers and Society, Computer Science - Computation and Language
Abstract: Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integrated into this learning format, resulting in a variety of educational AI applications such as educational recommendation and intelligent tutoring. The emergence of intelligence in large language models (LLMs) has allowed for these educational enhancements to be built upon a unified foundational model, enabling deeper integration. In this context, we propose MAIC (Massive AI-empowered Course), a new form of online education that leverages LLM-driven multi-agent systems to construct an AI-augmented classroom, balancing scalability with adaptivity. Beyond exploring the conceptual framework and technical innovations, we conduct preliminary experiments at Tsinghua University, one of China's leading universities. Drawing from over 100,000 learning records of more than 500 students, we obtain a series of valuable observations and initial analyses. This project will continue to evolve, ultimately aiming to establish a comprehensive open platform that supports and unifies research, technology, and applications in exploring the possibilities of online education in the era of large model AI. We envision this platform as a collaborative hub, bringing together educators, researchers, and innovators to collectively explore the future of AI-driven online education.
Published: 2024

17. Configurable Foundation Models: Building LLMs from a Modular Perspective

Author: Xiao, Chaojun, Zhang, Zhengyan, Song, Chenyang, Jiang, Dazhi, Yao, Feng, Han, Xu, Wang, Xiaozhi, Wang, Shuo, Huang, Yufei, Lin, Guanyu, Chen, Yingfa, Zhao, Weilin, Tu, Yuge, Zhong, Zexuan, Zhang, Ao, Si, Chenglei, Moo, Khai Hao, Zhao, Chenyang, Chen, Huimin, Lin, Yankai, Liu, Zhiyuan, Shang, Jingbo, and Sun, Maosong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
Published: 2024

18. Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts

Author: Chen, Yingfa, Hu, Chenlong, Feng, Cong, Song, Chenyang, Yu, Shi, Han, Xu, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: This study presents a multi-modal multi-granularity tokenizer specifically designed for analyzing ancient Chinese scripts, focusing on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Considering the complex hierarchical structure of ancient Chinese scripts, where a single character may be a combination of multiple sub-characters, our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. Moreover, to support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans. On the part-of-speech tagging task built on our dataset, using our tokenizer gives a 5.5% relative improvement in F1-score compared to mainstream sub-word tokenizers. Our work not only aids in further investigations of the specific script but also has the potential to advance research on other forms of ancient Chinese scripts., Comment: 12 pages, 3 figures
Published: 2024

19. RIDE: Boosting 3D Object Detection for LiDAR Point Clouds via Rotation-Invariant Analysis

Author: Wang, Zhaoxuan, Han, Xu, Liu, Hongxin, and Li, Xianzhi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The rotation robustness property has drawn much attention to point cloud analysis, whereas it still poses a critical challenge in 3D object detection. When subjected to arbitrary rotation, most existing detectors fail to produce expected outputs due to the poor rotation robustness. In this paper, we present RIDE, a pioneering exploration of Rotation-Invariance for the 3D LiDAR-point-based object DEtector, with the key idea of designing rotation-invariant features from LiDAR scenes and then effectively incorporating them into existing 3D detectors. Specifically, we design a bi-feature extractor that extracts (i) object-aware features though sensitive to rotation but preserve geometry well, and (ii) rotation-invariant features, which lose geometric information to a certain extent but are robust to rotation. These two kinds of features complement each other to decode 3D proposals that are robust to arbitrary rotations. Particularly, our RIDE is compatible and easy to plug into the existing one-stage and two-stage 3D detectors, and boosts both detection performance and rotation robustness. Extensive experiments on the standard benchmarks showcase that the mean average precision (mAP) and rotation robustness can be significantly boosted by integrating with our RIDE, with +5.6% mAP and 53% rotation robustness improvement on KITTI, +5.1% and 28% improvement correspondingly on nuScenes. The code will be available soon.
Published: 2024

20. More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Author: Tang, Yuan, Han, Xu, Li, Xianzhi, Yu, Qiao, Xu, Jinfeng, Hao, Yixue, Hu, Long, and Chen, Min
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.
Published: 2024

21. Data-Driven Parametrization of Molecular Mechanics Force Fields for Expansive Chemical Space Coverage

Author: Zheng, Tianze, Wang, Ailun, Han, Xu, Xia, Yu, Xu, Xingyuan, Zhan, Jiawei, Liu, Yu, Chen, Yang, Wang, Zhi, Wu, Xiaojie, Gong, Sheng, and Yan, Wen
Subjects: Computer Science - Machine Learning, Physics - Chemical Physics
Abstract: A force field is a critical component in molecular dynamics simulations for computational drug discovery. It must achieve high accuracy within the constraints of molecular mechanics' (MM) limited functional forms, which offers high computational efficiency. With the rapid expansion of synthetically accessible chemical space, traditional look-up table approaches face significant challenges. In this study, we address this issue using a modern data-driven approach, developing ByteFF, an Amber-compatible force field for drug-like molecules. To create ByteFF, we generated an expansive and highly diverse molecular dataset at the B3LYP-D3(BJ)/DZVP level of theory. This dataset includes 2.4 million optimized molecular fragment geometries with analytical Hessian matrices, along with 3.2 million torsion profiles. We then trained an edge-augmented, symmetry-preserving molecular graph neural network (GNN) on this dataset, employing a carefully optimized training strategy. Our model predicts all bonded and non-bonded MM force field parameters for drug-like molecules simultaneously across a broad chemical space. ByteFF demonstrates state-of-the-art performance on various benchmark datasets, excelling in predicting relaxed geometries, torsional energy profiles, and conformational energies and forces. Its exceptional accuracy and expansive chemical space coverage make ByteFF a valuable tool for multiple stages of computational drug discovery., Comment: ByteFF, a machine learning parametrized MMFF. Code available at https://github.com/bytedance/byteff
Published: 2024

22. Baby Bear: Seeking a Just Right Rating Scale for Scalar Annotations

Author: Han, Xu, Yu, Felix, Sedoc, Joao, and Van Durme, Benjamin
Subjects: Computer Science - Machine Learning, Computer Science - Human-Computer Interaction
Abstract: Our goal is a mechanism for efficiently assigning scalar ratings to each of a large set of elements. For example, "what percent positive or negative is this product review?" When sample sizes are small, prior work has advocated for methods such as Best Worst Scaling (BWS) as being more robust than direct ordinal annotation ("Likert scales"). Here we first introduce IBWS, which iteratively collects annotations through Best-Worst Scaling, resulting in robustly ranked crowd-sourced data. While effective, IBWS is too expensive for large-scale tasks. Using the results of IBWS as a best-desired outcome, we evaluate various direct assessment methods to determine what is both cost-efficient and best correlating to a large scale BWS annotation strategy. Finally, we illustrate in the domains of dialogue and sentiment how these annotations can support robust learning-to-rank models.
Published: 2024

23. FastFiD: Improve Inference Efficiency of Open Domain Question Answering via Sentence Selection

Author: Huang, Yufei, Han, Xu, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Open Domain Question Answering (ODQA) has been advancing rapidly in recent times, driven by significant developments in dense passage retrieval and pretrained language models. Current models typically incorporate the FiD framework, which is composed by a neural retriever alongside an encoder-decoder neural reader. In the answer generation process, the retriever will retrieve numerous passages (around 100 for instance), each of which is then individually encoded by the encoder. Subsequently, the decoder makes predictions based on these encoded passages. Nevertheless, this framework can be relatively time-consuming, particularly due to the extensive length of the gathered passages. To address this, we introduce FastFiD in this paper, a novel approach that executes sentence selection on the encoded passages. This aids in retaining valuable sentences while reducing the context length required for generating answers. Experiments on three commonly used datasets (Natural Questions, TriviaQA and ASQA) demonstrate that our method can enhance the inference speed by 2.3X-5.7X, while simultaneously maintaining the model's performance. Moreover, an in-depth analysis of the model's attention reveals that the selected sentences indeed hold a substantial contribution towards the final answer. The codes are publicly available at https://github.com/thunlp/FastFiD., Comment: ACL 2024 Main Conference
Published: 2024

24. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Author: Yao, Yuan, Yu, Tianyu, Zhang, Ao, Wang, Chongyi, Cui, Junbo, Zhu, Hongji, Cai, Tianchi, Li, Haoyu, Zhao, Weilin, He, Zhihui, Chen, Qianyu, Zhou, Huarong, Zou, Zhensheng, Zhang, Haoye, Hu, Shengding, Zheng, Zhi, Zhou, Jie, Cai, Jie, Han, Xu, Zeng, Guoyang, Li, Dahai, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future., Comment: preprint
Published: 2024

25. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Author: Zhu, Kunlun, Luo, Yifan, Xu, Dingling, Wang, Ruobing, Yu, Shi, Wang, Shuo, Yan, Yukun, Liu, Zhenghao, Han, Xu, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Retrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance to rigorously evaluate LLM-generated responses. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications., Comment: https://github.com/OpenBMB/RAGEval
Published: 2024

26. LIDIA: Precise Liver Tumor Diagnosis on Multi-Phase Contrast-Enhanced CT via Iterative Fusion and Asymmetric Contrastive Learning

Author: Huang, Wei, Liu, Wei, Zhang, Xiaoming, Yin, Xiaoli, Han, Xu, Li, Chunli, Gao, Yuan, Shi, Yu, Lu, Le, Zhang, Ling, Zhang, Lei, and Yan, Ke
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The early detection and precise diagnosis of liver tumors are tasks of critical clinical value, yet they pose significant challenges due to the high heterogeneity and variability of liver tumors. In this work, a precise LIver tumor DIAgnosis network on multi-phase contrast-enhance CT, named LIDIA, is proposed for real-world scenario. To fully utilize all available phases in contrast-enhanced CT, LIDIA first employs the iterative fusion module to aggregate variable numbers of image phases, thereby capturing the features of lesions at different phases for better tumor diagnosis. To effectively mitigate the high heterogeneity problem of liver tumors, LIDIA incorporates asymmetric contrastive learning to enhance the discriminability between different classes. To evaluate our method, we constructed a large-scale dataset comprising 1,921 patients and 8,138 lesions. LIDIA has achieved an average AUC of 93.6% across eight different types of lesions, demonstrating its effectiveness. Besides, LIDIA also demonstrated strong generalizability with an average AUC of 89.3% when tested on an external cohort of 828 patients., Comment: Accepted to MICCAI 2024
Published: 2024

27. Continual Learning for Adaptable Car-Following in Dynamic Traffic Environments

Author: Chen, Xianda, Tiu, PakHin, Han, Xu, Chen, Junjie, Wu, Yuanfei, Zheng, Xinhu, and Zhu, Meixin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The continual evolution of autonomous driving technology requires car-following models that can adapt to diverse and dynamic traffic environments. Traditional learning-based models often suffer from performance degradation when encountering unseen traffic patterns due to a lack of continual learning capabilities. This paper proposes a novel car-following model based on continual learning that addresses this limitation. Our framework incorporates Elastic Weight Consolidation (EWC) and Memory Aware Synapses (MAS) techniques to mitigate catastrophic forgetting and enable the model to learn incrementally from new traffic data streams. We evaluate the performance of the proposed model on the Waymo and Lyft datasets which encompass various traffic scenarios. The results demonstrate that the continual learning techniques significantly outperform the baseline model, achieving 0\% collision rates across all traffic conditions. This research contributes to the advancement of autonomous driving technology by fostering the development of more robust and adaptable car-following models.
Published: 2024

28. The white-light superflares from cool stars in GWAC triggers

Author: Li, Guang-Wei, Wang, Liang, Yuan, Hai-Long, Xin, Li-Ping, Wang, Jing, Wu, Chao, Li, Hua-Li, Haerken, Hasitieer, Wang, Wei-Hua, Cai, Hong-Bo, Han, Xu-Hui, Xu, Yang, Huang, Lei, Lu, Xiao-Meng, Bai, Jian-Ying, Wang, Xiang-Yu, Dai, Zi-Gao, Liang, En-Wei, and Wei, Jian-Yan
Subjects: Astrophysics - Solar and Stellar Astrophysics
Abstract: M-type stars are the ones that flare most frequently, but how big their maximum flare energy can reach is still unknown. We present 163 flares from 162 individual M2 through L1-type stars that triggered the GWAC, with flare energies ranging from $10^{32.2}$ to $10^{36.4}$ erg . The flare amplitudes range from $\triangle G = 0.84$ to $\sim 10$ mag. Flare energy increases with stellar surface temperature ($T_{\rm eff}$) but both $\triangle G$ and equivalent duration $\log_{10}(ED)$ seem to be independent of $T_{\rm eff}$. Combining periods detected from light curves of TESS and K2, spectra from LAMOST, SDSS and the 2.16 m Telescope, and the Gaia DR3 data, we found that these GWAC flare stars are young. For the stars that have spectra, we found that these stars are in or very near to the saturation region, and $\log_{10}(L_{\rm H\alpha}/L_{\rm bol})$ is lower for M7-L1 stars than for M2-M6 stars. We also studied the relation between GWAC flare bolometric energy $E_{\rm bol}$ and stellar hemispherical area $S$, and found that $\log_{10}E_{\rm bol}$ (in erg) increases with increasing $S$ (in cm$^2$), and the maximum flare energy $\log_{10}E_{\rm bol, max} \geqslant \log_{10}S + 14.25$. For M7-L1 stars, there seem to be other factors limiting their maximum flare energies in addition to stellar hemispherical area., Comment: 18 pages, 11 figures, 4 tables
Published: 2024
Full Text: View/download PDF

29. Light-weight Fine-tuning Method for Defending Adversarial Noise in Pre-trained Medical Vision-Language Models

Author: Han, Xu, Jin, Linghao, Ma, Xuezhe, and Liu, Xiaofeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Fine-tuning pre-trained Vision-Language Models (VLMs) has shown remarkable capabilities in medical image and textual depiction synergy. Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susceptibility to adversarial attacks. To investigate how VLMs trained on adversarial noisy data perform on downstream medical tasks, we first craft noisy upstream datasets using multi-modal adversarial attacks. Through our comprehensive analysis, we unveil that moderate noise enhances model robustness and transferability, but increasing noise levels negatively impact downstream task performance. To mitigate this issue, we propose rectify adversarial noise (RAN) framework, a recipe designed to effectively defend adversarial attacks and rectify the influence of upstream noise during fine-tuning.
Published: 2024

30. Hong-Ou-Mandel Interference with a Coexisting Clock using Transceivers for Synchronization over Deployed Fiber

Author: Ramesh, Anirudh, Reilly, Daniel R., Lee, Kim Fook, Moraw, Paul M., Chung, Joaquin, Islam, Md Shariful, Peña, Cristián, Han, Xu, Kettimuthu, Rajkumar, Kumar, Prem, and Kanter, Gregory
Subjects: Quantum Physics, Physics - Optics
Abstract: Interference between independently generated photons is a key step towards distributing entanglement over long distances, but it requires synchronization between the distantly-located photon sources. Synchronizing the clocks of such photon sources using coexisting two-way classical optical communications over the same fiber that transport the quantum photonic signals is a promising approach for achieving photon-photon interference over long distances, enabling entanglement distribution for quantum networking using the deployed fiber infrastructure. Here, we demonstrate photon-photon interference by observing the Hong-Ou-Mandel dip between two distantly-located sources: a weak coherent state source obtained by attenuating the output of a laser and a heralded single-photon source. We achieve a maximum dip visibility of $0.58 \pm 0.04$ when the two sources are connected via $4.3$ km of deployed fiber. Dip visibilities $>0.5$ are nonclassical and a first step towards achieving teleportation over the deployed fiber infrastructure. In our experiment, the classical optical communication is achieved with $-21$ dBm of optical signal launch power, which is used to synchronize the clocks in the two independent, distantly-located photon sources. The impact of spontaneous Raman scattering from the classical optical signals is mitigated by appropriate choice of the quantum and classical channel wavelengths. All equipment used in our experiment (the photon sources and the synchronization setup) is commercially available. Finally, our experiment represents a scalable approach to enabling practical quantum networking with commercial equipment and coexistence with classical communications in optical fiber.
Published: 2024

31. EditFollower: Tunable Car Following Models for Customizable Adaptive Cruise Control Systems

Author: Chen, Xianda, Han, Xu, Zhu, Meixin, Chu, Xiaowen, Tiu, PakHin, Zheng, Xinhu, and Wang, Yinhai
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: In the realm of driving technologies, fully autonomous vehicles have not been widely adopted yet, making advanced driver assistance systems (ADAS) crucial for enhancing driving experiences. Adaptive Cruise Control (ACC) emerges as a pivotal component of ADAS. However, current ACC systems often employ fixed settings, failing to intuitively capture drivers' social preferences and leading to potential function disengagement. To overcome these limitations, we propose the Editable Behavior Generation (EBG) model, a data-driven car-following model that allows for adjusting driving discourtesy levels. The framework integrates diverse courtesy calculation methods into long short-term memory (LSTM) and Transformer architectures, offering a comprehensive approach to capture nuanced driving dynamics. By integrating various discourtesy values during the training process, our model generates realistic agent trajectories with different levels of courtesy in car-following behavior. Experimental results on the HighD and Waymo datasets showcase a reduction in Mean Squared Error (MSE) of spacing and MSE of speed compared to baselines, establishing style controllability. To the best of our knowledge, this work represents the first data-driven car-following model capable of dynamically adjusting discourtesy levels. Our model provides valuable insights for the development of ACC systems that take into account drivers' social preferences.
Published: 2024

32. Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models

Author: Zhang, Xinrong, Chen, Yingfa, Hu, Shengding, Han, Xu, Xu, Zihang, Xu, Yuanwei, Zhao, Weilin, Sun, Maosong, and Liu, Zhiyuan
Subjects: Computer Science - Computation and Language
Abstract: As large language models (LLMs) increasingly permeate daily lives, there is a growing demand for real-time interactions that mirror human conversations. Traditional turn-based chat systems driven by LLMs prevent users from verbally interacting with the system while it is generating responses. To overcome these limitations, we adapt existing LLMs to \textit{duplex models} so that these LLMs can listen for users while generating output and dynamically adjust themselves to provide users with instant feedback. % such as in response to interruptions. Specifically, we divide the queries and responses of conversations into several time slices and then adopt a time-division-multiplexing (TDM) encoding-decoding strategy to pseudo-simultaneously process these slices. Furthermore, to make LLMs proficient enough to handle real-time conversations, we build a fine-tuning dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions. Our experiments show that although the queries and responses of conversations are segmented into incomplete slices for processing, LLMs can preserve their original performance on standard benchmarks with a few fine-tuning steps on our dataset. Automatic and human evaluation indicate that duplex models make user-AI interactions more natural and human-like, and greatly improve user satisfaction compared to vanilla LLMs. Our duplex model and dataset will be released.
Published: 2024

33. Fair Text to Medical Image Diffusion Model with Subgroup Distribution Aligned Tuning

Author: Han, Xu, Fan, Fangfang, Rong, Jingzhao, and Liu, Xiaofeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The text to medical image (T2MedI) with latent diffusion model has great potential to alleviate the scarcity of medical imaging data and explore the underlying appearance distribution of lesions in a specific patient status description. However, as the text to nature image models, we show that the T2MedI model can also bias to some subgroups to overlook the minority ones in the training set. In this work, we first build a T2MedI model based on the pre-trained Imagen model, which has the fixed contrastive language-image pre-training (CLIP) text encoder, while its decoder has been fine-tuned on medical images from the Radiology Objects in COntext (ROCO) dataset. Its gender bias is analyzed qualitatively and quantitatively. Toward this issue, we propose to fine-tune the T2MedI toward the target application dataset to align their sensitive subgroups distribution probability. Specifically, the alignment loss for fine-tuning is guided by an off-the-shelf sensitivity-subgroup classifier to match the classification probability between the generated images and the expected target dataset. In addition, the image quality is maintained by a CLIP-consistency regularization term following a knowledge distillation scheme. For evaluation, we set the target dataset to be enhanced as the BraST18 dataset, and trained a brain magnetic resonance (MR) slice-based gender classifier from it. With our method, the generated MR image can markedly reduce the inconsistency with the gender proportion in the BraTS18 dataset.
Published: 2024

34. Generating and Evolving Reward Functions for Highway Driving with Large Language Models

Author: Han, Xu, Yang, Qiannan, Chen, Xianda, Chu, Xiaowen, and Zhu, Meixin
Subjects: Computer Science - Artificial Intelligence, Computer Science - Neural and Evolutionary Computing, Computer Science - Robotics
Abstract: Reinforcement Learning (RL) plays a crucial role in advancing autonomous driving technologies by maximizing reward functions to achieve the optimal policy. However, crafting these reward functions has been a complex, manual process in many practices. To reduce this complexity, we introduce a novel framework that integrates Large Language Models (LLMs) with RL to improve reward function design in autonomous driving. This framework utilizes the coding capabilities of LLMs, proven in other areas, to generate and evolve reward functions for highway scenarios. The framework starts with instructing LLMs to create an initial reward function code based on the driving environment and task descriptions. This code is then refined through iterative cycles involving RL training and LLMs' reflection, which benefits from their ability to review and improve the output. We have also developed a specific prompt template to improve LLMs' understanding of complex driving simulations, ensuring the generation of effective and error-free code. Our experiments in a highway driving simulator across three traffic configurations show that our method surpasses expert handcrafted reward functions, achieving a 22% higher average success rate. This not only indicates safer driving but also suggests significant gains in development productivity., Comment: 7 pages, 6 figures
Published: 2024

35. Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Author: Ping, Bowen, Wang, Shuo, Wang, Hanqing, Han, Xu, Xu, Yuzhuang, Yan, Yukun, Chen, Yun, Chang, Baobao, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability., Comment: 12 pages
Published: 2024

36. Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

Author: Sun, Ao, Zhao, Weilin, Han, Xu, Yang, Cheng, Zhang, Xinrong, Liu, Zhiyuan, Shi, Chuan, and Sun, Maosong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The emergence of large language models (LLMs) relies heavily on distributed training strategies, among which pipeline parallelism plays a crucial role. As LLMs' training sequence length extends to 32k or even 128k, the current pipeline parallel methods face severe bottlenecks, including high memory footprints and substantial pipeline bubbles, greatly hindering model scalability and training throughput. To enhance memory efficiency and training throughput, in this work, we introduce an efficient sequence-level one-forward-one-backward (1F1B) pipeline scheduling method tailored for training LLMs on long sequences named Seq1F1B. Seq1F1B decomposes batch-level schedulable units into finer sequence-level units, reducing bubble size and memory footprint. Considering that Seq1F1B may produce slight extra bubbles if sequences are split evenly, we design a computation-wise strategy to partition input sequences and mitigate this side effect. Compared to competitive pipeline baseline methods such as Megatron 1F1B pipeline parallelism, our method achieves higher training throughput with less memory footprint. Notably, Seq1F1B efficiently trains a LLM with 30B parameters on sequences up to 64k using 64 NVIDIA A100 GPUs without recomputation strategies, a feat unachievable with existing methods. Our source code is based on Megatron-LM, and now is avaiable at: https://github.com/MayDomine/Seq1F1B.git., Comment: 12 pages, 4 figures, 6 tables
Published: 2024

37. Identifying User Profile by Incorporating Self-Attention Mechanism based on CSDN Data Set

Author: Lu, Junru, Chen, Le, Meng, Kongming, Wang, Fengyi, Xiang, Jun, Chen, Nuo, Han, Xu, and Li, Binyang
Subjects: Information technology, T58.5-58.64
Abstract: With the popularity of social media, there has been an increasing interest in user profiling and its applications nowadays. This paper presents our system named UIR-SIST for User Profiling Technology Evaluation Campaign in SMP CUP 2017. UIR-SIST aims to complete three tasks, including keywords extraction from blogs, user interests labeling and user growth value prediction. To this end, we first extract keywords from a user's blog, including the blog itself, blogs on the same topic and other blogs published by the same user. Then a unified neural network model is constructed based on a convolutional neural network (CNN) for user interests tagging. Finally, we adopt a stacking model for predicting user growth value. We eventually receive the sixth place with evaluation scores of 0.563, 0.378 and 0.751 on the three tasks, respectively.
Published: 2019
Full Text: View/download PDF

38. Evidence for Multiferroicity in Single-Layer CuCrSe$_2$

Author: Sun, Zhenyu, Su, Yueqi, Zhi, Aomiao, Gao, Zhicheng, Han, Xu, Wu, Kang, Bao, Lihong, Huang, Yuan, Shi, Youguo, Bai, Xuedong, Cheng, Peng, Chen, Lan, Wu, Kehui, Tian, Xuezeng, Wu, Changzheng, and Feng, Baojie
Subjects: Condensed Matter - Materials Science
Abstract: Multiferroic materials, which simultaneously exhibit ferroelectricity and magnetism, have attracted substantial attention due to their fascinating physical properties and potential technological applications. With the trends towards device miniaturization, there is an increasing demand for the persistence of multiferroicity in single-layer materials at elevated temperatures. Here, we report high-temperature multiferroicity in single-layer CuCrSe$_2$, which hosts room-temperature ferroelectricity and 120 K ferromagnetism. Notably, the ferromagnetic coupling in single-layer CuCrSe$_2$ is enhanced by the ferroelectricity-induced orbital shift of Cr atoms, which is distinct from both types I and II multiferroicity. These findings are supported by a combination of second-harmonic generation, piezo-response force microscopy, scanning transmission electron microscopy, magnetic, and Hall measurements. Our research provides not only an exemplary platform for delving into intrinsic magnetoelectric interactions at the single-layer limit but also sheds light on potential development of electronic and spintronic devices utilizing two-dimensional multiferroics.
Published: 2024
Full Text: View/download PDF

39. MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors

Author: Tang, Yuan, Han, Xu, Li, Xianzhi, Yu, Qiao, Hao, Yixue, Hu, Long, and Chen, Min
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at https://github.com/TangYuan96/MiniGPT-3D., Comment: 17 pages, 9 figures
Published: 2024

40. Mamba3D: Enhancing Local Features for 3D Point Cloud Analysis via State Space Model

Author: Han, Xu, Tang, Yuan, Wang, Zhaoxuan, and Li, Xianzhi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity. Our code and weights are available at https://github.com/xhanxu/Mamba3D., Comment: ACM MM 2024. Code and weights are available at https://github.com/xhanxu/Mamba3D
Published: 2024

41. UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs

Author: He, Chaoqun, Luo, Renjie, Hu, Shengding, Zhao, Yuanqian, Zhou, Jie, Wu, Hanghao, Zhang, Jiajie, Han, Xu, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language
Abstract: Evaluation is pivotal for refining Large Language Models (LLMs), pinpointing their capabilities, and guiding enhancements. The rapid development of LLMs calls for a lightweight and easy-to-use framework for swift evaluation deployment. However, considering various implementation details, developing a comprehensive evaluation platform is never easy. Existing platforms are often complex and poorly modularized, hindering seamless incorporation into research workflows. This paper introduces UltraEval, a user-friendly evaluation framework characterized by its lightweight nature, comprehensiveness, modularity, and efficiency. We identify and reimplement three core components of model evaluation (models, data, and metrics). The resulting composability allows for the free combination of different models, tasks, prompts, benchmarks, and metrics within a unified evaluation workflow. Additionally, UltraEval supports diverse models owing to a unified HTTP service and provides sufficient inference acceleration. UltraEval is now available for researchers publicly., Comment: Accepted by ACL 2024 System Demostration Track, update
Published: 2024

42. MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Author: Hu, Shengding, Tu, Yuge, Han, Xu, He, Chaoqun, Cui, Ganqu, Long, Xiang, Zheng, Zhi, Fang, Yewei, Huang, Yuxiang, Zhao, Weilin, Zhang, Xinrong, Thai, Zheng Leng, Zhang, Kaihuo, Wang, Chongyi, Yao, Yuan, Zhao, Chenyang, Zhou, Jie, Cai, Jie, Zhai, Zhongwu, Ding, Ning, Jia, Chao, Zeng, Guoyang, Li, Dahai, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The burgeoning interest in developing Large Language Models (LLMs) with up to trillion parameters has been met with concerns regarding resource efficiency and practical expense, particularly given the immense cost of experimentation. This scenario underscores the importance of exploring the potential of Small Language Models (SLMs) as a resource-efficient alternative. In this context, we introduce MiniCPM, specifically the 1.2B and 2.4B non-embedding parameter variants, not only excel in their respective categories but also demonstrate capabilities on par with 7B-13B LLMs. While focusing on SLMs, our approach exhibits scalability in both model and data dimensions for future LLM research. Regarding model scaling, we employ extensive model wind tunnel experiments for stable and optimal scaling. For data scaling, we introduce a Warmup-Stable-Decay (WSD) learning rate scheduler (LRS), conducive to continuous training and domain adaptation. We present an in-depth analysis of the intriguing training dynamics that occurred in the WSD LRS. With WSD LRS, we are now able to efficiently study data-model scaling law without extensive retraining experiments on both axes of model and data, from which we derive the much higher compute optimal data-model ratio than Chinchilla Optimal. Additionally, we introduce MiniCPM family, including MiniCPM-DPO, MiniCPM-MoE and MiniCPM-128K, whose excellent performance further cementing MiniCPM's foundation in diverse SLM applications. MiniCPM models are available publicly at https://github.com/OpenBMB/MiniCPM ., Comment: revise according to peer review
Published: 2024

43. Robust and Scalable Model Editing for Large Language Models

Author: Chen, Yingfa, Zhang, Zhengyan, Han, Xu, Xiao, Chaojun, Liu, Zhiyuan, Chen, Chen, Li, Kuai, Yang, Tao, and Sun, Maosong
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large language models (LLMs) can make predictions using parametric knowledge--knowledge encoded in the model weights--or contextual knowledge--knowledge presented in the context. In many scenarios, a desirable behavior is that LLMs give precedence to contextual knowledge when it conflicts with the parametric knowledge, and fall back to using their parametric knowledge when the context is irrelevant. This enables updating and correcting the model's knowledge by in-context editing instead of retraining. Previous works have shown that LLMs are inclined to ignore contextual knowledge and fail to reliably fall back to parametric knowledge when presented with irrelevant context. In this work, we discover that, with proper prompting methods, instruction-finetuned LLMs can be highly controllable by contextual knowledge and robust to irrelevant context. Utilizing this feature, we propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing. To better evaluate the robustness of model editors, we collect a new dataset, that contains irrelevant questions that are more challenging than the ones in existing datasets. Empirical results show that our method outperforms current state-of-the-art methods by a large margin. Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs (and vice versa). The source code can be found at https://github.com/thunlp/EREN., Comment: LREC-COLING 2024 paper, 16 pages, 4 figures
Published: 2024

44. V2X-Real: a Largs-Scale Dataset for Vehicle-to-Everything Cooperative Perception

Author: Xiang, Hao, Zheng, Zhaoliang, Xia, Xin, Xu, Runsheng, Gao, Letian, Zhou, Zewei, Han, Xu, Ji, Xinkai, Li, Mingxi, Meng, Zonglin, Jin, Li, Lei, Mingyue, Ma, Zhaoyang, He, Zihang, Ma, Haoxuan, Yuan, Yunshuang, Zhao, Yingqian, and Ma, Jiaqi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in Vehicle-to-Everything (V2X) technologies have enabled autonomous vehicles to share sensing information to see through occlusions, greatly boosting the perception capability. However, there are no real-world datasets to facilitate the real V2X cooperative perception research -- existing datasets either only support Vehicle-to-Infrastructure cooperation or Vehicle-to-Vehicle cooperation. In this paper, we propose a dataset that has a mixture of multiple vehicles and smart infrastructure simultaneously to facilitate the V2X cooperative perception development with multi-modality sensing data. Our V2X-Real is collected using two connected automated vehicles and two smart infrastructures, which are all equipped with multi-modal sensors including LiDAR sensors and multi-view cameras. The whole dataset contains 33K LiDAR frames and 171K camera data with over 1.2M annotated bounding boxes of 10 categories in very challenging urban scenarios. According to the collaboration mode and ego perspective, we derive four types of datasets for Vehicle-Centric, Infrastructure-Centric, Vehicle-to-Vehicle, and Infrastructure-to-Infrastructure cooperative perception. Comprehensive multi-class multi-agent benchmarks of SOTA cooperative perception methods are provided. The V2X-Real dataset and benchmark codes will be released.
Published: 2024

45. BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Author: Sun, Ao, Zhao, Weilin, Han, Xu, Yang, Cheng, Liu, Zhiyuan, Shi, Chuan, and Sun, Maosong
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning
Abstract: Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long sequences. One potential solution for the long sequence problem is to utilize distributed clusters to parallelize the computation of attention modules across multiple devices (e.g., GPUs). However, adopting a distributed approach inevitably introduces extra memory overheads to store local attention results and incurs additional communication costs to aggregate local results into global ones. In this paper, we propose a distributed attention framework named ``BurstAttention'' to optimize memory access and communication operations at both the global cluster and local device levels. In our experiments, we compare BurstAttention with other competitive distributed attention solutions for long sequence processing. The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences compared with these competitive baselines, reducing 40% communication overheads and achieving 1.37 X speedup during training 128K sequence length on 32 X A100., Comment: 13 pages, 7 figures
Published: 2024

46. CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables

Author: Lu, Jiecheng, Han, Xu, Sun, Yan, and Yang, Shihao
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: For Multivariate Time Series Forecasting (MTSF), recent deep learning applications show that univariate models frequently outperform multivariate ones. To address the difficiency in multivariate models, we introduce a method to Construct Auxiliary Time Series (CATS) that functions like a 2D temporal-contextual attention mechanism, which generates Auxiliary Time Series (ATS) from Original Time Series (OTS) to effectively represent and incorporate inter-series relationships for forecasting. Key principles of ATS - continuity, sparsity, and variability - are identified and implemented through different modules. Even with a basic 2-layer MLP as core predictor, CATS achieves state-of-the-art, significantly reducing complexity and parameters compared to previous multivariate models, marking it an efficient and transferable MTSF solution.
Published: 2024

47. Significant Mobility Enhancement by Semicrystalline Polymers Additive for Crystallization and Charge Transport in Organic Field-effect Transistor

Author: Bi, Sheng, Yao, Zehui, Han, Xu, Bi, Congjie, Wang, Xiaolong, Chen, Qiangqiang, Wang, Yao, Wang, Rongyi, Asare-Yeboah, Kyeiwaa, He, Zhengran, and Song, Ruonan
Published: 2024
Full Text: View/download PDF

48. Light-emitting diodes based on intercalated transition metal dichalcogenides with suppressed efficiency roll-off at high generation rates

Author: Wang, Shixuan, Fu, Qiang, Zheng, Ting, Han, Xu, Wang, Hao, Zhou, Tao, Liu, Jing, Liu, Tianqi, Zhang, Yuwei, Chen, Kaiqi, Wang, Qixing, Duan, Zhexing, Zhou, Xin, Watanabe, Kenji, Taniguchi, Takashi, Yan, Jiaxu, Huang, Yuan, Xiong, Yuwei, Yang, Joel K. W., Hu, Zhenliang, Xu, Tao, Sun, Litao, Hong, Jinhua, Zheng, Yujie, You, Yumeng, Zhang, Qi, Lu, Junpeng, and Ni, Zhenhua
Published: 2024
Full Text: View/download PDF

49. MALAT1/miR-582-5p/GALNT1/MUC1 axis modulates progression of AML leukemia stem cells by regulating JAK2/STAT3 pathway

Author: Li, Si, Gao, Rui, Han, Xu, Wang, Kai, Kang, Bingyu, and Ma, Xiaolu
Published: 2024
Full Text: View/download PDF

50. Off-policy asymptotic and adaptive maximum entropy deep reinforcement learning

Author: Zhang, Huihui and Han, Xu
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

13,602 results on '"Han, Xu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources