70,331 results on '"Li, Bo"'
Search Results
2. Self-Branding through NFL Team Fanship: Fans’ Desired Self-Image and Its Implications for Branding Practices
- Author
-
Wang, Jerred Junqi, Braunstein-Minkove, Jessica R., Baker, Thomas A., Li, Bo, and Zhang, James J.
- Published
- 2024
3. Does Star Power Boost Soccer Match Attendance? Empirical Evidence from the Chinese Soccer League
- Author
-
Li, Bo, Liu, Yuanyang, Wang, Jerred Junqi, Scott, Olan K.M., and Stokowski, Sarah
- Published
- 2024
4. Developing a Fair Online Recruitment Framework Based on Job-seekers' Fairness Concerns
- Author
-
He, Changyang, Deng, Yue, Fabris, Alessandro, Li, Bo, and Biega, Asia
- Subjects
Computer Science - Human-Computer Interaction - Abstract
The susceptibility to biases and discrimination is a pressing issue in today's labor markets. Though digital recruitment systems play an increasingly significant role in human resources management, thus far we lack a systematic understanding of human-centered design principles for fair online hiring. This work proposes a fair recruitment framework based on job-seekers' fairness concerns shared in an online forum. Through qualitative analysis, we uncover four overarching themes of job-seekers' fairness concerns, including discrimination against sensitive attributes, interaction biases, improper interpretations of qualifications, and power imbalance. Based on these findings, we derive design implications for algorithms and interfaces in recruitment systems, integrating them into a fair recruitment framework spanning different hiring stages and fairness considerations., Comment: 24 pages, 3 figures
- Published
- 2025
5. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
- Author
-
Hu, Kairui, Wu, Penghao, Pu, Fanyi, Xiao, Wang, Zhang, Yuanhan, Yue, Xiang, Li, Bo, and Liu, Ziwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computation and Language - Abstract
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.
- Published
- 2025
6. Deep Reinforcement Learning with Hybrid Intrinsic Reward Model
- Author
-
Yuan, Mingqi, Li, Bo, Jin, Xin, and Zeng, Wenjun
- Subjects
Computer Science - Machine Learning - Abstract
Intrinsic reward shaping has emerged as a prevalent approach to solving hard-exploration and sparse-rewards environments in reinforcement learning (RL). While single intrinsic rewards, such as curiosity-driven or novelty-based methods, have shown effectiveness, they often limit the diversity and efficiency of exploration. Moreover, the potential and principle of combining multiple intrinsic rewards remains insufficiently explored. To address this gap, we introduce HIRE (Hybrid Intrinsic REward), a flexible and elegant framework for creating hybrid intrinsic rewards through deliberate fusion strategies. With HIRE, we conduct a systematic analysis of the application of hybrid intrinsic rewards in both general and unsupervised RL across multiple benchmarks. Extensive experiments demonstrate that HIRE can significantly enhance exploration efficiency and diversity, as well as skill acquisition in complex and dynamic settings., Comment: 18 pages, 14 figures
- Published
- 2025
7. Adaptive Data Exploitation in Deep Reinforcement Learning
- Author
-
Yuan, Mingqi, Li, Bo, Jin, Xin, and Zeng, Wenjun
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
We introduce ADEPT: Adaptive Data ExPloiTation, a simple yet powerful framework to enhance the **data efficiency** and **generalization** in deep reinforcement learning (RL). Specifically, ADEPT adaptively manages the use of sampled data across different learning stages via multi-armed bandit (MAB) algorithms, optimizing data utilization while mitigating overfitting. Moreover, ADEPT can significantly reduce the computational overhead and accelerate a wide range of RL algorithms. We test ADEPT on benchmarks including Procgen, MiniGrid, and PyBullet. Extensive simulation demonstrates that ADEPT can achieve superior performance with remarkable computational efficiency, offering a practical solution to data-efficient RL. Our code is available at https://github.com/yuanmingqi/ADEPT., Comment: 40 pages, 37 figures
- Published
- 2025
8. 'Auntie, Please Don't Fall for Those Smooth Talkers': How Chinese Younger Family Members Safeguard Seniors from Online Fraud
- Author
-
Deng, Yue, He, Changyang, Zou, Yixin, and Li, Bo
- Subjects
Computer Science - Human-Computer Interaction - Abstract
Online fraud substantially harms individuals and seniors are disproportionately targeted. While family is crucial for seniors, little research has empirically examined how they protect seniors against fraud. To address this gap, we employed an inductive thematic analysis of 124 posts and 16,872 comments on RedNote (Xiaohongshu), exploring the family support ecosystem for senior-targeted online fraud in China. We develop a taxonomy of senior-targeted online fraud from a familial perspective, revealing younger members often spot frauds hard for seniors to detect, such as unusual charges. Younger family members fulfill multiple safeguarding roles, including preventative measures, fraud identification, fraud persuasion, loss recovery, and education. They also encounter numerous challenges, such as seniors' refusal of help and considerable mental and financial stress. Drawing on these, we develop a conceptual framework to characterize family support in senior-targeted fraud, and outline implications for researchers and practitioners to consider the broader stakeholder ecosystem and cultural aspects., Comment: 27 pages, 3 figures. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI '25), April 26-May 1, 2025, Yokohama, Japan
- Published
- 2025
9. FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
- Author
-
Pan, Xinglin, Lin, Wenxiang, Zhang, Lin, Shi, Shaohuai, Tang, Zhenheng, Wang, Rui, Li, Bo, and Chu, Xiaowen
- Subjects
Computer Science - Machine Learning - Abstract
Recent large language models (LLMs) have tended to leverage sparsity to reduce computations, employing the sparsely activated mixture-of-experts (MoE) technique. MoE introduces four modules, including token routing, token communication, expert computation, and expert parallelism, that impact model quality and training efficiency. To enable versatile usage of MoE models, we introduce FSMoE, a flexible training system optimizing task scheduling with three novel techniques: 1) Unified abstraction and online profiling of MoE modules for task scheduling across various MoE implementations. 2) Co-scheduling intra-node and inter-node communications with computations to minimize communication overheads. 3) To support near-optimal task scheduling, we design an adaptive gradient partitioning method for gradient aggregation and a schedule to adaptively pipeline communications and computations. We conduct extensive experiments with configured MoE layers and real-world MoE models on two GPU clusters. Experimental results show that 1) our FSMoE supports four popular types of MoE routing functions and is more efficient than existing implementations (with up to a 1.42$\times$ speedup), and 2) FSMoE outperforms the state-of-the-art MoE training systems (DeepSpeed-MoE and Tutel) by 1.18$\times$-1.22$\times$ on 1458 MoE layers and 1.19$\times$-3.01$\times$ on real-world MoE models based on GPT-2 and Mixtral using a popular routing function.
- Published
- 2025
- Full Text
- View/download PDF
10. Quantifying the imaginarity via different distance measures
- Author
-
Guo, Meng-Li, Huang, Si-Yin, Li, Bo, and Fei, Shao-Ming
- Subjects
Quantum Physics - Abstract
The recently introduced resource theory of imaginarity facilitates a systematic investigation into the role of complex numbers in quantum mechanics and quantum information theory. In this work, we propose well-defined measures of imaginarity using various distance metrics, drawing inspiration from recent advancements in quantum entanglement and coherence. Specifically, we focus on quantitatively evaluating imaginarity through measures such as Tsallis relative $\alpha$-entropy, Sandwiched R\'{e}nyi relative entropy, and Tsallis relative operator entropy. Additionally, we analyze the decay rates of these measures. Our findings reveal that the Tsallis relative $\alpha$-entropy of imaginarity exhibits higher decay rate under quantum channels compared to other measures. Finally, we examine the ordering of single-qubit states under these imaginarity measures, demonstrating that the order remains invariant under the bit-flip channel for specific parameter ranges. This study enhances our understanding of imaginarity as a quantum resource and its potential applications in quantum information theory.
- Published
- 2025
- Full Text
- View/download PDF
11. Automated Heterogeneous Network learning with Non-Recursive Message Passing
- Author
-
Li, Zhaoqing, Jiang, Maiqi, Chen, Shengyuan, Li, Bo, Chen, Guorong, and Huang, Xiao
- Subjects
Computer Science - Machine Learning - Abstract
Heterogeneous information networks (HINs) can be used to model various real-world systems. As HINs consist of multiple types of nodes, edges, and node features, it is nontrivial to directly apply graph neural network (GNN) techniques in heterogeneous cases. There are two remaining major challenges. First, homogeneous message passing in a recursive manner neglects the distinct types of nodes and edges in different hops, leading to unnecessary information mixing. This often results in the incorporation of ``noise'' from uncorrelated intermediate neighbors, thereby degrading performance. Second, feature learning should be handled differently for different types, which is challenging especially when the type sizes are large. To bridge this gap, we develop a novel framework - AutoGNR, to directly utilize and automatically extract effective heterogeneous information. Instead of recursive homogeneous message passing, we introduce a non-recursive message passing mechanism for GNN to mitigate noise from uncorrelated node types in HINs. Furthermore, under the non-recursive framework, we manage to efficiently perform neural architecture search for an optimal GNN structure in a differentiable way, which can automatically define the heterogeneous paths for aggregation. Our tailored search space encompasses more effective candidates while maintaining a tractable size. Experiments show that AutoGNR consistently outperforms state-of-the-art methods on both normal and large scale real-world HIN datasets.
- Published
- 2025
12. Identity-aware Feature Decoupling Learning for Clothing-change Person Re-identification
- Author
-
Xu, Haoxuan, Li, Bo, and Niu, Guanglin
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Clothing-change person re-identification (CC Re-ID) has attracted increasing attention in recent years due to its application prospect. Most existing works struggle to adequately extract the ID-related information from the original RGB images. In this paper, we propose an Identity-aware Feature Decoupling (IFD) learning framework to mine identity-related features. Particularly, IFD exploits a dual stream architecture that consists of a main stream and an attention stream. The attention stream takes the clothing-masked images as inputs and derives the identity attention weights for effectively transferring the spatial knowledge to the main stream and highlighting the regions with abundant identity-related information. To eliminate the semantic gap between the inputs of two streams, we propose a clothing bias diminishing module specific to the main stream to regularize the features of clothing-relevant regions. Extensive experimental results demonstrate that our framework outperforms other baseline models on several widely-used CC Re-ID datasets., Comment: Accepted by ICASSP2025
- Published
- 2025
13. TAPFed: Threshold Secure Aggregation for Privacy-Preserving Federated Learning
- Author
-
Xu, Runhua, Li, Bo, Li, Chao, Joshi, James B. D., Ma, Shuai, and Li, Jianxin
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Artificial Intelligence - Abstract
Federated learning is a computing paradigm that enhances privacy by enabling multiple parties to collaboratively train a machine learning model without revealing personal data. However, current research indicates that traditional federated learning platforms are unable to ensure privacy due to privacy leaks caused by the interchange of gradients. To achieve privacy-preserving federated learning, integrating secure aggregation mechanisms is essential. Unfortunately, existing solutions are vulnerable to recently demonstrated inference attacks such as the disaggregation attack. This paper proposes TAPFed, an approach for achieving privacy-preserving federated learning in the context of multiple decentralized aggregators with malicious actors. TAPFed uses a proposed threshold functional encryption scheme and allows for a certain number of malicious aggregators while maintaining security and privacy. We provide formal security and privacy analyses of TAPFed and compare it to various baselines through experimental evaluation. Our results show that TAPFed offers equivalent performance in terms of model quality compared to state-of-the-art approaches while reducing transmission overhead by 29%-45% across different model training scenarios. Most importantly, TAPFed can defend against recently demonstrated inference attacks caused by curious aggregators, which the majority of existing approaches are susceptible to., Comment: The paper has been published in IEEE TDSC
- Published
- 2025
- Full Text
- View/download PDF
14. PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
- Author
-
Yuan, Lingzhi, Li, Xinfeng, Xu, Chejian, Tao, Guanhong, Jia, Xiaojun, Huang, Yihao, Dong, Wei, Liu, Yang, Wang, XiaoFeng, and Li, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Cryptography and Security - Abstract
Text-to-image (T2I) models have been shown to be vulnerable to misuse, particularly in generating not-safe-for-work (NSFW) content, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. Extensive experiments across three datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 7.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%., Comment: 16 pages, 8 figures, 10 tables
- Published
- 2025
15. DepthMaster: Taming Diffusion Models for Monocular Depth Estimation
- Author
-
Song, Ziyang, Wang, Zerong, Li, Bo, Zhang, Hao, Zhu, Ruijie, Liu, Li, Jiang, Peng-Tao, and Zhang, Tianzhu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page., Comment: 11 pages, 6 figures, 6 tables
- Published
- 2025
16. The (Exact) Price of Cardinality for Indivisible Goods: A Parametric Perspective
- Author
-
Lam, Alexander, Li, Bo, and Sun, Ankang
- Subjects
Computer Science - Computer Science and Game Theory - Abstract
We adopt a parametric approach to analyze the worst-case degradation in social welfare when the allocation of indivisible goods is constrained to be fair. Specifically, we are concerned with cardinality-constrained allocations, which require that each agent has at most $k$ items in their allocated bundle. We propose the notion of the price of cardinality, which captures the worst-case multiplicative loss of utilitarian or egalitarian social welfare resulting from imposing the cardinality constraint. We then characterize tight or almost-tight bounds on the price of cardinality as exact functions of the instance parameters, demonstrating how the social welfare improves as $k$ is increased. In particular, one of our main results refines and generalizes the existing asymptotic bound on the price of balancedness, as studied by Bei et al. [BLMS21]. We also further extend our analysis to the problem where the items are partitioned into disjoint categories, and each category has its own cardinality constraint. Through a parametric study of the price of cardinality, we provide a framework which aids decision makers in choosing an ideal level of cardinality-based fairness, using their knowledge of the potential loss of utilitarian and egalitarian social welfare., Comment: To appear in the proceeding of AAAI2025
- Published
- 2025
17. Boosting Adversarial Transferability with Spatial Adversarial Alignment
- Author
-
Chen, Zhaoyu, Guo, Haijing, Jiang, Kaixun, Fu, Jiyuan, Zhou, Xinyu, Yang, Dingkang, Tang, Hao, Li, Bo, and Zhang, Wenqiang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Cryptography and Security - Abstract
Deep neural networks are vulnerable to adversarial examples that exhibit transferability across various models. Numerous approaches are proposed to enhance the transferability of adversarial examples, including advanced optimization, data augmentation, and model modifications. However, these methods still show limited transferability, particularly in cross-architecture scenarios, such as from CNN to ViT. To achieve high transferability, we propose a technique termed Spatial Adversarial Alignment (SAA), which employs an alignment loss and leverages a witness model to fine-tune the surrogate model. Specifically, SAA consists of two key parts: spatial-aware alignment and adversarial-aware alignment. First, we minimize the divergences of features between the two models in both global and local regions, facilitating spatial alignment. Second, we introduce a self-adversarial strategy that leverages adversarial examples to impose further constraints, aligning features from an adversarial perspective. Through this alignment, the surrogate model is trained to concentrate on the common features extracted by the witness model. This facilitates adversarial attacks on these shared features, thereby yielding perturbations that exhibit enhanced transferability. Extensive experiments on various architectures on ImageNet show that aligned surrogate models based on SAA can provide higher transferable adversarial examples, especially in cross-architecture attacks.
- Published
- 2025
18. SatFlow: Scalable Network Planning for LEO Mega-Constellations
- Author
-
Cen, Sheng, Pan, Qiying, Zhu, Yifei, and Li, Bo
- Subjects
Computer Science - Networking and Internet Architecture - Abstract
Low-earth-orbit (LEO) satellite communication networks have evolved into mega-constellations with hundreds to thousands of satellites inter-connecting with inter-satellite links (ISLs). Network planning, which plans for network resources and architecture to improve the network performance and save operational costs, is crucial for satellite network management. However, due to the large scale of mega-constellations, high dynamics of satellites, and complex distribution of real-world traffic, it is extremely challenging to conduct scalable network planning on mega-constellations with high performance. In this paper, we propose SatFlow, a distributed and hierarchical network planning framework to plan for the network topology, traffic allocation, and fine-grained ISL terminal power allocation for mega-constellations. To tackle the hardness of the original problem, we decompose the grand problem into two hierarchical sub-problems, tackled by two-tier modules. A multi-agent reinforcement learning approach is proposed for the upper-level module so that the overall laser energy consumption and ISL operational costs can be minimized; A distributed alternating step algorithm is proposed for the lower-level module so that the laser energy consumption could be minimized with low time complexity for a given topology. Extensive simulations on various mega-constellations validate SatFlow's scalability on the constellation size, reducing the flow violation ratio by up to 21.0% and reducing the total costs by up to 89.4%, compared with various state-of-the-art benchmarks., Comment: Accepted by IEEE International Conference on Network Protocols (ICNP'24)
- Published
- 2024
19. XRAG: eXamining the Core -- Benchmarking Foundational Components in Advanced Retrieval-Augmented Generation
- Author
-
Mao, Qianren, Luo, Yangyifei, Zhang, Jinlong, Hao, Hanwen, Cao, Zhilong, Wang, Xiaolong, Guan, Xiao, Huang, Zhenting, Jiang, Weifeng, Guo, Shuyu, Han, Zhentao, Zhang, Qili, Tao, Siyuan, Liu, Yujie, Liu, Junnan, Tan, Zhixing, Sun, Jie, Li, Bo, Liu, Xudong, Zhang, Richong, and Li, Jianxin
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Retrieval-augmented generation (RAG) synergizes the retrieval of pertinent data with the generative capabilities of Large Language Models (LLMs), ensuring that the generated output is not only contextually relevant but also accurate and current. We introduce XRAG, an open-source, modular codebase that facilitates exhaustive evaluation of the performance of foundational components of advanced RAG modules. These components are systematically categorized into four core phases: pre-retrieval, retrieval, post-retrieval, and generation. We systematically analyse them across reconfigured datasets, providing a comprehensive benchmark for their effectiveness. As the complexity of RAG systems continues to escalate, we underscore the critical need to identify potential failure points in RAG systems. We formulate a suite of experimental methodologies and diagnostic testing protocols to dissect the failure points inherent in RAG engineering. Subsequently, we proffer bespoke solutions aimed at bolstering the overall performance of these modules. Our work thoroughly evaluates the performance of advanced core components in RAG systems, providing insights into optimizations for prevalent failure points.
- Published
- 2024
20. LLMs Lost in Translation: M-ALERT uncovers Cross-Linguistic Safety Gaps
- Author
-
Friedrich, Felix, Tedeschi, Simone, Schramowski, Patrick, Brack, Manuel, Navigli, Roberto, Nguyen, Huu, Li, Bo, and Kersting, Kristian
- Subjects
Computer Science - Computation and Language - Abstract
Building safe Large Language Models (LLMs) across multiple languages is essential in ensuring both safe access and linguistic diversity. To this end, we introduce M-ALERT, a multilingual benchmark that evaluates the safety of LLMs in five languages: English, French, German, Italian, and Spanish. M-ALERT includes 15k high-quality prompts per language, totaling 75k, following the detailed ALERT taxonomy. Our extensive experiments on 10 state-of-the-art LLMs highlight the importance of language-specific safety analysis, revealing that models often exhibit significant inconsistencies in safety across languages and categories. For instance, Llama3.2 shows high unsafety in the category crime_tax for Italian but remains safe in other languages. Similar differences can be observed across all models. In contrast, certain categories, such as substance_cannabis and crime_propaganda, consistently trigger unsafe responses across models and languages. These findings underscore the need for robust multilingual safety practices in LLMs to ensure safe and responsible usage across diverse user communities.
- Published
- 2024
21. Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking
- Author
-
Xu, Zhengfei, Zhao, Sijia, Hao, Yanchao, Liu, Xiaolong, Li, Lili, Yin, Yuyang, Li, Bo, Chen, Xi, and Xin, Xin
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Information Retrieval ,Computer Science - Multimedia - Abstract
Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline., Comment: AAAI 2025;Dataset are released at https://github.com/NP-NET-research/PL-VEL
- Published
- 2024
22. BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvement
- Author
-
Du, Yuhao, Chen, Shunian, Zan, Wenbo, Li, Peizhao, Wang, Mingxuan, Song, Dingjie, Li, Bo, Hu, Yan, and Wang, Benyou
- Subjects
Computer Science - Human-Computer Interaction ,Computer Science - Artificial Intelligence - Abstract
The application of Large Language Models (LLMs) in Computer-Aided Design (CAD) remains an underexplored area, despite their remarkable advancements in other domains. In this paper, we present BlenderLLM, a novel framework for training LLMs specifically for CAD tasks leveraging a self-improvement methodology. To support this, we developed a bespoke training dataset, BlendNet, and introduced a comprehensive evaluation suite, CADBench. Our results reveal that existing models demonstrate significant limitations in generating accurate CAD scripts. However, through minimal instruction-based fine-tuning and iterative self-improvement, BlenderLLM significantly surpasses these models in both functionality and accuracy of CAD script generation. This research establishes a strong foundation for the application of LLMs in CAD while demonstrating the transformative potential of self-improving models in advancing CAD automation. We encourage further exploration and adoption of these methodologies to drive innovation in the field. The dataset, model, benchmark, and source code are publicly available at https://github.com/FreedomIntelligence/BlenderLLM
- Published
- 2024
23. Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning
- Author
-
Liu, Yuti, Liu, Shice, Gao, Junyuan, Jiang, Pengtao, Zhang, Hao, Chen, Jinwei, and Li, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages., Comment: Accepted by AAAI 2025
- Published
- 2024
24. Intention Knowledge Graph Construction for User Intention Relation Modeling
- Author
-
Bai, Jiaxin, Wang, Zhaobo, Cheng, Junfei, Yu, Dan, Huang, Zerui, Wang, Weiqi, Liu, Xin, Luo, Chen, He, Qi, Zhu, Yanming, Li, Bo, and Song, Yangqiu
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Understanding user intentions is challenging for online platforms. Recent work on intention knowledge graphs addresses this but often lacks focus on connecting intentions, which is crucial for modeling user behavior and predicting future actions. This paper introduces a framework to automatically generate an intention knowledge graph, capturing connections between user intentions. Using the Amazon m2 dataset, we construct an intention graph with 351 million edges, demonstrating high plausibility and acceptance. Our model effectively predicts new session intentions and enhances product recommendations, outperforming previous state-of-the-art methods and showcasing the approach's practical utility.
- Published
- 2024
25. Boundaries of the sets of quantum realizable values of arbitrary order Bargmann invariants
- Author
-
Zhang, Lin, Xie, Bing, and Li, Bo
- Subjects
Quantum Physics ,Mathematical Physics - Abstract
In the latest developments within the field of quantum information science, Bargmann invariants have emerged as fundamental quantities, uniquely characterizing tuples of quantum states while remaining invariant under unitary transformations. However, determining the boundaries of quantum-realizable values for Bargmann invariants of arbitrary order remains a significant theoretical challenge. In this work, we completely solve this problem by deriving a unified boundary formulation for these values. Through rigorous mathematical analysis and numerical simulations, we explore the constraints imposed by quantum mechanics to delineate the achievable ranges of these invariants. We demonstrate that the boundaries depend on the specific properties of quantum states and the order of the Bargmann invariants, illustrated by a family of single-parameter qubit pure states. Our findings uncover intricate connections between Bargmann invariants and quantum imaginarity, offering a unified perspective on the associated boundary curves. These results enhance our understanding of the physical limits within quantum mechanics and may lead to novel applications of Bargmann invariants in quantum information processing., Comment: 12 pages, 3 figures
- Published
- 2024
26. AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
- Author
-
Kang, Mintong, Xu, Chejian, and Li, Bo
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Cryptography and Security ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Recent advancements in large audio-language models (LALMs) have enabled speech-based user interactions, significantly enhancing user experience and accelerating the deployment of LALMs in real-world applications. However, ensuring the safety of LALMs is crucial to prevent risky outputs that may raise societal concerns or violate AI regulations. Despite the importance of this issue, research on jailbreaking LALMs remains limited due to their recent emergence and the additional technical challenges they present compared to attacks on DNN-based audio models. Specifically, the audio encoders in LALMs, which involve discretization operations, often lead to gradient shattering, hindering the effectiveness of attacks relying on gradient-based optimizations. The behavioral variability of LALMs further complicates the identification of effective (adversarial) optimization targets. Moreover, enforcing stealthiness constraints on adversarial audio waveforms introduces a reduced, non-convex feasible solution space, further intensifying the challenges of the optimization process. To overcome these challenges, we develop AdvWave, the first jailbreak framework against LALMs. We propose a dual-phase optimization method that addresses gradient shattering, enabling effective end-to-end gradient-based optimization. Additionally, we develop an adaptive adversarial target search algorithm that dynamically adjusts the adversarial optimization target based on the response patterns of LALMs for specific queries. To ensure that adversarial audio remains perceptually natural to human listeners, we design a classifier-guided optimization approach that generates adversarial noise resembling common urban sounds. Extensive evaluations on multiple advanced LALMs demonstrate that AdvWave outperforms baseline methods, achieving a 40% higher average jailbreak attack success rate.
- Published
- 2024
27. Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning
- Author
-
Wei, Rongzhe, Li, Mufei, Ghassemi, Mohsen, Kreačić, Eleonora, Li, Yifan, Yue, Xiang, Li, Bo, Potluru, Vamsi K., Li, Pan, and Chien, Eli
- Subjects
Computer Science - Machine Learning - Abstract
Large Language Models are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically. The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks to compare the unlearned models against models retrained without the to-be-unlearned data. However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective. Prior work shows that data outliers may exhibit higher memorization effects. Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is underestimated in the current evaluation. In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable assessments of LLM unlearning efficacy.
- Published
- 2024
28. Comateformer: Combined Attention Transformer for Semantic Sentence Matching
- Author
-
Li, Bo, Liang, Di, and Zhang, Zixin
- Subjects
Computer Science - Computation and Language - Abstract
The Transformer-based model have made significant strides in semantic matching tasks by capturing connections between phrase pairs. However, to assess the relevance of sentence pairs, it is insufficient to just examine the general similarity between the sentences. It is crucial to also consider the tiny subtleties that differentiate them from each other. Regrettably, attention softmax operations in transformers tend to miss these subtle differences. To this end, in this work, we propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer). In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties. Unlike traditional attention mechanisms that merely adjust the weights of input tokens, our proposed method learns how to combine, subtract, or resize specific vectors when building a representation. Moreover, our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores. This allows for a more meaningful representation of relationships between sentences. To evaluate the performance of our proposed model, we conducted extensive experiments on ten public real-world datasets and robustness testing. Experimental results show that our method achieves consistent improvements., Comment: This paper is accepted by 27th EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2024)
- Published
- 2024
- Full Text
- View/download PDF
29. SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations
- Author
-
Chen, Zhaorun, Pinto, Francesco, Pan, Minzhou, and Li, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
With the rise of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, which are inefficient and impractical for guardrailing real-world content. To bridge this gap, we propose SafeWatch, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, causing inefficiency and bias, SafeWatch uniquely encodes each policy chunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. SafeWatch outperforms SOTA by 28.2% on SafeWatch-Bench, 13.6% on benchmarks, cuts costs by 10%, and delivers top-tier explanations validated by LLM and human reviews., Comment: 43 pages, 20 figures
- Published
- 2024
30. MIT-10M: A Large Scale Parallel Corpus of Multilingual Image Translation
- Author
-
Li, Bo, Zhu, Shaolin, and Wen, Lijie
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Image Translation (IT) holds immense potential across diverse domains, enabling the translation of textual content within images into various languages. However, existing datasets often suffer from limitations in scale, diversity, and quality, hindering the development and evaluation of IT models. To address this issue, we introduce MIT-10M, a large-scale parallel corpus of multilingual image translation with over 10M image-text pairs derived from real-world data, which has undergone extensive data cleaning and multilingual translation validation. It contains 840K images in three sizes, 28 categories, tasks with three levels of difficulty and 14 languages image-text pairs, which is a considerable improvement on existing datasets. We conduct extensive experiments to evaluate and train models on MIT-10M. The experimental results clearly indicate that our dataset has higher adaptability when it comes to evaluating the performance of the models in tackling challenging and complex image translation tasks in the real world. Moreover, the performance of the model fine-tuned with MIT-10M has tripled compared to the baseline model, further confirming its superiority., Comment: Accepted in COLING 2025
- Published
- 2024
31. Data Free Backdoor Attacks
- Author
-
Cao, Bochuan, Jia, Jinyuan, Hu, Chuxuan, Guo, Wenbo, Xiang, Zhen, Chen, Jinghui, Li, Bo, and Song, Dawn
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Backdoor attacks aim to inject a backdoor into a classifier such that it predicts any input with an attacker-chosen backdoor trigger as an attacker-chosen target class. Existing backdoor attacks require either retraining the classifier with some clean data or modifying the model's architecture. As a result, they are 1) not applicable when clean data is unavailable, 2) less efficient when the model is large, and 3) less stealthy due to architecture changes. In this work, we propose DFBA, a novel retraining-free and data-free backdoor attack without changing the model architecture. Technically, our proposed method modifies a few parameters of a classifier to inject a backdoor. Through theoretical analysis, we verify that our injected backdoor is provably undetectable and unremovable by various state-of-the-art defenses under mild assumptions. Our evaluation on multiple datasets further demonstrates that our injected backdoor: 1) incurs negligible classification loss, 2) achieves 100% attack success rates, and 3) bypasses six existing state-of-the-art defenses. Moreover, our comparison with a state-of-the-art non-data-free backdoor attack shows our attack is more stealthy and effective against various defenses while achieving less classification accuracy loss., Comment: 24 pages, 8 figures, accepted by NeurIPS 2024
- Published
- 2024
32. MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
- Author
-
Guo, Jarvis, Zheng, Tuney, Bai, Yuelin, Li, Bo, Wang, Yubo, Zhu, King, Li, Yizhi, Neubig, Graham, Chen, Wenhu, and Yue, Xiang
- Subjects
Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.
- Published
- 2024
33. Composition of Experts: A Modular Compound AI System Leveraging Large Language Models
- Author
-
Jain, Swayambhoo, Raju, Ravi, Li, Bo, Csaki, Zoltan, Li, Jonathan, Liang, Kaizhao, Feng, Guoyao, Thakkar, Urmish, Sampat, Anand, Prabhakar, Raghu, and Jairath, Sumati
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Statistics - Machine Learning - Abstract
Large Language Models (LLMs) have achieved remarkable advancements, but their monolithic nature presents challenges in terms of scalability, cost, and customization. This paper introduces the Composition of Experts (CoE), a modular compound AI system leveraging multiple expert LLMs. CoE leverages a router to dynamically select the most appropriate expert for a given input, enabling efficient utilization of resources and improved performance. We formulate the general problem of training a CoE and discuss inherent complexities associated with it. We propose a two-step routing approach to address these complexities that first uses a router to classify the input into distinct categories followed by a category-to-expert mapping to obtain desired experts. CoE offers a flexible and cost-effective solution to build compound AI systems. Our empirical evaluation demonstrates the effectiveness of CoE in achieving superior performance with reduced computational overhead. Given that CoE comprises of many expert LLMs it has unique system requirements for cost-effective serving. We present an efficient implementation of CoE leveraging SambaNova SN40L RDUs unique three-tiered memory architecture. CoEs obtained using open weight LLMs Qwen/Qwen2-7B-Instruct, google/gemma-2-9b-it, google/gemma-2-27b-it, meta-llama/Llama-3.1-70B-Instruct and Qwen/Qwen2-72B-Instruct achieve a score of $59.4$ with merely $31$ billion average active parameters on Arena-Hard and a score of $9.06$ with $54$ billion average active parameters on MT-Bench.
- Published
- 2024
34. Learning Adaptive Lighting via Channel-Aware Guidance
- Author
-
Yang, Qirui, Jiang, Peng-Tao, Zhang, Hao, Chen, Jinwei, Li, Bo, Yue, Huanjing, and Yang, Jingyu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
Learning lighting adaption is a key step in obtaining a good visual perception and supporting downstream vision tasks. There are multiple light-related tasks (e.g., image retouching and exposure correction) and previous studies have mainly investigated these tasks individually. However, we observe that the light-related tasks share fundamental properties: i) different color channels have different light properties, and ii) the channel differences reflected in the time and frequency domains are different. Based on the common light property guidance, we propose a Learning Adaptive Lighting Network (LALNet), a unified framework capable of processing different light-related tasks. Specifically, we introduce the color-separated features that emphasize the light difference of different color channels and combine them with the traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across channels. We introduce dual domain channel modulation to generate color-separated features and a wavelet followed by a vision state space module to generate color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources. We provide an anonymous online demo at https://xxxxxx2025.github.io/LALNet/.
- Published
- 2024
35. CPA: Camera-pose-awareness Diffusion Transformer for Video Generation
- Author
-
Wang, Yuelei, Zhang, Jian, Jiang, Pengtao, Zhang, Hao, Chen, Jinwei, and Li, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Despite the significant advancements made by Diffusion Transformer (DiT)-based methods in video generation, there remains a notable gap with controllable camera pose perspectives. Existing works such as OpenSora do NOT adhere precisely to anticipated trajectories and physical interactions, thereby limiting the flexibility in downstream applications. To alleviate this issue, we introduce CPA, a unified camera-pose-awareness text-to-video generation approach that elaborates the camera movement and integrates the textual, visual, and spatial conditions. Specifically, we deploy the Sparse Motion Encoding (SME) module to transform camera pose information into a spatial-temporal embedding and activate the Temporal Attention Injection (TAI) module to inject motion patches into each ST-DiT block. Our plug-in architecture accommodates the original DiT parameters, facilitating diverse types of camera poses and flexible object movement. Extensive qualitative and quantitative experiments demonstrate that our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.
- Published
- 2024
36. Practical Performative Policy Learning with Strategic Agents
- Author
-
Chen, Qianyi, Chen, Ying, and Li, Bo
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Science and Game Theory ,Statistics - Methodology ,Statistics - Machine Learning - Abstract
This paper studies the performative policy learning problem, where agents adjust their features in response to a released policy to improve their potential outcomes, inducing an endogenous distribution shift. There has been growing interest in training machine learning models in strategic environments, including strategic classification and performative prediction. However, existing approaches often rely on restrictive parametric assumptions: micro-level utility models in strategic classification and macro-level data distribution maps in performative prediction, severely limiting scalability and generalizability. We approach this problem as a complex causal inference task, relaxing parametric assumptions on both micro-level agent behavior and macro-level data distribution. Leveraging bounded rationality, we uncover a practical low-dimensional structure in distribution shifts and construct an effective mediator in the causal path from the deployed model to the shifted data. We then propose a gradient-based policy optimization algorithm with a differentiable classifier as a substitute for the high-dimensional distribution map. Our algorithm efficiently utilizes batch feedback and limited manipulation patterns. Our approach achieves high sample efficiency compared to methods reliant on bandit feedback or zero-order optimization. We also provide theoretical guarantees for algorithmic convergence. Extensive and challenging experiments on high-dimensional settings demonstrate our method's practical efficacy.
- Published
- 2024
37. FullStack Bench: Evaluating LLMs as Full Stack Coders
- Author
-
Bytedance-Seed-Foundation-Code-Team, Cheng, Yao, Chen, Jianfeng, Chen, Jie, Chen, Li, Chen, Liyu, Chen, Wentao, Chen, Zhengyu, Geng, Shijie, Li, Aoyan, Li, Bo, Li, Bowen, Li, Linyi, Liu, Boyi, Liu, Jerry, Liu, Kaibo, Liu, Qi, Liu, Shukai, Liu, Siyao, Liu, Tianyi, Liu, Tingkai, Liu, Yongfei, Long, Rui, Mai, Jing, Ning, Guanghan, Peng, Z. Y., Shen, Kai, Su, Jiahao, Su, Jing, Sun, Tao, Sun, Yifan, Tao, Yunzhe, Wang, Guoyin, Wang, Siwei, Wang, Xuwu, Wang, Yite, Wang, Zihan, Xia, Jinxiang, Xiang, Liang, Xiao, Xia, Xiao, Yongsheng, Xi, Chenguang, Xin, Shulin, Xu, Jingjing, Xu, Shikun, Yang, Hongxia, Yang, Jack, Yang, Yingxiang, Yuan, Jianbo, Zhang, Jun, Zhang, Yufeng, Zhang, Yuyu, Zheng, Shen, Zhu, He, and Zhu, Ming
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Software Engineering - Abstract
As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion., Comment: 26 pages
- Published
- 2024
38. Universal non-Hermitian transport in disordered systems
- Author
-
Li, Bo, Chen, Chuan, and Wang, Zhong
- Subjects
Quantum Physics ,Condensed Matter - Disordered Systems and Neural Networks ,Condensed Matter - Statistical Mechanics ,Physics - Optics - Abstract
In disordered Hermitian systems, localization of energy eigenstates prohibits wave propagation. In non-Hermitian systems, however, wave propagation is possible even when the eigenstates of Hamiltonian are exponentially localized by disorders. We find in this regime that non-Hermitian wave propagation exhibits novel universal scaling behaviors without Hermitian counterpart. Furthermore, our theory demonstrates how the tail of imaginary-part density of states dictates wave propagation in the long-time limit. Specifically, for the three typical classes, namely the Gaussian, the uniform, and the linear imaginary-part density of states, we obtain logarithmically suppressed sub-ballistic transport, and two types of subdiffusion with exponents that depend only on spatial dimensions, respectively. Our work highlights the fundamental differences between Hermitian and non-Hermitian Anderson localization, and uncovers unique universality in non-Hermitian wave propagation., Comment: 5+10 pages,3+2 figures
- Published
- 2024
39. DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams
- Author
-
Zhang, Xinyu, Zhang, Lingling, Wu, Yanrui, Huang, Muye, Wu, Wenjun, Li, Bo, Wang, Shaowei, and Liu, Jun
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Visual Question Generation (VQG) has gained significant attention due to its potential in educational applications. However, VQG researches mainly focus on natural images, neglecting diagrams in educational materials used to assess students' conceptual understanding. To address this gap, we introduce DiagramQG, a dataset containing 8,372 diagrams and 19,475 questions across various subjects. DiagramQG introduces concept and target text constraints, guiding the model to generate concept-focused questions for educational purposes. Meanwhile, we present the Hierarchical Knowledge Integration framework for Diagram Question Generation (HKI-DQG) as a strong baseline. This framework obtains multi-scale patches of diagrams and acquires knowledge using a visual language model with frozen parameters. It then integrates knowledge, text constraints and patches to generate concept-focused questions. We evaluate the performance of existing VQG models, open-source and closed-source vision-language models, and HKI-DQG on the DiagramQG dataset. Our HKI-DQG outperform existing methods, demonstrating that it serves as a strong baseline. Furthermore, to assess its generalizability, we apply HKI-DQG to two other VQG datasets of natural images, namely VQG-COCO and K-VQG, achieving state-of-the-art performance.The dataset and code are available at https://dxzxy12138.github.io/diagramqg-home.
- Published
- 2024
40. MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
- Author
-
Fu, Chaoyou, Zhang, Yi-Fan, Yin, Shukang, Li, Bo, Fang, Xinyu, Zhao, Sirui, Duan, Haodong, Sun, Xing, Liu, Ziwei, Wang, Liang, Shan, Caifeng, and He, Ran
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research., Comment: Produced by MME+MMBench+LLaVA Teams. Project Page: https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Benchmarks
- Published
- 2024
41. Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
- Author
-
Zhang, Kaichen, Shen, Yifei, Li, Bo, and Liu, Ziwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computation and Language - Abstract
Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.
- Published
- 2024
42. Text-guided Zero-Shot Object Localization
- Author
-
Wang, Jingjing, Piao, Xinglin, Gao, Zongzhi, Li, Bo, Zhang, Yong, and Yin, Baocai
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Object localization is a hot issue in computer vision area, which aims to identify and determine the precise location of specific objects from image or video. Most existing object localization methods heavily rely on extensive labeled data, which are costly to annotate and constrain their applicability. Therefore, we propose a new Zero-Shot Object Localization (ZSOL) framework for addressing the aforementioned challenges. In the proposed framework, we introduce the Contrastive Language Image Pre-training (CLIP) module which could integrate visual and linguistic information effectively. Furthermore, we design a Text Self-Similarity Matching (TSSM) module, which could improve the localization accuracy by enhancing the representation of text features extracted by CLIP module. Hence, the proposed framework can be guided by prompt words to identify and locate specific objects in an image in the absence of labeled samples. The results of extensive experiments demonstrate that the proposed method could improve the localization performance significantly and establishes an effective benchmark for further research.
- Published
- 2024
43. SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach
- Author
-
Sun, Ruoxi, Chang, Jiamin, Pearce, Hammond, Xiao, Chaowei, Li, Bo, Wu, Qi, Nepal, Surya, and Xue, Minhui
- Subjects
Computer Science - Cryptography and Security - Abstract
Multimodal foundation models (MFMs) represent a significant advancement in artificial intelligence, combining diverse data modalities to enhance learning and understanding across a wide range of applications. However, this integration also brings unique safety and security challenges. In this paper, we conceptualize cybersafety and cybersecurity in the context of multimodal learning and present a comprehensive Systematization of Knowledge (SoK) to unify these concepts in MFMs, identifying key threats to these models. We propose a taxonomy framework grounded in information theory, evaluating and categorizing threats through the concepts of channel capacity, signal, noise, and bandwidth. This approach provides a novel framework that unifies model safety and system security in MFMs, offering a more comprehensive and actionable understanding of the risks involved. We used this to explore existing defense mechanisms, and identified gaps in current research - particularly, a lack of protection for alignment between modalities and a need for more systematic defense methods. Our work contributes to a deeper understanding of the security and safety landscape in MFMs, providing researchers and practitioners with valuable insights for improving the robustness and reliability of these models.
- Published
- 2024
44. Temporal evolution of axially standing kink motions in solar coronal slabs: An eigenfunction expansion approach
- Author
-
Gao, Yuhong, Li, Bo, Shi, Mijie, Chen, Shaoxia, and Yu, Hui
- Subjects
Astrophysics - Solar and Stellar Astrophysics - Abstract
We aim to provide more insights into the applicability to solar coronal seismology of the much-studied discrete leaky modes (DLMs) in classic analyses. Under linear ideal pressureless MHD, we examine two-dimensional (2D) axial fundamental kink motions that arise when localized velocity exciters impact some symmetric slab equilibria. Continuous structuring is allowed for. A 1D initial value problem (IVP) is formulated in conjunction with an eigenvalue problem (EVP) for laterally open systems, with no strict boundary conditions (BCs) at infinity. The IVP is solved by eigenfunction expansion, allowing a clear distinction between the contributions from proper eigenmodes and improper continuum eigenmodes. Example solutions are offered for parameters typical of active region loops. Our solutions show that the system evolves towards long periodicities due to proper eigenmodes (of order the axial Alfven time), whereas the interference of the improper continuum may lead to short periodicities initially (of order the lateral Alfven time). Specializing to the slab axis, we demonstrate that the proper contribution strengthens with the density contrast, but may occasionally be stronger for less steep density profiles. Short periodicities are not guaranteed in the improper contribution, the details of the initial exciter being key. When identifiable, these periodicities tend to agree with the oscillation frequencies expected for DLMs, despite the differences in the BCs between our EVP and classic analyses. The eigenfunction expansion approach enables all qualitative features to be interpreted as the interplay between the initial exciter and some response function, the latter solely determined by the equilibria. Classic theories for DLMs can find seismological applications, with time-dependent studies offering additional ways for constraining initial exciters., Comment: accepted for publication in A&A
- Published
- 2024
45. UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation
- Author
-
Zhang, Chengyuan, Zhang, Yilin, Zhu, Lei, Liu, Deyin, Wu, Lin, Li, Bo, Zhang, Shichao, Bennamoun, Mohammed, and Boussaid, Farid
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper introduces a novel framework for unified incremental few-shot object detection (iFSOD) and instance segmentation (iFSIS) using the Transformer architecture. Our goal is to create an optimal solution for situations where only a few examples of novel object classes are available, with no access to training data for base or old classes, while maintaining high performance across both base and novel classes. To achieve this, We extend Mask-DINO into a two-stage incremental learning framework. Stage 1 focuses on optimizing the model using the base dataset, while Stage 2 involves fine-tuning the model on novel classes. Besides, we incorporate a classifier selection strategy that assigns appropriate classifiers to the encoder and decoder according to their distinct functions. Empirical evidence indicates that this approach effectively mitigates the over-fitting on novel classes learning. Furthermore, we implement knowledge distillation to prevent catastrophic forgetting of base classes. Comprehensive evaluations on the COCO and LVIS datasets for both iFSIS and iFSOD tasks demonstrate that our method significantly outperforms state-of-the-art approaches., Comment: 11 pages, 3 figures
- Published
- 2024
46. RedCode: Risky Code Execution and Generation Benchmark for Code Agents
- Author
-
Guo, Chengquan, Liu, Xun, Xie, Chulin, Zhou, Andy, Zeng, Yi, Lin, Zinan, Song, Dawn, and Li, Bo
- Subjects
Computer Science - Software Engineering ,Computer Science - Artificial Intelligence - Abstract
With the rapidly increasing capabilities and adoption of code agents for AI-assisted coding, safety concerns, such as generating or executing risky code, have become significant barriers to the real-world deployment of these agents. To provide comprehensive and practical evaluations on the safety of code agents, we propose RedCode, a benchmark for risky code execution and generation: (1) RedCode-Exec provides challenging prompts that could lead to risky code execution, aiming to evaluate code agents' ability to recognize and handle unsafe code. We provide a total of 4,050 risky test cases in Python and Bash tasks with diverse input formats including code snippets and natural text. They covers 25 types of critical vulnerabilities spanning 8 domains (e.g., websites, file systems). We provide Docker environments and design corresponding evaluation metrics to assess their execution results. (2) RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions to generate harmful code or software. Our empirical findings, derived from evaluating three agent frameworks based on 19 LLMs, provide insights into code agents' vulnerabilities. For instance, evaluations on RedCode-Exec show that agents are more likely to reject executing risky operations on the operating system, but are less likely to reject executing technically buggy code, indicating high risks. Risky operations described in natural text lead to a lower rejection rate than those in code format. Additionally, evaluations on RedCode-Gen show that more capable base models and agents with stronger overall coding abilities, such as GPT4, tend to produce more sophisticated and effective harmful software. Our findings highlight the need for stringent safety evaluations for diverse code agents. Our dataset and code are available at https://github.com/AI-secure/RedCode., Comment: Accepted by NeurIPS 2024 Datasets and Benchmarks Track
- Published
- 2024
47. 3D Focusing-and-Matching Network for Multi-Instance Point Cloud Registration
- Author
-
Zhang, Liyuan, Hui, Le, Liu, Qi, Li, Bo, and Dai, Yuchao
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Multi-instance point cloud registration aims to estimate the pose of all instances of a model point cloud in the whole scene. Existing methods all adopt the strategy of first obtaining the global correspondence and then clustering to obtain the pose of each instance. However, due to the cluttered and occluded objects in the scene, it is difficult to obtain an accurate correspondence between the model point cloud and all instances in the scene. To this end, we propose a simple yet powerful 3D focusing-and-matching network for multi-instance point cloud registration by learning the multiple pair-wise point cloud registration. Specifically, we first present a 3D multi-object focusing module to locate the center of each object and generate object proposals. By using self-attention and cross-attention to associate the model point cloud with structurally similar objects, we can locate potential matching instances by regressing object centers. Then, we propose a 3D dual masking instance matching module to estimate the pose between the model point cloud and each object proposal. It performs instance mask and overlap mask masks to accurately predict the pair-wise correspondence. Extensive experiments on two public benchmarks, Scan2CAD and ROBI, show that our method achieves a new state-of-the-art performance on the multi-instance point cloud registration task. Code is available at https://github.com/zlynpu/3DFMNet., Comment: Accepted to NeurIPS 2024
- Published
- 2024
48. The Limits of Differential Privacy in Online Learning
- Author
-
Li, Bo, Wang, Wei, and Ye, Peng
- Subjects
Computer Science - Machine Learning - Abstract
Differential privacy (DP) is a formal notion that restricts the privacy leakage of an algorithm when running on sensitive data, in which privacy-utility trade-off is one of the central problems in private data analysis. In this work, we investigate the fundamental limits of differential privacy in online learning algorithms and present evidence that separates three types of constraints: no DP, pure DP, and approximate DP. We first describe a hypothesis class that is online learnable under approximate DP but not online learnable under pure DP under the adaptive adversarial setting. This indicates that approximate DP must be adopted when dealing with adaptive adversaries. We then prove that any private online learner must make an infinite number of mistakes for almost all hypothesis classes. This essentially generalizes previous results and shows a strong separation between private and non-private settings since a finite mistake bound is always attainable (as long as the class is online learnable) when there is no privacy requirement.
- Published
- 2024
49. Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
- Author
-
Ma, Yingzi, Wang, Jiongxiao, Wang, Fei, Ma, Siyuan, Li, Jiazhao, Li, Xiujun, Huang, Furong, Sun, Lichao, Li, Bo, Choi, Yejin, Chen, Muhao, and Xiao, Chaowei
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Machine unlearning has emerged as an effective strategy for forgetting specific information in the training data. However, with the increasing integration of visual data, privacy concerns in Vision Language Models (VLMs) remain underexplored. To address this, we introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms under the Right to be Forgotten setting. Specifically, we formulate the VLM unlearning task via constructing the Fictitious Facial Identity VQA dataset and apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels. In terms of evaluation, since VLM supports various forms of ways to ask questions with the same semantic meaning, we also provide robust evaluation metrics including membership inference attacks and carefully designed adversarial privacy attacks to evaluate the performance of algorithms. Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance, with significant trade-offs between model utility and forget quality. Furthermore, our findings also highlight the importance of privacy attacks for robust evaluations. We hope FIUBench will drive progress in developing more effective VLM unlearning algorithms.
- Published
- 2024
50. Minder: Faulty Machine Detection for Large-scale Distributed Model Training
- Author
-
Deng, Yangtao, Shi, Xiang, Jiang, Zhuo, Zhang, Xingjian, Zhang, Lei, Zhang, Zhang, Li, Bo, Song, Zuquan, Zhu, Hang, Liu, Gaohong, Li, Fuliang, Wang, Shuguang, Lin, Haibin, Ye, Jianxi, and Yu, Minlan
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning - Abstract
Large-scale distributed model training requires simultaneous training on up to thousands of machines. Faulty machine detection is critical when an unexpected fault occurs in a machine. From our experience, a training task can encounter two faults per day on average, possibly leading to a halt for hours. To address the drawbacks of the time-consuming and labor-intensive manual scrutiny, we propose Minder, an automatic faulty machine detector for distributed training tasks. The key idea of Minder is to automatically and efficiently detect faulty distinctive monitoring metric patterns, which could last for a period before the entire training task comes to a halt. Minder has been deployed in our production environment for over one year, monitoring daily distributed training tasks where each involves up to thousands of machines. In our real-world fault detection scenarios, Minder can accurately and efficiently react to faults within 3.6 seconds on average, with a precision of 0.904 and F1-score of 0.893.
- Published
- 2024
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.