Author: "An, Shaohui" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"An, Shaohui"' showing total 19,192 results

Start Over Author "An, Shaohui"

19,192 results on '"An, Shaohui"'

1. BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

Author: Zhao, Junqi, Fang, Zhijin, Li, Shu, Yang, Shaohui, and He, Shichao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by $\textbf{2.5}\times$ in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by $\textbf{7.69%}$ under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a $\log{n}$ time complexity. The code is available at https://github.com/JunqiZhao888/buzz-llm.
Published: 2024

2. ActiveSplat: High-Fidelity Scene Reconstruction through Active Gaussian Splatting

Author: Li, Yuetao, Kuang, Zijia, Li, Ting, Zhou, Guyue, Zhang, Shaohui, and Yan, Zike
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose ActiveSplat, an autonomous high-fidelity reconstruction system leveraging Gaussian splatting. Taking advantage of efficient and realistic rendering, the system establishes a unified framework for online mapping, viewpoint selection, and path planning. The key to ActiveSplat is a hybrid map representation that integrates both dense information about the environment and a sparse abstraction of the workspace. Therefore, the system leverages sparse topology for efficient viewpoint sampling and path planning, while exploiting view-dependent dense prediction for viewpoint selection, facilitating efficient decision-making with promising accuracy and completeness. A hierarchical planning strategy based on the topological map is adopted to mitigate repetitive trajectories and improve local granularity given limited budgets, ensuring high-fidelity reconstruction with photorealistic view synthesis. Extensive experiments and ablation studies validate the efficacy of the proposed method in terms of reconstruction accuracy, data coverage, and exploration efficiency. Project page: https://li-yuetao.github.io/ActiveSplat/.
Published: 2024

3. Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution

Author: Qiao, Junbo, Liao, Jincheng, Li, Wei, Zhang, Yulun, Guo, Yong, Wen, Yi, Qiu, Zhangxizi, Xie, Jiao, Hu, Jie, and Lin, Shaohui
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: State Space Models (SSM), such as Mamba, have shown strong representation ability in modeling long-range dependency with linear complexity, achieving successful applications from high-level to low-level vision tasks. However, SSM's sequential nature necessitates multiple scans in different directions to compensate for the loss of spatial dependency when unfolding the image into a 1D sequence. This multi-direction scanning strategy significantly increases the computation overhead and is unbearable for high-resolution image processing. To address this problem, we propose a novel Hierarchical Mamba network, namely, Hi-Mamba, for image super-resolution (SR). Hi-Mamba consists of two key designs: (1) The Hierarchical Mamba Block (HMB) assembled by a Local SSM (L-SSM) and a Region SSM (R-SSM) both with the single-direction scanning, aggregates multi-scale representations to enhance the context modeling ability. (2) The Direction Alternation Hierarchical Mamba Group (DA-HMG) allocates the isomeric single-direction scanning into cascading HMBs to enrich the spatial relationship modeling. Extensive experiments demonstrate the superiority of Hi-Mamba across five benchmark datasets for efficient SR. For example, Hi-Mamba achieves a significant PSNR improvement of 0.29 dB on Manga109 for $\times3$ SR, compared to the strong lightweight MambaIR.
Published: 2024

4. Robust Incremental Structure-from-Motion with Hybrid Features

Author: Liu, Shaohui, Gao, Yidan, Zhang, Tianyi, Pautrat, Rémi, Schönberger, Johannes L., Larsson, Viktor, and Pollefeys, Marc
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Structure-from-Motion (SfM) has become a ubiquitous tool for camera calibration and scene reconstruction with many downstream applications in computer vision and beyond. While the state-of-the-art SfM pipelines have reached a high level of maturity in well-textured and well-configured scenes over the last decades, they still fall short of robustly solving the SfM problem in challenging scenarios. In particular, weakly textured scenes and poorly constrained configurations oftentimes cause catastrophic failures or large errors for the primarily keypoint-based pipelines. In these scenarios, line segments are often abundant and can offer complementary geometric constraints. Their large spatial extent and typically structured configurations lead to stronger geometric constraints as compared to traditional keypoint-based methods. In this work, we introduce an incremental SfM system that, in addition to points, leverages lines and their structured geometric relations. Our technical contributions span the entire pipeline (mapping, triangulation, registration) and we integrate these into a comprehensive end-to-end SfM system that we share as an open-source software with the community. We also present the first analytical method to propagate uncertainties for 3D optimized lines via sensitivity analysis. Experiments show that our system is consistently more robust and accurate compared to the widely used point-based state of the art in SfM -- achieving richer maps and more precise camera registrations, especially under challenging conditions. In addition, our uncertainty-aware localization module alone is able to consistently improve over the state of the art under both point-alone and hybrid setups., Comment: 40 pages, 16 figures, 9 tables. To appear in ECCV 2024
Published: 2024

5. A Unified Approach for Learning the Dynamics of Power System Generators and Inverter-based Resources

Author: Liu, Shaohui, Cai, Weiqian, Zhu, Hao, and Johnson, Brian
Subjects: Electrical Engineering and Systems Science - Systems and Control, Computer Science - Machine Learning
Abstract: The growing prevalence of inverter-based resources (IBRs) for renewable energy integration and electrification greatly challenges power system dynamic analysis. To account for both synchronous generators (SGs) and IBRs, this work presents an approach for learning the model of an individual dynamic component. The recurrent neural network (RNN) model is used to match the recursive structure in predicting the key dynamical states of a component from its terminal bus voltage and set-point input. To deal with the fast transients especially due to IBRs, we develop a Stable Integral (SI-)RNN to mimic high-order integral methods that can enhance the stability and accuracy for the dynamic learning task. We demonstrate that the proposed SI-RNN model not only can successfully predict the component's dynamic behaviors, but also offers the possibility of efficiently computing the dynamic sensitivity relative to a set-point change. These capabilities have been numerically validated based on full-order Electromagnetic Transient (EMT) simulations on a small test system with both SGs and IBRs, particularly for predicting the dynamics of grid-forming inverters.
Published: 2024

6. An Efficient Projection-Based Next-best-view Planning Framework for Reconstruction of Unknown Objects

Author: Jia, Zhizhou, Zhang, Shaohui, and Hao, Qun
Subjects: Computer Science - Robotics
Abstract: Efficiently and completely capturing the three-dimensional data of an object is a fundamental problem in industrial and robotic applications. The task of next-best-view (NBV) planning is to infer the pose of the next viewpoint based on the current data, and gradually realize the complete three-dimensional reconstruction. Many existing algorithms, however, suffer a large computational burden due to the use of ray-casting. To address this, this paper proposes a projection-based NBV planning framework. It can select the next best view at an extremely fast speed while ensuring the complete scanning of the object. Specifically, this framework refits different types of voxel clusters into ellipsoids based on the voxel structure.Then, the next best view is selected from the candidate views using a projection-based viewpoint quality evaluation function in conjunction with a global partitioning strategy. This process replaces the ray-casting in voxel structures, significantly improving the computational efficiency. Comparative experiments with other algorithms in a simulation environment show that the framework proposed in this paper can achieve 10 times efficiency improvement on the basis of capturing roughly the same coverage. The real-world experimental results also prove the efficiency and feasibility of the framework.
Published: 2024

7. Optical intensity-gradient torque due to chiral multipole interplay

Author: Wen, Jiquan, Chen, Huajin, Zheng, Hongxia, Xu, Xiaohao, Yan, Shaohui, Yao, Baoli, and Lin, Zhifang
Subjects: Physics - Optics
Abstract: Owing to the ubiquity and easy-to-shape property of optical intensity, the intensity gradient force of light has been most spectacularly exploited in optical manipulation of small particles. Manifesting the intensity gradient as an optical torque to spin particles is of great fascination on both fundamental and practical sides but remains elusive. Here, we uncover the existence of the optical intensity-gradient torque in the interaction of light with chiral particles. Such a new type of torque derives from the interplay between chirality induced multipoles, which switches its direction for particles with opposite chirality. We show that this torque can be directly detected by a simple standing wave field, created with the interference of two counterpropagating plane-like waves. Our work offers a unique route to achieve rotational control of matter by tailoring the field intensity of Maxwell waves. It also establishes a framework that maps a remarkable connection among the optical forces and torques, across chiral to nonchiral.
Published: 2024

8. Distance Measurement for UAVs in Deep Hazardous Tunnels

Author: Choudhary, Vishal, Gupta, Shashi Kant, Foong, Shaohui, and Lim, Hock Beng
Subjects: Computer Science - Robotics
Abstract: The localization of Unmanned aerial vehicles (UAVs) in deep tunnels is extremely challenging due to their inaccessibility and hazardous environment. Conventional outdoor localization techniques (such as using GPS) and indoor localization techniques (such as those based on WiFi, Infrared (IR), Ultra-Wideband, etc.) do not work in deep tunnels. We are developing a UAV-based system for the inspection of defects in the Deep Tunnel Sewerage System (DTSS) in Singapore. To enable the UAV localization in the DTSS, we have developed a distance measurement module based on the optical flow technique. However, the standard optical flow technique does not work well in tunnels with poor lighting and a lack of features. Thus, we have developed an enhanced optical flow algorithm with prediction, to improve the distance measurement for UAVs in deep hazardous tunnels.
Published: 2024

9. Batch-FPM: Random batch-update multi-parameter physical Fourier ptychography neural network

Author: Sun, Ruiqing, Yang, Delong, Su, Yiyan, Zhang, Shaohui, and Hao, Qun
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Physics - Optics
Abstract: Fourier Ptychographic Microscopy (FPM) is a computational imaging technique that enables high-resolution imaging over a large field of view. However, its application in the biomedical field has been limited due to the long image reconstruction time and poor noise robustness. In this paper, we propose a fast and robust FPM reconstruction method based on physical neural networks with batch update stochastic gradient descent (SGD) optimization strategy, capable of achieving attractive results with low single-to-noise ratio and correcting multiple system parameters simultaneously. Our method leverages a random batch optimization approach, breaks away from the fixed sequential iterative order and gives greater attention to high-frequency information. The proposed method has better convergence performance even for low signal-to-noise ratio data sets, such as low exposure time dark-field images. As a result, it can greatly increase the image recording and result reconstruction speed without any additional hardware modifications. By utilizing advanced deep learning optimizers and perform parallel computational scheme, our method enhances GPU computational efficiency, significantly reducing reconstruction costs. Experimental results demonstrate that our method achieves near real-time digital refocusing of a 1024 x 1024 pixels region of interest on consumer-grade GPUs. This approach significantly improves temporal resolution (by reducing the exposure time of dark-field images), noise resistance, and reconstruction speed, and therefore can efficiently promote the practical application of FPM in clinical diagnostics, digital pathology, and biomedical research, etc. In addition, we believe our algorithm scheme can help researchers quickly validate and implement FPM-related ideas. We invite requests for the full code via email.
Published: 2024

10. Attack Anything: Blind DNNs via Universal Background Adversarial Attack

Author: Lian, Jiawei, Mei, Shaohui, Wang, Xiaofei, Wang, Yi, Wang, Lefan, Lu, Yingjie, Ma, Mingyang, and Chau, Lap-Pui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: It has been widely substantiated that deep neural networks (DNNs) are susceptible and vulnerable to adversarial perturbations. Existing studies mainly focus on performing attacks by corrupting targeted objects (physical attack) or images (digital attack), which is intuitively acceptable and understandable in terms of the attack's effectiveness. In contrast, our focus lies in conducting background adversarial attacks in both digital and physical domains, without causing any disruptions to the targeted objects themselves. Specifically, an effective background adversarial attack framework is proposed to attack anything, by which the attack efficacy generalizes well between diverse objects, models, and tasks. Technically, we approach the background adversarial attack as an iterative optimization problem, analogous to the process of DNN learning. Besides, we offer a theoretical demonstration of its convergence under a set of mild but sufficient conditions. To strengthen the attack efficacy and transferability, we propose a new ensemble strategy tailored for adversarial perturbations and introduce an improved smooth constraint for the seamless connection of integrated perturbations. We conduct comprehensive and rigorous experiments in both digital and physical domains across various objects, models, and tasks, demonstrating the effectiveness of attacking anything of the proposed method. The findings of this research substantiate the significant discrepancy between human and machine vision on the value of background variations, which play a far more critical role than previously recognized, necessitating a reevaluation of the robustness and reliability of DNNs. The code will be publicly available at https://github.com/JiaweiLian/Attack_Anything
Published: 2024

11. PADetBench: Towards Benchmarking Physical Attacks against Object Detection

Author: Lian, Jiawei, Pan, Jianhong, Wang, Lefan, Wang, Yi, Chau, Lap-Pui, and Mei, Shaohui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Physical attacks against object detection have gained increasing attention due to their significant practical implications. However, conducting physical experiments is extremely time-consuming and labor-intensive. Moreover, physical dynamics and cross-domain transformation are challenging to strictly regulate in the real world, leading to unaligned evaluation and comparison, severely hindering the development of physically robust models. To accommodate these challenges, we explore utilizing realistic simulation to thoroughly and rigorously benchmark physical attacks with fairness under controlled physical dynamics and cross-domain transformation. This resolves the problem of capturing identical adversarial images that cannot be achieved in the real world. Our benchmark includes 20 physical attack methods, 48 object detectors, comprehensive physical dynamics, and evaluation metrics. We also provide end-to-end pipelines for dataset generation, detection, evaluation, and further analysis. In addition, we perform 8064 groups of evaluation based on our benchmark, which includes both overall evaluation and further detailed ablation studies for controlled physical dynamics. Through these experiments, we provide in-depth analyses of physical attack performance and physical adversarial robustness, draw valuable observations, and discuss potential directions for future research. Codebase: https://github.com/JiaweiLian/Benchmarking_Physical_Attack
Published: 2024

12. Ex3: Automatic Novel Writing by Extracting, Excelsior and Expanding

Author: Huang, Lei, Guo, Jiaming, He, Guanhua, Zhang, Xishan, Zhang, Rui, Peng, Shaohui, Liu, Shaoli, and Chen, Tianshi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Generating long-term texts such as novels using artificial intelligence has always been a challenge. A common approach is to use large language models (LLMs) to construct a hierarchical framework that first plans and then writes. Despite the fact that the generated novels reach a sufficient length, they exhibit poor logical coherence and appeal in their plots and deficiencies in character and event depiction, ultimately compromising the overall narrative quality. In this paper, we propose a method named Extracting Excelsior and Expanding. Ex3 initially extracts structure information from raw novel data. By combining this structure information with the novel data, an instruction-following dataset is meticulously crafted. This dataset is then utilized to fine-tune the LLM, aiming for excelsior generation performance. In the final stage, a tree-like expansion method is deployed to facilitate the generation of arbitrarily long novels. Evaluation against previous methods showcases Ex3's ability to produce higher-quality long-form novels.
Published: 2024

13. Latent Linear Quadratic Regulator for Robotic Control Tasks

Author: Zhang, Yuan, Yang, Shaohui, Ohtsuka, Toshiyuki, Jones, Colin, and Boedecker, Joschka
Subjects: Computer Science - Robotics, Computer Science - Machine Learning
Abstract: Model predictive control (MPC) has played a more crucial role in various robotic control tasks, but its high computational requirements are concerning, especially for nonlinear dynamical models. This paper presents a $\textbf{la}$tent $\textbf{l}$inear $\textbf{q}$uadratic $\textbf{r}$egulator (LaLQR) that maps the state space into a latent space, on which the dynamical model is linear and the cost function is quadratic, allowing the efficient application of LQR. We jointly learn this alternative system by imitating the original MPC. Experiments show LaLQR's superior efficiency and generalization compared to other baselines., Comment: Accepted at RSS 2024 workshop on Koopman Operators in Robotics
Published: 2024

14. Image Compression for Machine and Human Vision with Spatial-Frequency Adaptation

Author: Li, Han, Li, Shaohui, Ding, Shuangrui, Dai, Wenrui, Cao, Maida, Li, Chenglin, Zou, Junni, and Xiong, Hongkai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image compression for machine and human vision (ICMH) has gained increasing attention in recent years. Existing ICMH methods are limited by high training and storage overheads due to heavy design of task-specific networks. To address this issue, in this paper, we develop a novel lightweight adapter-based tuning framework for ICMH, named Adapt-ICMH, that better balances task performance and bitrates with reduced overheads. We propose a spatial-frequency modulation adapter (SFMA) that simultaneously eliminates non-semantic redundancy with a spatial modulation adapter, and enhances task-relevant frequency components and suppresses task-irrelevant frequency components with a frequency modulation adapter. The proposed adapter is plug-and-play and compatible with almost all existing learned image compression models without compromising the performance of pre-trained models. Experiments demonstrate that Adapt-ICMH consistently outperforms existing ICMH frameworks on various machine vision tasks with fewer fine-tuned parameters and reduced computational complexity. Code will be released at https://github.com/qingshi9974/ECCV2024-AdpatICMH ., Comment: Accepted by ECCV2024, project: https://github.com/qingshi9974/ECCV2024-AdpatICMH
Published: 2024

15. HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection

Author: Cao, Liujuan, Lin, Jianghang, Hong, Zebo, Shen, Yunhang, Lin, Shaohui, Chen, Chao, and Ji, Rongrong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most WSOD methods rely on traditional object proposals to generate candidate regions and are confronted with unstable training, which easily gets stuck in a poor local optimum. In this paper, we introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD, which utilizes a comprehensive self-training framework without needing external modules or additional supervision. HUWSOD innovatively incorporates a self-supervised proposal generator and an autoencoder proposal generator with a multi-rate resampling pyramid to replace traditional object proposals, enabling end-to-end WSOD training and inference. Additionally, we implement a holistic self-training scheme that refines detection scores and coordinates through step-wise entropy minimization and consistency-constraint regularization, ensuring consistent predictions across stochastic augmentations of the same image. Extensive experiments on PASCAL VOC and MS COCO demonstrate that HUWSOD competes with state-of-the-art WSOD methods, eliminating the need for offline proposals and additional data. The peak performance of HUWSOD approaches that of fully-supervised Faster R-CNN. Our findings also indicate that randomly initialized boxes, although significantly different from well-designed offline object proposals, are effective for WSOD training.
Published: 2024

16. Prompt-based Visual Alignment for Zero-shot Policy Transfer

Author: Gao, Haihan, Zhang, Rui, Yi, Qi, Yao, Hantao, Li, Haochen, Guo, Jiaming, Peng, Shaohui, Gao, Yunkai, Wang, QiCheng, Hu, Xing, Wen, Yuanbo, Zhang, Zihao, Du, Zidong, Li, Ling, Guo, Qi, and Chen, Yunji
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Overfitting in RL has become one of the main obstacles to applications in reinforcement learning(RL). Existing methods do not provide explicit semantic constrain for the feature extractor, hindering the agent from learning a unified cross-domain representation and resulting in performance degradation on unseen domains. Besides, abundant data from multiple domains are needed. To address these issues, in this work, we propose prompt-based visual alignment (PVA), a robust framework to mitigate the detrimental domain bias in the image for zero-shot policy transfer. Inspired that Visual-Language Model (VLM) can serve as a bridge to connect both text space and image space, we leverage the semantic information contained in a text sequence as an explicit constraint to train a visual aligner. Thus, the visual aligner can map images from multiple domains to a unified domain and achieve good generalization performance. To better depict semantic information, prompt tuning is applied to learn a sequence of learnable tokens. With explicit constraints of semantic information, PVA can learn unified cross-domain representation under limited access to cross-domain data and achieves great zero-shot generalization ability in unseen domains. We verify PVA on a vision-based autonomous driving task with CARLA simulator. Experiments show that the agent generalizes well on unseen domains under limited access to multi-domain data., Comment: This paper has been accepted by ICML2024
Published: 2024

17. Scientists in the Textbook: Development and Validation of an Analytical Framework for Analyzing Scientists' Portrayals in an American Chemistry Textbook

Author: Shaohui Chi, Zuhao Wang, and Li Qian
Abstract: Enabling students to learn about science is essential for science education. Students are expected to not only gain scientific knowledge but also need to develop a deep understanding of science. One approach to equipping students with a sense of science is to present science as a living collective human enterprise. As essential educational resources, science textbooks are powerful supportive tools for helping students be aware of the tentative, historical, and humanistic features of science. This study aims to develop and validate a comprehensive analytical framework for examining how a science textbook enables students to understand science and scientists through scientists' portrayals. The final analytical framework comprises five themes and 13 dimensions concerning scientists and their work, including the textbook's representation method (i.e., format and role of representation), scientists' background (i.e., personal and social background), scientists' work-related features (i.e., motivation for doing research, research methods, and way of working), scientists' achievements (i.e., type, evaluation, and influence), and educational values of scientists and their work (i.e., scientific thinking, scientific attitudes, and social responsibility). The analysis results of an American high school science textbook indicate that the framework developed is feasible to cover all the desired scientist-related elements and evaluate the extent of scientists' portrayals presented in the textbooks. In addition, the results also revealed that this textbook is inadequate in providing students with a comprehensive understanding of science and scientists via its portrayals of scientists.
Published: 2024
Full Text: View/download PDF

18. Instructor's Low Guided Gaze Duration Improves Learning Performance for Students with Low Prior Knowledge in Video Lectures

Author: Yawen Shi, Zengzhao Chen, Mengke Wang, Shaohui Chen, and Jianwen Sun
Abstract: Background: Guided gaze is the instructor's gaze towards teaching materials to guide students' attention, and it plays a vital role in enhancing video-based education. The duration of guided gaze, indicating how long instructors focus on teaching materials, varies based on the lecture design. Nevertheless, the impact of varying durations of guided gaze, especially concerning students' prior knowledge, remains inadequately understood. Objectives: This study investigates the influence of the instructor's guided gaze duration and students' prior knowledge on learning performance and affective experiences in video lectures. Methods: 145 fifth-grade students participated and were divided into high and low prior knowledge groups based on a pre-test. Within each group, students were randomly assigned to view one of three video lectures with different guided gaze durations (high vs. medium vs. low). Learning performance and affective experiences (learning experience, satisfaction, and emotions) were measured as dependent variables. Results and Conclusion: The results revealed that low guided gaze duration significantly improves learning performance for students with low prior knowledge. Conversely, high guided gaze duration negatively impacts learning experience, satisfaction, and positive emotions. Additionally, students with high prior knowledge reported higher learning experience and satisfaction. These findings highlight the interaction between guided gaze duration and prior knowledge in students' learning performance. Implications: Our findings provide valuable implications for the design of guided gaze duration in video lectures based on students' prior knowledge. By adjusting guided gaze duration appropriately, instructors can optimise students' learning performance and affective experiences.
Published: 2024
Full Text: View/download PDF

19. Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Author: Fu, Chaoyou, Dai, Yuhan, Luo, Yongdong, Li, Lei, Ren, Shuhuai, Zhang, Renrui, Wang, Zihan, Zhou, Chenyu, Shen, Yunhang, Zhang, Mengdan, Chen, Peixian, Li, Yanwei, Lin, Shaohui, Zhao, Sirui, Li, Ke, Xu, Tong, Zheng, Xiawu, Chen, Enhong, Ji, Rongrong, and Sun, Xing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io, Comment: Project Page: https://video-mme.github.io
Published: 2024

20. 3D Neural Edge Reconstruction

Author: Li, Lei, Peng, Songyou, Yu, Zehao, Liu, Shaohui, Pautrat, Rémi, Yin, Xiaochuan, and Pollefeys, Marc
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Real-world objects and environments are predominantly composed of edge features, including straight lines and curves. Such edges are crucial elements for various applications, such as CAD modeling, surface meshing, lane mapping, etc. However, existing traditional methods only prioritize lines over curves for simplicity in geometric modeling. To this end, we introduce EMAP, a new method for learning 3D edge representations with a focus on both lines and curves. Our method implicitly encodes 3D edge distance and direction in Unsigned Distance Functions (UDF) from multi-view edge maps. On top of this neural representation, we propose an edge extraction algorithm that robustly abstracts parametric 3D edges from the inferred edge points and their directions. Comprehensive evaluations demonstrate that our method achieves better 3D edge reconstruction on multiple challenging datasets. We further show that our learned UDF field enhances neural surface reconstruction by capturing more details., Comment: Project page: https://neural-edge-map.github.io
Published: 2024

21. GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane

Author: Qu, Yansong, Dai, Shaohui, Li, Xinyang, Lin, Jianghang, Cao, Liujuan, Zhang, Shengchuan, and Ji, Rongrong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: 3D open-vocabulary scene understanding, crucial for advancing augmented reality and robotic applications, involves interpreting and locating specific regions within a 3D space as directed by natural language instructions. To this end, we introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) and identifies 3D Gaussians of Interest using an Optimizable Semantic-space Hyperplane. Our approach includes an efficient compression method that utilizes scene priors to condense noisy high-dimensional semantic features into compact low-dimensional vectors, which are subsequently embedded in 3DGS. During the open-vocabulary querying process, we adopt a distinct approach compared to existing methods, which depend on a manually set fixed empirical threshold to select regions based on their semantic feature distance to the query text embedding. This traditional approach often lacks universal accuracy, leading to challenges in precisely identifying specific target areas. Instead, our method treats the feature selection process as a hyperplane division within the feature space, retaining only those features that are highly relevant to the query. We leverage off-the-shelf 2D Referring Expression Segmentation (RES) models to fine-tune the semantic-space hyperplane, enabling a more precise distinction between target regions and others. This fine-tuning substantially improves the accuracy of open-vocabulary queries, ensuring the precise localization of pertinent 3D Gaussians. Extensive experiments demonstrate GOI's superiority over previous state-of-the-art methods. Our project page is available at https://quyans.github.io/GOI-Hyperplane/ ., Comment: Our project page is available at https://quyans.github.io/GOI-Hyperplane/
Published: 2024

22. Luban: Building Open-Ended Creative Agents via Autonomous Embodied Verification

Author: Guo, Yuxuan, Peng, Shaohui, Guo, Jiaming, Huang, Di, Zhang, Xishan, Zhang, Rui, Hao, Yifan, Li, Ling, Tian, Zikang, Gao, Mingju, Li, Yutai, Gan, Yiming, Liang, Shuai, Zhang, Zihao, Du, Zidong, Guo, Qi, Hu, Xing, and Chen, Yunji
Subjects: Computer Science - Artificial Intelligence
Abstract: Building open agents has always been the ultimate goal in AI research, and creative agents are the more enticing. Existing LLM agents excel at long-horizon tasks with well-defined goals (e.g., `mine diamonds' in Minecraft). However, they encounter difficulties on creative tasks with open goals and abstract criteria due to the inability to bridge the gap between them, thus lacking feedback for self-improvement in solving the task. In this work, we introduce autonomous embodied verification techniques for agents to fill the gap, laying the groundwork for creative tasks. Specifically, we propose the Luban agent target creative building tasks in Minecraft, which equips with two-level autonomous embodied verification inspired by human design practices: (1) visual verification of 3D structural speculates, which comes from agent synthesized CAD modeling programs; (2) pragmatic verification of the creation by generating and verifying environment-relevant functionality programs based on the abstract criteria. Extensive multi-dimensional human studies and Elo ratings show that the Luban completes diverse creative building tasks in our proposed benchmark and outperforms other baselines ($33\%$ to $100\%$) in both visualization and pragmatism. Additional demos on the real-world robotic arm show the creation potential of the Luban in the physical world.
Published: 2024

23. NeRF in Robotics: A Survey

Author: Wang, Guangming, Pan, Lei, Peng, Songyou, Liu, Shaohui, Xu, Chenfeng, Miao, Yanzi, Zhan, Wei, Tomizuka, Masayoshi, Pollefeys, Marc, and Wang, Hesheng
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: Meticulous 3D environment representations have been a longstanding goal in computer vision and robotics fields. The recent emergence of neural implicit representations has introduced radical innovation to this field as implicit representations enable numerous capabilities. Among these, the Neural Radiance Field (NeRF) has sparked a trend because of the huge representational advantages, such as simplified mathematical models, compact environment storage, and continuous scene representations. Apart from computer vision, NeRF has also shown tremendous potential in the field of robotics. Thus, we create this survey to provide a comprehensive understanding of NeRF in the field of robotics. By exploring the advantages and limitations of NeRF, as well as its current applications and future potential, we hope to shed light on this promising area of research. Our survey is divided into two main sections: \textit{The Application of NeRF in Robotics} and \textit{The Advance of NeRF in Robotics}, from the perspective of how NeRF enters the field of robotics. In the first section, we introduce and analyze some works that have been or could be used in the field of robotics from the perception and interaction perspectives. In the second section, we show some works related to improving NeRF's own properties, which are essential for deploying NeRF in the field of robotics. In the discussion section of the review, we summarize the existing challenges and provide some valuable future research directions for reference., Comment: 21 pages, 19 figures
Published: 2024

24. SC-HVPPNet: Spatial and Channel Hybrid-Attention Video Post-Processing Network with CNN and Transformer

Author: Zhang, Tong, Cui, Wenxue, Liu, Shaohui, and Jiang, Feng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Convolutional Neural Network (CNN) and Transformer have attracted much attention recently for video post-processing (VPP). However, the interaction between CNN and Transformer in existing VPP methods is not fully explored, leading to inefficient communication between the local and global extracted features. In this paper, we explore the interaction between CNN and Transformer in the task of VPP, and propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains. Specifically, in the spatial domain, a novel spatial attention fusion module is designed, in which two attention weights are generated to fuse the local and global representations collaboratively. In the channel domain, a novel channel attention fusion module is developed, which can blend the deep representations at the channel dimension dynamically. Extensive experiments show that SC-HVPPNet notably boosts video restoration quality, with average bitrate savings of 5.29%, 12.42%, and 13.09% for Y, U, and V components in the VTM-11.0-NNVC RA configuration.
Published: 2024

25. The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report

Author: Ren, Bin, Li, Yawei, Mehta, Nancy, Timofte, Radu, Yu, Hongyuan, Wan, Cheng, Hong, Yuxin, Han, Bingnan, Wu, Zhuoyuan, Zou, Yajun, Liu, Yuqing, Li, Jizhe, He, Keji, Fan, Chao, Zhang, Heng, Zhang, Xiaolin, Yin, Xuanwu, Zuo, Kunlong, Liao, Bohao, Xia, Peizhe, Peng, Long, Du, Zhibo, Di, Xin, Li, Wangkai, Wang, Yang, Zhai, Wei, Pei, Renjing, Guo, Jiaming, Xu, Songcen, Cao, Yang, Zha, Zhengjun, Wang, Yan, Liu, Yi, Wang, Qing, Zhang, Gang, Zhang, Liou, Zhao, Shijie, Sun, Long, Pan, Jinshan, Dong, Jiangxin, Tang, Jinhui, Liu, Xin, Yan, Min, Wang, Qian, Zhou, Menghan, Yan, Yiqiang, Liu, Yixuan, Chan, Wensong, Tang, Dehua, Zhou, Dong, Wang, Li, Tian, Lu, Emad, Barsoum, Jia, Bohan, Qiao, Junbo, Zhou, Yunshuai, Zhang, Yun, Li, Wei, Lin, Shaohui, Zhou, Shenglong, Chen, Binbin, Liao, Jincheng, Zhao, Suiyi, Zhang, Zhao, Wang, Bo, Luo, Yan, Wei, Yanyan, Li, Feng, Wang, Mingshen, Guan, Jinhan, Hu, Dehua, Yu, Jiawei, Xu, Qisheng, Sun, Tao, Lan, Long, Xu, Kele, Lin, Xin, Yue, Jingtong, Yang, Lehan, Du, Shiyi, Qi, Lu, Ren, Chao, Han, Zeyu, Wang, Yuhan, Chen, Chaolin, Li, Haobo, Zheng, Mingjun, Yang, Zhongbao, Song, Lianhong, Yan, Xingzhuo, Fu, Minghan, Zhang, Jingyi, Li, Baiang, Zhu, Qi, Xu, Xiaogang, Guo, Dan, Guo, Chunle, Chen, Jiadi, Long, Huanhuan, Duanmu, Chunjiang, Lei, Xiaoyan, Liu, Jie, Jia, Weilin, Cao, Weifeng, Zhang, Wenlong, Mao, Yanyu, Guo, Ruilong, Zhang, Nihao, Pandey, Manoj, Chernozhukov, Maksym, Le, Giang, Cheng, Shuli, Wang, Hongyuan, Wei, Ziyan, Tang, Qingting, Wang, Liejun, Li, Yongming, Guo, Yanhui, Xu, Hao, Khatami-Rizi, Akram, Mahmoudi-Aznaveh, Ahmad, Hsu, Chih-Chung, Lee, Chia-Ming, Chou, Yi-Shiuan, Joshi, Amogh, Akalwadi, Nikhil, Malagi, Sampada, Yashaswini, Palani, Desai, Chaitra, Tabib, Ramesh Ashok, Patil, Ujwala, and Mudenagudi, Uma
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/., Comment: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024
Published: 2024

26. Fusion-Mamba for Cross-modality Object Detection

Author: Dong, Wenhao, Zhu, Haodong, Lin, Shaohui, Luo, Xiaoyan, Shen, Yunhang, Liu, Xuhui, Zhang, Juan, Guo, Guodong, and Zhang, Baochang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
Published: 2024

27. LIPT: Latency-aware Image Processing Transformer

Author: Qiao, Junbo, Li, Wei, Xie, Haizhen, Chen, Hanting, Zhou, Yunshuai, Tu, Zhijun, Hu, Jie, and Lin, Shaohui
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Transformer is leading a trend in the field of image processing. Despite the great success that existing lightweight image processing transformers have achieved, they are tailored to FLOPs or parameters reduction, rather than practical inference acceleration. In this paper, we present a latency-aware image processing transformer, termed LIPT. We devise the low-latency proportion LIPT block that substitutes memory-intensive operators with the combination of self-attention and convolutions to achieve practical speedup. Specifically, we propose a novel non-volatile sparse masking self-attention (NVSM-SA) that utilizes a pre-computing sparse mask to capture contextual information from a larger window with no extra computation overload. Besides, a high-frequency reparameterization module (HRM) is proposed to make LIPT block reparameterization friendly, which improves the model's detail reconstruction capability. Extensive experiments on multiple image processing tasks (e.g., image super-resolution (SR), JPEG artifact reduction, and image denoising) demonstrate the superiority of LIPT on both latency and PSNR. LIPT achieves real-time GPU inference with state-of-the-art performance on multiple image SR benchmarks.
Published: 2024

28. Knowledge Distillation with Multi-granularity Mixture of Priors for Image Super-Resolution

Author: Li, Simiao, Zhang, Yun, Li, Wei, Chen, Hanting, Wang, Wenjia, Jing, Bingyi, Lin, Shaohui, and Hu, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Knowledge distillation (KD) is a promising yet challenging model compression technique that transfers rich learning representations from a well-performing but cumbersome teacher model to a compact student model. Previous methods for image super-resolution (SR) mostly compare the feature maps directly or after standardizing the dimensions with basic algebraic operations (e.g. average, dot-product). However, the intrinsic semantic differences among feature maps are overlooked, which are caused by the disparate expressive capacity between the networks. This work presents MiPKD, a multi-granularity mixture of prior KD framework, to facilitate efficient SR model through the feature mixture in a unified latent space and stochastic network block mixture. Extensive experiments demonstrate the effectiveness of the proposed MiPKD method.
Published: 2024

29. A General and Efficient Training for Transformer via Token Expansion

Author: Huang, Wenxuan, Shen, Yunhang, Xie, Jiao, Zhang, Baochang, He, Gaoqi, Li, Ke, Sun, Xing, and Lin, Shaohui
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion ., Comment: Accepted to CVPR 2024. Code is available at https://github.com/Osilly/TokenExpansion
Published: 2024

30. Polarized Charge Dynamics of a Novel Charge Density Wave in Kagome FeGe

Author: Yi, Shaohui, Liao, Zhiyu, Wang, Qi, Ma, Haiyang, Liu, Jianpeng, Teng, Xiaokun, Dai, Pengcheng, Dai, Yaomin, Zhao, Jianzhou, Qi, Yanpeng, Xu, Bing, and Qiu, Xianggang
Subjects: Condensed Matter - Strongly Correlated Electrons, Condensed Matter - Materials Science
Abstract: We report on the charge dynamics of kagome FeGe, an antiferromagnet with a charge density wave (CDW) transition at $T_{\mathrm{CDW}} \simeq 105$ K, using polarized infrared spectroscopy and band structure calculations. We reveal a pronounced optical anisotropy, various excitations associated with flat bands and van Hove singularities (VHSs), and a moderate level of electronic correlations. Notably, there are two types of remarkable spectral weight (SW) redistributions for above and below $T_{\mathrm{CDW}}$. The former involves a transfer between incoherent and coherent excitations driven by the magnetic splitting-induced elevation of flat bands. The latter manifests itself as a sudden change of SW from low to high energies for both $a$ and $c$ directions, suggesting a first-order transition and the three-dimensional nature of CDW. These anomalies in SW significantly differ from those observed in other kagome metals like CsV$_3$Sb$_5$, where the nesting of VHSs results in a pronounced CDW gap feature. Instead, our findings can be accounted for by the jump of VHSs relative to the Fermi energy via a first-order structural transition involving large partial Ge1-dimerization. Our study thus unveils a complex interplay among structure, magnetism, electronic correlations, and charge order in FeGe, offering valuable insights for a comprehensive understanding of CDW order in kagome systems., Comment: 7 pages, 3 figures
Published: 2024

31. Deep learning prediction of ribosome profiling with Translatomer reveals translational regulation and interprets disease variants

Author: He, Jialin, Xiong, Lei, Shi, Shaohui, Li, Chengyu, Chen, Kexuan, Fang, Qianchen, Nan, Jiuhong, Ding, Ke, Mao, Yuanhui, Boix, Carles A., Hu, Xinyang, Kellis, Manolis, Li, Jingyun, and Xiong, Xushen
Published: 2024
Full Text: View/download PDF

32. Health co-benefits of post-COVID-19 low-carbon recovery in Chinese cities

Author: Lu, Chenxi, Huang, Yingjian, Yu, Ying, Hu, Jiawei, Mo, Huibin, Li, Yun, Huo, Da, Song, Xuanren, Huang, Xiaoting, Sun, Yun, Liu, Kai, Zhang, Shaohui, Morrissey, Karyn, Hong, Jinpyo, Deng, Zhu, Du, Zhuanjia, Creutzig, Felix, and Liu, Zhu
Published: 2024
Full Text: View/download PDF

33. FDI, new development philosophy and China’s high-quality economic development

Author: Zhang, Shaohui, Han, Zhongxian, and Guo, Mingwei
Published: 2024
Full Text: View/download PDF

34. Real-time defect detection for FFF 3D printing using lightweight model deployment

Author: Hu, WenJing, Chen, Chang, Su, Shaohui, Zhang, Jian, and Zhu, An
Published: 2024
Full Text: View/download PDF

35. A nomogram predicting intraoperative adverse events during minimally invasive radical nephrectomy and thrombectomy

Author: Chen, Kewei, Yu, Le, Ge, Liyuan, Deng, Shaohui, Zhang, Fan, Wang, Guoliang, Tian, Xiaojun, Zhang, Hongxian, and Zhang, Shudong
Published: 2024
Full Text: View/download PDF

36. Photoinduced interface activation strategy for enhancing photocatalytic hydrogen production performance of plasmonic nano Bi/Ni based metal-organic framework

Author: Zhang, Baichao, Cao, Xuchuan, Suo, Chao, Cui, Jing, Duan, Xiaochuan, Guo, Shaohui, and Zhang, Xian-Ming
Published: 2024
Full Text: View/download PDF

37. Predicting the impacts of climate change on the geographic distribution of moso bamboo in China based on biomod2 model

Author: Gu, Rui, Wei, Songpo, Li, Jiarui, Zheng, Shihui, Li, Zhiteng, Liu, Guanglu, and Fan, Shaohui
Published: 2024
Full Text: View/download PDF

38. Thyroid dysfunction in nonvalvular atrial fibrillation and clinical outcomes

Author: Chen, Zeni, Wan, Huaibin, Min, Tingting, Su, Shaohui, and Yang, De-Guang
Published: 2024
Full Text: View/download PDF

39. Targeting HSP90 in Gynecologic Cancer: Molecular Mechanisms and Therapeutic Approaches

Author: Min, Lu, Li, Xuewei, Liang, Lily, Ruan, Zheng, and Yu, Shaohui
Published: 2024
Full Text: View/download PDF

40. Apoptosis induced by cationic liposome based on the mitochondrial signaling pathway in vitro

Author: Du, Sang, Wang, Yueying, Li, Min, Zhao, Yinan, Zhi, Defu, Cui, Shaohui, and Zhang, Shubiao
Published: 2024
Full Text: View/download PDF

41. An Improved Ensemble Learning Method for Protein Content Analysis of Corn with Small Sample by Near-Infrared Spectroscopy

Author: Liu, Jing and Yu, Shaohui
Published: 2024
Full Text: View/download PDF

42. Deep Network for Image Compressed Sensing Coding Using Local Structural Sampling

Author: Cui, Wenxue, Wang, Xingtao, Fan, Xiaopeng, Liu, Shaohui, Gao, Xinwei, and Zhao, Debin
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing image compressed sensing (CS) coding frameworks usually solve an inverse problem based on measurement coding and optimization-based image reconstruction, which still exist the following two challenges: 1) The widely used random sampling matrix, such as the Gaussian Random Matrix (GRM), usually leads to low measurement coding efficiency. 2) The optimization-based reconstruction methods generally maintain a much higher computational complexity. In this paper, we propose a new CNN based image CS coding framework using local structural sampling (dubbed CSCNet) that includes three functional modules: local structural sampling, measurement coding and Laplacian pyramid reconstruction. In the proposed framework, instead of GRM, a new local structural sampling matrix is first developed, which is able to enhance the correlation between the measurements through a local perceptual sampling strategy. Besides, the designed local structural sampling matrix can be jointly optimized with the other functional modules during training process. After sampling, the measurements with high correlations are produced, which are then coded into final bitstreams by the third-party image codec. At last, a Laplacian pyramid reconstruction network is proposed to efficiently recover the target image from the measurement domain to the image domain. Extensive experimental results demonstrate that the proposed scheme outperforms the existing state-of-the-art CS coding methods, while maintaining fast computational speed., Comment: Accepted by ACM Transactions on Multimedia Computing Communications and Applications (TOMM)
Published: 2024

43. Spec-Gaussian: Anisotropic View-Dependent Appearance for 3D Gaussian Splatting

Author: Yang, Ziyi, Gao, Xinyu, Sun, Yangtian, Huang, Yihua, Lyu, Xiaoyang, Zhou, Wen, Jiao, Shaohui, Qi, Xiaojuan, and Jin, Xiaogang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The recent advancements in 3D Gaussian splatting (3D-GS) have not only facilitated real-time rendering through modern GPU rasterization pipelines but have also attained state-of-the-art rendering quality. Nevertheless, despite its exceptional rendering quality and performance on standard datasets, 3D-GS frequently encounters difficulties in accurately modeling specular and anisotropic components. This issue stems from the limited ability of spherical harmonics (SH) to represent high-frequency information. To overcome this challenge, we introduce Spec-Gaussian, an approach that utilizes an anisotropic spherical Gaussian (ASG) appearance field instead of SH for modeling the view-dependent appearance of each 3D Gaussian. Additionally, we have developed a coarse-to-fine training strategy to improve learning efficiency and eliminate floaters caused by overfitting in real-world scenes. Our experimental results demonstrate that our method surpasses existing approaches in terms of rendering quality. Thanks to ASG, we have significantly improved the ability of 3D-GS to model scenes with specular and anisotropic components without increasing the number of 3D Gaussians. This improvement extends the applicability of 3D GS to handle intricate scenarios with specular and anisotropic surfaces. Project page is https://ingra14m.github.io/Spec-Gaussian-website/., Comment: Accepted by NeurIPS 2024
Published: 2024

44. Assessing and Understanding Creativity in Large Language Models

Author: Zhao, Yunpu, Zhang, Rui, Li, Wenyi, Huang, Di, Guo, Jiaming, Peng, Shaohui, Hao, Yifan, Wen, Yuanbo, Hu, Xing, Du, Zidong, Guo, Qi, Li, Ling, and Chen, Yunji
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In the field of natural language processing, the rapid development of large language model (LLM) has attracted more and more attention. LLMs have shown a high level of creativity in various tasks, but the methods for assessing such creativity are inadequate. The assessment of LLM creativity needs to consider differences from humans, requiring multi-dimensional measurement while balancing accuracy and efficiency. This paper aims to establish an efficient framework for assessing the level of creativity in LLMs. By adapting the modified Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks, emphasizing 4 criteria including Fluency, Flexibility, Originality, and Elaboration. In this context, we develop a comprehensive dataset of 700 questions for testing and an LLM-based evaluation method. In addition, this study presents a novel analysis of LLMs' responses to diverse prompts and role-play situations. We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration. Besides, the use of prompts and the role-play settings of the model significantly influence creativity. Additionally, the experimental results also indicate that collaboration among multiple LLMs can enhance originality. Notably, our findings reveal a consensus between human evaluations and LLMs regarding the personality traits that influence creativity. The findings underscore the significant impact of LLM design on creativity and bridges artificial intelligence and human creativity, offering insights into LLMs' creativity and potential applications.
Published: 2024

45. Rethinking Centered Kernel Alignment in Knowledge Distillation

Author: Zhou, Zikai, Shen, Yunhang, Shao, Shitong, Gong, Linrui, and Lin, Shaohui
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Knowledge distillation has emerged as a highly effective method for bridging the representation discrepancy between large-scale models and lightweight models. Prevalent approaches involve leveraging appropriate metrics to minimize the divergence or distance between the knowledge extracted from the teacher model and the knowledge learned by the student model. Centered Kernel Alignment (CKA) is widely used to measure representation similarity and has been applied in several knowledge distillation methods. However, these methods are complex and fail to uncover the essence of CKA, thus not answering the question of how to use CKA to achieve simple and effective distillation properly. This paper first provides a theoretical perspective to illustrate the effectiveness of CKA, which decouples CKA to the upper bound of Maximum Mean Discrepancy~(MMD) and a constant term. Drawing from this, we propose a novel Relation-Centered Kernel Alignment~(RCKA) framework, which practically establishes a connection between CKA and MMD. Furthermore, we dynamically customize the application of CKA based on the characteristics of each task, with less computational source yet comparable performance than the previous methods. The extensive experiments on the CIFAR-100, ImageNet-1k, and MS-COCO demonstrate that our method achieves state-of-the-art performance on almost all teacher-student pairs for image classification and object detection, validating the effectiveness of our approaches. Our code is available in https://github.com/Klayand/PCKA
Published: 2024

46. Hybrid deep learning and physics-based neural network for programmable illumination computational microscopy

Author: Sun, Ruiqing, Yang, Delong, Zhang, Shaohui, and Hao, Qun
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition, Physics - Biological Physics, Physics - Optics
Abstract: Relying on either deep models or physical models are two mainstream approaches for solving inverse sample reconstruction problems in programmable illumination computational microscopy. Solutions based on physical models possess strong generalization capabilities while struggling with global optimization of inverse problems due to a lack of insufficient physical constraints. In contrast, deep learning methods have strong problem-solving abilities, but their generalization ability is often questioned because of the unclear physical principles. Besides, conventional deep models are difficult to apply to some specific scenes because of the difficulty in acquiring high-quality training data and their limited capacity to generalize across different scenarios. In this paper, to combine the advantages of deep models and physical models together, we propose a hybrid framework consisting of three sub-neural networks (two deep learning networks and one physics-based network). We first obtain a result with rich semantic information through a light deep learning neural network and then use it as the initial value of the physical network to make its output comply with physical process constraints. These two results are then used as the input of a fusion deep learning neural work which utilizes the paired features between the reconstruction results of two different models to further enhance imaging quality. The final result integrates the advantages of both deep models and physical models and can quickly solve the computational reconstruction inverse problem in programmable illumination computational microscopy and achieve better results. We verified the feasibility and effectiveness of the proposed hybrid framework with theoretical analysis and actual experiments on resolution targets and biological samples.
Published: 2024

47. A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

Author: Fu, Chaoyou, Zhang, Renrui, Wang, Zihan, Huang, Yubo, Zhang, Zhengye, Qiu, Longtian, Ye, Gaoxiang, Shen, Yunhang, Zhang, Mengdan, Chen, Peixian, Zhao, Sirui, Lin, Shaohui, Jiang, Deqiang, Yin, Di, Gao, Peng, Li, Ke, Li, Hongsheng, and Sun, Xing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The surge of interest towards Multi-modal Large Language Models (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models., Comment: Total 120 pages. See our project at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models
Published: 2023

48. Weakly Supervised Open-Vocabulary Object Detection

Author: Lin, Jianghang, Shen, Yunhang, Wang, Bingquan, Lin, Shaohui, Li, Ke, and Cao, Liujuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD)., Comment: Accepted by AAAI2024
Published: 2023

49. SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

Author: Li, Yunchen, Yu, Zhou, He, Gaoqi, Shen, Yunhang, Li, Ke, Sun, Xing, and Lin, Shaohui
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Symmetric positive definite~(SPD) matrices have shown important value and applications in statistics and machine learning, such as FMRI analysis and traffic prediction. Previous works on SPD matrices mostly focus on discriminative models, where predictions are made directly on $E(X|y)$, where $y$ is a vector and $X$ is an SPD matrix. However, these methods are challenging to handle for large-scale data, as they need to access and process the whole data. In this paper, inspired by denoising diffusion probabilistic model~(DDPM), we propose a novel generative model, termed SPD-DDPM, by introducing Gaussian distribution in the SPD space to estimate $E(X|y)$. Moreover, our model is able to estimate $p(X)$ unconditionally and flexibly without giving $y$. On the one hand, the model conditionally learns $p(X|y)$ and utilizes the mean of samples to obtain $E(X|y)$ as a prediction. On the other hand, the model unconditionally learns the probability distribution of the data $p(X)$ and generates samples that conform to this distribution. Furthermore, we propose a new SPD net which is much deeper than the previous networks and allows for the inclusion of conditional factors. Experiment results on toy data and real taxi data demonstrate that our models effectively fit the data distribution both unconditionally and unconditionally and provide accurate predictions., Comment: AAAI2024
Published: 2023

50. Aligning and Prompting Everything All at Once for Universal Visual Perception

Author: Shen, Yunhang, Fu, Chaoyou, Chen, Peixian, Zhang, Mengdan, Li, Ke, Sun, Xing, Wu, Yunsheng, Lin, Shaohui, and Ji, Rongrong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff, and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods, we present APE, a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks, i.e., detection, segmentation, and grounding, as an instance-level sentence-object matching paradigm. Specifically, APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection, which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks, APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that, with only one-suit of weights, APE outperforms (or is on par with) the state-of-the-art models, proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

19,192 results on '"An, Shaohui"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources