Author: "Hu, Di" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Hu, Di"' showing total 2,192 results

Start Over Author "Hu, Di"

2,192 results on '"Hu, Di"'

1. Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection

Author: Pang, Xincheng, Xia, Wenke, Wang, Zhigang, Zhao, Bin, Hu, Di, Wang, Dong, and Li, Xuelong
Subjects: Computer Science - Robotics
Abstract: 3D perception ability is crucial for generalizable robotic manipulation. While recent foundation models have made significant strides in perception and decision-making with RGB-based input, their lack of 3D perception limits their effectiveness in fine-grained robotic manipulation tasks. To address these limitations, we propose a Depth Information Injection ($\bold{DI}^{\bold{2}}$) framework that leverages the RGB-Depth modality for policy fine-tuning, while relying solely on RGB images for robust and efficient deployment. Concretely, we introduce the Depth Completion Module (DCM) to extract the spatial prior knowledge related to depth information and generate virtual depth information from RGB inputs to aid policy deployment. Further, we propose the Depth-Aware Codebook (DAC) to eliminate noise and reduce the cumulative error from the depth prediction. In the inference phase, this framework employs RGB inputs and accurately predicted depth data to generate the manipulation action. We conduct experiments on simulated LIBERO environments and real-world scenarios, and the experiment results prove that our method could effectively enhance the pre-trained RGB-based policy with 3D perception ability for robotic manipulation. The website is released at https://gewu-lab.github.io/DepthHelps-IROS2024., Comment: accepted by IROS 2024
Published: 2024

2. KOI: Accelerating Online Imitation Learning via Hybrid Key-state Guidance

Author: Lu, Jingxian, Xia, Wenke, Wang, Dong, Wang, Zhigang, Zhao, Bin, Hu, Di, and Li, Xuelong
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: Online Imitation Learning methods struggle with the gap between extensive online exploration space and limited expert trajectories, which hinder efficient exploration due to inaccurate task-aware reward estimation. Inspired by the findings from cognitive neuroscience that task decomposition could facilitate cognitive processing for efficient learning, we hypothesize that an agent could estimate precise task-aware imitation rewards for efficient online exploration by decomposing the target task into the objectives of "what to do" and the mechanisms of "how to do". In this work, we introduce the hybrid Key-state guided Online Imitation (KOI) learning approach, which leverages the integration of semantic and motion key states as guidance for task-aware reward estimation. Initially, we utilize the visual-language models to segment the expert trajectory into semantic key states, indicating the objectives of "what to do". Within the intervals between semantic key states, optical flow is employed to capture motion key states to understand the process of "how to do". By integrating a thorough grasp of both semantic and motion key states, we refine the trajectory-matching reward computation, encouraging task-aware exploration for efficient online imitation learning. Our experiment results prove that our method is more sample efficient in the Meta-World and LIBERO environments. We also conduct real-world robotic manipulation experiments to validate the efficacy of our method, demonstrating the practical applicability of our KOI method.
Published: 2024

3. Play to the Score: Stage-Guided Dynamic Multi-Sensory Fusion for Robotic Manipulation

Author: Feng, Ruoxuan, Hu, Di, Ma, Wenke, and Li, Xuelong
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: Humans possess a remarkable talent for flexibly alternating to different senses when interacting with the environment. Picture a chef skillfully gauging the timing of ingredient additions and controlling the heat according to the colors, sounds, and aromas, seamlessly navigating through every stage of the complex cooking process. This ability is founded upon a thorough comprehension of task stages, as achieving the sub-goal within each stage can necessitate the utilization of different senses. In order to endow robots with similar ability, we incorporate the task stages divided by sub-goals into the imitation learning process to accordingly guide dynamic multi-sensory fusion. We propose MS-Bot, a stage-guided dynamic multi-sensory fusion method with coarse-to-fine stage understanding, which dynamically adjusts the priority of modalities based on the fine-grained state within the predicted current stage. We train a robot system equipped with visual, auditory, and tactile sensors to accomplish challenging robotic manipulation tasks: pouring and peg insertion with keyway. Experimental results indicate that our approach enables more effective and explainable dynamic fusion, aligning more closely with the human fusion process than existing methods.
Published: 2024

4. Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

Author: Li, Guangyao, Du, Henghui, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM., Comment: Accepted by ACM MM 2024
Published: 2024

5. Towards Effective and Efficient Continual Pre-training of Large Language Models

Author: Chen, Jie, Chen, Zhipeng, Wang, Jiapeng, Zhou, Kun, Zhu, Yutao, Jiang, Jinhao, Min, Yingqian, Zhao, Wayne Xin, Dou, Zhicheng, Mao, Jiaxin, Lin, Yankai, Song, Ruihua, Xu, Jun, Chen, Xu, Yan, Rui, Wei, Zhewei, Hu, Di, Huang, Wenbing, and Wen, Ji-Rong
Subjects: Computer Science - Computation and Language, 68T50, I.2.7
Abstract: Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining the original abilities, we design specific data mixture and curriculum strategies by utilizing existing datasets and synthesizing high-quality datasets. Specifically, we synthesize multidisciplinary scientific question and answer (QA) pairs based on related web pages, and subsequently incorporate these synthetic data to improve the scientific reasoning ability of Llama-3. We refer to the model after CPT as Llama-3-SynE (Synthetic data Enhanced Llama-3). We also present the tuning experiments with a relatively small model -- TinyLlama, and employ the derived findings to train the backbone model. Extensive experiments on a number of evaluation benchmarks show that our approach can largely improve the performance of the backbone models, including both the general abilities (+8.81 on C-Eval and +6.31 on CMMLU) and the scientific reasoning abilities (+12.00 on MATH and +4.13 on SciEval), without hurting the original capacities. Our model, data, and codes are available at https://github.com/RUC-GSAI/Llama-3-SynE., Comment: 16 pages, 10 figures, 16 tables
Published: 2024

6. Unveiling and Mitigating Bias in Audio Visual Segmentation

Author: Sun, Peiwen, Zhang, Honggang, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic. We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding, which leads to the disregard of important modality information. Generally, the anomalous phenomena are often complex and cannot be directly observed systematically. In this study, we made a pioneering effort with the proper synthetic data to categorize and analyze phenomena as two types "audio priming bias" and "visual prior" according to the source of anomalies. For audio priming bias, to enhance audio sensitivity to different intensities and semantics, a perception module specifically for audio perceives the latent semantic information and incorporates information into a limited set of queries, namely active queries. Moreover, the interaction mechanism related to such active queries in the transformer decoder is customized to adapt to the need for interaction regulating among audio semantics. For visual prior, multiple contrastive training strategies are explored to optimize the model by incorporating a biased branch, without even changing the structure of the model. During experiments, observation demonstrates the presence and the impact that has been produced by the biases of the existing model. Finally, through experimental evaluation of AVS benchmarks, we demonstrate the effectiveness of our methods in handling both types of biases, achieving competitive performance across all three subsets., Comment: Accepted by ACM MM 24 (ORAL)
Published: 2024

7. Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation

Author: Ma, Juncheng, Sun, Peiwen, Wang, Yaoting, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called \textit{Stepping Stones}, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods. To further improve the performance of AVS tasks, we propose a novel framework Adaptive Audio Visual Segmentation, in which we incorporate an adaptive audio query generator and integrate masked attention into the transformer decoder, facilitating the adaptive fusion of visual and audio features. Extensive experiments demonstrate that our methods achieve state-of-the-art results on all three AVS benchmarks. The project homepage can be accessed at https://gewu-lab.github.io/stepping_stones/., Comment: ECCV2024 accepted. Project url: https://gewu-lab.github.io/stepping_stones
Published: 2024

8. Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Author: Wang, Yaoting, Sun, Peiwen, Zhou, Dongzhan, Li, Guangyao, Zhang, Honggang, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Traditional reference segmentation tasks have predominantly focused on silent visual scenes, neglecting the integral role of multimodal perception and interaction in human experiences. In this work, we introduce a novel task called Reference Audio-Visual Segmentation (Ref-AVS), which seeks to segment objects within the visual domain based on expressions containing multimodal cues. Such expressions are articulated in natural language forms but are enriched with multimodal cues, including audio and visual descriptions. To facilitate this research, we construct the first Ref-AVS benchmark, which provides pixel-level annotations for objects described in corresponding multimodal-cue expressions. To tackle the Ref-AVS task, we propose a new method that adequately utilizes multimodal cues to offer precise segmentation guidance. Finally, we conduct quantitative and qualitative experiments on three test subsets to compare our approach with existing methods from related tasks. The results demonstrate the effectiveness of our method, highlighting its capability to precisely segment objects using multimodal-cue expressions. Dataset is available at \href{https://gewu-lab.github.io/Ref-AVS}{https://gewu-lab.github.io/Ref-AVS}., Comment: Accepted by ECCV2024
Published: 2024

9. Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Author: Wang, Yaoting, Sun, Peiwen, Li, Yuanchao, Zhang, Honggang, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues. Subsequently, we introduce a novel semantics-driven audio modeling module with a dynamic mask to integrate audio features with text cues, leading to representative sounding object features. These features not only encompass audio cues but also possess vivid semantics, providing clearer guidance in the visual space. Experimental results on AVS benchmarks validate that our method exhibits enhanced sensitivity to audio when aided by text cues, achieving highly competitive performance on all three subsets. Project page: \href{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}{https://github.com/GeWu-Lab/Sounding-Object-Segmentation-Preference}, Comment: Accepted by ECCV2024
Published: 2024

10. Diagnosing and Re-learning for Balanced Multimodal Learning

Author: Wei, Yake, Li, Siwei, Feng, Ruoxuan, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: To overcome the imbalanced multimodal learning problem, where models prefer the training of specific modalities, existing methods propose to control the training of uni-modal encoders from different perspectives, taking the inter-modal performance discrepancy as the basis. However, the intrinsic limitation of modality capacity is ignored. The scarcely informative modalities can be recognized as ``worse-learnt'' ones, which could force the model to memorize more noise, counterproductively affecting the multimodal model ability. Moreover, the current modality modulation methods narrowly concentrate on selected worse-learnt modalities, even suppressing the training of others. Hence, it is essential to consider the intrinsic limitation of modality capacity and take all modalities into account during balancing. To this end, we propose the Diagnosing \& Re-learning method. The learning state of each modality is firstly estimated based on the separability of its uni-modal representation space, and then used to softly re-initialize the corresponding uni-modal encoder. In this way, the over-emphasizing of scarcely informative modalities is avoided. In addition, encoders of worse-learnt modalities are enhanced, simultaneously avoiding the over-training of other modalities. Accordingly, multimodal learning is effectively balanced and enhanced. Experiments covering multiple types of modalities and multimodal frameworks demonstrate the superior performance of our simple-yet-effective method for balanced multimodal learning. The source code and dataset are available at \url{https://github.com/GeWu-Lab/Diagnosing_Relearning_ECCV2024}., Comment: Accepted by ECCV 2024
Published: 2024

11. YuLan: An Open-source Large Language Model

Author: Zhu, Yutao, Zhou, Kun, Mao, Kelong, Chen, Wentong, Sun, Yiding, Chen, Zhipeng, Cao, Qian, Wu, Yihan, Chen, Yushuo, Wang, Feng, Zhang, Lei, Li, Junyi, Wang, Xiaolei, Wang, Lei, Zhang, Beichen, Dong, Zican, Cheng, Xiaoxue, Chen, Yuhan, Tang, Xinyu, Hou, Yupeng, Ren, Qiangqiang, Pang, Xincheng, Xie, Shufang, Zhao, Wayne Xin, Dou, Zhicheng, Mao, Jiaxin, Lin, Yankai, Song, Ruihua, Xu, Jun, Chen, Xu, Yan, Rui, Wei, Zhewei, Hu, Di, Huang, Wenbing, Gao, Ze-Feng, Chen, Yueguo, Lu, Weizheng, and Wen, Ji-Rong
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat.
Published: 2024

12. Stability impacts from the current and pressure profile modifications within finite sized island

Author: Sun, Yuxiang and Hu, Di
Subjects: Physics - Plasma Physics
Abstract: The stability (or instability) of finite sized magnetic island could play a significant role in disruption avoidance or disruption mitigation dynamics. Especially, various current and pressure profile modifications, such as the current drive and heating caused by electron cyclotron wave, or the radiative cooling and current expulsion caused by the Shattered Pellet Injection could be applied within the island to modify its stability, thus change the ensuing dynamics. In this study, we calculate the mode structure modification caused by such profile changes within the island using the perturbed equilibrium approach, thus obtain the change of stability criterion $\gD'$ and assess the corresponding quasi-linear island stability. The positive helical current perturbation is found to always stabilize the island, while the negative one is found to do the opposite, in agreement with previous results. The pressure bump or hole within the island has a more complicated stability impact. In the small island regime, its contribution is monotonic, with pressure bump tends to stabilize the island while pressure hole destabilizes it. This effect is relatively weak, though, due to the cancellation of the pressure term's odd parity contribution in the second derivatives of the mode structure. In the large island regime, such cancellation is broken due to the island asymmetry, and the pressure contribution to stability is manifested, which is non-monotonic. The stability analysis in this paper helps to more accurately clarify the expected island response in the presence of profile modifications caused by disruption avoidance or mitigation systems.
Published: 2024
Full Text: View/download PDF

13. Learning Manipulation by Predicting Interaction

Author: Zeng, Jia, Bu, Qingwen, Wang, Bangjun, Xia, Wenke, Chen, Li, Dong, Hao, Song, Haoming, Wang, Dong, Hu, Di, Luo, Ping, Cui, Heming, Zhao, Bin, Li, Xuelong, Qiao, Yu, and Li, Hongyang
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical interaction during the manipulation process, resulting in an inadequate understanding of the relationship between objects and the environment. To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation.Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively. These two learning objectives achieve superior comprehension towards "how-to-interact" and "where-to-interact". We conduct a comprehensive evaluation of several challenging robotic tasks.The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms as well as simulation environments. Code and checkpoints are publicly shared at https://github.com/OpenDriveLab/MPI., Comment: Accepted to RSS 2024. Project page: https://github.com/OpenDriveLab/MPI
Published: 2024

14. MMPareto: Boosting Multimodal Learning with Innocent Unimodal Assistance

Author: Wei, Yake and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: Multimodal learning methods with targeted unimodal learning objectives have exhibited their superior efficacy in alleviating the imbalanced multimodal learning problem. However, in this paper, we identify the previously ignored gradient conflict between multimodal and unimodal learning objectives, potentially misleading the unimodal encoder optimization. To well diminish these conflicts, we observe the discrepancy between multimodal loss and unimodal loss, where both gradient magnitude and covariance of the easier-to-learn multimodal loss are smaller than the unimodal one. With this property, we analyze Pareto integration under our multimodal scenario and propose MMPareto algorithm, which could ensure a final gradient with direction that is common to all learning objectives and enhanced magnitude to improve generalization, providing innocent unimodal assistance. Finally, experiments across multiple types of modalities and frameworks with dense cross-modal interaction indicate our superior and extendable method performance. Our method is also expected to facilitate multi-task cases with a clear discrepancy in task difficulty, demonstrating its ideal scalability. The source code and dataset are available at https://github.com/GeWu-Lab/MMPareto_ICML2024., Comment: Accepted by ICML2024
Published: 2024

15. Multimodal Fusion on Low-quality Data: A Comprehensive Survey

Author: Zhang, Qingyang, Wei, Yake, Han, Zongbo, Fu, Huazhu, Peng, Xi, Deng, Cheng, Hu, Qinghua, Xu, Cai, Wen, Jie, Hu, Di, and Zhang, Changqing
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Multimodal fusion focuses on integrating information from multiple modalities with the goal of more accurate prediction, which has achieved remarkable progress in a wide range of scenarios, including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion remains largely unexplored especially under low-quality data settings. This paper surveys the common challenges and recent advances of multimodal fusion in the wild and presents them in a comprehensive taxonomy. From a data-centric view, we identify four main challenges that are faced by multimodal fusion on low-quality data, namely (1) noisy multimodal data that are contaminated with heterogeneous noises, (2) incomplete multimodal data that some modalities are missing, (3) imbalanced multimodal data that the qualities or properties of different modalities are significantly different and (4) quality-varying multimodal data that the quality of each modality dynamically changes with respect to different samples. This new taxonomy will enable researchers to understand the state of the field and identify several potential directions. We also provide discussion for the open problems in this field together with interesting future research directions., Comment: Feel free to comment on our manuscript: qingyangzhang@tju.edu.cn
Published: 2024

16. Highly dispersed Ru nanoparticles anchored on NiAl layered double oxides catalyst for selective hydrodeoxygenation of vanillin

Author: Zeng, Yongjian, Lin, Lu, Hu, Di, Jiang, Zhiwei, Saeed, Shaimaa, Guo, Ruichao, Ashour, Ibrahim, and Yan, Kai
Subjects: Physics - Chemical Physics
Abstract: The hydrodeoxygenation (HDO) of lignin-derived feedstocks into value-added chemicals with high efficiency and selectivity is desirable for the utilization of biomass resource. The complex oxygen-containing groups of lignin-derived substance result in the challenge of the low selectivity toward the required product. In this work, highly dispersed Ru nanoparticles anchored on Ni3Al1 layered double oxides (LDOs) catalyst derived from NiAl layered double hydroxides (LDHs) with flower-shaped morphology was constructed by a simple deposition-reduction method. The introduction of LDHs-derived support can significantly impact the catalytic activity for the HDO of lignin-derived vanillin (VL) into 2-methoxy-4-methylphenol (MMP). The Ru/Ni3Al1-400 catalyst obtained complete conversion of VL and 94.2% yield of MMP at 130 {\deg}C in methanol solvent, much better than the catalysts without LDHs-derived support. The methanol solvent is beneficial for the conversion of reaction intermediate of vanillin alcohol (VA). Detailed characterization reveals that the existence of the enhanced metal-support interaction over Ru/Ni3Al1-400 and the easily accessible acid sites facilitate the production of MMP.
Published: 2024
Full Text: View/download PDF

17. Facile synthesis of fine-grained CoFe$_2$O$_4$ anchored on porous carbon for simultaneous removal of tetracycline and arsenite

Author: Chen, Yuwen, Zhu, Ke, Huang, Yizhe, Li, Xin, Zheng, Zhikeng, Jiang, Zhiwei, Hu, Di, Fang, Ping, and Yan, Kai
Subjects: Physics - Applied Physics, Condensed Matter - Mesoscale and Nanoscale Physics
Abstract: The coexistence of tetracycline (TC) and arsenite (As(III)) in livestock wastewater threatens public health, and the heterogeneous Fenton-like system is a practical approach for the simultaneous removal of TC and As(III). In this work, fine CoFe$_2$O$_4$ nanoparticles are facilely anchored on heretically porous carbon (CoFe$_2$O$_4$@PC) via a microwave-assisted calcination method and used for eliminating TC and As(III) via peroxymonosulfate (PMS) activation.
Published: 2024
Full Text: View/download PDF

18. SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

Author: Wu, Tao, Li, Xuewei, Qi, Zhongang, Hu, Di, Wang, Xintao, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images.For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images.Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion.For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic.Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images.With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average., Comment: Accepted by AAAI2024
Published: 2024

19. Quantifying and Enhancing Multi-modal Robustness with Modality Preference

Author: Yang, Zequn, Wei, Yake, Liang, Ce, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Multi-modal models have shown a promising capability to effectively integrate information from various sources, yet meanwhile, they are found vulnerable to pervasive perturbations, such as uni-modal attacks and missing conditions. To counter these perturbations, robust multi-modal representations are highly expected, which are positioned well away from the discriminative multi-modal decision boundary. In this paper, different from conventional empirical studies, we focus on a commonly used joint multi-modal framework and theoretically discover that larger uni-modal representation margins and more reliable integration for modalities are essential components for achieving higher robustness. This discovery can further explain the limitation of multi-modal robustness and the phenomenon that multi-modal models are often vulnerable to attacks on the specific modality. Moreover, our analysis reveals how the widespread issue, that the model has different preferences for modalities, limits the multi-modal robustness by influencing the essential components and could lead to attacks on the specific modality highly effective. Inspired by our theoretical finding, we introduce a training procedure called Certifiable Robust Multi-modal Training (CRMT), which can alleviate this influence from modality preference and explicitly regulate essential components to significantly improve robustness in a certifiable manner. Our method demonstrates substantial improvements in performance and robustness compared with existing methods. Furthermore, our training procedure can be easily extended to enhance other robust training strategies, highlighting its credibility and flexibility., Comment: Accepted to ICLR 2024
Published: 2024

20. Two-dimensional 5d multiferroic W3Cl8: breathing Kagome lattice and tunable magneto-optical Kerr effect

Author: Hu, Di, Ye, Haoshen, Ding, Ning, Xu, Kaidi, Wang, Shan-Shan, Dong, Shuai, and Yao, Xiaoyan
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics, Condensed Matter - Materials Science
Abstract: Owing to the strong spin-orbit coupling and the related fascinating physical properties, heavy 5d transition-metals exhibit desirable application prospects. However, up to now, the 5d magnetic materials are still very limited, especially very rare for tungsten. In this work, we theoretically predict a two-dimensional multiferroic W3Cl8 monolayer. Intrinsic 5d magnetism of tungsten is activated by the W ions' fractional valence in a breathing Kagome lattice of reduced effective dimension. A coplanar Y-type antiferromagnetism composed by ferromagnetic W3 trimers is confirmed as the magnetic ground state. The spontaneous ferroelectric polarization mainly originates from the ion displacement induced by the breathing distortion of Kagome lattice. An intrinsic magneto-optical Kerr effect with sizable Kerr angle can be observed to detect this trimeric Y-type antiferromagnetism, and it depends strongly on the detailed magnetic order. Thereby, we propose a general scheme for realizing more 5d magnetism in two-dimensional multiferroic systems.
Published: 2024
Full Text: View/download PDF

21. Modulation of Nicotine-Associated Behaviour in Rats By μ-Opioid Signals from the Medial Prefrontal Cortex to the Nucleus Accumbens Shell

Author: Zhu, Feng, Kanda, Hirosato, Neyama, Hiroyuki, Wu, Yuping, Kato, Shigeki, Hu, Di, Duan, Shaoqi, Noguchi, Koichi, Watanabe, Yasuyoshi, Kobayashi, Kazuto, Dai, Yi, and Cui, Yilong
Published: 2024
Full Text: View/download PDF

22. Genetically Proxied Therapeutic Effect of Metformin Use, Blood Pressure, and Hypertension’s Risk: a Drug Target-Based Mendelian Randomization Study

Author: Jiang, Junhong, Hu, Di, Zhang, Qi, and Lin, Zenan
Published: 2024
Full Text: View/download PDF

23. Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Author: Xia, Wenke, Wang, Dong, Pang, Xincheng, Wang, Zhigang, Zhao, Bin, Hu, Di, and Li, Xuelong
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: Generalizable articulated object manipulation is essential for home-assistant robots. Recent efforts focus on imitation learning from demonstrations or reinforcement learning in simulation, however, due to the prohibitive costs of real-world data collection and precise object simulation, it still remains challenging for these works to achieve broad adaptability across diverse articulated objects. Recently, many works have tried to utilize the strong in-context learning ability of Large Language Models (LLMs) to achieve generalizable robotic manipulation, but most of these researches focus on high-level task planning, sidelining low-level robotic control. In this work, building on the idea that the kinematic structure of the object determines how we can manipulate it, we propose a kinematic-aware prompting framework that prompts LLMs with kinematic knowledge of objects to generate low-level motion trajectory waypoints, supporting various object manipulation. To effectively prompt LLMs with the kinematic structure of different objects, we design a unified kinematic knowledge parser, which represents various articulated objects as a unified textual description containing kinematic joints and contact location. Building upon this unified description, a kinematic-aware planner model is proposed to generate precise 3D manipulation waypoints via a designed kinematic-aware chain-of-thoughts prompting method. Our evaluation spanned 48 instances across 16 distinct categories, revealing that our framework not only outperforms traditional methods on 8 seen categories but also shows a powerful zero-shot capability for 8 unseen articulated object categories. Moreover, the real-world experiments on 7 different object categories prove our framework's adaptability in practical scenarios. Code is released at https://github.com/GeWu-Lab/LLM_articulated_object_manipulation/tree/main., Comment: Accepted by ICRA 2024
Published: 2023

24. Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer

Author: Wang, Yaoting, Liu, Weisong, Li, Guangyao, Ding, Jian, Hu, Di, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Never having seen an object and heard its sound simultaneously, can the model still accurately localize its visual position from the input audio? In this work, we concentrate on the Audio-Visual Localization and Segmentation tasks but under the demanding zero-shot and few-shot scenarios. To achieve this goal, different from existing approaches that mostly employ the encoder-fusion-decoder paradigm to decode localization information from the fused audio-visual feature, we introduce the encoder-prompt-decoder paradigm, aiming to better fit the data scarcity and varying data distribution dilemmas with the help of abundant knowledge from pre-trained models. Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects, meanwhile, the semantic gap between the visual and audio modalities is also encouraged to shrink. Then, we develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model. By equipping with these means, extensive experiments demonstrate that this new paradigm outperforms other fusion-based methods in both the unseen class and cross-dataset settings. We hope that our work can further promote the generalization study of Audio-Visual Localization and Segmentation in practical application scenarios., Comment: Accepted by AAAI 2024
Published: 2023

25. Enhancing multimodal cooperation via sample-level modality valuation

Author: Wei, Yake, Feng, Ruoxuan, Wang, Zihe, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However most models often suffer from unsatisfactory multimodal cooperation which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence it is essential to reasonably observe and improve the fine-grained cooperation between modalities especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation we observe that modality discrepancy indeed could be different at sample-level beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation., Comment: Accepted by CVPR 2024
Published: 2023

26. Tide–Surge Interactions in Lingdingyang Bay, Pearl River Estuary, China: a Case Study from Typhoon Mangkhut, 2018

Author: Zhang, Zhuo, Song, Zhiyao, Zhang, Dong, Hu, Di, Yu, Zhaoyuan, and Yue, Songshan
Published: 2024
Full Text: View/download PDF

27. Two-orbital spin-fermion model study of ferromagnetism in honeycomb lattice

Author: Xu, Kaidi, Hu, Di, Chen, Jun, Ye, Haoshen, Han, Lin, Wang, Shan-Shan, and Dong, Shuai
Subjects: Condensed Matter - Strongly Correlated Electrons, Condensed Matter - Mesoscale and Nanoscale Physics
Abstract: The spin-fermion model was previously successful to describe the complex phase diagrams of colossal magnetoresistive manganites and iron-based superconductors. In recent years, two-dimensional magnets have rapidly raised up as a new attractive branch of quantum materials, which are theoretically described based on classical spin models in most studies. Alternatively, here the two-orbital spin-fermion model is established as a uniform scenario to describe the ferromagnetism in a two-dimensional honeycomb lattice. This model connects the magnetic interactions with the electronic structures. Then the continuous tuning of magnetism in these honeycomb lattices can be predicted, based on a general phase diagram. The electron/hole doping, from the empty $e_{g}$ to half-filled $e_{g}$ limit, is studied as a benchmark. Our Monte Carlo result finds that the ferromagnetic $T_{C}$ reaches the maximum at the quarter-filled case. In other regions, the linear relationship between $T_{C}$ and doping concentration provides a theoretical guideline for the experimental modulations of two-dimensional ferromagnetism tuned by ionic liquid or electrical gating.
Published: 2023

28. Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Author: Li, Guangyao, Hou, Wenxuan, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Audio-Visual Question Answering (AVQA) task aims to answer questions about different visual objects, sounds, and their associations in videos. Such naturally multi-modal videos are composed of rich and complex dynamic audio-visual components, where most of which could be unrelated to the given questions, or even play as interference in answering the content of interest. Oppositely, only focusing on the question-aware audio-visual content could get rid of influence, meanwhile enabling the model to answer more efficiently. In this paper, we propose a Progressive Spatio-Temporal Perception Network (PSTP-Net), which contains three modules that progressively identify key spatio-temporal regions w.r.t. questions. Specifically, a temporal segment selection module is first introduced to select the most relevant audio-visual segments related to the given question. Then, a spatial region selection module is utilized to choose the most relevant regions associated with the question from the selected temporal segments. To further refine the selection of features, an audio-guided visual attention module is employed to perceive the association between auido and selected spatial regions. Finally, the spatio-temporal features from these modules are integrated for answering the question. Extensive experimental results on the public MUSIC-AVQA and AVQA datasets provide compelling evidence of the effectiveness and efficiency of PSTP-Net. Code is available at: \href{https://github.com/GeWu-Lab/PSTP-Net}{https://github.com/GeWu-Lab/PSTP-Net}, Comment: Accepted by ACM MM 2023
Published: 2023

29. Towards Long Form Audio-visual Video Understanding

Author: Hou, Wenxuan, Li, Guangyao, Tian, Yapeng, and Hu, Di
Subjects: Computer Science - Multimedia
Abstract: We live in a world filled with never-ending streams of multimodal information. As a more natural recording of the real scenario, long form audio-visual videos are expected as an important bridge for better exploring and understanding the world. In this paper, we propose the multisensory temporal event localization task in long form videos and strive to tackle the associated challenges. To facilitate this study, we first collect a large-scale Long Form Audio-visual Video (LFAV) dataset with 5,175 videos and an average video length of 210 seconds. Each of the collected videos is elaborately annotated with diversified modality-aware events, in a long-range temporal sequence. We then propose an event-centric framework for localizing multisensory events as well as understanding their relations in long form videos. It includes three phases in different levels: snippet prediction phase to learn snippet features, event extraction phase to extract event-level features, and event interaction phase to study event relations. Experiments demonstrate that the proposed method, utilizing the new LFAV dataset, exhibits considerable effectiveness in localizing multiple modality-aware events within long form videos. Project website: http://gewu-lab.github.io/LFAV/
Published: 2023

30. Supervised Knowledge May Hurt Novel Class Discovery Performance

Author: Li, Ziyun, Otholt, Jona, Dai, Ben, Hu, Di, Meinel, Christoph, and Yang, Haojin
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset by leveraging prior knowledge of a labeled set comprising disjoint but related classes. Given that most existing literature focuses primarily on utilizing supervised knowledge from a labeled set at the methodology level, this paper considers the question: Is supervised knowledge always helpful at different levels of semantic relevance? To proceed, we first establish a novel metric, so-called transfer flow, to measure the semantic similarity between labeled/unlabeled datasets. To show the validity of the proposed metric, we build up a large-scale benchmark with various degrees of semantic similarities between labeled/unlabeled datasets on ImageNet by leveraging its hierarchical class structure. The results based on the proposed benchmark show that the proposed transfer flow is in line with the hierarchical class structure; and that NCD performance is consistent with the semantic similarities (measured by the proposed metric). Next, by using the proposed transfer flow, we conduct various empirical experiments with different levels of semantic similarity, yielding that supervised knowledge may hurt NCD performance. Specifically, using supervised information from a low-similarity labeled set may lead to a suboptimal result as compared to using pure self-supervised knowledge. These results reveal the inadequacy of the existing NCD literature which usually assumes that supervised knowledge is beneficial. Finally, we develop a pseudo-version of the transfer flow as a practical reference to decide if supervised knowledge should be used in NCD. Its effectiveness is supported by our empirical studies, which show that the pseudo transfer flow (with or without supervised knowledge) is consistent with the corresponding accuracy based on various datasets. Code is released at https://github.com/J-L-O/SK-Hurt-NCD, Comment: TMLR 2023 accepted paper. arXiv admin note: substantial text overlap with arXiv:2209.09120
Published: 2023

31. Multi-Scale Attention for Audio Question Answering

Author: Li, Guangyao, Xu, Yixin, and Hu, Di
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio question answering (AQA), acting as a widely used proxy task to explore scene understanding, has got more attention. The AQA is challenging for it requires comprehensive temporal reasoning from different scales' events of an audio scene. However, existing methods mostly extend the structures of visual question answering task to audio ones in a simple pattern but may not perform well when perceiving a fine-grained audio scene. To this end, we present a Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous hybrid attention module and a multi-scale window attention module. The former is designed to aggregate unimodal and cross-modal temporal contexts, while the latter captures sound events of varying lengths and their temporal dependencies for a more comprehensive understanding. Extensive experiments are conducted to demonstrate that the proposed MWAFM can effectively explore temporal information to facilitate AQA in the fine-grained scene.Code: https://github.com/GeWu-Lab/MWAFM, Comment: Accepted by InterSpeech 2023
Published: 2023

32. Robust Cross-Modal Knowledge Distillation for Unconstrained Videos

Author: Xia, Wenke, Li, Xingjian, Deng, Andong, Xiong, Haoyi, Dou, Dejing, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Cross-modal distillation has been widely used to transfer knowledge across different modalities, enriching the representation of the target unimodal one. Recent studies highly relate the temporal synchronization between vision and sound to the semantic consistency for cross-modal distillation. However, such semantic consistency from the synchronization is hard to guarantee in unconstrained videos, due to the irrelevant modality noise and differentiated semantic correlation. To this end, we first propose a \textit{Modality Noise Filter} (MNF) module to erase the irrelevant noise in teacher modality with cross-modal context. After this purification, we then design a \textit{Contrastive Semantic Calibration} (CSC) module to adaptively distill useful knowledge for target modality, by referring to the differentiated sample-wise semantic correlation in a contrastive fashion. Extensive experiments show that our method could bring a performance boost compared with other distillation methods in both visual action recognition and video retrieval task. We also extend to the audio tagging task to prove the generalization of our method. The source code is available at \href{https://github.com/GeWu-Lab/cross-modal-distillation}{https://github.com/GeWu-Lab/cross-modal-distillation}.
Published: 2023

33. Modulation of skyrmionic magnetic textures in two-dimensional vdW materials and their heterostructures

Author: Yao, Xiaoyan, Hu, Di, and Dong, Shuai
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics, Condensed Matter - Materials Science, Condensed Matter - Strongly Correlated Electrons
Abstract: The intrinsic magnetism observed in two-dimensional (2D) van der Waals (vdW) materials provides a unique opportunity for exploring the 2D topological magnetic textures, in particular skyrmionic magnetic textures (SMTs) including skyrmion and its topological equivalents. Since the experimental discovery of skyrmions in the 2D vdW materials and their heterostructures, a critical challenge lies in the control of these SMTs to translate their intriguing features into spintronic applications. Here, we review the recent experimental and theoretical progresses on the modulations of SMTs in 2D vdW monolayer materials and their heterostructures. Besides well-established basic modulation factors including temperature, magnetic field and sample thickness, we present the experimental realization of mobility and transition driven by electric current, and the theoretical prediction of diverse magnetoelectric modulations by electric field. Considering the 2D character of vdW layered materials, strain and stacking style are also efficient approaches to tune the magnetic textures., Comment: 19 pages, 6 figures; a review
Published: 2023
Full Text: View/download PDF

34. MMCosine: Multi-Modal Cosine Loss Towards Balanced Audio-Visual Fine-Grained Learning

Author: Xu, Ruize, Feng, Ruoxuan, Zhang, Shi-Xiong, and Hu, Di
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model's performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise $L_2$ normalization to features and weights towards balanced and better multi-modal fine-grained learning. We demonstrate that our method can alleviate the imbalanced optimization from the perspective of weight norm and fully exploit the discriminability of the cosine metric. Extensive experiments prove the effectiveness of our method and the versatility with advanced multi-modal fusion strategies and up-to-date imbalance-mitigating methods.
Published: 2023

35. Drift surface solver for runaway electron current dominant equilibria during the Current Quench

Author: Yuan, Lu and Hu, Di
Subjects: Physics - Plasma Physics
Abstract: Runaway electron current generated during the Current Quench phase of tokamak disruptions could result in severe damage to future high performance devices. To control and mitigate such runaway electron current, it is important to accurately describe the runaway electron current dominated equilibrium, based on which further stability analysis could be carried out. In this paper, we derive a Grad-Shafranov-like equation solving for the axisymmetric drift surfaces of the runaway electrons for the simple case that all runaway electron share the same parallel momentum. This new equilibrium equation is then numerically solved with simple rectangular wall with ITER-like and MAST-like geometry parameters. The deviation between the drift surfaces and the flux surfaces is readily obtained, and runaway electrons is found to be well confined even in regions with open field lines. The change of the runaway electron parallel momentum is found to result in a horizontal current center displacement without any changes in the total current or the external field. The runaway current density profile is found to affect the susceptibility of such displacement, with flatter profiles result in more displacement by the same momentum change. With up-down asymmetry in the external poloidal field, such displacement is accompanied by a vertical displacement of runaway electron current. It is found that this effect is more pronounced in smaller, compact device and weaker poloidal field cases. The above results demonstrate the dynamics of current center displacement caused by the momentum space change in the runaway electrons, and pave way for future, more sophisticated runaway current equilibrium theory with more realistic consideration on the runaway electron momentum distribution. This new equilibrium theory also provides foundation for future stability analysis of the runaway electron current.
Published: 2023
Full Text: View/download PDF

36. Where to Turn: Road Fork Detection in Sparse 3D Point Cloud

Author: Hu, Di, Zhang, Kai, Zhong, Yipan, Xu, Jiachen, Yuan, Xia, Zhao, Chunxia, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Qu, Yi, editor, Gu, Mancang, editor, Niu, Yifeng, editor, and Fu, Wenxing, editor
Published: 2024
Full Text: View/download PDF

37. Performance Evaluation of Multi-type Energy Storage Power Station Based on AHP and FCE

Author: Sang, Bingyu, Hu, Di, Dong, Cun, Li, Peng, Li, Kecheng, Tao, Yibin, Wang, Shibo, Wang, Jiahao, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Yang, Qingxin, editor, Li, Zewen, editor, and Luo, An, editor
Published: 2024
Full Text: View/download PDF

38. Optimal Allocation Method of Hybrid Energy Storage Capacity to Stabilize Wind Power Fluctuation

Author: Sun, Wu, Li, Peng, Yang, Bo, Tao, Yibin, Li, Kecheng, Bai, Zhenmin, Zhong, Hanming, Hu, Di, Jiang, Lei, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Yang, Qingxin, editor, Li, Zewen, editor, and Luo, An, editor
Published: 2024
Full Text: View/download PDF

39. Outlook for Global Oil Demand in the Post-COVID-19 Era

Author: Hu, Di, Zhao, Ying, China International United Petroleum & Chemicals Co., Ltd., editor, Chinese Academy of Social Sciences, editor, Peking University, editor, and Luo, Jing, Translated by
Published: 2024
Full Text: View/download PDF

40. The Applications of CircRNA in the Diagnosis and Treatment of Alzheimer’s Disease

Author: Wen, Xueyi, Huang, Cheng, Xie, Hesong, Hu, Di, Luo, Juyu, and Li, Keshen
Published: 2024
Full Text: View/download PDF

41. Towards accurate knowledge transfer via target-awareness representation disentanglement

Author: Li, Xingjian, Hu, Di, Li, Xuhong, Xiong, Haoyi, Xu, Chengzhong, and Dou, Dejing
Published: 2024
Full Text: View/download PDF

42. Investigation of Potential Crucial Genes and Key Pathways in Keratoconus: An Analysis of Gene Expression Omnibus Data

Author: Hu, Di, Lin, Zenan, Li, Pan, Zhang, Zhehuan, Jiang, Junhong, and Yang, Chenhao
Published: 2023
Full Text: View/download PDF

43. Balanced Audiovisual Dataset for Imbalance Analysis

Author: Xia, Wenke, Zhao, Xu, Pang, Xincheng, Zhang, Changqing, and Hu, Di
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: The imbalance problem is widespread in the field of machine learning, which also exists in multimodal learning areas caused by the intrinsic discrepancy between modalities of samples. Recent works have attempted to solve the modality imbalance problem from algorithm perspective, however, they do not fully analyze the influence of modality bias in datasets. Concretely, existing multimodal datasets are usually collected under specific tasks, where one modality tends to perform better than other ones in most conditions. In this work, to comprehensively explore the influence of modality bias, we first split existing datasets into different subsets by estimating sample-wise modality discrepancy. We surprisingly find that: the multimodal models with existing imbalance algorithms consistently perform worse than the unimodal one on specific subsets, in accordance with the modality bias. To further explore the influence of modality bias and analyze the effectiveness of existing imbalance algorithms, we build a balanced audiovisual dataset, with uniformly distributed modality discrepancy over the whole dataset. We then conduct extensive experiments to re-evaluate existing imbalance algorithms and draw some interesting findings: existing algorithms only provide a compromise between modalities and suffer from the large modality discrepancy of samples. We hope that these findings could facilitate future research on the modality imbalance problem., Comment: website:https://gewu-lab.github.io/Balanced-Audiovisual-Dataset/
Published: 2023

44. Revisiting Pre-training in Audio-Visual Learning

Author: Feng, Ruoxuan, Xia, Wenke, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Pre-training technique has gained tremendous success in enhancing model performance on various tasks, but found to perform worse than training from scratch in some uni-modal situations. This inspires us to think: are the pre-trained models always effective in the more complex multi-modal scenario, especially for the heterogeneous modalities such as audio and visual ones? We find that the answer is No. Specifically, we explore the effects of pre-trained models on two audio-visual learning scenarios: cross-modal initialization and multi-modal joint learning. When cross-modal initialization is applied, the phenomena of "dead channel" caused by abnormal Batchnorm parameters hinders the utilization of model capacity. Thus, we propose Adaptive Batchnorm Re-initialization (ABRi) to better exploit the capacity of pre-trained models for target tasks. In multi-modal joint learning, we find a strong pre-trained uni-modal encoder would bring negative effects on the encoder of another modality. To alleviate such problem, we introduce a two-stage Fusion Tuning strategy, taking better advantage of the pre-trained knowledge while making the uni-modal encoders cooperate with an adaptive masking method. The experiment results show that our methods could further exploit pre-trained models' potential and boost performance in audio-visual learning.
Published: 2023

45. TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

Author: Lin, Hongpeng, Ruan, Ludan, Xia, Wenke, Liu, Peiyu, Wen, Jingyuan, Xu, Yixin, Hu, Di, Song, Ruihua, Zhao, Wayne Xin, Jin, Qin, and Lu, Zhiwu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: To facilitate the research on intelligent and human-like chatbots with multi-modal context, we introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context. Compared to previous multi-modal dialogue datasets, the richer context types in TikTalk lead to more diverse conversations, but also increase the difficulty in capturing human interests from intricate multi-modal information to generate personalized responses. Moreover, external knowledge is more frequently evoked in our dataset. These facts reveal new challenges for multi-modal dialogue models. We quantitatively demonstrate the characteristics of TikTalk, propose a video-based multi-modal chitchat task, and evaluate several dialogue baselines. Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall. Furthermore, no existing model can solve all the above challenges well. There is still a large room for future improvements, even for LLM with visual extensions. Our dataset is available at \url{https://ruc-aimind.github.io/projects/TikTalk/}., Comment: Accepted to ACM Multimedia 2023
Published: 2023
Full Text: View/download PDF

46. A Closer Look at Novel Class Discovery from the Labeled Set

Author: Li, Ziyun, Otholt, Jona, Dai, Ben, hu, Di, Meinel, Christoph, and Yang, Haojin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Novel class discovery (NCD) aims to infer novel categories in an unlabeled dataset leveraging prior knowledge of a labeled set comprising disjoint but related classes. Existing research focuses primarily on utilizing the labeled set at the methodological level, with less emphasis on the analysis of the labeled set itself. Thus, in this paper, we rethink novel class discovery from the labeled set and focus on two core questions: (i) Given a specific unlabeled set, what kind of labeled set can best support novel class discovery? (ii) A fundamental premise of NCD is that the labeled set must be related to the unlabeled set, but how can we measure this relation? For (i), we propose and substantiate the hypothesis that NCD could benefit more from a labeled set with a large degree of semantic similarity to the unlabeled set. Specifically, we establish an extensive and large-scale benchmark with varying degrees of semantic similarity between labeled/unlabeled datasets on ImageNet by leveraging its hierarchical class structure. As a sharp contrast, the existing NCD benchmarks are developed based on labeled sets with different number of categories and images, and completely ignore the semantic relation. For (ii), we introduce a mathematical definition for quantifying the semantic similarity between labeled and unlabeled sets. In addition, we use this metric to confirm the validity of our proposed benchmark and demonstrate that it highly correlates with NCD performance. Furthermore, without quantitative analysis, previous works commonly believe that label information is always beneficial. However, counterintuitively, our experimental results show that using labels may lead to sub-optimal outcomes in low-similarity settings., Comment: Workshop on Distribution Shifts, 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Published: 2022

47. Learning in Audio-visual Context: A Review, Analysis, and New Perspective

Author: Wei, Yake, Hu, Di, Tian, Yapeng, and Li, Xuelong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, I.2.10, I.4.8, I.5
Abstract: Sight and hearing are two senses that play a vital role in human communication and scene understanding. To mimic human perception ability, audio-visual learning, aimed at developing computational approaches to learn from both audio and visual modalities, has been a flourishing field in recent years. A comprehensive survey that can systematically organize and analyze studies of the audio-visual field is expected. Starting from the analysis of audio-visual cognition foundations, we introduce several key findings that have inspired our computational studies. Then, we systematically review the recent audio-visual learning studies and divide them into three categories: audio-visual boosting, cross-modal perception and audio-visual collaboration. Through our analysis, we discover that, the consistency of audio-visual data across semantic, spatial and temporal support the above studies. To revisit the current development of the audio-visual learning field from a more macro view, we further propose a new perspective on audio-visual scene understanding, then discuss and analyze the feasible future direction of the audio-visual learning area. Overall, this survey reviews and outlooks the current audio-visual learning field from different aspects. We hope it can provide researchers with a better understanding of this area. A website including constantly-updated survey is released: \url{https://gewu-lab.github.io/audio-visual-learning/}., Comment: https://gewu-lab.github.io/audio-visual-learning/
Published: 2022

48. Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

Author: Fan, Yingzi, Han, Longfei, Zhang, Yue, Cheng, Lechao, Xia, Chen, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Both visual and auditory information are valuable to determine the salient regions in videos. Deep convolution neural networks (CNN) showcase strong capacity in coping with the audio-visual saliency prediction task. Due to various factors such as shooting scenes and weather, there often exists moderate distribution discrepancy between source training data and target testing data. The domain discrepancy induces to performance degradation on target testing data for CNN models. This paper makes an early attempt to tackle the unsupervised domain adaptation problem for audio-visual saliency prediction. We propose a dual domain-adversarial learning algorithm to mitigate the domain discrepancy between source and target data. First, a specific domain discrimination branch is built up for aligning the auditory feature distributions. Then, those auditory features are fused into the visual features through a cross-modal self-attention module. The other domain discrimination branch is devised to reduce the domain discrepancy of visual features and audio-visual correlations implied by the fused audio-visual features. Experiments on public benchmarks demonstrate that our method can relieve the performance degradation caused by domain discrepancy., Comment: Accepted by ACM MM workshop 2022(HCMA2022)
Published: 2022
Full Text: View/download PDF

49. Hot-tail electrons' impact on assimilation and injection penetration of D2 Shattered Pellet Injections

Author: Hu, Di and Liu, Chang
Subjects: Physics - Plasma Physics
Abstract: The fragment ablation rate plays significant roles in the mitigation efficiency of Shattered Pellet Injection (SPI) as a Disruption Mitigation System (DMS). Current mainstream 3D MHD codes modelling SPIs mostly assume instantaneous thermalization between the previously hot ambient electrons and the newly released cold electrons, which results in underestimation of the ablation rate if the hot electron thermalization time is comparable or even longer than the fragment flying time. To resolve this doubt, we hereby investigate the thermalization dynamics and the overall hot-electron impact. The finite-time collisional thermalization of hot-tail electrons in a rapidly cooling plasma, as well as the so-called ``self-limiting'' effect are considered. The former effect tends to deplete the colder population within a hot-tail species, while the latter is found to preferentially deplete the higher energy population. The combined result is found to cause an almost self-similar decay of the hot electron distribution function, while its shape does not deviate much from that of Maxwellian distribution and the mean energy does not change much during the thermalization process. Based on this observation, axisymmetric JOREK D2 SPI simulations were carried out with additional hot-tail contribution to evaluate their overall impact onto the injection assimilation and penetration. It is found that the hot-tail effect indeed causes enhanced assimilation and shallower penetration, although the overall effect depends on the exact injection configuration, with the slow injection showing negligible hot-tail effect while the fast single non-shattered pellet case shows drastic hot-tail ablation enhancement. For ITER-like SPI parameters, there is no significant deviation in the total assimilation, but some deviation in the injection penetration is observed for the fast injection velocity cases., Comment: 29 pages, 19 figures
Published: 2022
Full Text: View/download PDF

50. Balanced Multimodal Learning via On-the-fly Gradient Modulation

Author: Peng, Xiaokang, Wei, Yake, Deng, Andong, Wang, Dong, and Hu, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Multimodal learning helps to comprehensively understand the world, by integrating different senses. Accordingly, multiple input modalities are expected to boost model performance, but we actually find that they are not fully exploited even when the multimodal model outperforms its uni-modal counterpart. Specifically, in this paper we point out that existing multimodal discriminative models, in which uniform objective is designed for all modalities, could remain under-optimized uni-modal representations, caused by another dominated modality in some scenarios, e.g., sound in blowing wind event, vision in drawing picture event, etc. To alleviate this optimization imbalance, we propose on-the-fly gradient modulation to adaptively control the optimization of each modality, via monitoring the discrepancy of their contribution towards the learning objective. Further, an extra Gaussian noise that changes dynamically is introduced to avoid possible generalization drop caused by gradient modulation. As a result, we achieve considerable improvement over common fusion methods on different multimodal tasks, and this simple strategy can also boost existing multimodal methods, which illustrates its efficacy and versatility. The source code is available at \url{https://github.com/GeWu-Lab/OGM-GE_CVPR2022}., Comment: Accepted by CVPR 2022 (ORAL)
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,192 results on '"Hu, Di"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources