Author: "Kong, A" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Kong, A"' showing total 610,608 results

Start Over Author "Kong, A"

610,608 results on '"Kong, A"'

101. Search for Efficient Large Language Models

Author: Shen, Xuan, Zhao, Pu, Gong, Yifan, Kong, Zhenglun, Zhan, Zheng, Wu, Yushu, Lin, Ming, Wu, Chao, Lin, Xue, and Wang, Yanzhi
Subjects: Computer Science - Artificial Intelligence
Abstract: Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Numerous efficient techniques, including weight pruning, quantization, and distillation, have been embraced to compress LLMs, targeting memory reduction and inference acceleration, which underscore the redundancy in LLMs. However, most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures. Besides, traditional architecture search methods, limited by the elevated complexity with extensive parameters, struggle to demonstrate their effectiveness on LLMs. In this paper, we propose a training-free architecture search framework to identify optimal subnets that preserve the fundamental strengths of the original LLMs while achieving inference acceleration. Furthermore, after generating subnets that inherit specific weights from the original LLMs, we introduce a reformation algorithm that utilizes the omitted weights to rectify the inherited weights with a small amount of calibration data. Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration. Code: https://github.com/shawnricecake/search-llm, Comment: Accepted by NeurIPS 2024
Published: 2024

102. World Model-based Perception for Visual Legged Locomotion

Author: Lai, Hang, Cao, Jiahang, Xu, Jiafeng, Wu, Hongtao, Lin, Yunfeng, Kong, Tao, Yu, Yong, and Zhang, Weinan
Subjects: Computer Science - Robotics, Computer Science - Machine Learning
Abstract: Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: https://wmp-loco.github.io/., Comment: under review
Published: 2024

103. Modulating dislocation reactions through preferential hydrogen segregation in bcc metals

Author: Hou, Jie, Peng, Ducheng, Kong, Xiang-Shan, Deng, Huiqiu, Hu, Wangyu, Chen, Cheng, and Song, Jun
Subjects: Condensed Matter - Materials Science
Abstract: The interaction between dislocations is fundamental to plastic deformation, work hardening, and defect accumulation. While extensive research has focused on the impact of solutes on individual dislocations, how solutes affect dislocation-dislocation reactions remains largely unexplored. Here, using atomistic simulations of iron as a model bcc system, we demonstrate that hydrogen solutes enable two <111>/2 screw dislocations to react and form a <001> edge dislocation junction, a process that is otherwise unfavorable in hydrogen-free environments. This phenomenon arises from the preferential segregation of hydrogen around the <001> dislocation, which reduces the energy of the reaction product. The resulting <001> dislocation demonstrates remarkable stability and transforms into a <001> vacancy-type dislocation loop under strain. These vacancy-type dislocation loops can accumulate during continuous deformation and dislocation reactions, serving as precursors for the initiation of structural damage, such as cracking and blistering. Our findings highlight the pivotal role of hydrogen in dislocation reactions, uncover a novel defect accumulation mechanism crucial for interpreting recent experimental observations, and represent a significant advance in understanding hydrogen-induced damage in bcc metals.
Published: 2024

104. Generative Pre-trained Ranking Model with Over-parameterization at Web-Scale (Extended Abstract)

Author: Li, Yuchen, Xiong, Haoyi, Kong, Linghe, Bian, Jiang, Wang, Shuaiqiang, Chen, Guihai, and Yin, Dawei
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Learning to rank (LTR) is widely employed in web searches to prioritize pertinent webpages from retrieved content based on input queries. However, traditional LTR models encounter two principal obstacles that lead to suboptimal performance: (1) the lack of well-annotated query-webpage pairs with ranking scores covering a diverse range of search query popularities, which hampers their ability to address queries across the popularity spectrum, and (2) inadequately trained models that fail to induce generalized representations for LTR, resulting in overfitting. To address these challenges, we propose a \emph{\uline{G}enerative \uline{S}emi-\uline{S}upervised \uline{P}re-trained} (GS2P) LTR model. We conduct extensive offline experiments on both a publicly available dataset and a real-world dataset collected from a large-scale search engine. Furthermore, we deploy GS2P in a large-scale web search engine with realistic traffic, where we observe significant improvements in the real-world application.
Published: 2024

105. Pre-trained Graphformer-based Ranking at Web-scale Search (Extended Abstract)

Author: Li, Yuchen, Xiong, Haoyi, Kong, Linghe, Sun, Zeyi, Chen, Hongyang, Wang, Shuaiqiang, and Yin, Dawei
Subjects: Computer Science - Machine Learning, Computer Science - Information Retrieval
Abstract: Both Transformer and Graph Neural Networks (GNNs) have been employed in the domain of learning to rank (LTR). However, these approaches adhere to two distinct yet complementary problem formulations: ranking score regression based on query-webpage pairs, and link prediction within query-webpage bipartite graphs, respectively. While it is possible to pre-train GNNs or Transformers on source datasets and subsequently fine-tune them on sparsely annotated LTR datasets, the distributional shifts between the pair-based and bipartite graph domains present significant challenges in integrating these heterogeneous models into a unified LTR framework at web scale. To address this, we introduce the novel MPGraf model, which leverages a modular and capsule-based pre-training strategy, aiming to cohesively integrate the regression capabilities of Transformers with the link prediction strengths of GNNs. We conduct extensive offline and online experiments to rigorously evaluate the performance of MPGraf.
Published: 2024

106. Advancing Video Quality Assessment for AIGC

Author: Yue, Xinli, Sun, Jianhui, Kong, Han, Yao, Liangchao, Wang, Tianyi, Li, Lei, Rao, Fengyun, Lv, Jing, Xia, Fan, Deng, Yuetang, Wang, Qian, and Zhao, Lingchen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent years, AI generative models have made remarkable progress across various domains, including text generation, image generation, and video generation. However, assessing the quality of text-to-video generation is still in its infancy, and existing evaluation frameworks fall short when compared to those for natural videos. Current video quality assessment (VQA) methods primarily focus on evaluating the overall quality of natural videos and fail to adequately account for the substantial quality discrepancies between frames in generated videos. To address this issue, we propose a novel loss function that combines mean absolute error with cross-entropy loss to mitigate inter-frame quality inconsistencies. Additionally, we introduce the innovative S2CNet technique to retain critical content, while leveraging adversarial training to enhance the model's generalization capabilities. Experimental results demonstrate that our method outperforms existing VQA techniques on the AIGC Video dataset, surpassing the previous state-of-the-art by 3.1% in terms of PLCC., Comment: 5 pages, 1 figure
Published: 2024

107. LlamaPartialSpoof: An LLM-Driven Fake Speech Dataset Simulating Disinformation Generation

Author: Luong, Hieu-Thi, Li, Haoyang, Zhang, Lin, Lee, Kong Aik, and Chng, Eng Siong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Previous fake speech datasets were constructed from a defender's perspective to develop countermeasure (CM) systems without considering diverse motivations of attackers. To better align with real-life scenarios, we created LlamaPartialSpoof, a 130-hour dataset contains both fully and partially fake speech, using a large language model (LLM) and voice cloning technologies to evaluate the robustness of CMs. By examining information valuable to both attackers and defenders, we identify several key vulnerabilities in current CM systems, which can be exploited to enhance attack success rates, including biases toward certain text-to-speech models or concatenation methods. Our experimental results indicate that current fake speech detection system struggle to generalize to unseen scenarios, achieving a best performance of 24.44% equal error rate., Comment: 5 pages, submitted to ICASSP 2025
Published: 2024

108. Room Impulse Responses help attackers to evade Deep Fake Detection

Author: Luong, Hieu-Thi, Truong, Duc-Tuan, Lee, Kong Aik, and Chng, Eng Siong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: The ASVspoof 2021 benchmark, a widely-used evaluation framework for anti-spoofing, consists of two subsets: Logical Access (LA) and Deepfake (DF), featuring samples with varied coding characteristics and compression artifacts. Notably, the current state-of-the-art (SOTA) system boasts impressive performance, achieving an Equal Error Rate (EER) of 0.87% on the LA subset and 2.58% on the DF. However, benchmark accuracy is no guarantee of robustness in real-world scenarios. This paper investigates the effectiveness of utilizing room impulse responses (RIRs) to enhance fake speech and increase their likelihood of evading fake speech detection systems. Our findings reveal that this simple approach significantly improves the evasion rate, doubling the SOTA system's EER. To counter this type of attack, We augmented training data with a large-scale synthetic/simulated RIR dataset. The results demonstrate significant improvement on both reverberated fake speech and original samples, reducing DF task EER to 2.13%., Comment: 7 pages, to be presented at SLT 2024
Published: 2024

109. Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment

Author: Chen, Yuxiao, Li, Kai, Bao, Wentao, Patel, Deep, Kong, Yu, Min, Martin Renqiang, and Metaxas, Dimitris N.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Learning to localize temporal boundaries of procedure steps in instructional videos is challenging due to the limited availability of annotated large-scale training videos. Recent works focus on learning the cross-modal alignment between video segments and ASR-transcripted narration texts through contrastive learning. However, these methods fail to account for the alignment noise, i.e., irrelevant narrations to the instructional task in videos and unreliable timestamps in narrations. To address these challenges, this work proposes a novel training framework. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps (LLM-steps) from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy. The key idea is to measure alignment between LLM-steps and videos via multiple pathways, including: (1) step-narration-video alignment using narration timestamps, (2) direct step-to-video alignment based on their long-term semantic similarity, and (3) direct step-to-video alignment focusing on short-term fine-grained semantic similarity learned from general video domains. The results from different pathways are fused to generate reliable pseudo step-video matching. We conducted extensive experiments across various tasks and problem settings to evaluate our proposed method. Our approach surpasses state-of-the-art methods in three downstream tasks: procedure step grounding, step localization, and narration grounding by 5.9\%, 3.1\%, and 2.8\%., Comment: Accepted to ECCV 2024
Published: 2024

110. Anisotropic Diffusion Probabilistic Model for Imbalanced Image Classification

Author: Kong, Jingyu, Guo, Yuan, Wang, Yu, and Duan, Yuping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Real-world data often has a long-tailed distribution, where the scarcity of tail samples significantly limits the model's generalization ability. Denoising Diffusion Probabilistic Models (DDPM) are generative models based on stochastic differential equation theory and have demonstrated impressive performance in image classification tasks. However, existing diffusion probabilistic models do not perform satisfactorily in classifying tail classes. In this work, we propose the Anisotropic Diffusion Probabilistic Model (ADPM) for imbalanced image classification problems. We utilize the data distribution to control the diffusion speed of different class samples during the forward process, effectively improving the classification accuracy of the denoiser in the reverse process. Specifically, we provide a theoretical strategy for selecting noise levels for different categories in the diffusion process based on error analysis theory to address the imbalanced classification problem. Furthermore, we integrate global and local image prior in the forward process to enhance the model's discriminative ability in the spatial dimension, while incorporate semantic-level contextual information in the reverse process to boost the model's discriminative power and robustness. Through comparisons with state-of-the-art methods on four medical benchmark datasets, we validate the effectiveness of the proposed method in handling long-tail data. Our results confirm that the anisotropic diffusion model significantly improves the classification accuracy of rare classes while maintaining the accuracy of head classes. On the skin lesion datasets, PAD-UFES and HAM10000, the F1-scores of our method improved by 4% and 3%, respectively compared to the original diffusion probabilistic model.
Published: 2024

111. Lidar Panoptic Segmentation in an Open World

Author: Chakravarthy, Anirudh S, Ganesina, Meghana Reddy, Hu, Peiyun, Leal-Taixe, Laura, Kong, Shu, Ramanan, Deva, and Osep, Aljosa
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Addressing Lidar Panoptic Segmentation (LPS ) is crucial for safe deployment of autonomous vehicles. LPS aims to recognize and segment lidar points w.r.t. a pre-defined vocabulary of semantic classes, including thing classes of countable objects (e.g., pedestrians and vehicles) and stuff classes of amorphous regions (e.g., vegetation and road). Importantly, LPS requires segmenting individual thing instances (e.g., every single vehicle). Current LPS methods make an unrealistic assumption that the semantic class vocabulary is fixed in the real open world, but in fact, class ontologies usually evolve over time as robots encounter instances of novel classes that are considered to be unknowns w.r.t. the pre-defined class vocabulary. To address this unrealistic assumption, we study LPS in the Open World (LiPSOW): we train models on a dataset with a pre-defined semantic class vocabulary and study their generalization to a larger dataset where novel instances of thing and stuff classes can appear. This experimental setting leads to interesting conclusions. While prior art train class-specific instance segmentation methods and obtain state-of-the-art results on known classes, methods based on class-agnostic bottom-up grouping perform favorably on classes outside of the initial class vocabulary (i.e., unknown classes). Unfortunately, these methods do not perform on-par with fully data-driven methods on known classes. Our work suggests a middle ground: we perform class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification, akin to Region Proposal Network [1]. We obtain the final point cloud segmentation by computing a cut in the weighted hierarchical tree of point segments, independently of semantic classification. Remarkably, this unified approach leads to strong performance on both known and unknown classes., Comment: Pre-print. Accepted in the International Journal of Computer Vision, 19 Sept 2024. Code available at https://github.com/g-meghana-reddy/open-world-panoptic-segmentation
Published: 2024
Full Text: View/download PDF

112. Democratising Artificial Intelligence for Pandemic Preparedness and Global Governance in Latin American and Caribbean Countries

Author: de Carvalho, Andre, Bonidia, Robson, Kong, Jude Dzevela, Dauhajre, Mariana, Struchiner, Claudio, Goedert, Guilherme, Stadler, Peter F., Walter, Maria Emilia, Sanches, Danilo, Day, Troy, Castro, Marcia, Edmunds, John, Colome-Hidalgo, Manuel, Morban, Demian Arturo Herrera, Franco, Edian F., Ugarte-Gil, Cesar, Espinoza-Lopez, Patricia, Carrasco-Escobar, Gabriel, and Rocha, Ulisses
Subjects: Computer Science - Artificial Intelligence
Abstract: Infectious diseases, transmitted directly or indirectly, are among the leading causes of epidemics and pandemics. Consequently, several open challenges exist in predicting epidemic outbreaks, detecting variants, tracing contacts, discovering new drugs, and fighting misinformation. Artificial Intelligence (AI) can provide tools to deal with these scenarios, demonstrating promising results in the fight against the COVID-19 pandemic. AI is becoming increasingly integrated into various aspects of society. However, ensuring that AI benefits are distributed equitably and that they are used responsibly is crucial. Multiple countries are creating regulations to address these concerns, but the borderless nature of AI requires global cooperation to define regulatory and guideline consensus. Considering this, The Global South AI for Pandemic & Epidemic Preparedness & Response Network (AI4PEP) has developed an initiative comprising 16 projects across 16 countries in the Global South, seeking to strengthen equitable and responsive public health systems that leverage Southern-led responsible AI solutions to improve prevention, preparedness, and response to emerging and re-emerging infectious disease outbreaks. This opinion introduces our branches in Latin American and Caribbean (LAC) countries and discusses AI governance in LAC in the light of biotechnology. Our network in LAC has high potential to help fight infectious diseases, particularly in low- and middle-income countries, generating opportunities for the widespread use of AI techniques to improve the health and well-being of their communities.
Published: 2024

113. LiDAR-based Quadrotor for Slope Inspection in Dense Vegetation

Author: Liu, Wenyi, Ren, Yunfan, Guo, Rui, Kong, Vickie W. W., Hung, Anthony S. P., Zhu, Fangcheng, Cai, Yixi, Zou, Yuying, and Zhang, Fu
Subjects: Computer Science - Robotics
Abstract: This work presents a LiDAR-based quadrotor system for slope inspection in dense vegetation environments. Cities like Hong Kong are vulnerable to climate hazards, which often result in landslides. To mitigate the landslide risks, the Civil Engineering and Development Department (CEDD) has constructed steel flexible debris-resisting barriers on vulnerable natural catchments to protect residents. However, it is necessary to carry out regular inspections to identify any anomalies, which may affect the proper functioning of the barriers. Traditional manual inspection methods face challenges and high costs due to steep terrain and dense vegetation. Compared to manual inspection, unmanned aerial vehicles (UAVs) equipped with LiDAR sensors and cameras have advantages such as maneuverability in complex terrain, and access to narrow areas and high spots. However, conducting slope inspections using UAVs in dense vegetation poses significant challenges. First, in terms of hardware, the overall design of the UAV must carefully consider its maneuverability in narrow spaces, flight time, and the types of onboard sensors required for effective inspection. Second, regarding software, navigation algorithms need to be designed to enable obstacle avoidance flight in dense vegetation environments. To overcome these challenges, we develop a LiDAR-based quadrotor, accompanied by a comprehensive software system. The goal is to deploy our quadrotor in field environments to achieve efficient slope inspection. To assess the feasibility of our hardware and software system, we conduct functional tests in non-operational scenarios. Subsequently, invited by CEDD, we deploy our quadrotor in six field environments, including five flexible debris-resisting barriers located in dense vegetation and one slope that experienced a landslide. These experiments demonstrated the superiority of our quadrotor in slope inspection., Comment: 36 pages
Published: 2024

114. Preparation for CSST: Star-galaxy Classification using a Rotationally Invariant Supervised Machine Learning Method

Author: Zhang, Shiliang, Fang, Guanwen, Song, Jie, Li, Ran, Gu, Yizhou, Lin, Zesen, Zhou, Chichun, Dai, Yao, and Kong, Xu
Subjects: Astrophysics - Astrophysics of Galaxies
Abstract: Most existing star-galaxy classifiers depend on the reduced information from catalogs, necessitating careful data processing and feature extraction. In this study, we employ a supervised machine learning method (GoogLeNet) to automatically classify stars and galaxies in the COSMOS field. Unlike traditional machine learning methods, we introduce several preprocessing techniques, including noise reduction and the unwrapping of denoised images in polar coordinates, applied to our carefully selected samples of stars and galaxies. By dividing the selected samples into training and validation sets in an 8:2 ratio, we evaluate the performance of the GoogLeNet model in distinguishing between stars and galaxies. The results indicate that the GoogLeNet model is highly effective, achieving accuracies of 99.6% and 99.9% for stars and galaxies, respectively. Furthermore, by comparing the results with and without preprocessing, we find that preprocessing can significantly improve classification accuracy (by approximately 2.0% to 6.0%) when the images are rotated. In preparation for the future launch of the China Space Station Telescope (CSST), we also evaluate the performance of the GoogLeNet model on the CSST simulation data. These results demonstrate a high level of accuracy (approximately 99.8%), indicating that this model can be effectively utilized for future observations with the CSST., Comment: 11 pages, 9 figures, published in Research in Astronomy and Astrophysics, Volume 24, Number 9 (2024)
Published: 2024
Full Text: View/download PDF

115. Medium modifications of heavy-flavor jet angularities in high-energy nuclear collisions

Author: Li, Yao, Chen, Shi-Yong, Kong, Weixi, Wang, Sa, and Zhang, Ben-Wei
Subjects: High Energy Physics - Phenomenology, Nuclear Theory
Abstract: We present the first theoretical study of heavy-flavor jet angularities ($\lambda_{\kappa}^{\alpha}$) in Pb+Pb collisions at $\sqrt{s_{\rm NN}}=$ 5.02 TeV. The initial production of heavy-flavor jets is carried out using the POWHEG+PYTHIA8 prescription, while the jet evolution in the quark-gluon plasma (QGP) is described by the SHELL transport model. In p+p collisions, we observe narrower angularity distributions for the D$^0$-tagged jets compared to inclusive jets, consistent with the ALICE preliminary results. We then demonstrate that jet quenching in the QGP may slightly widen the angularity distributions of both inclusive and D$^0$-tagged jets in Pb+Pb collisions relative to p+p at $10< p_{\rm T,jet} < 20$ GeV/c. Additionally, by comparing the averaged angularities $\langle \lambda^{\kappa}_{\alpha} \rangle$ of inclusive, D$^0$-tagged and B$^0$-tagged jets with varying $\alpha$ and $\kappa$, we show that the larger the quark mass is, the lower the jet's $\langle \lambda^{\kappa}_{\alpha} \rangle$ values are. As a result of the slenderer initial distribution, we predict that as compared to inclusive jets, the heavy-flavor jets, especially the B$^0$-tagged one, will suffer more distinct modifications of $\langle \lambda^{\kappa}_{\alpha} \rangle$ in Pb+Pb relative to p+p at $10 < p_{\rm T,jet} < 20$ GeV/c. For a larger jet radius, a more significant broadening of jet angularities could be obtained because of the enhanced contributions of the wide-angle particles. It is also noted that the angularity distributions of inclusive and D$^0$-tagged jets become narrower in Pb+Pb collisions relative to p+p at $p_{\rm T,jet} > 20$ GeV/c due to the strong influence of the selection bias., Comment: 8 pages, 6 figures
Published: 2024

116. Discrimination vs. Generation: The Machine Learning Dichotomy for Dopaminergic Hit Discovery

Author: Sobodu, Temitope, Yusuf, Adeshina, Kiel, Dan, and Kong, Dong
Subjects: Quantitative Biology - Biomolecules
Abstract: Virtual screening plays a pivotal role in early drug discovery, traditionally dominated by physics-based methods. While these approaches offer detailed insights, they are often hindered by high computational costs, limited sampling, and forcefield inaccuracies. Advances in Machine Learning (Ml)and Deep Learning (DL) present resource-efficient alternatives, with approaches like predictive geometric ML (EQUIBIND) and generative geometric ML (DIFFDOCK)showing promise in enhancing both efficiency and predictive capability. Here, we compare these two strategies, retrospectively and prospectively, for identifying novel agonists targeting the dopamine D2 receptor. To complement DIFFDOCK's dual functionality in protein-ligand conformer generation and confidence estimation, we adopted a complementary atom-type-based confidence model for EQUIBIND. This pipeline, termed the discriminative model, integrates a featurization step and an XGBoost classifier to differentiate between active and inactive ligands. The top-ranked compounds from both models were evaluated using an ultrafast dopaminergic biosensor assay, dLight. Our results demonstrate that the generative model achieved a higher hit rate, notably leading to the discovery of Compound 1, a nanomolar dopamine D2 receptor agonist with a novel scaffold.
Published: 2024

117. Frequency-Guided Spatial Adaptation for Camouflaged Object Detection

Author: Zhang, Shizhou, Kong, Dexuan, Xing, Yinghui, Lu, Yue, Ran, Lingyan, Liang, Guoqiang, Wang, Hexu, and Zhang, Yanning
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background.With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released., Comment: The paper has been accepted for publication as a regular paper in the IEEE Transactions on Multimedia
Published: 2024

118. Towards Closing the Loop in Robotic Pollination for Indoor Farming via Autonomous Microscopic Inspection

Author: Kong, Chuizheng, Qiu, Alex, Wibowo, Idris, Ren, Marvin, Dhori, Aishik, Ling, Kai-Shu, Hu, Ai-Ping, and Kousik, Shreyas
Subjects: Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control
Abstract: Effective pollination is a key challenge for indoor farming, since bees struggle to navigate without the sun. While a variety of robotic system solutions have been proposed, it remains difficult to autonomously check that a flower has been sufficiently pollinated to produce high-quality fruit, which is especially critical for self-pollinating crops such as strawberries. To this end, this work proposes a novel robotic system for indoor farming. The proposed hardware combines a 7-degree-of-freedom (DOF) manipulator arm with a custom end-effector, comprised of an endoscope camera, a 2-DOF microscope subsystem, and a custom vibrating pollination tool; this is paired with algorithms to detect and estimate the pose of strawberry flowers, navigate to each flower, pollinate using the tool, and inspect with the microscope. The key novelty is vibrating the flower from below while simultaneously inspecting with a microscope from above. Each subsystem is validated via extensive experiments.
Published: 2024

119. M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

Author: Zhou, Jiaming, Zhao, Shiwan, He, Jiabei, Wang, Hui, Zeng, Wenjia, Chen, Yong, Sun, Haoqin, Kong, Aobo, and Qin, Yong
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: State-of-the-art models like OpenAI's Whisper exhibit strong performance in multilingual automatic speech recognition (ASR), but they still face challenges in accurately recognizing diverse subdialects. In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. Building on the principles of in-context learning (ICL) and retrieval-augmented techniques, our method employs sentence-level ICL in the pre-processing stage to harness contextual information, while integrating token-level k-Nearest Neighbors (kNN) retrieval as a post-processing step to further refine the final output distribution. By synergistically combining sentence-level and token-level retrieval strategies, M2R-whisper effectively mitigates various types of recognition errors. Experiments conducted on Mandarin and subdialect datasets, including AISHELL-1 and KeSpeech, demonstrate substantial improvements in ASR accuracy, all achieved without any parameter updates.
Published: 2024

120. Enhancing Complex Formula Recognition with Hierarchical Detail-Focused Network

Author: Wang, Jiale, Yu, Junhui, Liu, Huanyong, and Kong, Chenanran
Subjects: Computer Science - Computation and Language
Abstract: Hierarchical and complex Mathematical Expression Recognition (MER) is challenging due to multiple possible interpretations of a formula, complicating both parsing and evaluation. In this paper, we introduce the Hierarchical Detail-Focused Recognition dataset (HDR), the first dataset specifically designed to address these issues. It consists of a large-scale training set, HDR-100M, offering an unprecedented scale and diversity with one hundred million training instances. And the test set, HDR-Test, includes multiple interpretations of complex hierarchical formulas for comprehensive model performance evaluation. Additionally, the parsing of complex formulas often suffers from errors in fine-grained details. To address this, we propose the Hierarchical Detail-Focused Recognition Network (HDNet), an innovative framework that incorporates a hierarchical sub-formula module, focusing on the precise handling of formula details, thereby significantly enhancing MER performance. Experimental results demonstrate that HDNet outperforms existing MER models across various datasets., Comment: Submitted to the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025)
Published: 2024

121. CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

Author: Gao, Jiahui, Pi, Renjie, Han, Tianyang, Wu, Han, Hong, Lanqing, Kong, Lingpeng, Jiang, Xin, and Li, Zhenguo
Subjects: Computer Science - Computation and Language
Abstract: The deployment of multimodal large language models (MLLMs) has demonstrated remarkable success in engaging in conversations involving visual inputs, thanks to the superior power of large language models (LLMs). Those MLLMs are typically built based on the LLMs, with an image encoder to process images into the token embedding space of the LLMs. However, the integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs and prone to generating sensitive or harmful responses, even though the LLM has been trained on textual dataset to align with human value. In this paper, we first raise the question: ``Do the MLLMs possess safety-awareness against malicious image inputs?". We find that after adding a principle that specifies the safety requirement into the input of the MLLM, the model's safety awareness becomes boosted. This phenomenon verifies the existence of MLLM's safety-awareness against image inputs, it is only weakened by the modality gap. We then introduce a simple yet effective technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution. Our proposed strategy helps the model reclaim its original safety awareness without losing its original capabilities. We verify the effectiveness of our approach on both multimodal safety and understanding benchmarks., Comment: 10 pages, COLM-2024
Published: 2024

122. Three Approaches to the Automation of Laser System Alignment and Their Resource Implications: A Case Study

Author: Robb, David A., Risbridger, Donald, Mills, Ben, Rakhmatulin, Ildar, Kong, Xianwen, Erden, Mustafa, Esser, M. J. Daniel, Carter, Richard M., and Chantler, Mike J.
Subjects: Electrical Engineering and Systems Science - Systems and Control, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: The alignment of optical systems is a critical step in their manufacture. Alignment normally requires considerable knowledge and expertise of skilled operators. The automation of such processes has several potential advantages, but requires additional resource and upfront costs. Through a case study of a simple two mirror system we identify and examine three different automation approaches. They are: artificial neural networks; practice-led, which mimics manual alignment practices; and design-led, modelling from first principles. We find that these approaches make use of three different types of knowledge 1) basic system knowledge (of controls, measurements and goals); 2) behavioural skills and expertise, and 3) fundamental system design knowledge. We demonstrate that the different automation approaches vary significantly in human resources, and measurement sampling budgets. This will have implications for practitioners and management considering the automation of such tasks., Comment: Author Accepted Manuscript- 8 pages, The 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE 2024), Aug28-Sep1st 2024, Bari, Italy. Keywords: Automation, optimisation, regression, behaviour analysis, artificial neural networks, optical systems, mathematical model, human factors, sampling cost, cost benefit analysis
Published: 2024

123. Physics-Informed Tailored Finite Point Operator Network for Parametric Interface Problems

Author: Du, Ting, Xu, Xianliang, Kong, Wang, Li, Ye, and Huang, Zhongyi
Subjects: Mathematics - Numerical Analysis
Abstract: Learning operators for parametric partial differential equations (PDEs) using neural networks has gained significant attention in recent years. However, standard approaches like Deep Operator Networks (DeepONets) require extensive labeled data, and physics-informed DeepONets encounter training challenges. In this paper, we introduce a novel physics-informed tailored finite point operator network (PI-TFPONet) method to solve parametric interface problems without the need for labeled data. Our method fully leverages the prior physical information of the problem, eliminating the need to include the PDE residual in the loss function, thereby avoiding training challenges. The PI-TFPONet is specifically designed to address certain properties of the problem, allowing us to naturally obtain an approximate solution that closely matches the exact solution. Our method is theoretically proven to converge if the local mesh size is sufficiently small and the training loss is minimized. Notably, our approach is uniformly convergent for singularly perturbed interface problems. Extensive numerical studies show that our unsupervised PI-TFPONet is comparable to or outperforms existing state-of-the-art supervised deep operator networks in terms of accuracy and versatility.
Published: 2024

124. A cytokine-enhanced viral infection model with CTL immune response, distributed delay and saturation incidence

Author: Cao, Xiaodong, Hou, Songbo, and Kong, Xiaoqing
Subjects: Mathematics - Dynamical Systems, 60H10, 92D30
Abstract: In this paper, we propose a delayed cytokine-enhanced viral infection model incorporating saturation incidence and immune response. We compute the basic reproduction numbers and introduce a convex cone to discuss the impact of non-negative initial data on solutions. By defining appropriate Lyapunov functionals and employing LaSalle's invariance principle, we investigate the stability of three equilibria: the disease-free equilibrium, the immunity-inactivated equilibrium, and the immunity-activated equilibrium. We establish conditions under which these equilibria are globally asymptotically stable. Numerical analyses not only corroborate the theoretical results but also reveal that intervention in virus infection can be achieved by extending the delay period., Comment: 20 pages
Published: 2024

125. Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge

Author: Liu, Shuiyun, Kong, Yuxiang, Guo, Pengcheng, Zhuang, Weiji, Gao, Peng, Wang, Yujun, and Xie, Lei
Subjects: Computer Science - Sound, Computer Science - Human-Computer Interaction, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speech has emerged as a widely embraced user interface across diverse applications. However, for individuals with dysarthria, the inherent variability in their speech poses significant challenges. This paper presents an end-to-end Pretrain-based Dual-filter Dysarthria Wake-up word Spotting (PD-DWS) system for the SLT 2024 Low-Resource Dysarthria Wake-Up Word Spotting Challenge. Specifically, our system improves performance from two key perspectives: audio modeling and dual-filter strategy. For audio modeling, we propose an innovative 2branch-d2v2 model based on the pre-trained data2vec2 (d2v2), which can simultaneously model automatic speech recognition (ASR) and wake-up word spotting (WWS) tasks through a unified multi-task finetuning paradigm. Additionally, a dual-filter strategy is introduced to reduce the false accept rate (FAR) while maintaining the same false reject rate (FRR). Experimental results demonstrate that our PD-DWS system achieves an FAR of 0.00321 and an FRR of 0.005, with a total score of 0.00821 on the test-B eval set, securing first place in the challenge., Comment: 8 pages, Accepted to SLT 2024
Published: 2024

126. Generalized Matrix Factor Model

Author: Kong, Xinbing and Zhang, Tong
Subjects: Statistics - Methodology
Abstract: This article introduces a nonlinear generalized matrix factor model (GMFM) that allows for mixed-type variables, extending the scope of linear matrix factor models (LMFM) that are so far limited to handling continuous variables. We introduce a novel augmented Lagrange multiplier method, equivalent to the constraint maximum likelihood estimation, and carefully tailored to be locally concave around the true factor and loading parameters. This statistically guarantees the local convexity of the negative Hessian matrix around the true parameters of the factors and loadings, which is nontrivial in the matrix factor modeling and leads to feasible central limit theorems of the estimated factors and loadings. We also theoretically establish the convergence rates of the estimated factor and loading matrices for the GMFM under general conditions that allow for correlations across samples, rows, and columns. Moreover, we provide a model selection criterion to determine the numbers of row and column factors consistently. To numerically compute the constraint maximum likelihood estimator, we provide two algorithms: two-stage alternating maximization and minorization maximization. Extensive simulation studies demonstrate GMFM's superiority in handling discrete and mixed-type variables. An empirical data analysis of the company's operating performance shows that GMFM does clustering and reconstruction well in the presence of discontinuous entries in the data matrix.
Published: 2024

127. Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Author: Yang, Yudong, Liu, Zhan, Yu, Wenyi, Sun, Guangzhi, Kong, Qiuqiang, and Zhang, Chao
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound
Abstract: Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments, they may not achieve the same level of fidelity as the discriminative models specifically trained to enhance particular acoustic conditions. In this paper, we propose Ex-Diff, a novel score-based diffusion model that integrates the latent representations produced by a discriminative model to improve speech and vocal enhancement, which combines the strengths of both generative and discriminative models. Experimental results on the widely used MUSDB dataset show relative improvements of 3.7% in SI-SDR and 10.0% in SI-SIR compared to the baseline diffusion model for speech and vocal enhancement tasks, respectively. Additionally, case studies are provided to further illustrate and analyze the complementary nature of generative and discriminative models in this context.
Published: 2024

128. Protecting Copyright of Medical Pre-trained Language Models: Training-Free Backdoor Watermarking

Author: Kong, Cong, Xu, Rui, Chen, Weixi, Chen, Jiawei, and Yin, Zhaoxia
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security
Abstract: Pre-training language models followed by fine-tuning on specific tasks is standard in NLP, but traditional models often underperform when applied to the medical domain, leading to the development of specialized medical pre-trained language models (Med-PLMs). These models are valuable assets but are vulnerable to misuse and theft, requiring copyright protection. However, no existing watermarking methods are tailored for Med-PLMs, and adapting general PLMs watermarking techniques to the medical domain faces challenges such as task incompatibility, loss of fidelity, and inefficiency. To address these issues, we propose the first training-free backdoor watermarking method for Med-PLMs. Our method uses rare special symbols as trigger words, which do not impact downstream task performance, embedding watermarks by replacing their original embeddings with those of specific medical terms in the Med-PLMs' word embeddings layer. After fine-tuning the watermarked Med-PLMs on various medical downstream tasks, the final models (FMs) respond to the trigger words in the same way they would to the corresponding medical terms. This property can be utilized to extract the watermark. Experiments demonstrate that our method achieves high fidelity while effectively extracting watermarks across various medical downstream tasks. Additionally, our method demonstrates robustness against various attacks and significantly enhances the efficiency of watermark embedding, reducing the embedding time from 10 hours to 10 seconds., Comment: 9 pages
Published: 2024

129. On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Author: Li, Junjie, Zhang, Ke, Wang, Shuai, Li, Haizhou, Mak, Man-Wai, and Lee, Kong Aik
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep learning technologies have significantly advanced the performance of target speaker extraction (TSE) tasks. To enhance the generalization and robustness of these algorithms when training data is insufficient, data augmentation is a commonly adopted technique. Unlike typical data augmentation applied to speech mixtures, this work thoroughly investigates the effectiveness of augmenting the enrollment speech space. We found that for both pretrained and jointly optimized speaker encoders, directly augmenting the enrollment speech leads to consistent performance improvement. In addition to conventional methods such as noise and reverberation addition, we propose a novel augmentation method called self-estimated speech augmentation (SSA). Experimental results on the Libri2Mix test set show that our proposed method can achieve an improvement of up to 2.5 dB., Comment: Accepted by SLT2024
Published: 2024

130. Language-Queried Target Sound Extraction Without Parallel Training Data

Author: Ma, Hao, Peng, Zhiyuan, Li, Xu, Li, Yukai, Shao, Mingjie, Kong, Qiuqiang, and Liu, Ju
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Language-queried target sound extraction (TSE) aims to extract specific sounds from mixtures based on language queries. Traditional fully-supervised training schemes require extensively annotated parallel audio-text data, which are labor-intensive. We introduce a language-free training scheme, requiring only unlabelled audio clips for TSE model training by utilizing the multi-modal representation alignment nature of the contrastive language-audio pre-trained model (CLAP). In a vanilla language-free training stage, target audio is encoded using the pre-trained CLAP audio encoder to form a condition embedding for the TSE model, while during inference, user language queries are encoded by CLAP text encoder. This straightforward approach faces challenges due to the modality gap between training and inference queries and information leakage from direct exposure to target audio during training. To address this, we propose a retrieval-augmented strategy. Specifically, we create an embedding cache using audio captions generated by a large language model (LLM). During training, target audio embeddings retrieve text embeddings from this cache to use as condition embeddings, ensuring consistent modalities between training and inference and eliminating information leakage. Extensive experiment results show that our retrieval-augmented approach achieves consistent and notable performance improvements over existing state-of-the-art with better generalizability., Comment: Submitted to ICASSP 2025
Published: 2024

131. Pathfinder for Low-altitude Aircraft with Binary Neural Network

Author: Yin, Kaijie, Gao, Tian, and Kong, Hui
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: A prior global topological map (e.g., the OpenStreetMap, OSM) can boost the performance of autonomous mapping by a ground mobile robot. However, the prior map is usually incomplete due to lacking labeling in partial paths. To solve this problem, this paper proposes an OSM maker using airborne sensors carried by low-altitude aircraft, where the core of the OSM maker is a novel efficient pathfinder approach based on LiDAR and camera data, i.e., a binary dual-stream road segmentation model. Specifically, a multi-scale feature extraction based on the UNet architecture is implemented for images and point clouds. To reduce the effect caused by the sparsity of point cloud, an attention-guided gated block is designed to integrate image and point-cloud features. For enhancing the efficiency of the model, we propose a binarization streamline to each model component, including a variant of vision transformer (ViT) architecture as the encoder of the image branch, and new focal and perception losses to optimize the model training. The experimental results on two datasets demonstrate that our pathfinder method achieves SOTA accuracy with high efficiency in finding paths from the low-level airborne sensors, and we can create complete OSM prior maps based on the segmented road skeletons. Code and data are available at:https://github.com/IMRL/Pathfinder}{https://github.com/IMRL/Pathfinder.
Published: 2024

132. Effective Integration of KAN for Keyword Spotting

Author: Xu, Anfeng, Zhang, Biqiao, Kong, Shuyu, Huang, Yiteng, Yang, Zhaojun, Srivastava, Sangeeta, and Sun, Ming
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Keyword spotting (KWS) is an important speech processing component for smart devices with voice assistance capability. In this paper, we investigate if Kolmogorov-Arnold Networks (KAN) can be used to enhance the performance of KWS. We explore various approaches to integrate KAN for a model architecture based on 1D Convolutional Neural Networks (CNN). We find that KAN is effective at modeling high-level features in lower-dimensional spaces, resulting in improved KWS performance when integrated appropriately. The findings shed light on understanding KAN for speech processing tasks and on other modalities for future researchers., Comment: Under review
Published: 2024

133. Exploiting Supervised Poison Vulnerability to Strengthen Self-Supervised Defense

Author: Styborski, Jeremy, Lyu, Mingzhi, Huang, Yi, and Kong, Adams
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Availability poisons exploit supervised learning (SL) algorithms by introducing class-related shortcut features in images such that models trained on poisoned data are useless for real-world datasets. Self-supervised learning (SSL), which utilizes augmentations to learn instance discrimination, is regarded as a strong defense against poisoned data. However, by extending the study of SSL across multiple poisons on the CIFAR-10 and ImageNet-100 datasets, we demonstrate that it often performs poorly, far below that of training on clean data. Leveraging the vulnerability of SL to poison attacks, we introduce adversarial training (AT) on SL to obfuscate poison features and guide robust feature learning for SSL. Our proposed defense, designated VESPR (Vulnerability Exploitation of Supervised Poisoning for Robust SSL), surpasses the performance of six previous defenses across seven popular availability poisons. VESPR displays superior performance over all previous defenses, boosting the minimum and average ImageNet-100 test accuracies of poisoned models by 16% and 9%, respectively. Through analysis and ablation studies, we elucidate the mechanisms by which VESPR learns robust class features., Comment: 28 pages, 5 figures
Published: 2024

134. Customized Mid-Air Gestures for Accessibility: A $B Recognizer for Multi-Dimensional Biosignal Gestures

Author: Yamagami, Momona, Mitchell, Claire L., Portnova-Fahreeva, Alexandra A., Kong, Junhan, Mankoff, Jennifer, and Wobbrock, Jacob O.
Subjects: Computer Science - Human-Computer Interaction
Abstract: Biosignal interfaces, using sensors in, on, or around the body, promise to enhance wearables interaction and improve device accessibility for people with motor disabilities. However, biosignals are multi-modal, multi-dimensional, and noisy, requiring domain expertise to design input features for gesture classifiers. The \$B-recognizer enables mid-air gesture recognition without needing expertise in biosignals or algorithms. \$B resamples, normalizes, and performs dimensionality reduction to reduce noise and enhance signals relevant to the recognition. We tested \$B on a dataset of 26 participants with and 8 participants without upper-body motor disabilities performing personalized ability-based gestures. For two conditions (user-dependent, gesture articulation variability), \$B outperformed our comparison algorithms (traditional machine learning with expert features and deep learning), with > 95% recognition rate. For the user-independent condition, \$B and deep learning performed comparably for participants with disabilities. Our biosignal dataset is publicly available online. $B highlights the potential and feasibility of accessible biosignal interfaces., Comment: 20 pages, 7 figures, 1 table
Published: 2024

135. Towards Quantifying and Reducing Language Mismatch Effects in Cross-Lingual Speech Anti-Spoofing

Author: Liu, Tianchi, Kukanov, Ivan, Pan, Zihan, Wang, Qiongqiong, Sailor, Hardik B., and Lee, Kong Aik
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Sound
Abstract: The effects of language mismatch impact speech anti-spoofing systems, while investigations and quantification of these effects remain limited. Existing anti-spoofing datasets are mainly in English, and the high cost of acquiring multilingual datasets hinders training language-independent models. We initiate this work by evaluating top-performing speech anti-spoofing systems that are trained on English data but tested on other languages, observing notable performance declines. We propose an innovative approach - Accent-based data expansion via TTS (ACCENT), which introduces diverse linguistic knowledge to monolingual-trained models, improving their cross-lingual capabilities. We conduct experiments on a large-scale dataset consisting of over 3 million samples, including 1.8 million training samples and nearly 1.2 million testing samples across 12 languages. The language mismatch effects are preliminarily quantified and remarkably reduced over 15% by applying the proposed ACCENT. This easily implementable method shows promise for multilingual and low-resource language scenarios., Comment: Accepted to the IEEE Spoken Language Technology Workshop (SLT) 2024
Published: 2024

136. StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

Author: Zhao, Sijie, Hu, Wenbo, Cun, Xiaodong, Zhang, Yong, Li, Xiaoyu, Kong, Zhe, Gao, Xiangjun, Niu, Muyao, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, I.3.0, I.4.0
Abstract: This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting. We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task. To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing. Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays. In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media., Comment: 11 pages, 10 figures
Published: 2024

137. ART: Artifact Removal Transformer for Reconstructing Noise-Free Multichannel Electroencephalographic Signals

Author: Chuang, Chun-Hsiang, Chang, Kong-Yi, Huang, Chih-Sheng, and Bessas, Anne-Mei
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Machine Learning
Abstract: Artifact removal in electroencephalography (EEG) is a longstanding challenge that significantly impacts neuroscientific analysis and brain-computer interface (BCI) performance. Tackling this problem demands advanced algorithms, extensive noisy-clean training data, and thorough evaluation strategies. This study presents the Artifact Removal Transformer (ART), an innovative EEG denoising model employing transformer architecture to adeptly capture the transient millisecond-scale dynamics characteristic of EEG signals. Our approach offers a holistic, end-to-end denoising solution for diverse artifact types in multichannel EEG data. We enhanced the generation of noisy-clean EEG data pairs using an independent component analysis, thus fortifying the training scenarios critical for effective supervised learning. We performed comprehensive validations using a wide range of open datasets from various BCI applications, employing metrics like mean squared error and signal-to-noise ratio, as well as sophisticated techniques such as source localization and EEG component classification. Our evaluations confirm that ART surpasses other deep-learning-based artifact removal methods, setting a new benchmark in EEG signal processing. This advancement not only boosts the accuracy and reliability of artifact removal but also promises to catalyze further innovations in the field, facilitating the study of brain dynamics in naturalistic environments.
Published: 2024

138. What is the Right Notion of Distance between Predict-then-Optimize Tasks?

Author: Rodriguez-Diaz, Paula, Kong, Lingkai, Wang, Kai, Alvarez-Melis, David, and Tambe, Milind
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Comparing datasets is a fundamental task in machine learning, essential for various learning paradigms; from evaluating train and test datasets for model generalization to using dataset similarity for detecting data drift. While traditional notions of dataset distances offer principled measures of similarity, their utility has largely been assessed through prediction error minimization. However, in Predict-then-Optimize (PtO) frameworks, where predictions serve as inputs for downstream optimization tasks, model performance is measured through decision regret minimization rather than prediction error minimization. In this work, we (i) show that traditional dataset distances, which rely solely on feature and label dimensions, lack informativeness in the PtO context, and (ii) propose a new dataset distance that incorporates the impacts of downstream decisions. Our results show that this decision-aware dataset distance effectively captures adaptation success in PtO contexts, providing a PtO adaptation bound in terms of dataset distance. Empirically, we show that our proposed distance measure accurately predicts transferability across three different PtO tasks from the literature.
Published: 2024

139. Semi-Supervised 3D Object Detection with Channel Augmentation using Transformation Equivariance

Author: Kang, Minju, Kong, Taehun, and Kim, Tae-Kyun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurate 3D object detection is crucial for autonomous vehicles and robots to navigate and interact with the environment safely and effectively. Meanwhile, the performance of 3D detector relies on the data size and annotation which is expensive. Consequently, the demand of training with limited labeled data is growing. We explore a novel teacher-student framework employing channel augmentation for 3D semi-supervised object detection. The teacher-student SSL typically adopts a weak augmentation and strong augmentation to teacher and student, respectively. In this work, we apply multiple channel augmentations to both networks using the transformation equivariance detector (TED). The TED allows us to explore different combinations of augmentation on point clouds and efficiently aggregates multi-channel transformation equivariance features. In principle, by adopting fixed channel augmentations for the teacher network, the student can train stably on reliable pseudo-labels. Adopting strong channel augmentations can enrich the diversity of data, fostering robustness to transformations and enhancing generalization performance of the student network. We use SOTA hierarchical supervision as a baseline and adapt its dual-threshold to TED, which is called channel IoU consistency. We evaluate our method with KITTI dataset, and achieved a significant performance leap, surpassing SOTA 3D semi-supervised object detection models., Comment: Accepted to 2024 IEEE International Conference on Image Processing (ICIP)
Published: 2024

140. Hardware Acceleration of Kolmogorov-Arnold Network (KAN) for Lightweight Edge Inference

Author: Huang, Wei-Hsing, Jia, Jianwei, Kong, Yuyao, Waqar, Faaiq, Wen, Tai-Hao, Chang, Meng-Fan, and Yu, Shimeng
Subjects: Computer Science - Hardware Architecture
Abstract: Recently, a novel model named Kolmogorov-Arnold Networks (KAN) has been proposed with the potential to achieve the functionality of traditional deep neural networks (DNNs) using orders of magnitude fewer parameters by parameterized B-spline functions with trainable coefficients. However, the B-spline functions in KAN present new challenges for hardware acceleration. Evaluating the B-spline functions can be performed by using look-up tables (LUTs) to directly map the B-spline functions, thereby reducing computational resource requirements. However, this method still requires substantial circuit resources (LUTs, MUXs, decoders, etc.). For the first time, this paper employs an algorithm-hardware co-design methodology to accelerate KAN. The proposed algorithm-level techniques include Alignment-Symmetry and PowerGap KAN hardware aware quantization, KAN sparsity aware mapping strategy, and circuit-level techniques include N:1 Time Modulation Dynamic Voltage input generator with analog-CIM (ACIM) circuits. The impact of non-ideal effects, such as partial sum errors caused by the process variations, has been evaluated with the statistics measured from the TSMC 22nm RRAM-ACIM prototype chips. With the best searched hyperparameters of KAN and the optimized circuits implemented in 22 nm node, we can reduce hardware area by 41.78x, energy by 77.97x with 3.03% accuracy boost compared to the traditional DNN hardware., Comment: Accepted at ASP-DAC (Asia and South Pacific Design Automation Conference)
Published: 2024

141. Joint Model Assignment and Resource Allocation for Cost-Effective Mobile Generative Services

Author: Gao, Shuangwei, Yang, Peng, Kong, Yuxin, Lyu, Feng, and Zhang, Ning
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Artificial Intelligence Generated Content (AIGC) services can efficiently satisfy user-specified content creation demands, but the high computational requirements pose various challenges to supporting mobile users at scale. In this paper, we present our design of an edge-enabled AIGC service provisioning system to properly assign computing tasks of generative models to edge servers, thereby improving overall user experience and reducing content generation latency. Specifically, once the edge server receives user requested task prompts, it dynamically assigns appropriate models and allocates computing resources based on features of each category of prompts. The generated contents are then delivered to users. The key to this system is a proposed probabilistic model assignment approach, which estimates the quality score of generated contents for each prompt based on category labels. Next, we introduce a heuristic algorithm that enables adaptive configuration of both generation steps and resource allocation, according to the various task requests received by each generative model on the edge.Simulation results demonstrate that the designed system can effectively enhance the quality of generated content by up to 4.7% while reducing response delay by up to 39.1% compared to benchmarks.
Published: 2024

142. Nuclear transparencies with a two-step process of the $A(e,e'\pi^+)$ reaction

Author: Choi, Tae Keun, Kong, Kook-Jin, and Yu, Byung-Geel
Subjects: Nuclear Theory, High Energy Physics - Phenomenology
Abstract: Nuclear transparency in pion-induced nuclear reactions has been investigated based on Glauber multiple scattering theory considering a two-step process within the framework of vector meson dominance (VMD). In the present context, the application of the quantum diffusion model (QDM) to the Glauber theory plays a role in explaining the dependence of the transparency on the four-momentum transfer squared $Q^2$. The short-range correlation (SRC) considered further gives the contribution to the magnitude of the transparency by a constant amount independent of the $Q^2$ variation, and the results from the QDM and SRC overestimate the experimental data. The inclusion of the two-step process with the $\rho N$ scattering cross section, $\sigma_{\rho N}=3 $ mb has the effect of reducing the transparency and thus leads to a good agreement with the experimental data on the reaction $A(e,e'\pi^+)$ for $^{12}$C, $^{27}$Al, $^{63}$Cu and $^{197}$Au nuclei., Comment: 6 pages, 4 figures, 1 table
Published: 2024

143. A novel standard candle: collapsing axion stars

Author: Di, Haoran, Shao, Lijing, Yi, Zhu, and Kong, Shi-Bei
Subjects: High Energy Physics - Phenomenology, General Relativity and Quantum Cosmology
Abstract: The Hubble constant, $H_0$, is a crucial parameter in cosmology. However, various cosmic observations have produced differing posterior values for $H_0$, resulting in what is referred to as the $H_0$ tension. To resolve this discrepancy, utilizing other cosmological probes to constrain $H_0$ is advantageous. In the quest to identify dark matter candidates, the QCD axion and axionlike particles, collectively referred to as axions, have become leading contenders. These elusive particles can coalesce into dense structures known as axion stars via Bose-Einstein condensation. When these axion stars exceed a critical mass, typically through accretion or merging, they experience a self-induced collapse. This process results in short radio bursts, assuming a decay constant $f_a\lesssim10^{13}{\rm{GeV}}$, with the frequency depending on the axion mass and the luminosity determined by both the axion mass and decay constant. Therefore, we propose that collapsing axion stars could serve as a novel standard candle to constrain $H_0$. Even more interesting is that the radio bursts emitted by collapsing axion stars with specific parameters match the characteristics of observed non-repeating fast radio bursts (FRBs). Thus, FRBs generated by collapsing axion stars have the potential to be used as standard candles to constrain $H_0$.
Published: 2024

144. DeepTTV: Deep Learning Prediction of Hidden Exoplanet From Transit Timing Variations

Author: Chen, Chen, Kong, Lingkai, Li, Gongjie, and Tao, Molei
Subjects: Astrophysics - Earth and Planetary Astrophysics, Astrophysics - Instrumentation and Methods for Astrophysics, Computer Science - Machine Learning
Abstract: Transit timing variation (TTV) provides rich information about the mass and orbital properties of exoplanets, which are often obtained by solving an inverse problem via Markov Chain Monte Carlo (MCMC). In this paper, we design a new data-driven approach, which potentially can be applied to problems that are hard to traditional MCMC methods, such as the case with only one planet transiting. Specifically, we use a deep learning approach to predict the parameters of non-transit companion for the single transit system with transit information (i.e., TTV, and Transit Duration Variation (TDV)) as input. Thanks to a newly constructed \textit{Transformer}-based architecture that can extract long-range interactions from TTV sequential data, this previously difficult task can now be accomplished with high accuracy, with an overall fractional error of $\sim$2\% on mass and eccentricity., Comment: 13 pages, 6 figures and 5 tables submitted to AAS journals, comments welcome
Published: 2024

145. Reprojection Errors as Prompts for Efficient Scene Coordinate Regression

Author: Liu, Ting-Ru, Yang, Hsuan-Kung, Liu, Jou-Min, Huang, Chun-Wei, Chiang, Tsung-Chih, Kong, Quan, Kobori, Norimasa, and Lee, Chun-Yi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization. However, many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas. Utilizing these areas for optimization during training can potentially hamper the overall performance and efficiency of the model. In this study, we first perform an in-depth analysis to validate the adverse impacts of these areas. Drawing inspiration from our analysis, we then introduce an error-guided feature selection (EGFS) mechanism, in tandem with the use of the Segment Anything Model (SAM). This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner. The experiments demonstrate that our method outperforms existing SCR approaches that do not rely on 3D information on the Cambridge Landmarks and Indoor6 datasets., Comment: ECCV2024
Published: 2024

146. NPU-NTU System for Voice Privacy 2024 Challenge

Author: Yao, Jixun, Kuzmin, Nikita, Wang, Qing, Guo, Pengcheng, Ning, Ziqian, Guo, Dake, Lee, Kong Aik, Chng, Eng-Siong, and Xie, Lei
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Speaker anonymization is an effective privacy protection solution that conceals the speaker's identity while preserving the linguistic content and paralinguistic information of the original speech. To establish a fair benchmark and facilitate comparison of speaker anonymization systems, the VoicePrivacy Challenge (VPC) was held in 2020 and 2022, with a new edition planned for 2024. In this paper, we describe our proposed speaker anonymization system for VPC 2024. Our system employs a disentangled neural codec architecture and a serial disentanglement strategy to gradually disentangle the global speaker identity and time-variant linguistic content and paralinguistic information. We introduce multiple distillation methods to disentangle linguistic content, speaker identity, and emotion. These methods include semantic distillation, supervised speaker distillation, and frame-level emotion distillation. Based on these distillations, we anonymize the original speaker identity using a weighted sum of a set of candidate speaker identities and a randomly generated speaker identity. Our system achieves the best trade-off of privacy protection and emotion preservation in VPC 2024., Comment: System description for VPC 2024
Published: 2024

147. Latent Space Energy-based Neural ODEs

Author: Cheng, Sheng, Kong, Deqian, Xie, Jianwen, Lee, Kookjin, Wu, Ying Nian, and Yang, Yezhou
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper introduces a novel family of deep dynamical models designed to represent continuous-time sequence data. This family of models generates each data point in the time series by a neural emission model, which is a non-linear transformation of a latent state vector. The trajectory of the latent states is implicitly described by a neural ordinary differential equation (ODE), with the initial state following an informative prior distribution parameterized by an energy-based model. Furthermore, we can extend this model to disentangle dynamic states from underlying static factors of variation, represented as time-invariant variables in the latent space. We train the model using maximum likelihood estimation with Markov chain Monte Carlo (MCMC) in an end-to-end manner, without requiring additional assisting components such as an inference network. Our experiments on oscillating systems, videos and real-world state sequences (MuJoCo) illustrate that ODEs with the learnable energy-based prior outperform existing counterparts, and can generalize to new dynamic parameterization, enabling long-horizon predictions.
Published: 2024

148. The Giant Radio Array for Neutrino Detection (GRAND) Collaboration -- Contributions to the 10th International Workshop on Acoustic and Radio EeV Neutrino Detection Activities (ARENA 2024)

Author: Batista, Rafael Alves, Benoit-Lévy, Aurélien, Bister, Teresa, Bohacova, Martina, Bustamante, Mauricio, Carvalho, Washington, Chen, Yiren, Cheng, LingMei, Chiche, Simon, Colley, Jean-Marc, Correa, Pablo, Laurenciu, Nicoleta Cucu, Dai, Zigao, de Almeida, Rogerio M., de Errico, Beatriz, de Jong, Sijbrand, Neto, João R. T. de Mello, de Vries, Krijn D, Decoene, Valentin, Denton, Peter B., Duan, Bohao, Duan, Kaikai, Engel, Ralph, Erba, William, Fan, Yizhong, Ferrière, Arsène, Gou, QuanBu, Gu, Junhua, Guelfand, Marion, Guo, Jianhua, Guo, Yiqing, Guépin, Claire, Gülzow, Lukas, Haungs, Andreas, Havelka, Matej, He, Haoning, Hivon, Eric, Hu, Hongbo, Huang, Xiaoyuan, Huang, Yan, Huege, Tim, Jiang, Wen, Koirala, Ramesh, Kong, ChuiZheng, Kotera, Kumiko, Köhler, Jelena, Lago, Bruno L., Lai, Zhisen, Coz, Sandra Le, Legrand, François, Leisos, Antonios, Li, Rui, Li, Xingyu, Li, YiFei, Liu, Cheng, Liu, Ruoyu, Liu, Wei, Ma, Pengxiong, Macias, Oscar, Magnard, Frédéric, Marcowith, Alexandre, Martineau-Huynh, Olivier, McKinley, Thomas, Minodier, Paul, Mitra, Pragati, Mostafá, Miguel, Murase, Kohta, Niess, Valentin, Nonis, Stavros, Ogio, Shoichi, Oikonomou, Foteini, Pan, Hongwei, Papageorgiou, Konstantinos, Pierog, Tanguy, Piotrowski, Lech Wiktor, Prunet, Simon, Qian, Xiangli, Roth, Markus, Sako, Takashi, Schoorlemmer, Harm, Szálas-Motesiczky, Dániel, Sławiński, Szymon, Tian, Xishui, Timmermans, Anne, Timmermans, Charles, Tobiska, Petr, Tsirigotis, Apostolos, Tueros, Matías, Vittakis, George, Wang, Hanrui, Wang, Jiale, Wang, Shen, Wang, Xiangyu, Wang, Xu, Wei, Daming, Wei, Feng, Wu, Xiangping, Wu, Xuefeng, Xu, Xin, Xu, Xing, Yang, Fufu, Yang, Lili, Yang, Xuan, Yuan, Qiang, Zarka, Philippe, Zeng, Houdun, Zhang, Chao, Zhang, Jianli, Zhang, Kewen, Zhang, Pengfei, Zhang, Qingchi, Zhang, Songbo, Zhang, Yi, Zhou, Hao, Wissel, Stephanie, Zeolla, Andrew, Deaconu, Cosmin, Hughes, Kaeli, Martin, Zachary, Mulrey, Katharine, Cummings, Austin, Krömer, Oliver, Plant, Kathryn, and Schroeder, Frank G.
Subjects: Astrophysics - Instrumentation and Methods for Astrophysics, Astrophysics - High Energy Astrophysical Phenomena, High Energy Physics - Experiment, High Energy Physics - Phenomenology
Abstract: This is an index of the contributions by the Giant Radio Array for Neutrino Detection (GRAND) Collaboration to the 10th International Workshop on Acoustic and Radio EeV Neutrino Detection Activities (ARENA 2024, University of Chicago, June 11-14, 2024). The contributions include an overview of GRAND in its present and future incarnations, methods of radio-detection that are being developed for them, and ongoing joint work between the GRAND and BEACON experiments., Comment: Note: To access the list of contributions, please follow the "HTML" link that can be found on the arXiv page
Published: 2024

149. SymPAC: Scalable Symbolic Music Generation With Prompts And Constraints

Author: Chen, Haonan, Smith, Jordan B. L., Spijkervet, Janne, Wang, Ju-Chiang, Zou, Pei, Li, Bochen, Kong, Qiuqiang, and Du, Xingjian
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Progress in the task of symbolic music generation may be lagging behind other tasks like audio and text generation, in part because of the scarcity of symbolic training data. In this paper, we leverage the greater scale of audio music data by applying pre-trained MIR models (for transcription, beat tracking, structure analysis, etc.) to extract symbolic events and encode them into token sequences. To the best of our knowledge, this work is the first to demonstrate the feasibility of training symbolic generation models solely from auto-transcribed audio data. Furthermore, to enhance the controllability of the trained model, we introduce SymPAC (Symbolic Music Language Model with Prompting And Constrained Generation), which is distinguished by using (a) prompt bars in encoding and (b) a technique called Constrained Generation via Finite State Machines (FSMs) during inference time. We show the flexibility and controllability of this approach, which may be critical in making music AI useful to creators and users., Comment: ISMIR 2024
Published: 2024

150. Training on the Benchmark Is Not All You Need

Author: Ni, Shiwen, Kong, Xiangtao, Li, Chengming, Hu, Xiping, Xu, Ruifeng, Zhu, Jia, and Yang, Min
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under black-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

610,608 results on '"Kong, A"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources