Author: "Yang, Xi" / Publication Type: Reports - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yang, Xi"' showing total 213 results

Start Over Author "Yang, Xi" Publication Type Reports

213 results on '"Yang, Xi"'

1. SA3DIP: Segment Any 3D Instance with Potential 3D Priors

Author: Yang, Xi, Gu, Xu, Yin, Xingyilang, and Gao, Xinbo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The proliferation of 2D foundation models has sparked research into adapting them for open-world 3D instance segmentation. Recent methods introduce a paradigm that leverages superpoints as geometric primitives and incorporates 2D multi-view masks from Segment Anything model (SAM) as merging guidance, achieving outstanding zero-shot instance segmentation results. However, the limited use of 3D priors restricts the segmentation performance. Previous methods calculate the 3D superpoints solely based on estimated normal from spatial coordinates, resulting in under-segmentation for instances with similar geometry. Besides, the heavy reliance on SAM and hand-crafted algorithms in 2D space suffers from over-segmentation due to SAM's inherent part-level segmentation tendency. To address these issues, we propose SA3DIP, a novel method for Segmenting Any 3D Instances via exploiting potential 3D Priors. Specifically, on one hand, we generate complementary 3D primitives based on both geometric and textural priors, which reduces the initial errors that accumulate in subsequent procedures. On the other hand, we introduce supplemental constraints from the 3D space by using a 3D detector to guide a further merging process. Furthermore, we notice a considerable portion of low-quality ground truth annotations in ScanNetV2 benchmark, which affect the fair evaluations. Thus, we present ScanNetV2-INS with complete ground truth labels and supplement additional instances for 3D class-agnostic instance segmentation. Experimental evaluations on various 2D-3D datasets demonstrate the effectiveness and robustness of our approach. Our code and proposed ScanNetV2-INS dataset are available HERE.
Published: 2024

2. Off-Policy Selection for Initiating Human-Centric Experimental Design

Author: Gao, Ge, Yang, Xi, Gao, Qitong, Ju, Song, Pajic, Miroslav, and Chi, Min
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction
Abstract: In human-centric tasks such as healthcare and education, the heterogeneity among patients and students necessitates personalized treatments and instructional interventions. While reinforcement learning (RL) has been utilized in those tasks, off-policy selection (OPS) is pivotal to close the loop by offline evaluating and selecting policies without online interactions, yet current OPS methods often overlook the heterogeneity among participants. Our work is centered on resolving a pivotal challenge in human-centric systems (HCSs): how to select a policy to deploy when a new participant joining the cohort, without having access to any prior offline data collected over the participant? We introduce First-Glance Off-Policy Selection (FPS), a novel approach that systematically addresses participant heterogeneity through sub-group segmentation and tailored OPS criteria to each sub-group. By grouping individuals with similar traits, FPS facilitates personalized policy selection aligned with unique characteristics of each participant or group of participants. FPS is evaluated via two important but challenging applications, intelligent tutoring systems and a healthcare application for sepsis treatment and intervention. FPS presents significant advancement in enhancing learning outcomes of students and in-hospital care outcomes.
Published: 2024

3. Gamification of virtual museum curation: a case study of Chinese bronze wares

Author: Li, Zhaokang, Zhang, Qian, Xu, Jiayue, Li, Chuntao, and Yang, Xi
Subjects: Computer Science - Computers and Society
Abstract: Museums, which are among the most popular science institutions outside schools, are usually used to display and introduce historical culture and cultural relics to tourists. Text and audio explanations are used by traditional museums to popularize historical knowledge and science for tourists, and general interactive systems are based on desktops. This learning method is relatively boring in terms of experience. As a result, tourists have no desire or interest in actively exploring and learning about bronze ware, so they only have a basic understanding about bronze ware. Since most tourists are familiar with games, they are more likely to be attracted by game content and will actively explore and interact with it. In addition, a certain degree of reality is created by virtual reality technology and an immersive experience through head-mounted devices is provided to users. In this paper, we take Chinese bronzes as the research objects. We first use laser scanners to obtain bronze models ; then, we build a virtual museum environment, and we finally design a virtual reality curation game based on this bronze digital museum. This game offers visitors an immersive museum roaming and bronze ware interactive experience. Through a combination of text, video learning, and games, visitors' curiosity and desire to explore bronze ware are stimulated, and their understanding and ability to remember bronze ware knowledge can be deepened. In terms of cultural heritage, this game is also conducive to the spread of traditional Chinese bronze culture throughout the world., Comment: 18 pages, 10 figures
Published: 2024
Full Text: View/download PDF

4. Interpret Your Decision: Logical Reasoning Regularization for Generalization in Visual Classification

Author: Tan, Zhaorui, Yang, Xi, Wang, Qiufeng, Nguyen, Anh, and Huang, Kaizhu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Vision models excel in image classification but struggle to generalize to unseen data, such as classifying images from unseen domains or discovering novel categories. In this paper, we explore the relationship between logical reasoning and deep learning generalization in visual classification. A logical regularization termed L-Reg is derived which bridges a logical analysis framework to image classification. Our work reveals that L-Reg reduces the complexity of the model in terms of the feature distribution and classifier weights. Specifically, we unveil the interpretability brought by L-Reg, as it enables the model to extract the salient features, such as faces to persons, for classification. Theoretical analysis and experiments demonstrate that L-Reg enhances generalization across various scenarios, including multi-domain generalization and generalized category discovery. In complex real-world scenarios where images span unknown classes and unseen domains, L-Reg consistently improves generalization, highlighting its practical efficacy., Comment: Accepted by NeurIPS2024 as Spotlight
Published: 2024

5. Sine-transform-based fast solvers for Riesz fractional nonlinear Schr\'odinger equations with attractive nonlinearities

Author: Chen, Chao, Yang, Xi, and Zhang, Fei-Yan
Subjects: Mathematics - Numerical Analysis
Abstract: This paper presents fast solvers for linear systems arising from the discretization of fractional nonlinear Schr\"odinger equations with Riesz derivatives and attractive nonlinearities. These systems are characterized by complex symmetry, indefiniteness, and a $d$-level Toeplitz-plus-diagonal structure. We propose a Toeplitz-based anti-symmetric and normal splitting iteration method for the equivalent real block linear systems, ensuring unconditional convergence. The derived optimal parameter is approximately equal to 1. By combining this iteration method with sine-transform-based preconditioning, we introduce a novel preconditioner that enhances the convergence rate of Krylov subspace methods. Both theoretical and numerical analyses demonstrate that the new preconditioner exhibits a parameter-free property (allowing the iteration parameter to be fixed at 1). The eigenvalues of the preconditioned system matrix are nearly clustered in a small neighborhood around 1, and the convergence rate of the corresponding preconditioned GMRES method is independent of the spatial mesh size and the fractional order of the Riesz derivatives.
Published: 2024

6. Recording dynamic facial micro-expressions with a multi-focus camera array

Author: Kreiss, Lucas, Tang, Weiheng, Balla, Ramana, Yang, Xi, Chaware, Amey, Kim, Kanghyun, Cook, Clare B., Begue, Aurelien, Dugo, Clay, Harfouche, Mark, Zhou, Kevin C., and Horstmeyer, Roarke
Subjects: Physics - Optics
Abstract: We present an approach of utilizing a multi-camera array system for capturing dynamic high-resolution videos of the human face, with improved imaging performance as compared to traditional single-camera configurations. Employing an array of 54 individual high-resolution cameras, each with its own 13 megapixel sensor (709 megapixels total), we uniquely focus each camera to a different plane across the curved surface of the human face in order to capture dynamic facial expressions. Post-processing methods then stitch together each synchronized set of 54 images into a composite video frame. Our multi-focus strategy overcomes the resolution and depth-of-field (DOF) limitations for capturing macroscopically curved surfaces such as the human face, while maintaining high lateral resolution. Specifically we demonstrate how our setup achieves a generally uniform lateral resolution of 26.75 +/- 8.8 micrometer across a composite DOF of ~43mm that covers the entire face (85 cm^2 + FOV). Compared to a single-focus configuration this is almost a 10-fold increase in effective DOF. We believe that our new approach for multi-focus camera array video sets the stage for future video capture of a variety of dynamic and macroscopically curved surfaces at microscopic resolution.
Published: 2024

7. Mind the Gap: Promoting Missing Modality Brain Tumor Segmentation with Alignment

Author: Liu, Tianyi, Tan, Zhaorui, Jiang, Haochuan, Yang, Xi, and Huang, Kaizhu
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents an even more difficult scenario. To cope with this challenge, knowledge distillation has emerged as one promising strategy. However, recent efforts typically overlook the modality gaps and thus fail to learn invariant feature representations across different modalities. Such drawback consequently leads to limited performance for both teachers and students. To ameliorate these problems, in this paper, we propose a novel paradigm that aligns latent features of involved modalities to a well-defined distribution anchor. As a major contribution, we prove that our novel training paradigm ensures a tight evidence lower bound, thus theoretically certifying its effectiveness. Extensive experiments on different backbones validate that the proposed paradigm can enable invariant feature representations and produce a teacher with narrowed modality gaps. This further offers superior guidance for missing modality students, achieving an average improvement of 1.75 on dice score.
Published: 2024

8. Emu3: Next-Token Prediction is All You Need

Author: Wang, Xinlong, Zhang, Xiaosong, Luo, Zhengxiong, Sun, Quan, Cui, Yufeng, Wang, Jinsheng, Zhang, Fan, Wang, Yueze, Li, Zhen, Yu, Qiying, Zhao, Yingli, Ao, Yulong, Min, Xuebin, Li, Tao, Wu, Boya, Zhao, Bo, Zhang, Bowen, Wang, Liangdong, Liu, Guang, He, Zheqi, Yang, Xi, Liu, Jingjing, Lin, Yonghua, Huang, Tiejun, and Wang, Zhongyuan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction., Comment: Project Page: https://emu.baai.ac.cn
Published: 2024

9. CathAction: A Benchmark for Endovascular Intervention Understanding

Author: Huang, Baoru, Vo, Tuan, Kongtongvattana, Chayun, Dagnino, Giulio, Kundrat, Dennis, Chi, Wenqiang, Abdelaziz, Mohamed, Kwok, Trevor, Jianu, Tudor, Do, Tuong, Le, Hieu, Nguyen, Minh, Nguyen, Hoan, Tjiputra, Erman, Tran, Quang, Xie, Jianyang, Meng, Yanda, Bhattarai, Binod, Tan, Zhaorui, Liu, Hongbin, Gan, Hong Seng, Wang, Wei, Yang, Xi, Wang, Qiufeng, Su, Jionglong, Huang, Kaizhu, Stefanidis, Angelos, Guo, Min, Du, Bo, Tao, Rong, Vu, Minh, Zheng, Guoyan, Zheng, Yalin, Vasconcelos, Francisco, Stoyanov, Danail, Elson, Daniel, Baena, Ferdinando Rodriguez y, and Nguyen, Anh
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Real-time visual feedback from catheterization analysis is crucial for enhancing surgical safety and efficiency during endovascular interventions. However, existing datasets are often limited to specific tasks, small scale, and lack the comprehensive annotations necessary for broader endovascular intervention understanding. To tackle these limitations, we introduce CathAction, a large-scale dataset for catheterization understanding. Our CathAction dataset encompasses approximately 500,000 annotated frames for catheterization action understanding and collision detection, and 25,000 ground truth masks for catheter and guidewire segmentation. For each task, we benchmark recent related works in the field. We further discuss the challenges of endovascular intentions compared to traditional computer vision tasks and point out open research questions. We hope that CathAction will facilitate the development of endovascular intervention understanding methods that can be applied to real-world applications. The dataset is available at https://airvlab.github.io/cathaction/., Comment: 10 pages. Webpage: https://airvlab.github.io/cathaction/
Published: 2024

10. MedMAP: Promoting Incomplete Multi-modal Brain Tumor Segmentation with Alignment

Author: Liu, Tianyi, Tan, Zhaorui, Chen, Muyin, Yang, Xi, Jiang, Haochuan, and Huang, Kaizhu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Brain tumor segmentation is often based on multiple magnetic resonance imaging (MRI). However, in clinical practice, certain modalities of MRI may be missing, which presents a more difficult scenario. To cope with this challenge, Knowledge Distillation, Domain Adaption, and Shared Latent Space have emerged as commonly promising strategies. However, recent efforts typically overlook the modality gaps and thus fail to learn important invariant feature representations across different modalities. Such drawback consequently leads to limited performance for missing modality models. To ameliorate these problems, pre-trained models are used in natural visual segmentation tasks to minimize the gaps. However, promising pre-trained models are often unavailable in medical image segmentation tasks. Along this line, in this paper, we propose a novel paradigm that aligns latent features of involved modalities to a well-defined distribution anchor as the substitution of the pre-trained model}. As a major contribution, we prove that our novel training paradigm ensures a tight evidence lower bound, thus theoretically certifying its effectiveness. Extensive experiments on different backbones validate that the proposed paradigm can enable invariant feature representations and produce models with narrowed modality gaps. Models with our alignment paradigm show their superior performance on both BraTS2018 and BraTS2020 datasets.
Published: 2024

11. Dedicated beam position monitor pair for model-independent lattice characterization at NSLS-II

Author: Li, Yongjun, Ha, Kiman, Padrazo, Danny, Kosciuk, Bernard, Bacha, Belkacem, Seegitz, Michael, Rainer, Robert, Mead, Joseph, Yang, Xi, Tian, Yuke, Todd, Robert, Smaluk, Victor, and Cheng, Weixing
Subjects: Physics - Accelerator Physics
Abstract: This paper reports recent lattice characterization results obtained at the National Synchrotron Light Source II (NSLS-II) storage ring, conducted without reliance on a lattice model. A pair of beam position monitors (BPMs) with bunch-by-bunch (B$\times$B) resolution, were recently installed in a section of the storage ring free of magnetic fields. The new BPM pair measured the beam, or bunch's transverse Poincar\'e map precisely after the beam was excited. Linear one-turn-matrices (OTM) were then derived, and from these, the 4-dimensional coupled Twiss parameters were extracted at the locations of the BPM pair. By normalizing beam oscillation amplitudes with the Twiss parameters, the global action-variables were obtained. These action-variables facilitated the measurement of the local Twiss parameters observed by other BPMs independent on lattice model. This method is general, and particularly useful in certain scenarios such as a round beam mode in a diffraction-limited light source ring. We applied it to assess both weakly and strongly coupled lattices at the NSLS-II ring. Through analysis of the strongly coupled lattice, the quadrupole tilt errors were estimated to be less than 400 \si{\mu}rad. Utilizing the BPMs' B$\times$B resolution, for the first time we observed the variations of the linear lattice along a long bunch-train., Comment: 28 pages, 16 figures, accepted by Nuclear Inst. and Methods in Physics Research, A
Published: 2024

12. Low-Overhead Channel Estimation via 3D Extrapolation for TDD mmWave Massive MIMO Systems Under High-Mobility Scenarios

Author: Zhou, Binggui, Yang, Xi, Ma, Shaodan, Gao, Feifei, and Yang, Guanghua
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Artificial Intelligence, Computer Science - Information Theory
Abstract: In TDD mmWave massive MIMO systems, the downlink CSI can be attained through uplink channel estimation thanks to the uplink-downlink channel reciprocity. However, the channel aging issue is significant under high-mobility scenarios and thus necessitates frequent uplink channel estimation. In addition, large amounts of antennas and subcarriers lead to high-dimensional CSI matrices, aggravating the pilot training overhead. To systematically reduce the pilot overhead, a spatial, frequency, and temporal domain (3D) channel extrapolation framework is proposed in this paper. Considering the marginal effects of pilots in the spatial and frequency domains and the effectiveness of traditional knowledge-driven channel estimation methods, we first propose a knowledge-and-data driven spatial-frequency channel extrapolation network (KDD-SFCEN) for uplink channel estimation by exploiting the least square estimator for coarse channel estimation and joint spatial-frequency channel extrapolation to reduce the spatial-frequency domain pilot overhead. Then, resorting to the uplink-downlink channel reciprocity and temporal domain dependencies of downlink channels, a temporal uplink-downlink channel extrapolation network (TUDCEN) is proposed for slot-level channel extrapolation, aiming to enlarge the pilot signal period and thus reduce the temporal domain pilot overhead under high-mobility scenarios. Specifically, we propose the spatial-frequency sampling embedding module to reduce the representation dimension and consequent computational complexity, and we propose to exploit the autoregressive generative Transformer for generating downlink channels autoregressively. Numerical results demonstrate the superiority of the proposed framework in significantly reducing the pilot training overhead by more than 16 times and improving the system's spectral efficiency under high-mobility scenarios., Comment: 13 pages, 11 figures, 3 tables. This paper has been submitted to IEEE journal for possible publication
Published: 2024

13. HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation

Author: Luo, Wen, Shen, Tianshu, Li, Wei, Peng, Guangyue, Xuan, Richeng, Wang, Houfeng, and Yang, Xi
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial encompasses both spontaneous and induced hallucination scenarios, covering factuality and faithfulness hallucinations. The benchmark includes 4,094 dialogues with a total of 146,856 samples. Leveraging HalluDial, we conduct a comprehensive meta-evaluation of LLMs' hallucination evaluation capabilities in information-seeking dialogues and introduce a specialized judge language model, HalluJudge. The high data quality of HalluDial enables HalluJudge to achieve superior or competitive performance in hallucination evaluation, facilitating the automatic assessment of dialogue-level hallucinations in LLMs and providing valuable insights into this phenomenon. The dataset and the code are available at https://github.com/FlagOpen/HalluDial.
Published: 2024

14. MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

Author: Zhou, Junjie, Shu, Yan, Zhao, Bo, Wu, Boya, Xiao, Shitao, Yang, Xi, Xiong, Yongping, Zhang, Bo, Huang, Tiejun, and Liu, Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
Published: 2024

15. A Generalized Apprenticeship Learning Framework for Modeling Heterogeneous Student Pedagogical Strategies

Author: Islam, Md Mirajul, Yang, Xi, Hostetter, John, Saha, Adittya Soukarjya, and Chi, Min
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: A key challenge in e-learning environments like Intelligent Tutoring Systems (ITSs) is to induce effective pedagogical policies efficiently. While Deep Reinforcement Learning (DRL) often suffers from sample inefficiency and reward function design difficulty, Apprenticeship Learning(AL) algorithms can overcome them. However, most AL algorithms can not handle heterogeneity as they assume all demonstrations are generated with a homogeneous policy driven by a single reward function. Still, some AL algorithms which consider heterogeneity, often can not generalize to large continuous state space and only work with discrete states. In this paper, we propose an expectation-maximization(EM)-EDM, a general AL framework to induce effective pedagogical policies from given optimal or near-optimal demonstrations, which are assumed to be driven by heterogeneous reward functions. We compare the effectiveness of the policies induced by our proposed EM-EDM against four AL-based baselines and two policies induced by DRL on two different but related tasks that involve pedagogical action prediction. Our overall results showed that, for both tasks, EM-EDM outperforms the four AL baselines across all performance metrics and the two DRL baselines. This suggests that EM-EDM can effectively model complex student pedagogical decision-making processes through the ability to manage a large, continuous state space and adapt to handle diverse and heterogeneous reward functions with very few given demonstrations.
Published: 2024

16. Revisiting Mutual Information Maximization for Generalized Category Discovery

Author: Tan, Zhaorui, Zhang, Chengrui, Yang, Xi, Sun, Jie, and Huang, Kaizhu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generalized category discovery presents a challenge in a realistic scenario, which requires the model's generalization ability to recognize unlabeled samples from known and unknown categories. This paper revisits the challenge of generalized category discovery through the lens of information maximization (InfoMax) with a probabilistic parametric classifier. Our findings reveal that ensuring independence between known and unknown classes while concurrently assuming a uniform probability distribution across all classes, yields an enlarged margin among known and unknown classes that promotes the model's performance. To achieve the aforementioned independence, we propose a novel InfoMax-based method, Regularized Parametric InfoMax (RPIM), which adopts pseudo labels to supervise unlabeled samples during InfoMax, while proposing a regularization to ensure the quality of the pseudo labels. Additionally, we introduce novel semantic-bias transformation to refine the features from the pre-trained model instead of direct fine-tuning to rescue the computational costs. Extensive experiments on six benchmark datasets validate the effectiveness of our method. RPIM significantly improves the performance regarding unknown classes, surpassing the state-of-the-art method by an average margin of 3.5%., Comment: Preprint version
Published: 2024

17. SCMix: Stochastic Compound Mixing for Open Compound Domain Adaptation in Semantic Segmentation

Author: Yao, Kai, Tan, Zhaorui, Su, Zixian, Yang, Xi, Sun, Jie, and Huang, Kaizhu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Open compound domain adaptation (OCDA) aims to transfer knowledge from a labeled source domain to a mix of unlabeled homogeneous compound target domains while generalizing to open unseen domains. Existing OCDA methods solve the intra-domain gaps by a divide-and-conquer strategy, which divides the problem into several individual and parallel domain adaptation (DA) tasks. Such approaches often contain multiple sub-networks or stages, which may constrain the model's performance. In this work, starting from the general DA theory, we establish the generalization bound for the setting of OCDA. Built upon this, we argue that conventional OCDA approaches may substantially underestimate the inherent variance inside the compound target domains for model generalization. We subsequently present Stochastic Compound Mixing (SCMix), an augmentation strategy with the primary objective of mitigating the divergence between source and mixed target distributions. We provide theoretical analysis to substantiate the superiority of SCMix and prove that the previous methods are sub-groups of our methods. Extensive experiments show that our method attains a lower empirical risk on OCDA semantic segmentation tasks, thus supporting our theories. Combining the transformer architecture, SCMix achieves a notable performance boost compared to the SoTA results.
Published: 2024

18. FilterPrompt: Guiding Image Transfer in Diffusion Models

Author: Wang, Xi, Peng, Yichen, Fang, Heng, Xie, Haoran, Yang, Xi, and Li, Chuntao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In controllable generation tasks, flexibly manipulating the generated images to attain a desired appearance or structure based on a single input image cue remains a critical and longstanding challenge. Achieving this requires the effective decoupling of key attributes within the input image data, aiming to get representations accurately. Previous research has predominantly concentrated on disentangling image attributes within feature space. However, the complex distribution present in real-world data often makes the application of such decoupling algorithms to other datasets challenging. Moreover, the granularity of control over feature encoding frequently fails to meet specific task requirements. Upon scrutinizing the characteristics of various generative models, we have observed that the input sensitivity and dynamic evolution properties of the diffusion model can be effectively fused with the explicit decomposition operation in pixel space. This integration enables the image processing operations performed in pixel space for a specific feature distribution of the input image, and can achieve the desired control effect in the generated results. Therefore, we propose FilterPrompt, an approach to enhance the model control effect. It can be universally applied to any diffusion model, allowing users to adjust the representation of specific image features in accordance with task requirements, thereby facilitating more precise and controllable generation outcomes. In particular, our designed experiments demonstrate that the FilterPrompt optimizes feature correlation, mitigates content conflicts during the generation process, and enhances the model's control capability., Comment: Project Page: https://meaoxixi.github.io/FilterPrompt/
Published: 2024

19. CanvasPic: An Interactive Tool for Freely Generating Facial Images Based on Spatial Layout

Author: Wei, Jiafu, Chang, Chia-Ming, Yang, Xi, and Igarashi, Takeo
Subjects: Computer Science - Human-Computer Interaction
Abstract: In real-world usage, existing GAN image generation tools come up short due to their lack of intuitive interfaces and limited flexibility. To overcome these limitations, we developed CanvasPic, an innovative tool for flexible GAN image generation. Our tool introduces a novel 2D layout design that allows users to intuitively control image attributes based on real-world images. By interacting with the distances between images in the spatial layout, users are able to conveniently control the influence of each attribute on the target image and explore a wide range of generated results. Considering practical application scenarios, a user study involving 24 participants was conducted to compare our tool with existing tools in GAN image generation. The results of the study demonstrate that our tool significantly enhances the user experience, enabling more effective achievement of desired generative results.
Published: 2024

20. Towards Understanding the Influence of Reward Margin on Preference Model Performance

Author: Qin, Bowen, Feng, Duanyu, and Yang, Xi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a widely used framework for the training of language models. However, the process of using RLHF to develop a language model that is well-aligned presents challenges, especially when it comes to optimizing the reward model. Our research has found that existing reward models, when trained using the traditional ranking objective based on human preference data, often struggle to effectively distinguish between responses that are more or less favorable in real-world scenarios. To bridge this gap, our study introduces a novel method to estimate the preference differences without the need for detailed, exhaustive labels from human annotators. Our experimental results provide empirical evidence that incorporating margin values into the training process significantly improves the effectiveness of reward models. This comparative analysis not only demonstrates the superiority of our approach in terms of reward prediction accuracy but also highlights its effectiveness in practical applications.
Published: 2024

21. Beneath the Surface: Revealing Deep-Tissue Blood Flow in Human Subjects with Massively Parallelized Diffuse Correlation Spectroscopy

Author: Kreiss, Lucas, Wu, Melissa, Wayne, Michael, Xu, Shiqi, McKee, Paul, Dwamena, Derrick, Kim, Kanghyun, Lee, Kyung Chul, Liu, Wenhui, Ulku, Aarin, Harfouche, Mark, Yang, Xi, Cook, Clare, Chaware, Amey, Lee, Seung Ah, Buckley, Erin, Bruschini, Claudio, Charbon, Edoardo, Huettel, Scott, and Horstmeyer, Roarke
Subjects: Physics - Medical Physics, Physics - Optics
Abstract: Diffuse Correlation Spectroscopy (DCS) allows the label-free investigation of microvascular dynamics deep within living tissue. However, common implementations of DCS are currently limited to measurement depths of $\sim 1-1.5cm$, which can limit the accuracy of cerebral hemodynamics measurement. Here we present massively parallelized DCS (pDCS) using novel single photon avalanche detector (SPAD) arrays with up to 500x500 individual channels. The new SPAD array technology can boost the signal-to-noise ratio by a factor of up to 500 compared to single-pixel DCS, or by more than 15-fold compared to the most recent state-of-the-art pDCS demonstrations. Our results demonstrate the first in vivo use of this massively parallelized DCS system to measure cerebral blood flow changes at $\sim 2cm$ depth in human adults. We compared different modes of operation and applied a dual detection strategy, where a secondary SPAD array is used to simultaneously assess the superficial blood flow as a built-in reference measurement. While the blood flow in the superficial scalp tissue showed no significant change during cognitive activation, the deep pDCS measurement showed a statistically significant increase in the derived blood flow index of 8-12% when compared to the control rest state.
Published: 2024

22. Rethinking Multi-domain Generalization with A General Learning Objective

Author: Tan, Zhaorui, Yang, Xi, and Huang, Kaizhu
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Multi-domain generalization (mDG) is universally aimed to minimize the discrepancy between training and testing distributions to enhance marginal-to-label distribution mapping. However, existing mDG literature lacks a general learning objective paradigm and often imposes constraints on static target marginal distributions. In this paper, we propose to leverage a $Y$-mapping to relax the constraint. We rethink the learning objective for mDG and design a new \textbf{general learning objective} to interpret and analyze most existing mDG wisdom. This general objective is bifurcated into two synergistic amis: learning domain-independent conditional features and maximizing a posterior. Explorations also extend to two effective regularization terms that incorporate prior information and suppress invalid causality, alleviating the issues that come with relaxed constraints. We theoretically contribute an upper bound for the domain alignment of domain-independent conditional features, disclosing that many previous mDG endeavors actually \textbf{optimize partially the objective} and thus lead to limited performance. As such, our study distills a general learning objective into four practical components, providing a general, robust, and flexible mechanism to handle complex domain shifts. Extensive empirical results indicate that the proposed objective with $Y$-mapping leads to substantially better mDG performance in various downstream tasks, including regression, segmentation, and classification., Comment: Accepted by CVPR24
Published: 2024

23. CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning

Author: He, Zheqi, Wu, Xinya, Zhou, Pengfei, Xuan, Richeng, Liu, Guang, Yang, Xi, Zhu, Qiannan, and Huang, Hua
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Multi-modal large language models(MLLMs) have achieved remarkable progress and demonstrated powerful knowledge comprehension and reasoning abilities. However, the mastery of domain-specific knowledge, which is essential for evaluating the intelligence of MLLMs, continues to be a challenge. Current multi-modal benchmarks for domain-specific knowledge concentrate on multiple-choice questions and are predominantly available in English, which imposes limitations on the comprehensiveness of the evaluation. To this end, we introduce CMMU, a novel benchmark for multi-modal and multi-type question understanding and reasoning in Chinese. CMMU consists of 3,603 questions in 7 subjects, covering knowledge from primary to high school. The questions can be categorized into 3 types: multiple-choice, multiple-response, and fill-in-the-blank, bringing greater challenges to MLLMs. In addition, we propose an evaluation strategy called Positional Error Variance for assessing multiple-choice questions. The strategy aims to perform a quantitative analysis of position bias. We evaluate seven open-source MLLMs along with GPT4-V, Gemini-Pro, and Qwen-VL-Plus. The results demonstrate that CMMU poses a significant challenge to the recent MLLMs. The data and code are available at https://github.com/FlagOpen/CMMU.
Published: 2024

24. Rotating massive strangeon stars and X-ray plateau of short GRBs

Author: Yang, Xi-Yan, Lai, Xiao-Yu, Tan, Wei-Wei, and Xu, Ren-Xin
Subjects: Astrophysics - High Energy Astrophysical Phenomena
Abstract: Strangeon stars, which are proposed to describe the nature of pulsar-like compact stars, have passed various observational tests. The maximum mass of a non-rotating strangeon star could be high, which implies that the remnants of binary strangeon star mergers could even be long-lived massive strangeon stars. We study rigidly rotating strangeon stars in the slowly rotating approximation, using the Lennard-Jones model for the equation of state. Rotation can significantly increase the maximum mass of strangeon stars with unchanged baryon numbers, enlarging the mass-range of long-lived strangeon stars. During spin-down after merger, the decrease of radius of the remnant will lead to the release of gravitational energy. Taking into account the efficiency of converting the gravitational energy luminosity to the observed X-ray luminosity, we find that the gravitational energy could provide an alternative energy source for the plateau emission of X-ray afterglow. The fitting results of X-ray plateau emission of some short gamma-ray bursts suggest that the magnetic dipole field strength of the remnants can be much smaller than that of expected when the plateau emission is powered only by spin-down luminosity of magnetars., Comment: Accepted by RAA
Published: 2024

25. Joint Beamforming Optimization and Mode Selection for RDARS-aided MIMO Systems

Author: Wang, Jintao, Ma, Chengzhi, Gong, Shiqi, Yang, Xi, and Ma, Shaodan
Subjects: Computer Science - Information Theory, Electrical Engineering and Systems Science - Signal Processing
Abstract: Considering the appealing distribution gains of distributed antenna systems (DAS) and passive gains of reconfigurable intelligent surface (RIS), a flexible reconfigurable architecture called reconfigurable distributed antenna and reflecting surface (RDARS) is proposed. RDARS encompasses DAS and RIS as two special cases and maintains the advantages of distributed antennas while reducing the hardware cost by replacing some active antennas with low-cost passive reflecting surfaces. In this paper, we present a RDARS-aided uplink multi-user communication system and investigate the system transmission reliability with the newly proposed architecture. Specifically, in addition to the distribution gain and the reflection gain provided by the connection and reflection modes, respectively, we also consider the dynamic mode switching of each element which introduces an additional degree of freedom (DoF) and thus results in a selection gain. As such, we aim to minimize the total sum mean-square-error (MSE) of all data streams by jointly optimizing the receive beamforming matrix, the reflection phase shifts and the channel-aware placement of elements in the connection mode. To tackle this nonconvex problem with intractable binary and cardinality constraints, we propose an inexact block coordinate descent (BCD) based penalty dual decomposition (PDD) algorithm with the guaranteed convergence. Since the PDD algorithm usually suffers from high computational complexity, a low-complexity greedy-search-based alternating optimization (AO) algorithm is developed to yield a semi-closed-form solution with acceptable performance. Numerical results demonstrate the superiority of the proposed architecture compared to the conventional fully passive RIS or DAS. Furthermore, some insights about the practical implementation of RDARS are provided., Comment: 13 pages, 9 figures. This paper has been submitted to IEEE journal for possible publication
Published: 2024

26. Data-driven Option Pricing

Author: Dai, Min, Jin, Hanqing, and Yang, Xi
Subjects: Quantitative Finance - Pricing of Securities
Abstract: We propose an innovative data-driven option pricing methodology that relies exclusively on the dataset of historical underlying asset prices. While the dataset is rooted in the objective world, option prices are commonly expressed as discounted expectations of their terminal payoffs in a risk-neutral world. Bridging this gap motivates us to identify a pricing kernel process, transforming option pricing into evaluating expectations in the objective world. We recover the pricing kernel by solving a utility maximization problem, and evaluate the expectations in terms of a functional optimization problem. Leveraging the deep learning technique, we design data-driven algorithms to solve both optimization problems over the dataset. Numerical experiments are presented to demonstrate the efficiency of our methodology., Comment: 15 pages, 3 figures
Published: 2024

27. Toward Accurate and Temporally Consistent Video Restoration from Raw Data

Author: Guo, Shi, Ma, Jianqi, Yang, Xi, Zhang, Zhengqiang, and Zhang, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Denoising and demosaicking are two fundamental steps in reconstructing a clean full-color video from raw data, while performing video denoising and demosaicking jointly, namely VJDD, could lead to better video restoration performance than performing them separately. In addition to restoration accuracy, another key challenge to VJDD lies in the temporal consistency of consecutive frames. This issue exacerbates when perceptual regularization terms are introduced to enhance video perceptual quality. To address these challenges, we present a new VJDD framework by consistent and accurate latent space propagation, which leverages the estimation of previous frames as prior knowledge to ensure consistent recovery of the current frame. A data temporal consistency (DTC) loss and a relational perception consistency (RPC) loss are accordingly designed. Compared with the commonly used flow-based losses, the proposed losses can circumvent the error accumulation problem caused by inaccurate flow estimation and effectively handle intensity changes in videos, improving much the temporal consistency of output videos while preserving texture details. Extensive experiments demonstrate the leading VJDD performance of our method in term of restoration accuracy, perceptual quality and temporal consistency. Codes and dataset are available at \url{https://github.com/GuoShi28/VJDD}.
Published: 2023

28. Optimize electron beam energy toward in-situ imaging of large thick bio-samples with nanometer resolution

Author: Yang, Xi, Smaluk, Victor, and Shaftan, Timur
Subjects: Physics - Applied Physics
Abstract: To optimize electron energy toward in-situ imaging large bio-samples up to 10-um thickness with nanoscale resolution, we implemented an analytical model based on elastic and inelastic characteristic angles [1]. This model can be used to predict the transverse beam size broadening as a function of electron energy while the probe beam traverses through the sample. As result, the optimal choice of the electron beam energy can be realized. While the sample thickness is less than 10 um, there exists an optimal electron beam energy below 10 MeV regarding a specific sample thickness. However, for samples thicker than 10 um, the optimal beam energy is 10 MeV, and the ultimate resolution could become worse with the increase of the sample thickness.
Published: 2023

29. Point Deformable Network with Enhanced Normal Embedding for Point Cloud Analysis

Author: Yin, Xingyilang, Yang, Xi, Liu, Liangchen, Wang, Nannan, and Gao, Xinbo
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently MLP-based methods have shown strong performance in point cloud analysis. Simple MLP architectures are able to learn geometric features in local point groups yet fail to model long-range dependencies directly. In this paper, we propose Point Deformable Network (PDNet), a concise MLP-based network that can capture long-range relations with strong representation ability. Specifically, we put forward Point Deformable Aggregation Module (PDAM) to improve representation capability in both long-range dependency and adaptive aggregation among points. For each query point, PDAM aggregates information from deformable reference points rather than points in limited local areas. The deformable reference points are generated data-dependent, and we initialize them according to the input point positions. Additional offsets and modulation scalars are learned on the whole point features, which shift the deformable reference points to the regions of interest. We also suggest estimating the normal vector for point clouds and applying Enhanced Normal Embedding (ENE) to the geometric extractors to improve the representation ability of single-point. Extensive experiments and ablation studies on various benchmarks demonstrate the effectiveness and superiority of our PDNet.
Published: 2023

30. RDARS Empowered Massive MIMO System: Two-Timescale Transceiver Design with Imperfect CSI

Author: Ma, Chengzhi, Wang, Jintao, Yang, Xi, Yang, Guanghua, Zhang, Wei, and Ma, Shaodan
Subjects: Computer Science - Information Theory
Abstract: In this paper, we investigate a novel reconfigurable distributed antennas and reflecting surface (RDARS) aided multi-user massive MIMO system with imperfect CSI and propose a practical two-timescale (TTS) transceiver design to reduce the communication overhead and computational complexity of the system. In the RDARS-aided system, not only distribution gain but also reflection gain can be obtained by a flexible combination of the distributed antennas and reflecting surface, which differentiates the system from the others and also makes the TTS design challenging. To enable the optimal TTS transceiver design, the achievable rate of the system is first derived in closed-form. Then the TTS design aiming at the weighted sum rate maximization is considered. To solve the challenging non-convex optimization problem with high-order design variables, i.e., the transmit powers and the phase shifts at the RDARS, a block coordinate descent based method is proposed to find the optimal solutions in semi-closed forms iteratively. Specifically, two efficient algorithms are proposed with provable convergence for the optimal phase shift design, i.e., Riemannian Gradient Ascent based algorithm by exploiting the unit-modulus constraints, and Two-Tier Majorization-Minimization based algorithm with closed-form optimal solutions in each iteration. Simulation results validate the effectiveness of the proposed algorithm and demonstrate the superiority of deploying RDARS in massive MIMO systems to provide substantial rate improvement with a significantly reduced total number of active antennas/RF chains and lower transmit power when compared to the DAS and RIS-aided systems., Comment: 13 pages, 6 figures
Published: 2023

31. PairingNet: A Learning-based Pair-searching and -matching Network for Image Fragments

Author: Zhou, Rixin, Xia, Ding, Zhang, Yi, Pang, Honglin, Yang, Xi, and Li, Chuntao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: In this paper, we propose a learning-based image fragment pair-searching and -matching approach to solve the challenging restoration problem. Existing works use rule-based methods to match similar contour shapes or textures, which are always difficult to tune hyperparameters for extensive data and computationally time-consuming. Therefore, we propose a neural network that can effectively utilize neighbor textures with contour shape information to fundamentally improve performance. First, we employ a graph-based network to extract the local contour and texture features of fragments. Then, for the pair-searching task, we adopt a linear transformer-based module to integrate these local features and use contrastive loss to encode the global features of each fragment. For the pair-matching task, we design a weighted fusion module to dynamically fuse extracted local contour and texture features, and formulate a similarity matrix for each pair of fragments to calculate the matching score and infer the adjacent segment of contours. To faithfully evaluate our proposed network, we created a new image fragment dataset through an algorithm we designed that tears complete images into irregular fragments. The experimental results show that our proposed network achieves excellent pair-searching accuracy, reduces matching errors, and significantly reduces computational time. Details, sourcecode, and data are available in our supplementary material., Comment: 14 pages, 16 figures, 4 tables
Published: 2023

32. Unraveling Batch Normalization for Realistic Test-Time Adaptation

Author: Su, Zixian, Guo, Jingwei, Yao, Kai, Yang, Xi, Wang, Qiufeng, and Huang, Kaizhu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: While recent test-time adaptations exhibit efficacy by adjusting batch normalization to narrow domain disparities, their effectiveness diminishes with realistic mini-batches due to inaccurate target estimation. As previous attempts merely introduce source statistics to mitigate this issue, the fundamental problem of inaccurate target estimation still persists, leaving the intrinsic test-time domain shifts unresolved. This paper delves into the problem of mini-batch degradation. By unraveling batch normalization, we discover that the inexact target statistics largely stem from the substantially reduced class diversity in batch. Drawing upon this insight, we introduce a straightforward tool, Test-time Exponential Moving Average (TEMA), to bridge the class diversity gap between training and testing batches. Importantly, our TEMA adaptively extends the scope of typical methods beyond the current batch to incorporate a diverse set of class information, which in turn boosts an accurate target estimation. Built upon this foundation, we further design a novel layer-wise rectification strategy to consistently promote test-time performance. Our proposed method enjoys a unique advantage as it requires neither training nor tuning parameters, offering a truly hassle-free solution. It significantly enhances model robustness against shifted domains and maintains resilience in diverse real-world scenarios with various batch sizes, achieving state-of-the-art performance on several major benchmarks. Code is available at \url{https://github.com/kiwi12138/RealisticTTA}., Comment: Accepted by AAAI 2024
Published: 2023

33. Semantic-aware Data Augmentation for Text-to-image Synthesis

Author: Tan, Zhaorui, Yang, Xi, and Huang, Kaizhu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Data augmentation has been recently leveraged as an effective regularizer in various vision-language deep neural networks. However, in text-to-image synthesis (T2Isyn), current augmentation wisdom still suffers from the semantic mismatch between augmented paired data. Even worse, semantic collapse may occur when generated images are less semantically constrained. In this paper, we develop a novel Semantic-aware Data Augmentation (SADA) framework dedicated to T2Isyn. In particular, we propose to augment texts in the semantic space via an Implicit Textual Semantic Preserving Augmentation ($ITA$), in conjunction with a specifically designed Image Semantic Regularization Loss ($L_r$) as Generated Image Semantic Conservation, to cope well with semantic mismatch and collapse. As one major contribution, we theoretically show that $ITA$ can certify better text-image consistency while $L_r$ regularizing the semantics of generated images would avoid semantic collapse and enhance image quality. Extensive experiments validate that SADA enhances text-image consistency and improves image quality significantly in T2Isyn models across various backbones. Especially, incorporating SADA during the tuning process of Stable Diffusion models also yields performance improvements., Comment: Accepted by AAAI24
Published: 2023

34. Generative Large Language Models Are All-purpose Text Analytics Engines: Text-to-text Learning Is All Your Need

Author: Peng, Cheng, Yang, Xi, Chen, Aokun, Yu, Zehao, Smith, Kaleb E, Costa, Anthony B, Flores, Mona G, Bian, Jiang, and Wu, Yonghui
Subjects: Computer Science - Computation and Language
Abstract: Objective To solve major clinical natural language processing (NLP) tasks using a unified text-to-text learning architecture based on a generative large language model (LLM) via prompt tuning. Methods We formulated 7 key clinical NLP tasks as text-to-text learning and solved them using one unified generative clinical LLM, GatorTronGPT, developed using GPT-3 architecture and trained with up to 20 billion parameters. We adopted soft prompts (i.e., trainable vectors) with frozen LLM, where the LLM parameters were not updated (i.e., frozen) and only the vectors of soft prompts were updated, known as prompt tuning. We added additional soft prompts as a prefix to the input layer, which were optimized during the prompt tuning. We evaluated the proposed method using 7 clinical NLP tasks and compared them with previous task-specific solutions based on Transformer models. Results and Conclusion The proposed approach achieved state-of-the-art performance for 5 out of 7 major clinical NLP tasks using one unified generative LLM. Our approach outperformed previous task-specific transformer models by ~3% for concept extraction and 7% for relation extraction applied to social determinants of health, 3.4% for clinical concept normalization, 3.4~10% for clinical abbreviation disambiguation, and 5.5~9% for natural language inference. Our approach also outperformed a previously developed prompt-based machine reading comprehension (MRC) model, GatorTron-MRC, for clinical concept and relation extraction. The proposed approach can deliver the ``one model for all`` promise from training to deployment using a unified generative LLM.
Published: 2023

35. A Low-Overhead Incorporation-Extrapolation based Few-Shot CSI Feedback Framework for Massive MIMO Systems

Author: Zhou, Binggui, Yang, Xi, Wang, Jintao, Ma, Shaodan, Gao, Feifei, and Yang, Guanghua
Subjects: Computer Science - Information Theory, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Signal Processing
Abstract: Accurate channel state information (CSI) is essential for downlink precoding in frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems with orthogonal frequency-division multiplexing (OFDM). However, obtaining CSI through feedback from the user equipment (UE) becomes challenging with the increasing scale of antennas and subcarriers and leads to extremely high CSI feedback overhead. Deep learning-based methods have emerged for compressing CSI but these methods generally require substantial collected samples and thus pose practical challenges. Moreover, existing deep learning methods also suffer from dramatically growing feedback overhead owing to their focus on full-dimensional CSI feedback. To address these issues, we propose a low-overhead Incorporation-Extrapolation based Few-Shot CSI feedback Framework (IEFSF) for massive MIMO systems. An incorporation-extrapolation scheme for eigenvector-based CSI feedback is proposed to reduce the feedback overhead. Then, to alleviate the necessity of extensive collected samples and enable few-shot CSI feedback, we further propose a knowledge-driven data augmentation (KDDA) method and an artificial intelligence-generated content (AIGC) -based data augmentation method by exploiting the domain knowledge of wireless channels and by exploiting a novel generative model, respectively. Experimental results based on the DeepMIMO dataset demonstrate that the proposed IEFSF significantly reduces CSI feedback overhead by 64 times compared with existing methods while maintaining higher feedback accuracy using only several hundred collected samples., Comment: 16 pages, 12 figures, 5 tables. Accepted by IEEE Transactions on Wireless Communications
Published: 2023
Full Text: View/download PDF

36. Detecting Voice Cloning Attacks via Timbre Watermarking

Author: Liu, Chang, Zhang, Jie, Zhang, Tianwei, Yang, Xi, Zhang, Weiming, and Yu, Nenghai
Subjects: Computer Science - Sound, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Nowadays, it is common to release audio content to the public. However, with the rise of voice cloning technology, attackers have the potential to easily impersonate a specific person by utilizing his publicly released audio without any permission. Therefore, it becomes significant to detect any potential misuse of the released audio content and protect its timbre from being impersonated. To this end, we introduce a novel concept, "Timbre Watermarking", which embeds watermark information into the target individual's speech, eventually defeating the voice cloning attacks. To ensure the watermark is robust to the voice cloning model's learning process, we design an end-to-end voice cloning-resistant detection framework. The core idea of our solution is to embed and extract the watermark in the frequency domain in a temporally invariant manner. To acquire generalization across different voice cloning attacks, we modulate their shared process and integrate it into our framework as a distortion layer. Experiments demonstrate that the proposed timbre watermarking can defend against different voice cloning attacks, exhibit strong resistance against various adaptive attacks (e.g., reconstruction-based removal attacks, watermark overwriting attacks), and achieve practicality in real-world services such as PaddleSpeech, Voice-Cloning-App, and so-vits-svc. In addition, ablation studies are also conducted to verify the effectiveness of our design. Some audio samples are available at https://timbrewatermarking.github.io/samples., Comment: NDSS 2024
Published: 2023

37. DIPR: Efficient Point Cloud Registration via Dynamic Iteration

Author: Ai, Yang, Bai, Qiang, Li, Jindong, and Yang, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Point cloud registration (PCR) is an essential task in 3D vision. Existing methods achieve increasingly higher accuracy. However, a large proportion of non-overlapping points in point cloud registration consume a lot of computational resources while negatively affecting registration accuracy. To overcome this challenge, we introduce a novel Efficient Point Cloud Registration via Dynamic Iteration framework, DIPR, that makes the neural network interactively focus on overlapping points based on sparser input points. We design global and local registration stages to achieve efficient course-tofine processing. Beyond basic matching modules, we propose the Refined Nodes to narrow down the scope of overlapping points by using adopted density-based clustering to significantly reduce the computation amount. And our SC Classifier serves as an early-exit mechanism to terminate the registration process in time according to matching accuracy. Extensive experiments on multiple datasets show that our proposed approach achieves superior registration accuracy while significantly reducing computational time and GPU memory consumption compared to state-of-the-art methods.
Published: 2023

38. Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Author: Yang, Xi, He, Chenhang, Ma, Jianqi, and Zhang, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
Published: 2023

39. Feasibility study of a hard x-ray FEL oscillator at 3 to 4 GeV based on harmonic lasing and transverse gradient undulator

Author: Yu, Li Hua, Smaluk, Victor, Shaftan, Timur, Tiwari, Ganesh, and Yang, Xi
Subjects: Physics - Accelerator Physics
Abstract: We studied the feasibility of a hard x-ray FEL oscillator (XFELO) based on a 3 to 4 GeV storage ring considered for the low-emittance upgrade of NSLS-II. We present a more detailed derivation of a formula for the small-gain gain calculation for 3 GeV XFELO published in the proceedings of IPAC'21 [1]. We modified the small-signal low-gain formula developed by K.J. Kim, et.al. [4{6] so that the gain can be derived without taking the \no focusing approximation" and a strong focusing can be applied. In this formula, the gain is cast in the form of a product of two factors with one of them depending only on the harmonic number, undulator period, and gap. Using this factor, we show that it is favorable to use harmonic lasing to achieve hard x-ray FEL working in the small-signal low-gain regime with the medium-energy electron beam (3-4 GeV). Our formula also allows FEL optimization by varying the vertical gradient of the undulator, the vertical dispersion, and the horizontal and vertical focusing, independently. Since a quite high peak current is required for the FEL, the collective effects of beam dynamics in medium-energy synchrotrons significantly affect the electron beam parameters. We carried out a multiple-parameter optimization taking collective effects into account and the result indicates the XFELO is feasible for storage ring energy as low as 3 GeV, with local correction of betatron coupling.
Published: 2023

40. A fast normal splitting preconditioner for attractive coupled nonlinear Schr\'odinger equations with fractional Laplacian

Author: Cheng, Yan and Yang, Xi
Subjects: Mathematics - Numerical Analysis
Abstract: A linearly implicit conservative difference scheme is applied to discretize the attractive coupled nonlinear Schr\"odinger equations with fractional Laplacian. Complex symmetric linear systems can be obtained, and the system matrices are indefinite and Toeplitz-plus-diagonal. Neither efficient preconditioned iteration method nor fast direct method is available to deal with these systems. In this paper, we propose a novel matrix splitting iteration method based on a normal splitting of an equivalent real block form of the complex linear systems. This new iteration method converges unconditionally, and the quasi-optimal iteration parameter is deducted. The corresponding new preconditioner is obtained naturally, which can be constructed easily and implemented efficiently by fast Fourier transform. Theoretical analysis indicates that the eigenvalues of the preconditioned system matrix are tightly clustered. Numerical experiments show that the new preconditioner can significantly accelerate the convergence rate of the Krylov subspace iteration methods. Specifically, the convergence behavior of the related preconditioned GMRES iteration method is spacial mesh-size-independent, and almost fractional order insensitive. Moreover, the linearly implicit conservative difference scheme in conjunction with the preconditioned GMRES iteration method conserves the discrete mass and energy in terms of a given precision.
Published: 2023

41. On the Impact of Cross-Domain Data on German Language Models

Author: Dada, Amin, Chen, Aokun, Peng, Cheng, Smith, Kaleb E, Idrissi-Yaghir, Ahmad, Seibold, Constantin Marc, Li, Jianning, Heiliger, Lars, Yang, Xi, Friedrich, Christoph M., Truhn, Daniel, Egger, Jan, Bian, Jiang, Kleesiek, Jens, and Wu, Yonghui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen, Comment: 13 pages, 1 figure, accepted at Findings of the Association for Computational Linguistics: EMNLP 2023
Published: 2023

42. Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction

Author: Peng, Cheng, Yang, Xi, Smith, Kaleb E, Yu, Zehao, Chen, Aokun, Bian, Jiang, and Wu, Yonghui
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Objective To develop soft prompt-based learning algorithms for large language models (LLMs), examine the shape of prompts, prompt-tuning using frozen/unfrozen LLMs, transfer learning, and few-shot learning abilities. Methods We developed a soft prompt-based LLM model and compared 4 training strategies including (1) fine-tuning without prompts; (2) hard-prompt with unfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with frozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for clinical concept and relation extraction on two benchmark datasets. We evaluated the transfer learning ability of the prompt-based learning algorithms in a cross-institution setting. We also assessed the few-shot learning ability. Results and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming the other two models by 0.2~2% and 0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million parameters) LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen LLMs. For cross-institute evaluation, soft prompting with a frozen GatorTron-8.9B model achieved the best performance. This study demonstrates that (1) machines can learn soft prompts better than humans, (2) frozen LLMs have better few-shot learning ability and transfer learning ability to facilitate muti-institution applications, and (3) frozen LLMs require large models.
Published: 2023
Full Text: View/download PDF

43. IPMix: Label-Preserving Data Augmentation Method for Training Robust Classifiers

Author: Huang, Zhenglin, Bao, Xiaoan, Zhang, Na, Zhang, Qingqi, Tu, Xiaomei, Wu, Biao, and Yang, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Data augmentation has been proven effective for training high-accuracy convolutional neural network classifiers by preventing overfitting. However, building deep neural networks in real-world scenarios requires not only high accuracy on clean data but also robustness when data distributions shift. While prior methods have proposed that there is a trade-off between accuracy and robustness, we propose IPMix, a simple data augmentation approach to improve robustness without hurting clean accuracy. IPMix integrates three levels of data augmentation (image-level, patch-level, and pixel-level) into a coherent and label-preserving technique to increase the diversity of training data with limited computational overhead. To further improve the robustness, IPMix introduces structural complexity at different levels to generate more diverse images and adopts the random mixing method for multi-scale information fusion. Experiments demonstrate that IPMix outperforms state-of-the-art corruption robustness on CIFAR-C and ImageNet-C. In addition, we show that IPMix also significantly improves the other safety measures, including robustness to adversarial perturbations, calibration, prediction consistency, and anomaly detection, achieving state-of-the-art or comparable results on several benchmarks, including ImageNet-R, ImageNet-A, and ImageNet-O., Comment: NeurIPS 2023
Published: 2023

44. Diagonal and normal with Toeplitz-block splitting iteration method for space fractional coupled nonlinear Schr\'odinger equations with repulsive nonlinearities

Author: Zhang, Fei-Yan, Yang, Xi, and Chen, Chao
Subjects: Mathematics - Numerical Analysis
Abstract: By applying the linearly implicit conservative difference scheme proposed in [D.-L. Wang, A.-G. Xiao, W. Yang. J. Comput. Phys. 2014;272:670-681], the system of repulsive space fractional coupled nonlinear Schr\"odinger equations leads to a sequence of linear systems with complex symmetric and Toeplitz-plus-diagonal structure. In this paper, we propose the diagonal and normal with Toeplitz-block splitting iteration method to solve the above linear systems. The new iteration method is proved to converge unconditionally, and the optimal iteration parameter is deducted. Naturally, this new iteration method leads to a diagonal and normal with circulant-block preconditioner which can be executed efficiently by fast algorithms. In theory, we provide sharp bounds for the eigenvalues of the discrete fractional Laplacian and its circulant approximation, and further analysis indicates that the spectral distribution of the preconditioned system matrix is tight. Numerical experiments show that the new preconditioner can significantly improve the computational efficiency of the Krylov subspace iteration methods. Moreover, the behavior of the corresponding preconditioned GMRES method exhibits a linear dependence on the space mesh size, which weakens as the fractional order parameter decreases.
Published: 2023

45. Gradient constrained sharpness-aware prompt learning for vision-language models

Author: Liu, Liangchen, Wang, Nannan, Zhou, Dawei, Gao, Xinbo, Liu, Decheng, Yang, Xi, and Liu, Tongliang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM), i.e., improving the performance on unseen classes while maintaining the performance on seen classes. Comparing with existing generalizable methods that neglect the seen classes degradation, the setting of this problem is more strict and fits more closely with practical applications. To solve this problem, we start from the optimization perspective, and leverage the relationship between loss landscape geometry and model generalization ability. By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness, while each of them is indispensable. However, we find the optimizing gradient of existing methods cannot maintain high relevance to both loss value and loss sharpness during optimization, which severely affects their trade-off performance. To this end, we propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp), to dynamically constrain the optimizing gradient, thus achieving above two-fold optimization objective simultaneously. Extensive experiments verify the effectiveness of GCSCoOp in the trade-off problem., Comment: 19 pages 11 figures
Published: 2023

46. AI Mobile Application for Archaeological Dating of Bronze Dings

Author: Li, Chuntao, Qi, Ruihua, Tang, Chuan, Wei, Jiafu, Yang, Xi, Zhang, Qian, and Zhou, Rixin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We develop an AI application for archaeological dating of bronze Dings. A classification model is employed to predict the period of the input Ding, and a detection model is used to show the feature parts for making a decision of archaeological dating. To train the two deep learning models, we collected a large number of Ding images from published materials, and annotated the period and the feature parts on each image by archaeological experts. Furthermore, we design a user system and deploy our pre-trained models based on the platform of WeChat Mini Program for ease of use. Only need a smartphone installed WeChat APP, users can easily know the result of intelligent archaeological dating, the feature parts, and other reference artifacts, by taking a photo of a bronze Ding. To use our application, please scan this QR code by WeChat.
Published: 2023

47. A Benchmark for Chinese-English Scene Text Image Super-resolution

Author: Ma, Jianqi, Liang, Zhetong, Xiang, Wangmeng, Yang, Xi, and Zhang, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from the given low-resolution (LR) input. Most existing works focus on recovering English texts, which have relatively simple character structures, while little work has been done on the more challenging Chinese texts with diverse and complex character structures. In this paper, we propose a real-world Chinese-English benchmark dataset, namely Real-CE, for the task of STISR with the emphasis on restoring structurally complex Chinese characters. The benchmark provides 1,935/783 real-world LR-HR text image pairs~(contains 33,789 text lines in total) for training/testing in 2$\times$ and 4$\times$ zooming modes, complemented by detailed annotations, including detection boxes and text transcripts. Moreover, we design an edge-aware learning method, which provides structural supervision in image and feature domains, to effectively reconstruct the dense structures of Chinese characters. We conduct experiments on the proposed Real-CE benchmark and evaluate the existing STISR models with and without our edge-aware loss. The benchmark, including data and source code, is available at https://github.com/mjq11302010044/Real-CE., Comment: Accepted by ICCV2023
Published: 2023

48. The KiTS21 Challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase CT

Author: Heller, Nicholas, Isensee, Fabian, Trofimova, Dasha, Tejpaul, Resha, Zhao, Zhongchen, Chen, Huai, Wang, Lisheng, Golts, Alex, Khapun, Daniel, Shats, Daniel, Shoshan, Yoel, Gilboa-Solomon, Flora, George, Yasmeen, Yang, Xi, Zhang, Jianpeng, Zhang, Jing, Xia, Yong, Wu, Mengran, Liu, Zhiyang, Walczak, Ed, McSweeney, Sean, Vasdev, Ranveer, Hornung, Chris, Solaiman, Rafat, Schoephoerster, Jamee, Abernathy, Bailey, Wu, David, Abdulkadir, Safa, Byun, Ben, Spriggs, Justice, Struyk, Griffin, Austin, Alexandra, Simpson, Ben, Hagstrom, Michael, Virnig, Sierra, French, John, Venkatesh, Nitin, Chan, Sarah, Moore, Keenan, Jacobsen, Anna, Austin, Susan, Austin, Mark, Regmi, Subodh, Papanikolopoulos, Nikolaos, and Weight, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: This paper presents the challenge report for the 2021 Kidney and Kidney Tumor Segmentation Challenge (KiTS21) held in conjunction with the 2021 international conference on Medical Image Computing and Computer Assisted Interventions (MICCAI). KiTS21 is a sequel to its first edition in 2019, and it features a variety of innovations in how the challenge was designed, in addition to a larger dataset. A novel annotation method was used to collect three separate annotations for each region of interest, and these annotations were performed in a fully transparent setting using a web-based annotation tool. Further, the KiTS21 test set was collected from an outside institution, challenging participants to develop methods that generalize well to new populations. Nonetheless, the top-performing teams achieved a significant improvement over the state of the art set in 2019, and this performance is shown to inch ever closer to human-level performance. An in-depth meta-analysis is presented describing which methods were used and how they faired on the leaderboard, as well as the characteristics of which cases generally saw good performance, and which did not. Overall KiTS21 facilitated a significant advancement in the state of the art in kidney tumor segmentation, and provides useful insights that are applicable to the field of semantic segmentation as a whole., Comment: 34 pages, 12 figures
Published: 2023

49. A Simple and Effective Baseline for Attentional Generative Adversarial Networks

Author: Jin, Mingyu, Zhang, Chong, Yu, Qinkai, Xue, Haochen, Jin, Xiaobo, and Yang, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Synthesising a text-to-image model of high-quality images by guiding the generative model through the Text description is an innovative and challenging task. In recent years, AttnGAN based on the Attention mechanism to guide GAN training has been proposed, SD-GAN, which adopts a self-distillation technique to improve the performance of the generator and the quality of image generation, and Stack-GAN++, which gradually improves the details and quality of the image by stacking multiple generators and discriminators. However, this series of improvements to GAN all have redundancy to a certain extent, which affects the generation performance and complexity to a certain extent. We use the popular simple and effective idea (1) to remove redundancy structure and improve the backbone network of AttnGAN. (2) to integrate and reconstruct multiple losses of DAMSM. Our improvements have significantly improved the model size and training efficiency while ensuring that the model's performance is unchanged and finally proposed our SEAttnGAN. Code is avalilable at https://github.com/jmyissb/SEAttnGAN., Comment: 12 pages, 3 figures
Published: 2023

50. SaliencyCut: Augmenting Plausible Anomalies for Anomaly Detection

Author: Ye, Jianan, Hu, Yijie, Yang, Xi, Wang, Qiu-Feng, Huang, Chao, and Huang, Kaizhu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Anomaly detection under open-set scenario is a challenging task that requires learning discriminative fine-grained features to detect anomalies that were even unseen during training. As a cheap yet effective approach, data augmentation has been widely used to create pseudo anomalies for better training of such models. Recent wisdom of augmentation methods focuses on generating random pseudo instances that may lead to a mixture of augmented instances with seen anomalies, or out of the typical range of anomalies. To address this issue, we propose a novel saliency-guided data augmentation method, SaliencyCut, to produce pseudo but more common anomalies which tend to stay in the plausible range of anomalies. Furthermore, we deploy a two-head learning strategy consisting of normal and anomaly learning heads, to learn the anomaly score of each sample. Theoretical analyses show that this mechanism offers a more tractable and tighter lower bound of the data log-likelihood. We then design a novel patch-wise residual module in the anomaly learning head to extract and assess the fine-grained anomaly features from each sample, facilitating the learning of discriminative representations of anomaly instances. Extensive experiments conducted on six real-world anomaly detection datasets demonstrate the superiority of our method to competing methods under various settings.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

213 results on '"Yang, Xi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources