Author: "Duan, Lixin" / Database: OAIster - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Duan, Lixin"' showing total 39 results

Start Over Author "Duan, Lixin" Database OAIster

39 results on '"Duan, Lixin"'

1. SSR: SAM is a Strong Regularizer for domain adaptive semantic segmentation

Author: Ge, Yanqi, Huang, Ye, Li, Wen, Duan, Lixin, Ge, Yanqi, Huang, Ye, Li, Wen, and Duan, Lixin
Abstract: We introduced SSR, which utilizes SAM (segment-anything) as a strong regularizer during training, to greatly enhance the robustness of the image encoder for handling various domains. Specifically, given the fact that SAM is pre-trained with a large number of images over the internet, which cover a diverse variety of domains, the feature encoding extracted by the SAM is obviously less dependent on specific domains when compared to the traditional ImageNet pre-trained image encoder. Meanwhile, the ImageNet pre-trained image encoder is still a mature choice of backbone for the semantic segmentation task, especially when the SAM is category-irrelevant. As a result, our SSR provides a simple yet highly effective design. It uses the ImageNet pre-trained image encoder as the backbone, and the intermediate feature of each stage (ie there are 4 stages in MiT-B5) is regularized by SAM during training. After extensive experimentation on GTA5$\rightarrow$Cityscapes, our SSR significantly improved performance over the baseline without introducing any extra inference overhead.
Published: 2024

2. Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition

Author: Li, Xunsong, Sun, Pengzhan, Liu, Yangcen, Duan, Lixin, Li, Wen, Li, Xunsong, Sun, Pengzhan, Liu, Yangcen, Duan, Lixin, and Li, Wen
Abstract: The interactions between human and objects are important for recognizing object-centric actions. Existing methods usually adopt a two-stage pipeline, where object proposals are first detected using a pretrained detector, and then are fed to an action recognition model for extracting video features and learning the object relations for action recognition. However, since the action prior is unknown in the object detection stage, important objects could be easily overlooked, leading to inferior action recognition performance. In this paper, we propose an end-to-end object-centric action recognition framework that simultaneously performs Detection And Interaction Reasoning in one stage. Particularly, after extracting video features with a base network, we create three modules for concurrent object detection and interaction reasoning. First, a Patch-based Object Decoder generates proposals from video patch tokens. Then, an Interactive Object Refining and Aggregation identifies important objects for action recognition, adjusts proposal scores based on position and appearance, and aggregates object-level info into a global video representation. Lastly, an Object Relation Modeling module encodes object relations. These three modules together with the video feature extractor can be trained jointly in an end-to-end fashion, thus avoiding the heavy reliance on an off-the-shelf object detector, and reducing the multi-stage training burden. We conduct experiments on two datasets, Something-Else and Ikea-Assembly, to evaluate the performance of our proposed approach on conventional, compositional, and few-shot action recognition tasks. Through in-depth experimental analysis, we show the crucial role of interactive objects in learning for action recognition, and we can outperform state-of-the-art methods on both datasets., Comment: 12 pages, 5 figures, submitted to IEEE Transactions on Multimedia
Published: 2024

3. Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

Author: Ge, Yanqi, Liu, Jiaqi, Fan, Qingnan, Jiang, Xi, Huang, Ye, Qin, Shuai, Gu, Hong, Li, Wen, Duan, Lixin, Ge, Yanqi, Liu, Jiaqi, Fan, Qingnan, Jiang, Xi, Huang, Ye, Qin, Shuai, Gu, Hong, Li, Wen, and Duan, Lixin
Abstract: In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.
Published: 2024

4. High-level Feature Guided Decoding for Semantic Segmentation

Author: Huang, Ye, Kang, Di, Gao, Shenghua, Li, Wen, Duan, Lixin, Huang, Ye, Kang, Di, Gao, Shenghua, Li, Wen, and Duan, Lixin
Abstract: Existing pyramid-based upsamplers (e.g. SemanticFPN), although efficient, usually produce less accurate results compared to dilation-based models when using the same backbone. This is partially caused by the contaminated high-level features since they are fused and fine-tuned with noisy low-level features on limited data. To address this issue, we propose to use powerful pre-trained high-level features as guidance (HFG) so that the upsampler can produce robust results. Specifically, \emph{only} the high-level features from the backbone are used to train the class tokens, which are then reused by the upsampler for classification, guiding the upsampler features to more discriminative backbone features. One crucial design of the HFG is to protect the high-level features from being contaminated by using proper stop-gradient operations so that the backbone does not update according to the noisy gradient from the upsampler. To push the upper limit of HFG, we introduce a context augmentation encoder (CAE) that can efficiently and effectively operate on the low-resolution high-level feature, resulting in improved representation and thus better guidance. We named our complete solution as the High-Level Features Guided Decoder (HFGD). We evaluate the proposed HFGD on three benchmarks: Pascal Context, COCOStuff164k, and Cityscapes. HFGD achieves state-of-the-art results among methods that do not use extra training data, demonstrating its effectiveness and generalization ability., Comment: Revised version, refactored presentation and added more experiments
Published: 2023

5. CARD: Semantic Segmentation with Efficient Class-Aware Regularized Decoder

Author: Huang, Ye, Kang, Di, Chen, Liang, Jia, Wenjing, He, Xiangjian, Duan, Lixin, Zhe, Xuefei, Bao, Linchao, Huang, Ye, Kang, Di, Chen, Liang, Jia, Wenjing, He, Xiangjian, Duan, Lixin, Zhe, Xuefei, and Bao, Linchao
Abstract: Semantic segmentation has recently achieved notable advances by exploiting "class-level" contextual information during learning. However, these approaches simply concatenate class-level information to pixel features to boost the pixel representation learning, which cannot fully utilize intra-class and inter-class contextual information. Moreover, these approaches learn soft class centers based on coarse mask prediction, which is prone to error accumulation. To better exploit class level information, we propose a universal Class-Aware Regularization (CAR) approach to optimize the intra-class variance and inter-class distance during feature learning, motivated by the fact that humans can recognize an object by itself no matter which other objects it appears with. Moreover, we design a dedicated decoder for CAR (CARD), which consists of a novel spatial token mixer and an upsampling module, to maximize its gain for existing baselines while being highly efficient in terms of computational cost. Specifically, CAR consists of three novel loss functions. The first loss function encourages more compact class representations within each class, the second directly maximizes the distance between different class centers, and the third further pushes the distance between inter-class centers and pixels. Furthermore, the class center in our approach is directly generated from ground truth instead of from the error-prone coarse prediction. CAR can be directly applied to most existing segmentation models during training, and can largely improve their accuracy at no additional inference overhead. Extensive experiments and ablation studies conducted on multiple benchmark datasets demonstrate that the proposed CAR can boost the accuracy of all baseline models by up to 2.23% mIOU with superior generalization ability. CARD outperforms SOTA approaches on multiple benchmarks with a highly efficient architecture., Comment: Tech report, text extended from arXiv:2203.07160
Published: 2023

6. Deep Cross-Attention Network for Crowdfunding Success Prediction

Author: Tang, Zhe, Yang, Yi, Li, Wei, Lian, Defu, Duan, Lixin, Tang, Zhe, Yang, Yi, Li, Wei, Lian, Defu, and Duan, Lixin
Abstract: Crowdfunding creates opportunities for entrepre- neurs. It allows startup companies to reach a large audience for fundraising and bring their creative ideas to life. In this work, we are concerned with crowdfunding project success prediction problem, i.e., to predict whether a project will successfully reach its funding goal by using its project profiles. This is important for startup companies to refine their project profiles and achieve their goals. Crowdfunding project success prediction is a typical classification problem but with a few critical challenges. On the one hand, with only coarse-grained project status as weak supervision, it is hard for a deep learning network to learn the relationship between project profiles and explain why it makes this prediction. On the other hand, on the project homepage, there are various modalities of description, including metadata, textual description, images, and videos. Among those, videos play an important role in the success of a crowdfunding project, however, were ignored in previous works, due to the difficulty in extracting useful semantic and authentic information from videos, especially for the crowdfunding project where information in different modalities are unaligned. To this end, we propose a novel framework called Deep Cross-Attention Network to learn and fuse information from introduction videos and textual descriptions of project profiles. More specifically, we develop a cross-attention block to align and represent mismatched textual description and untrimmed introduction videos and fuse the information from these two modalities, which effectively remedies the lack of supervised information caused by project status as weak supervision. More importantly, with our cross-attention mechanism, the model is able to interpret how it makes such predictions and show which keywords and keyframes it depends on. We conduct extensive experiments on two crowdfunding datasets (collected from Kickstarter and Indiegogo) and
Published: 2023

7. Beyond Prototypes: Semantic Anchor Regularization for Better Representation Learning

Author: Ge, Yanqi, Nie, Qiang, Huang, Ye, Liu, Yong, Wang, Chengjie, Zheng, Feng, Li, Wen, Duan, Lixin, Ge, Yanqi, Nie, Qiang, Huang, Ye, Liu, Yong, Wang, Chengjie, Zheng, Feng, Li, Wen, and Duan, Lixin
Abstract: One of the ultimate goals of representation learning is to achieve compactness within a class and well-separability between classes. Many outstanding metric-based and prototype-based methods following the Expectation-Maximization paradigm, have been proposed for this objective. However, they inevitably introduce biases into the learning process, particularly with long-tail distributed training data. In this paper, we reveal that the class prototype is not necessarily to be derived from training features and propose a novel perspective to use pre-defined class anchors serving as feature centroid to unidirectionally guide feature learning. However, the pre-defined anchors may have a large semantic distance from the pixel features, which prevents them from being directly applied. To address this issue and generate feature centroid independent from feature learning, a simple yet effective Semantic Anchor Regularization (SAR) is proposed. SAR ensures the interclass separability of semantic anchors in the semantic space by employing a classifier-aware auxiliary cross-entropy loss during training via disentanglement learning. By pulling the learned features to these semantic anchors, several advantages can be attained: 1) the intra-class compactness and naturally inter-class separability, 2) induced bias or errors from feature learning can be avoided, and 3) robustness to the long-tailed problem. The proposed SAR can be used in a plug-and-play manner in the existing models. Extensive experiments demonstrate that the SAR performs better than previous sophisticated prototype-based methods. The implementation is available at https://github.com/geyanqi/SAR., Comment: AAAI 2024
Published: 2023

8. Multi-modal Instance Refinement for Cross-domain Action Recognition

Author: Qing, Yuan, Wu, Naixing, Wan, Shaohua, Duan, Lixin, Qing, Yuan, Wu, Naixing, Wan, Shaohua, and Duan, Lixin
Abstract: Unsupervised cross-domain action recognition aims at adapting the model trained on an existing labeled source domain to a new unlabeled target domain. Most existing methods solve the task by directly aligning the feature distributions of source and target domains. However, this would cause negative transfer during domain adaptation due to some negative training samples in both domains. In the source domain, some training samples are of low-relevance to target domain due to the difference in viewpoints, action styles, etc. In the target domain, there are some ambiguous training samples that can be easily classified as another type of action under the case of source domain. The problem of negative transfer has been explored in cross-domain object detection, while it remains under-explored in cross-domain action recognition. Therefore, we propose a Multi-modal Instance Refinement (MMIR) method to alleviate the negative transfer based on reinforcement learning. Specifically, a reinforcement learning agent is trained in both domains for every modality to refine the training data by selecting out negative samples from each domain. Our method finally outperforms several other state-of-the-art baselines in cross-domain action recognition on the benchmark EPIC-Kitchens dataset, which demonstrates the advantage of MMIR in reducing negative transfer., Comment: Accepted by PRCV 2023
Published: 2023

9. Learning Motion Refinement for Unsupervised Face Animation

Author: Tao, Jiale, Gu, Shuhang, Li, Wen, Duan, Lixin, Tao, Jiale, Gu, Shuhang, Li, Wen, and Duan, Lixin
Abstract: Unsupervised face animation aims to generate a human face video based on the appearance of a source image, mimicking the motion from a driving video. Existing methods typically adopted a prior-based motion model (e.g., the local affine motion model or the local thin-plate-spline motion model). While it is able to capture the coarse facial motion, artifacts can often be observed around the tiny motion in local areas (e.g., lips and eyes), due to the limited ability of these methods to model the finer facial motions. In this work, we design a new unsupervised face animation approach to learn simultaneously the coarse and finer motions. In particular, while exploiting the local affine motion model to learn the global coarse facial motion, we design a novel motion refinement module to compensate for the local affine motion model for modeling finer face motions in local areas. The motion refinement is learned from the dense correlation between the source and driving images. Specifically, we first construct a structure correlation volume based on the keypoint features of the source and driving images. Then, we train a model to generate the tiny facial motions iteratively from low to high resolution. The learned motion refinements are combined with the coarse motion to generate the new image. Extensive experiments on widely used benchmarks demonstrate that our method achieves the best results among state-of-the-art baselines., Comment: NeurIPS 2023
Published: 2023

10. Deep Cross-Attention Network for Crowdfunding Success Prediction

Author: Tang, Zhe, Yang, Yi, Li, Wei, Lian, Defu, Duan, Lixin, Tang, Zhe, Yang, Yi, Li, Wei, Lian, Defu, and Duan, Lixin
Abstract: Crowdfunding creates opportunities for entrepreneurs. It allows startup companies to reach a large audience for fundraising and bring their creative ideas to life. In this work, we are interested in crowdfunding project success prediction problem, i.e., to predict whether a project will successfully reach its funding goal by using its project profiles. This is important for startup companies to refine their project profiles and achieve their goals. Crowdfunding project success prediction is a typical classification problem but with a few critical challenges. On the one hand, with only coarse-grained project status as weak supervision, it is hard for a deep learning network to learn the relationship between project profiles and explain why it makes this prediction. On the other hand, on the project homepage, there are various modalities of description, including metadata, textual description, images, and videos. Among those, videos play an important role in the success of a crowdfunding project, however, were ignored in previous works, due to the difficulty in extracting useful semantic and authentic information from videos, especially for the crowdfunding project where information in different modalities are unaligned. To this end, we propose a novel framework called Deep Cross-Attention Network to learn and fuse information from introduction videos and textual descriptions of project profiles. More specifically, we develop a cross-attention block to align and represent mismatched textual description and untrimmed introduction videos and fuse the information from these two modalities, which effectively remedies the lack of supervised information caused by project status as weak supervision. More importantly, with our cross-attention mechanism, the model is able to interpret how it makes such predictions and show which keywords and keyframes it depends on. We conduct exten- sive experiments on two crowdfunding datasets (collected from Kickstarter and Indiegogo) and s
Published: 2022

11. Minimizing Maximum Model Discrepancy for Transferable Black-box Targeted Attacks

Author: Zhao, Anqi, Chu, Tong, Liu, Yahao, Li, Wen, Li, Jingjing, Duan, Lixin, Zhao, Anqi, Chu, Tong, Liu, Yahao, Li, Wen, Li, Jingjing, and Duan, Lixin
Abstract: In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. Our codes will be released.
Published: 2022

12. Multi-rater Prism: Learning self-calibrated medical image segmentation from multiple raters

Author: Wu, Junde, Fang, Huihui, Yang, Yehui, Liu, Yuanpei, Gao, Jing, Duan, Lixin, Yang, Weihua, Xu, Yanwu, Wu, Junde, Fang, Huihui, Yang, Yehui, Liu, Yuanpei, Gao, Jing, Duan, Lixin, Yang, Weihua, and Xu, Yanwu
Abstract: In medical image segmentation, it is often necessary to collect opinions from multiple experts to make the final decision. This clinical routine helps to mitigate individual bias. But when data is multiply annotated, standard deep learning models are often not applicable. In this paper, we propose a novel neural network framework, called Multi-Rater Prism (MrPrism) to learn the medical image segmentation from multiple labels. Inspired by the iterative half-quadratic optimization, the proposed MrPrism will combine the multi-rater confidences assignment task and calibrated segmentation task in a recurrent manner. In this recurrent process, MrPrism can learn inter-observer variability taking into account the image semantic properties, and finally converges to a self-calibrated segmentation result reflecting the inter-observer agreement. Specifically, we propose Converging Prism (ConP) and Diverging Prism (DivP) to process the two tasks iteratively. ConP learns calibrated segmentation based on the multi-rater confidence maps estimated by DivP. DivP generates multi-rater confidence maps based on the segmentation masks estimated by ConP. The experimental results show that by recurrently running ConP and DivP, the two tasks can achieve mutual improvement. The final converged segmentation result of MrPrism outperforms state-of-the-art (SOTA) strategies on a wide range of medical image segmentation tasks.
Published: 2022

13. Motion Transformer for Unsupervised Image Animation

Author: Tao, Jiale, Wang, Biao, Ge, Tiezheng, Jiang, Yuning, Li, Wen, Duan, Lixin, Tao, Jiale, Wang, Biao, Ge, Tiezheng, Jiang, Yuning, Li, Wen, and Duan, Lixin
Abstract: Image animation aims to animate a source image by using motion learned from a driving video. Current state-of-the-art methods typically use convolutional neural networks (CNNs) to predict motion information, such as motion keypoints and corresponding local transformations. However, these CNN based methods do not explicitly model the interactions between motions; as a result, the important underlying motion relationship may be neglected, which can potentially lead to noticeable artifacts being produced in the generated animation video. To this end, we propose a new method, the motion transformer, which is the first attempt to build a motion estimator based on a vision transformer. More specifically, we introduce two types of tokens in our proposed method: i) image tokens formed from patch features and corresponding position encoding; and ii) motion tokens encoded with motion information. Both types of tokens are sent into vision transformers to promote underlying interactions between them through multi-head self attention blocks. By adopting this process, the motion information can be better learned to boost the model performance. The final embedded motion tokens are then used to predict the corresponding motion keypoints and local transformations. Extensive experiments on benchmark datasets show that our proposed method achieves promising results to the state-of-the-art baselines. Our source code will be public available.
Published: 2022

14. Motion and Appearance Adaptation for Cross-Domain Motion Transfer

Author: Xu, Borun, Wang, Biao, Deng, Jinhong, Tao, Jiale, Ge, Tiezheng, Jiang, Yuning, Li, Wen, Duan, Lixin, Xu, Borun, Wang, Biao, Deng, Jinhong, Tao, Jiale, Ge, Tiezheng, Jiang, Yuning, Li, Wen, and Duan, Lixin
Abstract: Motion transfer aims to transfer the motion of a driving video to a source image. When there are considerable differences between object in the driving video and that in the source image, traditional single domain motion transfer approaches often produce notable artifacts; for example, the synthesized image may fail to preserve the human shape of the source image (cf . Fig. 1 (a)). To address this issue, in this work, we propose a Motion and Appearance Adaptation (MAA) approach for cross-domain motion transfer, in which we regularize the object in the synthesized image to capture the motion of the object in the driving frame, while still preserving the shape and appearance of the object in the source image. On one hand, considering the object shapes of the synthesized image and the driving frame might be different, we design a shape-invariant motion adaptation module that enforces the consistency of the angles of object parts in two images to capture the motion information. On the other hand, we introduce a structure-guided appearance consistency module designed to regularize the similarity between the corresponding patches of the synthesized image and the source image without affecting the learned motion in the synthesized image. Our proposed MAA model can be trained in an end-to-end manner with a cyclic reconstruction loss, and ultimately produces a satisfactory motion transfer result (cf . Fig. 1 (b)). We conduct extensive experiments on human dancing dataset Mixamo-Video to Fashion-Video and human face dataset Vox-Celeb to Cufs; on both of these, our MAA model outperforms existing methods both quantitatively and qualitatively., Comment: fix bugs
Published: 2022

15. Calibrate the inter-observer segmentation uncertainty via diagnosis-first principle

Author: Wu, Junde, Fang, Huihui, Xiong, Hoayi, Duan, Lixin, Tan, Mingkui, Yang, Weihua, Liu, Huiying, Xu, Yanwu, Wu, Junde, Fang, Huihui, Xiong, Hoayi, Duan, Lixin, Tan, Mingkui, Yang, Weihua, Liu, Huiying, and Xu, Yanwu
Abstract: On the medical images, many of the tissues/lesions may be ambiguous. That is why the medical segmentation is typically annotated by a group of clinical experts to mitigate the personal bias. However, this clinical routine also brings new challenges to the application of machine learning algorithms. Without a definite ground-truth, it will be difficult to train and evaluate the deep learning models. When the annotations are collected from different graders, a common choice is majority vote. However such a strategy ignores the difference between the grader expertness. In this paper, we consider the task of predicting the segmentation with the calibrated inter-observer uncertainty. We note that in clinical practice, the medical image segmentation is usually used to assist the disease diagnosis. Inspired by this observation, we propose diagnosis-first principle, which is to take disease diagnosis as the criterion to calibrate the inter-observer segmentation uncertainty. Following this idea, a framework named Diagnosis First segmentation Framework (DiFF) is proposed to estimate diagnosis-first segmentation from the raw images.Specifically, DiFF will first learn to fuse the multi-rater segmentation labels to a single ground-truth which could maximize the disease diagnosis performance. We dubbed the fused ground-truth as Diagnosis First Ground-truth (DF-GT).Then, we further propose Take and Give Modelto segment DF-GT from the raw image. We verify the effectiveness of DiFF on three different medical segmentation tasks: OD/OC segmentation on fundus images, thyroid nodule segmentation on ultrasound images, and skin lesion segmentation on dermoscopic images. Experimental results show that the proposed DiFF is able to significantly facilitate the corresponding disease diagnosis, which outperforms previous state-of-the-art multi-rater learning methods., Comment: arXiv admin note: text overlap with arXiv:2202.06505
Published: 2022

16. Undoing the Damage of Label Shift for Cross-domain Semantic Segmentation

Author: Liu, Yahao, Deng, Jinhong, Tao, Jiale, Chu, Tong, Duan, Lixin, Li, Wen, Liu, Yahao, Deng, Jinhong, Tao, Jiale, Chu, Tong, Duan, Lixin, and Li, Wen
Abstract: Existing works typically treat cross-domain semantic segmentation (CDSS) as a data distribution mismatch problem and focus on aligning the marginal distribution or conditional distribution. However, the label shift issue is unfortunately overlooked, which actually commonly exists in the CDSS task, and often causes a classifier bias in the learnt model. In this paper, we give an in-depth analysis and show that the damage of label shift can be overcome by aligning the data conditional distribution and correcting the posterior probability. To this end, we propose a novel approach to undo the damage of the label shift problem in CDSS. In implementation, we adopt class-level feature alignment for conditional distribution alignment, as well as two simple yet effective methods to rectify the classifier bias from source to target by remolding the classifier predictions. We conduct extensive experiments on the benchmark datasets of urban scenes, including GTA5 to Cityscapes and SYNTHIA to Cityscapes, where our proposed approach outperforms previous methods by a large margin. For instance, our model equipped with a self-training strategy reaches 59.3% mIoU on GTA5 to Cityscapes, pushing to a new state-of-the-art. The code will be available at https://github.com/manmanjun/Undoing UDA., Comment: Tech report
Published: 2022

17. Structure-Aware Motion Transfer with Deformable Anchor Model

Author: Tao, Jiale, Wang, Biao, Xu, Borun, Ge, Tiezheng, Jiang, Yuning, Li, Wen, Duan, Lixin, Tao, Jiale, Wang, Biao, Xu, Borun, Ge, Tiezheng, Jiang, Yuning, Li, Wen, and Duan, Lixin
Abstract: Given a source image and a driving video depicting the same object type, the motion transfer task aims to generate a video by learning the motion from the driving video while preserving the appearance from the source image. In this paper, we propose a novel structure-aware motion modeling approach, the deformable anchor model (DAM), which can automatically discover the motion structure of arbitrary objects without leveraging their prior structure information. Specifically, inspired by the known deformable part model (DPM), our DAM introduces two types of anchors or keypoints: i) a number of motion anchors that capture both appearance and motion information from the source image and driving video; ii) a latent root anchor, which is linked to the motion anchors to facilitate better learning of the representations of the object structure information. Moreover, DAM can be further extended to a hierarchical version through the introduction of additional latent anchors to model more complicated structures. By regularizing motion anchors with latent anchor(s), DAM enforces the correspondences between them to ensure the structural information is well captured and preserved. Moreover, DAM can be learned effectively in an unsupervised manner. We validate our proposed DAM for motion transfer on different benchmark datasets. Extensive experiments clearly demonstrate that DAM achieves superior performance relative to existing state-of-the-art methods., Comment: CVPR 2022
Published: 2022

18. Learning Pixel-Level Distinctions for Video Highlight Detection

Author: Wei, Fanyue, Wang, Biao, Ge, Tiezheng, Jiang, Yuning, Li, Wen, Duan, Lixin, Wei, Fanyue, Wang, Biao, Ge, Tiezheng, Jiang, Yuning, Li, Wen, and Duan, Lixin
Abstract: The goal of video highlight detection is to select the most attractive segments from a long video to depict the most interesting parts of the video. Existing methods typically focus on modeling relationship between different video segments in order to learning a model that can assign highlight scores to these segments; however, these approaches do not explicitly consider the contextual dependency within individual segments. To this end, we propose to learn pixel-level distinctions to improve the video highlight detection. This pixel-level distinction indicates whether or not each pixel in one video belongs to an interesting section. The advantages of modeling such fine-level distinctions are two-fold. First, it allows us to exploit the temporal and spatial relations of the content in one video, since the distinction of a pixel in one frame is highly dependent on both the content before this frame and the content around this pixel in this frame. Second, learning the pixel-level distinction also gives a good explanation to the video highlight task regarding what contents in a highlight segment will be attractive to people. We design an encoder-decoder network to estimate the pixel-level distinction, in which we leverage the 3D convolutional neural networks to exploit the temporal context information, and further take advantage of the visual saliency to model the spatial distinction. State-of-the-art performance on three public benchmarks clearly validates the effectiveness of our framework for video highlight detection., Comment: Accepted at CVPR 2022
Published: 2022

19. Diverse Preference Augmentation with Multiple Domains for Cold-start Recommendations

Author: Zhang, Yan, Li, Changyu, Tsang, Ivor W., Xu, Hui, Duan, Lixin, Yin, Hongzhi, Li, Wen, Shao, Jie, Zhang, Yan, Li, Changyu, Tsang, Ivor W., Xu, Hui, Duan, Lixin, Yin, Hongzhi, Li, Wen, and Shao, Jie
Abstract: Cold-start issues have been more and more challenging for providing accurate recommendations with the fast increase of users and items. Most existing approaches attempt to solve the intractable problems via content-aware recommendations based on auxiliary information and/or cross-domain recommendations with transfer learning. Their performances are often constrained by the extremely sparse user-item interactions, unavailable side information, or very limited domain-shared users. Recently, meta-learners with meta-augmentation by adding noises to labels have been proven to be effective to avoid overfitting and shown good performance on new tasks. Motivated by the idea of meta-augmentation, in this paper, by treating a user's preference over items as a task, we propose a so-called Diverse Preference Augmentation framework with multiple source domains based on meta-learning (referred to as MetaDPA) to i) generate diverse ratings in a new domain of interest (known as target domain) to handle overfitting on the case of sparse interactions, and to ii) learn a preference model in the target domain via a meta-learning scheme to alleviate cold-start issues. Specifically, we first conduct multi-source domain adaptation by dual conditional variational autoencoders and impose a Multi-domain InfoMax (MDI) constraint on the latent representations to learn domain-shared and domain-specific preference properties. To avoid overfitting, we add a Mutually-Exclusive (ME) constraint on the output of decoders to generate diverse ratings given content data. Finally, these generated diverse ratings and the original ratings are introduced into the meta-training procedure to learn a preference meta-learner, which produces good generalization ability on cold-start recommendation tasks. Experiments on real-world datasets show our proposed MetaDPA clearly outperforms the current state-of-the-art baselines.
Published: 2022

20. Cross-domain Detection Transformer based on Spatial-aware and Semantic-aware Token Alignment

Author: Deng, Jinhong, Zhang, Xiaoyue, Li, Wen, Duan, Lixin, Deng, Jinhong, Zhang, Xiaoyue, Li, Wen, and Duan, Lixin
Abstract: Detection transformers like DETR have recently shown promising performance on many object detection tasks, but the generalization ability of those methods is still quite challenging for cross-domain adaptation scenarios. To address the cross-domain issue, a straightforward way is to perform token alignment with adversarial training in transformers. However, its performance is often unsatisfactory as the tokens in detection transformers are quite diverse and represent different spatial and semantic information. In this paper, we propose a new method called Spatial-aware and Semantic-aware Token Alignment (SSTA) for cross-domain detection transformers. In particular, we take advantage of the characteristics of cross-attention as used in detection transformer and propose the spatial-aware token alignment (SpaTA) and the semantic-aware token alignment (SemTA) strategies to guide the token alignment across domains. For spatial-aware token alignment, we can extract the information from the cross-attention map (CAM) to align the distribution of tokens according to their attention to object queries. For semantic-aware token alignment, we inject the category information into the cross-attention map and construct domain embedding to guide the learning of a multi-class discriminator so as to model the category relationship and achieve category-level token alignment during the entire adaptation process. We conduct extensive experiments on several widely-used benchmarks, and the results clearly show the effectiveness of our proposed method over existing state-of-the-art baselines., Comment: Technical report
Published: 2022

21. Move As You Like: Image Animation in E-Commerce Scenario

Author: Xu, Borun, Wang, Biao, Tao, Jiale, Ge, Tiezheng, Jiang, Yuning, Li, Wen, Duan, Lixin, Xu, Borun, Wang, Biao, Tao, Jiale, Ge, Tiezheng, Jiang, Yuning, Li, Wen, and Duan, Lixin
Abstract: Creative image animations are attractive in e-commerce applications, where motion transfer is one of the import ways to generate animations from static images. However, existing methods rarely transfer motion to objects other than human body or human face, and even fewer apply motion transfer in practical scenarios. In this work, we apply motion transfer on the Taobao product images in real e-commerce scenario to generate creative animations, which are more attractive than static images and they will bring more benefits. We animate the Taobao products of dolls, copper running horses and toy dinosaurs based on motion transfer method for demonstration., Comment: 3 pages, 3 figures, ACM MM 2021 demo session
Published: 2021
Full Text: View/download PDF

22. Real-Time Video Super-Resolution on Smartphones with Deep Learning, Mobile AI 2021 Challenge: Report

Author: Ignatov, Andrey, Romero, Andres, Kim, Heewon, Timofte, Radu, Ho, Chiu Man, Meng, Zibo, Lee, Kyoung Mu, Chen, Yuxiang, Wang, Yutong, Long, Zeyu, Wang, Chenhao, Chen, Yifei, Xu, Boshen, Gu, Shuhang, Duan, Lixin, Li, Wen, Bofei, Wang, Diankai, Zhang, Chengjian, Zheng, Shaoli, Liu, Si, Gao, Xiaofeng, Zhang, Kaidi, Lu, Tianyu, Xu, Hui, Zheng, Gao, Xinbo, Wang, Xiumei, Guo, Jiaming, Zhou, Xueyi, Jia, Hao, Yan, Youliang, Ignatov, Andrey, Romero, Andres, Kim, Heewon, Timofte, Radu, Ho, Chiu Man, Meng, Zibo, Lee, Kyoung Mu, Chen, Yuxiang, Wang, Yutong, Long, Zeyu, Wang, Chenhao, Chen, Yifei, Xu, Boshen, Gu, Shuhang, Duan, Lixin, Li, Wen, Bofei, Wang, Diankai, Zhang, Chengjian, Zheng, Shaoli, Liu, Si, Gao, Xiaofeng, Zhang, Kaidi, Lu, Tianyu, Xu, Hui, Zheng, Gao, Xinbo, Wang, Xiumei, Guo, Jiaming, Zhou, Xueyi, Jia, Hao, and Yan, Youliang
Abstract: Video super-resolution has recently become one of the most important mobile-related problems due to the rise of video communication and streaming services. While many solutions have been proposed for this task, the majority of them are too computationally expensive to run on portable devices with limited hardware resources. To address this problem, we introduce the first Mobile AI challenge, where the target is to develop an end-to-end deep learning-based video super-resolution solutions that can achieve a real-time performance on mobile GPUs. The participants were provided with the REDS dataset and trained their models to do an efficient 4X video upscaling. The runtime of all models was evaluated on the OPPO Find X2 smartphone with the Snapdragon 865 SoC capable of accelerating floating-point networks on its Adreno GPU. The proposed solutions are fully compatible with any mobile GPU and can upscale videos to HD resolution at up to 80 FPS while demonstrating high fidelity results. A detailed description of all models developed in the challenge is provided in this paper., Comment: Mobile AI 2021 Workshop and Challenges: https://ai-benchmark.com/workshops/mai/2021/. arXiv admin note: substantial text overlap with arXiv:2105.07825. substantial text overlap with arXiv:2105.08629, arXiv:2105.07809, arXiv:2105.08630
Published: 2021

23. Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction

Author: Tao, Chaofan, Jiang, Qinhong, Duan, Lixin, Luo, Ping, Tao, Chaofan, Jiang, Qinhong, Duan, Lixin, and Luo, Ping
Abstract: Multi-agent motion prediction is challenging because it aims to foresee the future trajectories of multiple agents (\textit{e.g.} pedestrians) simultaneously in a complicated scene. Existing work addressed this challenge by either learning social spatial interactions represented by the positions of a group of pedestrians, while ignoring their temporal coherence (\textit{i.e.} dependencies between different long trajectories), or by understanding the complicated scene layout (\textit{e.g.} scene segmentation) to ensure safe navigation. However, unlike previous work that isolated the spatial interaction, temporal coherence, and scene layout, this paper designs a new mechanism, \textit{i.e.}, Dynamic and Static Context-aware Motion Predictor (DSCMP), to integrates these rich information into the long-short-term-memory (LSTM). It has three appealing benefits. (1) DSCMP models the dynamic interactions between agents by learning both their spatial positions and temporal coherence, as well as understanding the contextual scene layout.(2) Different from previous LSTM models that predict motions by propagating hidden features frame by frame, limiting the capacity to learn correlations between long trajectories, we carefully design a differentiable queue mechanism in DSCMP, which is able to explicitly memorize and learn the correlations between long trajectories. (3) DSCMP captures the context of scene by inferring latent variable, which enables multimodal predictions with meaningful semantic scene layout. Extensive experiments show that DSCMP outperforms state-of-the-art methods by large margins, such as 9.05\% and 7.62\% relative improvements on the ETH-UCY and SDD datasets respectively., Comment: 17 pages, 6 figures
Published: 2020

24. Reconstruction Regularized Deep Metric Learning for Multi-label Image Classification

Author: Li, Changsheng, Liu, Chong, Duan, Lixin, Gao, Peng, Zheng, Kai, Li, Changsheng, Liu, Chong, Duan, Lixin, Gao, Peng, and Zheng, Kai
Abstract: In this paper, we present a novel deep metric learning method to tackle the multi-label image classification problem. In order to better learn the correlations among images features, as well as labels, we attempt to explore a latent space, where images and labels are embedded via two unique deep neural networks, respectively. To capture the relationships between image features and labels, we aim to learn a \emph{two-way} deep distance metric over the embedding space from two different views, i.e., the distance between one image and its labels is not only smaller than those distances between the image and its labels' nearest neighbors, but also smaller than the distances between the labels and other images corresponding to the labels' nearest neighbors. Moreover, a reconstruction module for recovering correct labels is incorporated into the whole framework as a regularization term, such that the label embedding space is more representative. Our model can be trained in an end-to-end manner. Experimental results on publicly available image datasets corroborate the efficacy of our method compared with the state-of-the-arts., Comment: Accepted by IEEE TNNLS
Published: 2020

25. Deeply Aligned Adaptation for Cross-domain Object Detection

Author: Fu, Minghao, Xie, Zhenshan, Li, Wen, Duan, Lixin, Fu, Minghao, Xie, Zhenshan, Li, Wen, and Duan, Lixin
Abstract: Cross-domain object detection has recently attracted more and more attention for real-world applications, since it helps build robust detectors adapting well to new environments. In this work, we propose an end-to-end solution based on Faster R-CNN, where ground-truth annotations are available for source images (e.g., cartoon) but not for target ones (e.g., watercolor) during training. Motivated by the observation that the transferabilities of different neural network layers differ from each other, we propose to apply a number of domain alignment strategies to different layers of Faster R-CNN, where the alignment strength is gradually reduced from low to higher layers. Moreover, after obtaining region proposals in our network, we develop a foreground-background aware alignment module to further reduce the domain mismatch by separately aligning features of the foreground and background regions from the source and target domains. Extensive experiments on benchmark datasets demonstrate the effectiveness of our proposed approach.
Published: 2020

26. Learning Cross-domain Semantic-Visual Relationships for Transductive Zero-Shot Learning

Author: Lv, Fengmao, Zhang, Jianyang, Yang, Guowu, Feng, Lei, Yu, Yufeng, Duan, Lixin, Lv, Fengmao, Zhang, Jianyang, Yang, Guowu, Feng, Lei, Yu, Yufeng, and Duan, Lixin
Abstract: Zero-Shot Learning (ZSL) learns models for recognizing new classes. One of the main challenges in ZSL is the domain discrepancy caused by the category inconsistency between training and testing data. Domain adaptation is the most intuitive way to address this challenge. However, existing domain adaptation techniques cannot be directly applied into ZSL due to the disjoint label space between source and target domains. This work proposes the Transferrable Semantic-Visual Relation (TSVR) approach towards transductive ZSL. TSVR redefines image recognition as predicting the similarity/dissimilarity labels for semantic-visual fusions consisting of class attributes and visual features. After the above transformation, the source and target domains can have the same label space, which hence enables to quantify domain discrepancy. For the redefined problem, the number of similar semantic-visual pairs is significantly smaller than that of dissimilar ones. To this end, we further propose to use Domain-Specific Batch Normalization to align the domain discrepancy.
Published: 2020

27. Unbiased Mean Teacher for Cross-domain Object Detection

Author: Deng, Jinhong, Li, Wen, Chen, Yuhua, Duan, Lixin, Deng, Jinhong, Li, Wen, Chen, Yuhua, and Duan, Lixin
Abstract: Cross-domain object detection is challenging, because object detection model is often vulnerable to data variance, especially to the considerable domain shift between two distinctive domains. In this paper, we propose a new Unbiased Mean Teacher (UMT) model for cross-domain object detection. We reveal that there often exists a considerable model bias for the simple mean teacher (MT) model in cross-domain scenarios, and eliminate the model bias with several simple yet highly effective strategies. In particular, for the teacher model, we propose a cross-domain distillation method for MT to maximally exploit the expertise of the teacher model. Moreover, for the student model, we alleviate its bias by augmenting training samples with pixel-level adaptation. Finally, for the teaching process, we employ an out-of-distribution estimation strategy to select samples that most fit the current model to further enhance the cross-domain distillation process. By tackling the model bias issue with these strategies, our UMT model achieves mAPs of 44.1%, 58.1%, 41.7%, and 43.1% on benchmark datasets Clipart1k, Watercolor2k, Foggy Cityscapes, and Cityscapes, respectively, which outperforms the existing state-of-the-art results in notable margins. Our implementation is available at https://github.com/kinredon/umt., Comment: Accepted by CVPR2021
Published: 2020

28. Collaborative Generative Hashing for Marketing and Fast Cold-start Recommendation

Author: Zhang, Yan, Tsang, Ivor W., Duan, Lixin, Zhang, Yan, Tsang, Ivor W., and Duan, Lixin
Abstract: Cold-start has being a critical issue in recommender systems with the explosion of data in e-commerce. Most existing studies proposed to alleviate the cold-start problem are also known as hybrid recommender systems that learn representations of users and items by combining user-item interactive and user/item content information. However, previous hybrid methods regularly suffered poor efficiency bottlenecking in online recommendations with large-scale items, because they were designed to project users and items into continuous latent space where the online recommendation is expensive. To this end, we propose a collaborative generated hashing (CGH) framework to improve the efficiency by denoting users and items as binary codes, then fast hashing search techniques can be used to speed up the online recommendation. In addition, the proposed CGH can generate potential users or items for marketing application where the generative network is designed with the principle of Minimum Description Length (MDL), which is used to learn compact and informative binary codes. Extensive experiments on two public datasets show the advantages for recommendations in various settings over competing baselines and analyze its feasibility in marketing application., Comment: 11 pages, 8 figures
Published: 2020

29. Region Comparison Network for Interpretable Few-shot Image Classification

Author: Xue, Zhiyu, Duan, Lixin, Li, Wen, Chen, Lin, Luo, Jiebo, Xue, Zhiyu, Duan, Lixin, Li, Wen, Chen, Lin, and Luo, Jiebo
Abstract: While deep learning has been successfully applied to many real-world computer vision tasks, training robust classifiers usually requires a large amount of well-labeled data. However, the annotation is often expensive and time-consuming. Few-shot image classification has thus been proposed to effectively use only a limited number of labeled examples to train models for new classes. Recent works based on transferable metric learning methods have achieved promising classification performance through learning the similarity between the features of samples from the query and support sets. However, rare of them explicitly considers the model interpretability, which can actually be revealed during the training phase. For that, in this work, we propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works as in a neural network as well as to find out specific regions that are related to each other in images coming from the query and support sets. Moreover, we also present a visualization strategy named Region Activation Mapping (RAM) to intuitively explain what our method has learned by visualizing intermediate variables in our network. We also present a new way to generalize the interpretability from the level of tasks to categories, which can also be viewed as a method to find the prototypical parts for supporting the final decision of our RCN. Extensive experiments on four benchmark datasets clearly show the effectiveness of our method over existing baselines.
Published: 2020

30. Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

Author: Xu, Yiming, Chen, Lin, Cheng, Zhongwei, Duan, Lixin, Luo, Jiebo, Xu, Yiming, Chen, Lin, Cheng, Zhongwei, Duan, Lixin, and Luo, Jiebo
Abstract: We study the problem of visual question answering (VQA) in images by exploiting supervised domain adaptation, where there is a large amount of labeled data in the source domain but only limited labeled data in the target domain with the goal to train a good target model. A straightforward solution is to fine-tune a pre-trained source model by using those limited labeled target data, but it usually cannot work well due to the considerable difference between the data distributions of the source and target domains. Moreover, the availability of multiple modalities (i.e., images, questions and answers) in VQA poses further challenges to model the transferability between those different modalities. In this paper, we tackle the above issues by proposing a novel supervised multi-modal domain adaptation method for VQA to learn joint feature embeddings across different domains and modalities. Specifically, we align the data distributions of the source and target domains by considering all modalities together as well as separately for each individual modality. Based on the extensive experiments on the benchmark VQA 2.0 and VizWiz datasets for the realistic open-ended VQA task, we demonstrate that our proposed method outperforms the existing state-of-the-art approaches in this challenging domain adaptation setting for VQA.
Published: 2019

31. Constructing Self-motivated Pyramid Curriculums for Cross-Domain Semantic Segmentation: A Non-Adversarial Approach

Author: Lian, Qing, Lv, Fengmao, Duan, Lixin, Gong, Boqing, Lian, Qing, Lv, Fengmao, Duan, Lixin, and Gong, Boqing
Abstract: We propose a new approach, called self-motivated pyramid curriculum domain adaptation (PyCDA), to facilitate the adaptation of semantic segmentation neural networks from synthetic source domains to real target domains. Our approach draws on an insight connecting two existing works: curriculum domain adaptation and self-training. Inspired by the former, PyCDA constructs a pyramid curriculum which contains various properties about the target domain. Those properties are mainly about the desired label distributions over the target domain images, image regions, and pixels. By enforcing the segmentation neural network to observe those properties, we can improve the network's generalization capability to the target domain. Motivated by the self-training, we infer this pyramid of properties by resorting to the semantic segmentation network itself. Unlike prior work, we do not need to maintain any additional models (e.g., logistic regression or discriminator networks) or to solve minmax problems which are often difficult to optimize. We report state-of-the-art results for the adaptation from both GTAV and SYNTHIA to Cityscapes, two popular settings in unsupervised domain adaptation for semantic segmentation.
Published: 2019

32. Adversarial Multimodal Network for Movie Question Answering

Author: Yuan, Zhaoquan, Sun, Siyuan, Duan, Lixin, Wu, Xiao, Xu, Changsheng, Yuan, Zhaoquan, Sun, Siyuan, Duan, Lixin, Wu, Xiao, and Xu, Changsheng
Abstract: Visual question answering by using information from multiple modalities has attracted more and more attention in recent years. However, it is a very challenging task, as the visual content and natural language have quite different statistical properties. In this work, we present a method called Adversarial Multimodal Network (AMN) to better understand video stories for question answering. In AMN, as inspired by generative adversarial networks, we propose to learn multimodal feature representations by finding a more coherent subspace for video clips and the corresponding texts (e.g., subtitles and questions). Moreover, we introduce a self-attention mechanism to enforce the so-called consistency constraints in order to preserve the self-correlation of visual cues of the original video clips in the learned multimodal representations. Extensive experiments on the MovieQA dataset show the effectiveness of our proposed AMN over other published state-of-the-art methods., Comment: We will revise the paper
Published: 2019

33. Domain Adversarial Reinforcement Learning for Partial Domain Adaptation

Author: Chen, Jin, Wu, Xinxiao, Duan, Lixin, Gao, Shenghua, Chen, Jin, Wu, Xinxiao, Duan, Lixin, and Gao, Shenghua
Abstract: Partial domain adaptation aims to transfer knowledge from a label-rich source domain to a label-scarce target domain which relaxes the fully shared label space assumption across different domains. In this more general and practical scenario, a major challenge is how to select source instances in the shared classes across different domains for positive transfer. To address this issue, we propose a Domain Adversarial Reinforcement Learning (DARL) framework to automatically select source instances in the shared classes for circumventing negative transfer as well as to simultaneously learn transferable features between domains by reducing the domain shift. Specifically, in this framework, we employ deep Q-learning to learn policies for an agent to make selection decisions by approximating the action-value function. Moreover, domain adversarial learning is introduced to learn domain-invariant features for the selected source instances by the agent and the target instances, and also to determine rewards for the agent based on how relevant the selected source instances are to the target domain. Experiments on several benchmark datasets demonstrate that the superior performance of our DARL method over existing state of the arts for partial domain adaptation.
Published: 2019

34. Known-class Aware Self-ensemble for Open Set Domain Adaptation

Author: Lian, Qing, Li, Wen, Chen, Lin, Duan, Lixin, Lian, Qing, Li, Wen, Chen, Lin, and Duan, Lixin
Abstract: Existing domain adaptation methods generally assume different domains have the identical label space, which is quite restrict for real-world applications. In this paper, we focus on a more realistic and challenging case of open set domain adaptation. Particularly, in open set domain adaptation, we allow the classes from the source and target domains to be partially overlapped. In this case, the assumption of conventional distribution alignment does not hold anymore, due to the different label spaces in two domains. To tackle this challenge, we propose a new approach coined as Known-class Aware Self-Ensemble (KASE), which is built upon the recently developed self-ensemble model. In KASE, we first introduce a Known-class Aware Recognition (KAR) module to identify the known and unknown classes from the target domain, which is achieved by encouraging a low cross-entropy for known classes and a high entropy based on the source data from the unknown class. Then, we develop a Known-class Aware Adaptation (KAA) module to better adapt from the source domain to the target by reweighing the adaptation loss based on the likeliness to belong to known classes of unlabeled target samples as predicted by KAR. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our approach.
Published: 2019

35. MiniMax Entropy Network: Learning Category-Invariant Features for Domain Adaptation

Author: Tao, Chaofan, Lv, Fengmao, Duan, Lixin, Wu, Min, Tao, Chaofan, Lv, Fengmao, Duan, Lixin, and Wu, Min
Abstract: How to effectively learn from unlabeled data from the target domain is crucial for domain adaptation, as it helps reduce the large performance gap due to domain shift or distribution change. In this paper, we propose an easy-to-implement method dubbed MiniMax Entropy Networks (MMEN) based on adversarial learning. Unlike most existing approaches which employ a generator to deal with domain difference, MMEN focuses on learning the categorical information from unlabeled target samples with the help of labeled source samples. Specifically, we set an unfair multi-class classifier named categorical discriminator, which classifies source samples accurately but be confused about the categories of target samples. The generator learns a common subspace that aligns the unlabeled samples based on the target pseudo-labels. For MMEN, we also provide theoretical explanations to show that the learning of feature alignment reduces domain mismatch at the category level. Experimental results on various benchmark datasets demonstrate the effectiveness of our method over existing state-of-the-art baselines., Comment: 8 pages, 6 figures
Published: 2019

36. Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision

Author: Zhang, Xiao-Yu, Shi, Haichao, Li, Changsheng, Zheng, Kai, Zhu, Xiaobin, Duan, Lixin, Zhang, Xiao-Yu, Shi, Haichao, Li, Changsheng, Zheng, Kai, Zhu, Xiaobin, and Duan, Lixin
Abstract: Action recognition in videos has attracted a lot of attention in the past decade. In order to learn robust models, previous methods usually assume videos are trimmed as short sequences and require ground-truth annotations of each video frame/sequence, which is quite costly and time-consuming. In this paper, given only video-level annotations, we propose a novel weakly supervised framework to simultaneously locate action frames as well as recognize actions in untrimmed videos. Our proposed framework consists of two major components. First, for action frame localization, we take advantage of the self-attention mechanism to weight each frame, such that the influence of background frames can be effectively eliminated. Second, considering that there are trimmed videos publicly available and also they contain useful information to leverage, we present an additional module to transfer the knowledge from trimmed videos for improving the classification performance in untrimmed ones. Extensive experiments are conducted on two benchmark datasets (i.e., THUMOS14 and ActivityNet1.3), and experimental results clearly corroborate the efficacy of our method.
Published: 2019

37. Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks

Author: Yu, Feiwu, Wu, Xinxiao, Sun, Yuchao, Duan, Lixin, Yu, Feiwu, Wu, Xinxiao, Sun, Yuchao, and Duan, Lixin
Abstract: Existing deep learning methods of video recognition usually require a large number of labeled videos for training. But for a new task, videos are often unlabeled and it is also time-consuming and labor-intensive to annotate them. Instead of human annotation, we try to make use of existing fully labeled images to help recognize those videos. However, due to the problem of domain shifts and heterogeneous feature representations, the performance of classifiers trained on images may be dramatically degraded for video recognition tasks. In this paper, we propose a novel method, called Hierarchical Generative Adversarial Networks (HiGAN), to enhance recognition in videos (i.e., target domain) by transferring knowledge from images (i.e., source domain). The HiGAN model consists of a \emph{low-level} conditional GAN and a \emph{high-level} conditional GAN. By taking advantage of these two-level adversarial learning, our method is capable of learning a domain-invariant feature representation of source images and target videos. Comprehensive experiments on two challenging video recognition datasets (i.e. UCF101 and HMDB51) demonstrate the effectiveness of the proposed method when compared with the existing state-of-the-art domain adaptation methods., Comment: IJCAI 2018
Published: 2018

38. Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Author: Liu, Hao, Yang, Yang, Shen, Fumin, Duan, Lixin, Shen, Heng Tao, Liu, Hao, Yang, Yang, Shen, Fumin, Duan, Lixin, and Shen, Heng Tao
Abstract: Along with the prosperity of recurrent neural network in modelling sequential data and the power of attention mechanism in automatically identify salient information, image captioning, a.k.a., image description, has been remarkably advanced in recent years. Nonetheless, most existing paradigms may suffer from the deficiency of invariance to images with different scaling, rotation, etc.; and effective integration of standalone attention to form a holistic end-to-end system. In this paper, we propose a novel image captioning architecture, termed Recurrent Image Captioner (\textbf{RIC}), which allows visual encoder and language decoder to coherently cooperate in a recurrent manner. Specifically, we first equip CNN-based visual encoder with a differentiable layer to enable spatially invariant transformation of visual signals. Moreover, we deploy an attention filter module (differentiable) between encoder and decoder to dynamically determine salient visual parts. We also employ bidirectional LSTM to preprocess sentences for generating better textual representations. Besides, we propose to exploit variational inference to optimize the whole architecture. Extensive experimental results on three benchmark datasets (i.e., Flickr8k, Flickr30k and MS COCO) demonstrate the superiority of our proposed architecture as compared to most of the state-of-the-art methods.
Published: 2016

39. Learning with Augmented Features for Heterogeneous Domain Adaptation

Author: Duan, Lixin, Xu, Dong, Tsang, Ivor, Duan, Lixin, Xu, Dong, and Tsang, Ivor
Abstract: We propose a new learning method for heterogeneous domain adaptation (HDA), in which the data from the source domain and the target domain are represented by heterogeneous features with different dimensions. Using two different projection matrices, we first transform the data from two domains into a common subspace in order to measure the similarity between the data from two domains. We then propose two new feature mapping functions to augment the transformed data with their original features and zeros. The existing learning methods (e.g., SVM and SVR) can be readily incorporated with our newly proposed augmented feature representations to effectively utilize the data from both domains for HDA. Using the hinge loss function in SVM as an example, we introduce the detailed objective function in our method called Heterogeneous Feature Augmentation (HFA) for a linear case and also describe its kernelization in order to efficiently cope with the data with very high dimensions. Moreover, we also develop an alternating optimization algorithm to effectively solve the nontrivial optimization problem in our HFA method. Comprehensive experiments on two benchmark datasets clearly demonstrate that HFA outperforms the existing HDA methods., Comment: ICML2012
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

39 results on '"Duan, Lixin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources