386 results on '"Huang, Feiyue"'
Search Results
2. Hallucinated Style Distillation for Single Domain Generalization in Medical Image Segmentation
- Author
-
Yi, Jingjun, Bi, Qi, Zheng, Hao, Zhan, Haolan, Ji, Wei, Huang, Yawen, Li, Shaoxin, Li, Yuexiang, Zheng, Yefeng, Huang, Feiyue, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Linguraru, Marius George, editor, Dou, Qi, editor, Feragen, Aasa, editor, Giannarou, Stamatia, editor, Glocker, Ben, editor, Lekadir, Karim, editor, and Schnabel, Julia A., editor
- Published
- 2024
- Full Text
- View/download PDF
3. Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks
- Author
-
Luo, Gen, Zhou, Yiyi, Sun, Xiaoshuai, Wang, Yan, Cao, Liujuan, Wu, Yongjian, Huang, Feiyue, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e., Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this issue, we introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer. LW-Transformer applies Group-wise Transformation to reduce both the parameters and computations of Transformer, while also preserving its two main properties, i.e., the efficient attention modeling on diverse subspaces of MHA, and the expanding-scaling feature transformation of FFN. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks. To examine the generalization ability, we also apply our optimization strategy to a recently proposed image Transformer called Swin-Transformer for image classification, where the effectiveness can be also confirmed
- Published
- 2022
- Full Text
- View/download PDF
4. The intentions and factors influencing university students to perform CPR for strangers based on the theory of planned behavior study
- Author
-
Xia, Lihua, Zhang, Kebiao, Huang, Feiyue, Jian, Ping, and Yang, Runli
- Published
- 2024
- Full Text
- View/download PDF
5. LSTC: Boosting Atomic Action Detection with Long-Short-Term Context
- Author
-
Li, Yuxi, Zhang, Boshen, Li, Jian, Wang, Yabiao, Lin, Weiyao, Wang, Chengjie, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we place the atomic action detection problem into a Long-Short Term Context (LSTC) to analyze how the temporal reliance among video signals affect the action detection results. To do this, we decompose the action recognition pipeline into short-term and long-term reliance, in terms of the hypothesis that the two kinds of context are conditionally independent given the objective action instance. Within our design, a local aggregation branch is utilized to gather dense and informative short-term cues, while a high order long-term inference branch is designed to reason the objective action class from high-order interaction between actor and other person or person pairs. Both branches independently predict the context-specific actions and the results are merged in the end. We demonstrate that both temporal grains are beneficial to atomic action recognition. On the mainstream benchmarks of atomic action detection, our design can bring significant performance gain from the existing state-of-the-art pipeline. The code of this project can be found at [this url](https://github.com/TencentYoutuResearch/ActionDetection-LSTC), Comment: ACM Multimedia 2021
- Published
- 2021
6. Transformer-based Dual Relation Graph for Multi-label Image Recognition
- Author
-
Zhao, Jiawei, Yan, Ke, Zhao, Yifan, Guo, Xiaowei, Huang, Feiyue, and Li, Jia
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The simultaneous recognition of multiple objects in one image remains a challenging task, spanning multiple events in the recognition field such as various object scales, inconsistent appearances, and confused inter-class relationships. Recent research efforts mainly resort to the statistic label co-occurrences and linguistic word embedding to enhance the unclear semantics. Different from these researches, in this paper, we propose a novel Transformer-based Dual Relation learning framework, constructing complementary relationships by exploring two aspects of correlation, i.e., structural relation graph and semantic relation graph. The structural relation graph aims to capture long-range correlations from object context, by developing a cross-scale transformer-based architecture. The semantic graph dynamically models the semantic meanings of image objects with explicit semantic-aware constraints. In addition, we also incorporate the learnt structural relationship into the semantic graph, constructing a joint relation graph for robust representations. With the collaborative learning of these two effective relation graphs, our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks, i.e., MS-COCO and VOC 2007 dataset., Comment: 10 pages, 5 figures. Published in ICCV 2021
- Published
- 2021
7. Spatiotemporal Inconsistency Learning for DeepFake Video Detection
- Author
-
Gu, Zhihao, Chen, Yang, Yao, Taiping, Ding, Shouhong, Li, Jilin, Huang, Feiyue, and Ma, Lizhuang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-based and video-based solutions. However, little attention has been paid to capturing the spatial-temporal inconsistency in forged videos. To address this issue, we term this task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation. Moreover, our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the state-of-the-art competitors., Comment: To appear in ACM MM 2021
- Published
- 2021
8. Adaptive Normalized Representation Learning for Generalizable Face Anti-Spoofing
- Author
-
Liu, Shubao, Zhang, Ke-Yue, Yao, Taiping, Bi, Mingwei, Ding, Shouhong, Li, Jilin, Huang, Feiyue, and Ma, Lizhuang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
With various face presentation attacks arising under unseen scenarios, face anti-spoofing (FAS) based on domain generalization (DG) has drawn growing attention due to its robustness. Most existing methods utilize DG frameworks to align the features to seek a compact and generalized feature space. However, little attention has been paid to the feature extraction process for the FAS task, especially the influence of normalization, which also has a great impact on the generalization of the learned representation. To address this issue, we propose a novel perspective of face anti-spoofing that focuses on the normalization selection in the feature extraction process. Concretely, an Adaptive Normalized Representation Learning (ANRL) framework is devised, which adaptively selects feature normalization methods according to the inputs, aiming to learn domain-agnostic and discriminative representation. Moreover, to facilitate the representation learning, Dual Calibration Constraints are designed, including Inter-Domain Compatible loss and Inter-Class Separable loss, which provide a better optimization direction for generalizable representation. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the SOTA competitors., Comment: accepted on ACM MM 2021
- Published
- 2021
9. Distributed Attention for Grounded Image Captioning
- Author
-
Chen, Nenglun, Pan, Xingjia, Chen, Runnan, Yang, Lei, Lin, Zhiwen, Ren, Yuqiang, Yuan, Haolei, Guo, Xiaowei, Huang, Feiyue, and Wang, Wenping
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia - Abstract
We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image. This task is challenging due to the lack of explicit fine-grained region word alignments as supervision. Previous weakly supervised methods mainly explore various kinds of regularization schemes to improve attention accuracy. However, their performances are still far from the fully supervised ones. One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object. To this end, we propose a simple yet effective method to alleviate the issue, termed as partial grounding problem in our paper. Specifically, we design a distributed attention mechanism to enforce the network to aggregate information from multiple spatially different regions with consistent semantics while generating the words. Therefore, the union of the focused region proposals should form a visual region that encloses the object of interest completely. Extensive experiments have demonstrated the superiority of our proposed method compared with the state-of-the-arts., Comment: mm21
- Published
- 2021
- Full Text
- View/download PDF
10. Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework
- Author
-
Song, Qingyu, Wang, Changan, Jiang, Zhengkai, Wang, Yabiao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, and Wu, Yang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Localizing individuals in crowds is more in accordance with the practical demands of subsequent high-level crowd analysis tasks than simply counting. However, existing localization based methods relying on intermediate representations (\textit{i.e.}, density maps or pseudo boxes) serving as learning targets are counter-intuitive and error-prone. In this paper, we propose a purely point-based framework for joint crowd counting and individual localization. For this framework, instead of merely reporting the absolute counting error at image level, we propose a new metric, called density Normalized Average Precision (nAP), to provide more comprehensive and more precise performance evaluation. Moreover, we design an intuitive solution under this framework, which is called Point to Point Network (P2PNet). P2PNet discards superfluous steps and directly predicts a set of point proposals to represent heads in an image, being consistent with the human annotation results. By thorough analysis, we reveal the key step towards implementing such a novel idea is to assign optimal learning targets for these proposals. Therefore, we propose to conduct this crucial association in an one-to-one matching manner using the Hungarian algorithm. The P2PNet not only significantly surpasses state-of-the-art methods on popular counting benchmarks, but also achieves promising localization accuracy. The codes will be available at: https://github.com/TencentYoutuResearch/CrowdCounting-P2PNet., Comment: To be appear in ICCV2021 (Oral)
- Published
- 2021
11. Discriminator-Free Generative Adversarial Attack
- Author
-
Lu, Shaohao, Xian, Yuqiao, Yan, Ke, Hu, Yi, Sun, Xing, Guo, Xiaowei, Huang, Feiyue, and Zheng, Wei-Shi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The Deep Neural Networks are vulnerable toadversarial exam-ples(Figure 1), making the DNNs-based systems collapsed byadding the inconspicuous perturbations to the images. Most of the existing works for adversarial attack are gradient-based and suf-fer from the latency efficiencies and the load on GPU memory. Thegenerative-based adversarial attacks can get rid of this limitation,and some relative works propose the approaches based on GAN.However, suffering from the difficulty of the convergence of train-ing a GAN, the adversarial examples have either bad attack abilityor bad visual quality. In this work, we find that the discriminatorcould be not necessary for generative-based adversarial attack, andpropose theSymmetric Saliency-based Auto-Encoder (SSAE)to generate the perturbations, which is composed of the saliencymap module and the angle-norm disentanglement of the featuresmodule. The advantage of our proposed method lies in that it is notdepending on discriminator, and uses the generative saliency map to pay more attention to label-relevant regions. The extensive exper-iments among the various tasks, datasets, and models demonstratethat the adversarial examples generated by SSAE not only make thewidely-used models collapse, but also achieves good visual quality.The code is available at https://github.com/BravoLu/SSAE., Comment: 9 pages, 6 figures, 4 tables
- Published
- 2021
- Full Text
- View/download PDF
12. HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping
- Author
-
Wang, Yuhan, Chen, Xu, Zhu, Junwei, Chu, Wenqing, Tai, Ying, Wang, Chengjie, Li, Jilin, Wu, Yongjian, Huang, Feiyue, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods., Comment: Accepted to IJCAI 2021, project website: https://johann.wang/HifiFace
- Published
- 2021
13. Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection
- Author
-
Zhang, Zhenyu, Ge, Yanhao, Chen, Renwang, Tai, Ying, Yan, Yan, Yang, Jian, Wang, Chengjie, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Graphics - Abstract
Non-parametric face modeling aims to reconstruct 3D face only from images without shape assumptions. While plausible facial details are predicted, the models tend to over-depend on local color appearance and suffer from ambiguous noise. To address such problem, this paper presents a novel Learning to Aggregate and Personalize (LAP) framework for unsupervised robust 3D face modeling. Instead of using controlled environment, the proposed method implicitly disentangles ID-consistent and scene-specific face from unconstrained photo set. Specifically, to learn ID-consistent face, LAP adaptively aggregates intrinsic face factors of an identity based on a novel curriculum learning approach with relaxed consistency loss. To adapt the face for a personalized scene, we propose a novel attribute-refining network to modify ID-consistent face with target attribute and details. Based on the proposed method, we make unsupervised 3D face modeling benefit from meaningful image facial structure and possibly higher resolutions. Extensive experiments on benchmarks show LAP recovers superior or competitive face shape and texture, compared with state-of-the-art (SOTA) methods with or without prior and supervision., Comment: CVPR 2021 Oral, 11 pages, 9 figures
- Published
- 2021
14. Consistent Instance False Positive Improves Fairness in Face Recognition
- Author
-
Xu, Xingkun, Huang, Yuge, Shen, Pengcheng, Li, Shaoxin, Li, Jilin, Huang, Feiyue, Li, Yong, and Cui, Zhen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Demographic bias is a significant challenge in practical face recognition systems. Existing methods heavily rely on accurate demographic annotations. However, such annotations are usually unavailable in real scenarios. Moreover, these methods are typically designed for a specific demographic group and are not general enough. In this paper, we propose a false positive rate penalty loss, which mitigates face recognition bias by increasing the consistency of instance False Positive Rate (FPR). Specifically, we first define the instance FPR as the ratio between the number of the non-target similarities above a unified threshold and the total number of the non-target similarities. The unified threshold is estimated for a given total FPR. Then, an additional penalty term, which is in proportion to the ratio of instance FPR overall FPR, is introduced into the denominator of the softmax-based loss. The larger the instance FPR, the larger the penalty. By such unequal penalties, the instance FPRs are supposed to be consistent. Compared with the previous debiasing methods, our method requires no demographic annotations. Thus, it can mitigate the bias among demographic groups divided by various attributes, and these attributes are not needed to be previously predefined during training. Extensive experimental results on popular benchmarks demonstrate the superiority of our method over state-of-the-art competitors. Code and trained models are available at https://github.com/Tencent/TFace., Comment: CVPR2021
- Published
- 2021
15. Adaptive Feature Alignment for Adversarial Training
- Author
-
Wang, Tao, Zhang, Ruixin, Chen, Xingyu, Zhao, Kai, Huang, Xiaolin, Huang, Yuge, Li, Shaoxin, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent studies reveal that Convolutional Neural Networks (CNNs) are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications. Many adversarial defense methods improve robustness at the cost of accuracy, raising the contradiction between standard and adversarial accuracies. In this paper, we observe an interesting phenomenon that feature statistics change monotonically and smoothly w.r.t the rising of attacking strength. Based on this observation, we propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths. Our method is trained to automatically align features of arbitrary attacking strength. This is done by predicting a fusing weight in a dual-BN architecture. Unlike previous works that need to either retrain the model or manually tune a hyper-parameters for different attacking strengths, our method can deal with arbitrary attacking strengths with a single model without introducing any hyper-parameter. Importantly, our method improves the model robustness against adversarial samples without incurring much loss in standard accuracy. Experiments on CIFAR-10, SVHN, and tiny-ImageNet datasets demonstrate that our method outperforms the state-of-the-art under a wide range of attacking strengths.
- Published
- 2021
16. Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model
- Author
-
Zhang, Jiangning, Xu, Chao, Li, Jian, Chen, Wenzhou, Wang, Yabiao, Tai, Ying, Chen, Shuo, Wang, Chengjie, Huang, Feiyue, and Liu, Yong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.
- Published
- 2021
17. Generalizable Representation Learning for Mixture Domain Face Anti-Spoofing
- Author
-
Chen, Zhihong, Yao, Taiping, Sheng, Kekai, Ding, Shouhong, Tai, Ying, Li, Jilin, Huang, Feiyue, and Jin, Xinyu
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face anti-spoofing approach based on domain generalization(DG) has drawn growing attention due to its robustness forunseen scenarios. Existing DG methods assume that the do-main label is known.However, in real-world applications, thecollected dataset always contains mixture domains, where thedomain label is unknown. In this case, most of existing meth-ods may not work. Further, even if we can obtain the domainlabel as existing methods, we think this is just a sub-optimalpartition. To overcome the limitation, we propose domain dy-namic adjustment meta-learning (D2AM) without using do-main labels, which iteratively divides mixture domains viadiscriminative domain representation and trains a generaliz-able face anti-spoofing with meta-learning. Specifically, wedesign a domain feature based on Instance Normalization(IN) and propose a domain representation learning module(DRLM) to extract discriminative domain features for cluster-ing. Moreover, to reduce the side effect of outliers on cluster-ing performance, we additionally utilize maximum mean dis-crepancy (MMD) to align the distribution of sample featuresto a prior distribution, which improves the reliability of clus tering. Extensive experiments show that the proposed methodoutperforms conventional DG-based face anti-spoofing meth-ods, including those utilizing domain labels. Furthermore, weenhance the interpretability through visualizatio, Comment: Accepted for publication in AAAI2021
- Published
- 2021
18. ISTR: End-to-End Instance Segmentation with Transformers
- Author
-
Hu, Jie, Cao, Liujuan, Lu, Yao, Zhang, ShengChuan, Wang, Yan, Li, Ke, Huang, Feiyue, Shao, Ling, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
End-to-end paradigms significantly improve the accuracy of various deep-learning-based computer vision models. To this end, tasks like object detection have been upgraded by replacing non-end-to-end components, such as removing non-maximum suppression by training with a set loss based on bipartite matching. However, such an upgrade is not applicable to instance segmentation, due to its significantly higher output dimensions compared to object detection. In this paper, we propose an instance segmentation Transformer, termed ISTR, which is the first end-to-end framework of its kind. ISTR predicts low-dimensional mask embeddings, and matches them with ground truth mask embeddings for the set loss. Besides, ISTR concurrently conducts detection and segmentation with a recurrent refinement strategy, which provides a new way to achieve instance segmentation compared to the existing top-down and bottom-up frameworks. Benefiting from the proposed end-to-end mechanism, ISTR demonstrates state-of-the-art performance even with approximation-based suboptimal embeddings. Specifically, ISTR obtains a 46.8/38.6 box/mask AP using ResNet50-FPN, and a 48.1/39.9 box/mask AP using ResNet101-FPN, on the MS COCO dataset. Quantitative and qualitative results reveal the promising potential of ISTR as a solid baseline for instance-level recognition. Code has been made available at: https://github.com/hujiecpp/ISTR.
- Published
- 2021
19. Black-Box Dissector: Towards Erasing-based Hard-Label Model Stealing Attack
- Author
-
Wang, Yixu, Li, Jie, Liu, Hong, Wang, Yan, Wu, Yongjian, Huang, Feiyue, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Previous studies have verified that the functionality of black-box models can be stolen with full probability outputs. However, under the more practical hard-label setting, we observe that existing methods suffer from catastrophic performance degradation. We argue this is due to the lack of rich information in the probability prediction and the overfitting caused by hard labels. To this end, we propose a novel hard-label model stealing method termed \emph{black-box dissector}, which consists of two erasing-based modules. One is a CAM-driven erasing strategy that is designed to increase the information capacity hidden in hard labels from the victim model. The other is a random-erasing-based self-knowledge distillation module that utilizes soft labels from the substitute model to mitigate overfitting. Extensive experiments on four widely-used datasets consistently demonstrate that our method outperforms state-of-the-art methods, with an improvement of at most $8.27\%$. We also validate the effectiveness and practical potential of our method on real-world APIs and defense methods. Furthermore, our method promotes other downstream tasks, \emph{i.e.}, transfer adversarial attacks.
- Published
- 2021
20. Delving into Data: Effectively Substitute Training for Black-box Attack
- Author
-
Wang, Wenxuan, Yin, Bangjie, Yao, Taiping, Zhang, Li, Fu, Yanwei, Ding, Shouhong, Li, Jilin, Huang, Feiyue, and Xue, Xiangyang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Deep models have shown their vulnerability when processing adversarial samples. As for the black-box attack, without access to the architecture and weights of the attacked model, training a substitute model for adversarial attacks has attracted wide attention. Previous substitute training approaches focus on stealing the knowledge of the target model based on real training data or synthetic data, without exploring what kind of data can further improve the transferability between the substitute and target models. In this paper, we propose a novel perspective substitute training that focuses on designing the distribution of data used in the knowledge stealing process. More specifically, a diverse data generation module is proposed to synthesize large-scale data with wide distribution. And adversarial substitute training strategy is introduced to focus on the data distributed near the decision boundary. The combination of these two modules can further boost the consistency of the substitute model and target model, which greatly improves the effectiveness of adversarial attack. Extensive experiments demonstrate the efficacy of our method against state-of-the-art competitors under non-target and target attack settings. Detailed visualization and analysis are also provided to help understand the advantage of our method., Comment: 10 pages, 6 figures, 6 tables, 1 algorithm, To appear in CVPR 2021 as a poster paper
- Published
- 2021
21. Carrying out CNN Channel Pruning in a White Box
- Author
-
Zhang, Yuxin, Lin, Mingbao, Lin, Chia-Wen, Chen, Jie, Huang, Feiyue, Wu, Yongjian, Tian, Yonghong, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Channel Pruning has been long studied to compress CNNs, which significantly reduces the overall computation. Prior works implement channel pruning in an unexplainable manner, which tends to reduce the final classification errors while failing to consider the internal influence of each channel. In this paper, we conduct channel pruning in a white box. Through deep visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification. Inspired by this, we choose to preserve channels contributing to most categories. Specifically, to model the contribution of each channel to differentiating categories, we develop a class-wise mask for each channel, implemented in a dynamic training manner w.r.t. the input image's category. On the basis of the learned class-wise mask, we perform a global voting mechanism to remove channels with less category discrimination. Lastly, a fine-tuning process is conducted to recover the performance of the pruned model. To our best knowledge, it is the first time that CNN interpretability theory is considered to guide channel pruning. Extensive experiments on representative image classification tasks demonstrate the superiority of our White-Box over many state-of-the-arts. For instance, on CIFAR-10, it reduces 65.23% FLOPs with even 0.62% accuracy improvement for ResNet-110. On ILSVRC-2012, White-Box achieves a 45.6% FLOPs reduction with only a small loss of 0.83% in the top-1 accuracy for ResNet-50., Comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS)
- Published
- 2021
22. Learning Dynamic Alignment via Meta-filter for Few-shot Learning
- Author
-
Xu, Chengming, Liu, Chen, Zhang, Li, Wang, Chengjie, Li, Jilin, Huang, Feiyue, Xue, Xiangyang, and Fu, Yanwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Few-shot learning (FSL), which aims to recognise new classes by adapting the learned knowledge with extremely limited few-shot (support) examples, remains an important open problem in computer vision. Most of the existing methods for feature alignment in few-shot learning only consider image-level or spatial-level alignment while omitting the channel disparity. Our insight is that these methods would lead to poor adaptation with redundant matching, and leveraging channel-wise adjustment is the key to well adapting the learned knowledge to new classes. Therefore, in this paper, we propose to learn a dynamic alignment, which can effectively highlight both query regions and channels according to different local support information. Specifically, this is achieved by first dynamically sampling the neighbourhood of the feature position conditioned on the input few shot, based on which we further predict a both position-dependent and channel-dependent Dynamic Meta-filter. The filter is used to align the query feature with position-specific and channel-specific knowledge. Moreover, we adopt Neural Ordinary Differential Equation (ODE) to enable a more accurate control of the alignment. In such a sense our model is able to better capture fine-grained semantic context of the few-shot example and thus facilitates dynamical knowledge adaptation for few-shot learning. The resulting framework establishes the new state-of-the-arts on major few-shot visual recognition benchmarks, including miniImageNet and tieredImageNet., Comment: accepted by CVPR2021
- Published
- 2021
23. On Evolving Attention Towards Domain Adaptation
- Author
-
Sheng, Kekai, Li, Ke, Zheng, Xiawu, Liang, Jian, Dong, Weiming, Huang, Feiyue, Ji, Rongrong, and Sun, Xing
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Towards better unsupervised domain adaptation (UDA). Recently, researchers propose various domain-conditioned attention modules and make promising progresses. However, considering that the configuration of attention, i.e., the type and the position of attention module, affects the performance significantly, it is more generalized to optimize the attention configuration automatically to be specialized for arbitrary UDA scenario. For the first time, this paper proposes EvoADA: a novel framework to evolve the attention configuration for a given UDA task without human intervention. In particular, we propose a novel search space containing diverse attention configurations. Then, to evaluate the attention configurations and make search procedure UDA-oriented (transferability + discrimination), we apply a simple and effective evaluation strategy: 1) training the network weights on two domains with off-the-shelf domain adaptation methods; 2) evolving the attention configurations under the guide of the discriminative ability on the target domain. Experiments on various kinds of cross-domain benchmarks, i.e., Office-31, Office-Home, CUB-Paintings, and Duke-Market-1510, reveal that the proposed EvoADA consistently boosts multiple state-of-the-art domain adaptation approaches, and the optimal attention configurations help them achieve better performance., Comment: Among the first to study arbitrary domain adaptation from the perspective of network architecture design
- Published
- 2021
24. Learning Salient Boundary Feature for Anchor-free Temporal Action Localization
- Author
-
Lin, Chuming, Xu, Chengming, Luo, Donghao, Wang, Yabiao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, and Fu, Yanwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Temporal action localization is an important yet challenging task in video understanding. Typically, such a task aims at inferring both the action category and localization of the start and end frame for each action instance in a long, untrimmed video.While most current models achieve good results by using pre-defined anchors and numerous actionness, such methods could be bothered with both large number of outputs and heavy tuning of locations and sizes corresponding to different anchors. Instead, anchor-free methods is lighter, getting rid of redundant hyper-parameters, but gains few attention. In this paper, we propose the first purely anchor-free temporal localization method, which is both efficient and effective. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module to gather more valuable boundary features for each proposal with a novel boundary pooling, and (iii) several consistency constraints to make sure our model can find the accurate boundary given arbitrary proposals. Extensive experiments show that our method beats all anchor-based and actionness-guided methods with a remarkable margin on THUMOS14, achieving state-of-the-art results, and comparable ones on ActivityNet v1.3. Code is available at https://github.com/TencentYoutuResearch/ActionDetection-AFSD., Comment: Accepted by CVPR2021
- Published
- 2021
25. Learning Comprehensive Motion Representation for Action Recognition
- Author
-
Wu, Mingyu, Jiang, Boyuan, Luo, Donghao, Yan, Junchi, Wang, Yabiao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, and Yang, Xiaokang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
For action recognition learning, 2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency. Moreover, the feature enhancement is often only performed by channel or space dimension in action recognition. To address these issues, we first devise a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector. The channel gates generated by CME incorporate the information from all the other frames in the video. We further propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps. The intuition is that the change of background is typically slower than the motion area. Both CME and SME have clear physical meaning in capturing action clues. By integrating the two modules into the off-the-shelf 2D network, we finally obtain a Comprehensive Motion Representation (CMR) learning method for action recognition, which achieves competitive performance on Something-Something V1 & V2 and Kinetics-400. On the temporal reasoning datasets Something-Something V1 and V2, our method outperforms the current state-of-the-art by 2.3% and 1.9% when using 16 frames as input, respectively., Comment: Accepted by AAAI21
- Published
- 2021
26. Unveiling the Potential of Structure Preserving for Weakly Supervised Object Localization
- Author
-
Pan, Xingjia, Gao, Yingguo, Lin, Zhiwen, Tang, Fan, Dong, Weiming, Yuan, Haolei, Huang, Feiyue, and Xu, Changsheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Weakly supervised object localization(WSOL) remains an open problem given the deficiency of finding object extent information using a classification network. Although prior works struggled to localize objects through various spatial regularization strategies, we argue that how to extract object structural information from the trained classification network is neglected. In this paper, we propose a two-stage approach, termed structure-preserving activation (SPA), toward fully leveraging the structure information incorporated in convolutional features for WSOL. First, a restricted activation module (RAM) is designed to alleviate the structure-missing issue caused by the classification network on the basis of the observation that the unbounded classification map and global average pooling layer drive the network to focus only on object parts. Second, we designed a post-process approach, termed self-correlation map generating (SCG) module to obtain structure-preserving localization maps on the basis of the activation maps acquired from the first stage. Specifically, we utilize the high-order self-correlation (HSC) to extract the inherent structural information retained in the learned model and then aggregate HSC of multiple points for precise object localization. Extensive experiments on two publicly available benchmarks including CUB-200-2011 and ILSVRC show that the proposed SPA achieves substantial and consistent performance gains compared with baseline approaches.Code and models are available at https://github.com/Panxjia/SPA_CVPR2021, Comment: Accepted by CVPR2021
- Published
- 2021
27. Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval with Partial Query
- Author
-
Cai, Guanyu, Zhang, Jun, Jiang, Xinyang, Gong, Yifei, He, Lianghua, Yu, Fufu, Peng, Pai, Guo, Xiaowei, Huang, Feiyue, and Sun, Xing
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Text-based image retrieval has seen considerable progress in recent years. However, the performance of existing methods suffers in real life since the user is likely to provide an incomplete description of an image, which often leads to results filled with false positives that fit the incomplete description. In this work, we introduce the partial-query problem and extensively analyze its influence on text-based image retrieval. Previous interactive methods tackle the problem by passively receiving users' feedback to supplement the incomplete query iteratively, which is time-consuming and requires heavy user effort. Instead, we propose a novel retrieval framework that conducts the interactive process in an Ask-and-Confirm fashion, where AI actively searches for discriminative details missing in the current query, and users only need to confirm AI's proposal. Specifically, we propose an object-based interaction to make the interactive retrieval more user-friendly and present a reinforcement-learning-based policy to search for discriminative objects. Furthermore, since fully-supervised training is often infeasible due to the difficulty of obtaining human-machine dialog data, we present a weakly-supervised training strategy that needs no human-annotated dialogs other than a text-image dataset. Experiments show that our framework significantly improves the performance of text-based image retrieval. Code is avaiable at https://github.com/CuthbertCai/Ask-Confirm., Comment: Accepted by ICCV2021
- Published
- 2021
28. Image-to-image Translation via Hierarchical Style Disentanglement
- Author
-
Li, Xinyang, Zhang, Shengchuan, Hu, Jie, Cao, Liujuan, Hong, Xiaopeng, Mao, Xudong, Huang, Feiyue, Wu, Yongjian, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, image-to-image translation has made significant progress in achieving both multi-label (\ie, translation conditioned on different labels) and multi-style (\ie, generation with diverse styles) tasks. However, due to the unexplored independence and exclusiveness in the labels, existing endeavors are defeated by involving uncontrolled manipulations to the translation results. In this paper, we propose Hierarchical Style Disentanglement (HiSD) to address this issue. Specifically, we organize the labels into a hierarchical tree structure, in which independent tags, exclusive attributes, and disentangled styles are allocated from top to bottom. Correspondingly, a new translation process is designed to adapt the above structure, in which the styles are identified for controllable translations. Both qualitative and quantitative results on the CelebA-HQ dataset verify the ability of the proposed HiSD. We hope our method will serve as a solid baseline and provide fresh insights with the hierarchically organized annotations for future research in image-to-image translation. The code has been released at https://github.com/imlixinyang/HiSD., Comment: CVPR 2021. The code will be released at at https://github.com/imlixinyang/HiSD
- Published
- 2021
29. DeeperForensics Challenge 2020 on Real-World Face Forgery Detection: Methods and Results
- Author
-
Jiang, Liming, Guo, Zhengkui, Wu, Wayne, Liu, Zhaoyang, Liu, Ziwei, Loy, Chen Change, Yang, Shuo, Xiong, Yuanjun, Xia, Wei, Chen, Baoying, Zhuang, Peiyu, Li, Sili, Chen, Shen, Yao, Taiping, Ding, Shouhong, Li, Jilin, Huang, Feiyue, Cao, Liujuan, Ji, Rongrong, Lu, Changlei, and Tan, Ganchao
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
This paper reports methods and results in the DeeperForensics Challenge 2020 on real-world face forgery detection. The challenge employs the DeeperForensics-1.0 dataset, one of the most extensive publicly available real-world face forgery detection datasets, with 60,000 videos constituted by a total of 17.6 million frames. The model evaluation is conducted online on a high-quality hidden test set with multiple sources and diverse distortions. A total of 115 participants registered for the competition, and 25 teams made valid submissions. We will summarize the winning solutions and present some discussions on potential research directions., Comment: Technical report. Challenge website: https://competitions.codalab.org/competitions/25228
- Published
- 2021
30. Aurora Guard: Reliable Face Anti-Spoofing via Mobile Lighting System
- Author
-
Zhang, Jian, Tai, Ying, Yao, Taiping, Meng, Jia, Ding, Shouhong, Wang, Chengjie, Li, Jilin, Huang, Feiyue, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face authentication on mobile end has been widely applied in various scenarios. Despite the increasing reliability of cutting-edge face authentication/verification systems to variations like blinking eye and subtle facial expression, anti-spoofing against high-resolution rendering replay of paper photos or digital videos retains as an open problem. In this paper, we propose a simple yet effective face anti-spoofing system, termed Aurora Guard (AG). Our system firstly extracts the normal cues via light reflection analysis, and then adopts an end-to-end trainable multi-task Convolutional Neural Network (CNN) to accurately recover subjects' intrinsic depth and material map to assist liveness classification, along with the light CAPTCHA checking mechanism in the regression branch to further improve the system reliability. Experiments on public Replay-Attack and CASIA datasets demonstrate the merits of our proposed method over the state-of-the-arts. We also conduct extensive experiments on a large-scale dataset containing 12,000 live and diverse spoofing samples, which further validates the generalization ability of our method in the wild., Comment: arXiv admin note: substantial text overlap with arXiv:1902.10311
- Published
- 2021
31. Network Pruning using Adaptive Exemplar Filters
- Author
-
Lin, Mingbao, Ji, Rongrong, Li, Shaojie, Wang, Yan, Wu, Yongjian, Huang, Feiyue, and Ye, Qixiang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Popular network pruning algorithms reduce redundant information by optimizing hand-crafted models, and may cause suboptimal performance and long time in selecting filters. We innovatively introduce adaptive exemplar filters to simplify the algorithm design, resulting in an automatic and efficient pruning approach called EPruner. Inspired by the face recognition community, we use a message passing algorithm Affinity Propagation on the weight matrices to obtain an adaptive number of exemplars, which then act as the preserved filters. EPruner breaks the dependency on the training data in determining the "important" filters and allows the CPU implementation in seconds, an order of magnitude faster than GPU based SOTAs. Moreover, we show that the weights of exemplars provide a better initialization for the fine-tuning. On VGGNet-16, EPruner achieves a 76.34%-FLOPs reduction by removing 88.80% parameters, with 0.06% accuracy improvement on CIFAR-10. In ResNet-152, EPruner achieves a 65.12%-FLOPs reduction by removing 64.18% parameters, with only 0.71% top-5 accuracy loss on ILSVRC-2012. Our code can be available at https://github.com/lmbxmu/EPruner., Comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems (IEEE TNNLS)
- Published
- 2021
32. Dual-Level Collaborative Transformer for Image Captioning
- Author
-
Luo, Yunpeng, Ji, Jiayi, Sun, Xiaoshuai, Cao, Liujuan, Wu, Yongjian, Huang, Feiyue, Lin, Chia-Wen, and Ji, Rongrong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processed by a novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a Comprehensive Relation Attention component is also introduced to embed the geometric information. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features, where a geometric alignment graph is constructed to accurately align and reinforce region and grid features. To validate our model, we conduct extensive experiments on the highly competitive MS-COCO dataset, and achieve new state-of-the-art performance on both local and online test sets, i.e., 133.8% CIDEr-D on Karpathy split and 135.4% CIDEr on the official split. Code is available at https://github.com/luo3300612/image-captioning-DLCT., Comment: AAAI 2021
- Published
- 2021
33. Frequency Consistent Adaptation for Real World Super Resolution
- Author
-
Ji, Xiaozhong, Tao, Guangpin, Cao, Yun, Tai, Ying, Lu, Tong, Wang, Chengjie, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent deep-learning based Super-Resolution (SR) methods have achieved remarkable performance on images with known degradation. However, these methods always fail in real-world scene, since the Low-Resolution (LR) images after the ideal degradation (e.g., bicubic down-sampling) deviate from real source domain. The domain gap between the LR images and the real-world images can be observed clearly on frequency density, which inspires us to explictly narrow the undesired gap caused by incorrect degradation. From this point of view, we design a novel Frequency Consistent Adaptation (FCA) that ensures the frequency domain consistency when applying existing SR methods to the real scene. We estimate degradation kernels from unsupervised images and generate the corresponding LR images. To provide useful gradient information for kernel estimation, we propose Frequency Density Comparator (FDC) by distinguishing the frequency density of images on different scales. Based on the domain-consistent LR-HR pairs, we train easy-implemented Convolutional Neural Network (CNN) SR models. Extensive experiments show that the proposed FCA improves the performance of the SR model under real-world setting achieving state-of-the-art results with high fidelity and plausible perception, thus providing a novel effective framework for real-world SR application.
- Published
- 2020
34. Effective Label Propagation for Discriminative Semi-Supervised Domain Adaptation
- Author
-
Huang, Zhiyong, Sheng, Kekai, Dong, Weiming, Mei, Xing, Ma, Chongyang, Huang, Feiyue, Zhou, Dengwen, and Xu, Changsheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Semi-supervised domain adaptation (SSDA) methods have demonstrated great potential in large-scale image classification tasks when massive labeled data are available in the source domain but very few labeled samples are provided in the target domain. Existing solutions usually focus on feature alignment between the two domains while paying little attention to the discrimination capability of learned representations in the target domain. In this paper, we present a novel and effective method, namely Effective Label Propagation (ELP), to tackle this problem by using effective inter-domain and intra-domain semantic information propagation. For inter-domain propagation, we propose a new cycle discrepancy loss to encourage consistency of semantic information between the two domains. For intra-domain propagation, we propose an effective self-training strategy to mitigate the noises in pseudo-labeled target domain data and improve the feature discriminability in the target domain. As a general method, our ELP can be easily applied to various domain adaptation approaches and can facilitate their feature discrimination in the target domain. Experiments on Office-Home and DomainNet benchmarks show ELP consistently improves the classification accuracy of mainstream SSDA methods by 2%~3%. Additionally, ELP also improves the performance of UDA methods as well (81.5% vs 86.1%), based on UDA experiments on the VisDA-2017 benchmark. Our source code and pre-trained models will be released soon.
- Published
- 2020
35. Fast Class-wise Updating for Online Hashing
- Author
-
Lin, Mingbao, Ji, Rongrong, Sun, Xiaoshuai, Zhang, Baochang, Huang, Feiyue, Tian, Yonghong, and Tao, Dacheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Information Retrieval - Abstract
Online image hashing has received increasing research attention recently, which processes large-scale data in a streaming fashion to update the hash functions on-the-fly. To this end, most existing works exploit this problem under a supervised setting, i.e., using class labels to boost the hashing performance, which suffers from the defects in both adaptivity and efficiency: First, large amounts of training batches are required to learn up-to-date hash functions, which leads to poor online adaptivity. Second, the training is time-consuming, which contradicts with the core need of online learning. In this paper, a novel supervised online hashing scheme, termed Fast Class-wise Updating for Online Hashing (FCOH), is proposed to address the above two challenges by introducing a novel and efficient inner product operation. To achieve fast online adaptivity, a class-wise updating method is developed to decompose the binary code learning and alternatively renew the hash functions in a class-wise fashion, which well addresses the burden on large amounts of training batches. Quantitatively, such a decomposition further leads to at least 75% storage saving. To further achieve online efficiency, we propose a semi-relaxation optimization, which accelerates the online training by treating different binary constraints independently. Without additional constraints and variables, the time complexity is significantly reduced. Such a scheme is also quantitatively shown to well preserve past information during updating hashing functions. We have quantitatively demonstrated that the collective effort of class-wise updating and semi-relaxation optimization provides a superior performance comparing to various state-of-the-art methods, which is verified through extensive experiments on three widely-used datasets., Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
- Published
- 2020
36. Adversarial Refinement Network for Human Motion Prediction
- Author
-
Chao, Xianjin, Bin, Yanrui, Chu, Wenqing, Cao, Xuan, Ge, Yanhao, Wang, Chengjie, Li, Jilin, Huang, Feiyue, and Leung, Howard
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Human motion prediction aims to predict future 3D skeletal sequences by giving a limited human motion as inputs. Two popular methods, recurrent neural networks and feed-forward deep networks, are able to predict rough motion trend, but motion details such as limb movement may be lost. To predict more accurate future human motion, we propose an Adversarial Refinement Network (ARNet) following a simple yet effective coarse-to-fine mechanism with novel adversarial error augmentation. Specifically, we take both the historical motion sequences and coarse prediction as input of our cascaded refinement network to predict refined human motion and strengthen the refinement network with adversarial error augmentation. During training, we deliberately introduce the error distribution by learning through the adversarial mechanism among different subjects. In testing, our cascaded refinement network alleviates the prediction error from the coarse predictor resulting in a finer prediction robustly. This adversarial error augmentation provides rich error cases as input to our refinement network, leading to better generalization performance on the testing dataset. We conduct extensive experiments on three standard benchmark datasets and show that our proposed ARNet outperforms other state-of-the-art methods, especially on challenging aperiodic actions in both short-term and long-term predictions., Comment: Accepted by ACCV 2020(Oral)
- Published
- 2020
37. Rotated Binary Neural Network
- Author
-
Lin, Mingbao, Ji, Rongrong, Xu, Zihan, Zhang, Baochang, Wang, Yan, Wu, Yongjian, Huang, Feiyue, and Lin, Chia-Wen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Binary Neural Network (BNN) shows its predominance in reducing the complexity of deep neural networks. However, it suffers severe performance degradation. One of the major impediments is the large quantization error between the full-precision weight vector and its binary vector. Previous works focus on compensating for the norm gap while leaving the angular bias hardly touched. In this paper, for the first time, we explore the influence of angular bias on the quantization error and then introduce a Rotated Binary Neural Network (RBNN), which considers the angle alignment between the full-precision weight vector and its binarized version. At the beginning of each training epoch, we propose to rotate the full-precision weight vector to its binary vector to reduce the angular bias. To avoid the high complexity of learning a large rotation matrix, we further introduce a bi-rotation formulation that learns two smaller rotation matrices. In the training stage, we devise an adjustable rotated weight vector for binarization to escape the potential local optimum. Our rotation leads to around 50% weight flips which maximize the information gain. Finally, we propose a training-aware approximation of the sign function for the gradient backward. Experiments on CIFAR-10 and ImageNet demonstrate the superiorities of RBNN over many state-of-the-arts. Our source code, experimental settings, training logs and binary models are available at https://github.com/lmbxmu/RBNN., Comment: Accepted by NeurIPS2020 (The 34th Conference on Neural Information Processing Systems)
- Published
- 2020
38. Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning
- Author
-
Wang, Jinpeng, Gao, Yuting, Li, Ke, Lin, Yiqi, Ma, Andy J., Cheng, Hao, Peng, Pai, Huang, Feiyue, Ji, Rongrong, and Sun, Xing
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Self-supervised learning has shown great potentials in improving the video representation ability of deep neural networks by getting supervision from the data itself. However, some of the current methods tend to cheat from the background, i.e., the prediction is highly dependent on the video background instead of the motion, making the model vulnerable to background changes. To mitigate the model reliance towards the background, we propose to remove the background impact by adding the background. That is, given a video, we randomly select a static frame and add it to every other frames to construct a distracting video sample. Then we force the model to pull the feature of the distracting video and the feature of the original video closer, so that the model is explicitly restricted to resist the background influence, focusing more on the motion changes. We term our method as \emph{Background Erasing} (BE). It is worth noting that the implementation of our method is so simple and neat and can be added to most of the SOTA methods without much efforts. Specifically, BE brings 16.4% and 19.1% improvements with MoCo on the severely biased datasets UCF101 and HMDB51, and 14.5% improvement on the less biased dataset Diving48., Comment: CVPR2021 camera ready
- Published
- 2020
39. Face Anti-Spoofing Via Disentangled Representation Learning
- Author
-
Zhang, Ke-Yue, Yao, Taiping, Zhang, Jian, Tai, Ying, Ding, Shouhong, Li, Jilin, Huang, Feiyue, Song, Haichuan, and Ma, Lizhuang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face anti-spoofing is crucial to security of face recognition systems. Previous approaches focus on developing discriminative models based on the features extracted from images, which may be still entangled between spoof patterns and real persons. In this paper, motivated by the disentangled representation learning, we propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images, and the liveness features is further used for classification. We also put forward a Convolutional Neural Network (CNN) architecture with the process of disentanglement and combination of low-level and high-level supervision to improve the generalization capabilities. We evaluate our method on public benchmark datasets and extensive experimental results demonstrate the effectiveness of our method against the state-of-the-art competitors. Finally, we further visualize some results to help understand the effect and advantage of disentanglement., Comment: To appear in ECCV 2020
- Published
- 2020
40. Adversarial Semantic Data Augmentation for Human Pose Estimation
- Author
-
Bin, Yanrui, Cao, Xuan, Chen, Xinya, Ge, Yanhao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, Gao, Changxin, and Sang, Nong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appearance and limited diversity. We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. Furthermore, we propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. Given off-the-shelf pose estimation network as discriminator, the generator seeks the most confusing transformation to increase the loss of the discriminator while the discriminator takes the generated sample as input and learns from it. The whole pipeline is optimized in an adversarial manner. State-of-the-art results are achieved on challenging benchmarks.
- Published
- 2020
41. Dense Scene Multiple Object Tracking with Box-Plane Matching
- Author
-
Peng, Jinlong, Gu, Yueyang, Wang, Yabiao, Wang, Chengjie, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Multiple Object Tracking (MOT) is an important task in computer vision. MOT is still challenging due to the occlusion problem, especially in dense scenes. Following the tracking-by-detection framework, we propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes. First, we design the Layer-wise Aggregation Discriminative Model (LADM) to filter the noisy detections. Then, to associate remaining detections correctly, we introduce the Global Attention Feature Model (GAFM) to extract appearance feature and use it to calculate the appearance similarity between history tracklets and current detections. Finally, we propose the Box-Plane Matching strategy to achieve data association according to the motion similarity and appearance similarity between tracklets and detections. With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020., Comment: ACM Multimedia 2020 GC paper. ACM Multimedia Grand Challenge HiEve 2020 Track-1 Winner
- Published
- 2020
42. Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking
- Author
-
Peng, Jinlong, Wang, Changan, Wan, Fangbin, Wu, Yang, Wang, Yabiao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, and Fu, Yanwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module). The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data. The source code of CTracker can be found at: github.com/pjl1995/CTracker., Comment: European Conference on Computer Vision 2020 (Spotlight)
- Published
- 2020
43. NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination
- Author
-
Zhou, Penghao, Zhou, Chong, Peng, Pai, Du, Junlong, Sun, Xing, Guo, Xiaowei, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Greedy-NMS inherently raises a dilemma, where a lower NMS threshold will potentially lead to a lower recall rate and a higher threshold introduces more false positives. This problem is more severe in pedestrian detection because the instance density varies more intensively. However, previous works on NMS don't consider or vaguely consider the factor of the existent of nearby pedestrians. Thus, we propose Nearby Objects Hallucinator (NOH), which pinpoints the objects nearby each proposal with a Gaussian distribution, together with NOH-NMS, which dynamically eases the suppression for the space that might contain other objects with a high likelihood. Compared to Greedy-NMS, our method, as the state-of-the-art, improves by $3.9\%$ AP, $5.1\%$ Recall, and $0.8\%$ $\text{MR}^{-2}$ on CrowdHuman to $89.0\%$ AP and $92.9\%$ Recall, and $43.9\%$ $\text{MR}^{-2}$ respectively., Comment: Accepted at the ACM International Conference on Multimedia (ACM MM) 2020
- Published
- 2020
44. Temporal Distinct Representation Learning for Action Recognition
- Author
-
Weng, Junwu, Luo, Donghao, Wang, Yabiao, Tai, Ying, Wang, Chengjie, Li, Jilin, Huang, Feiyue, Jiang, Xudong, and Yuan, Junsong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Motivated by the previous success of Two-Dimensional Convolutional Neural Network (2D CNN) on image recognition, researchers endeavor to leverage it to characterize videos. However, one limitation of applying 2D CNN to analyze videos is that different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization, especially in the spatial semantics extraction process, hence neglecting the critical variations among frames. In this paper, we attempt to tackle this issue through two ways. 1) Design a sequential channel filtering mechanism, i.e., Progressive Enhancement Module (PEM), to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction. 2) Create a Temporal Diversity Loss (TD Loss) to force the kernels to concentrate on and capture the variations among frames rather than the image regions with similar appearance. Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively. Besides, performance improvements over the 2D-CNN-based state-of-the-arts on the large-scale dataset Kinetics are also witnessed., Comment: 16 pages, 4 figures, 7 tables
- Published
- 2020
45. Collaborative Learning for Faster StyleGAN Embedding
- Author
-
Guan, Shanyan, Tai, Ying, Ni, Bingbing, Zhu, Feida, Huang, Feiyue, and Yang, Xiaokang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The latent code of the recent popular model StyleGAN has learned disentangled representations thanks to the multi-layer style-based generator. Embedding a given image back to the latent space of StyleGAN enables wide interesting semantic image editing applications. Although previous works are able to yield impressive inversion results based on an optimization framework, which however suffers from the efficiency issue. In this work, we propose a novel collaborative learning framework that consists of an efficient embedding network and an optimization-based iterator. On one hand, with the progress of training, the embedding network gives a reasonable latent code initialization for the iterator. On the other hand, the updated latent code from the iterator in turn supervises the embedding network. In the end, high-quality latent code can be obtained efficiently with a single forward pass through our embedding network. Extensive experiments demonstrate the effectiveness and efficiency of our work., Comment: 10 pages, 11 figures
- Published
- 2020
46. ACFD: Asymmetric Cartoon Face Detector
- Author
-
Zhang, Bin, Li, Jian, Wang, Yabiao, Cui, Zhipeng, Xia, Yili, Wang, Chengjie, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Cartoon face detection is a more challenging task than human face detection due to many difficult scenarios is involved. Aiming at the characteristics of cartoon faces, such as huge differences within the intra-faces, in this paper, we propose an asymmetric cartoon face detector, named ACFD. Specifically, it consists of the following modules: a novel backbone VoVNetV3 comprised of several asymmetric one-shot aggregation modules (AOSA), asymmetric bi-directional feature pyramid network (ABi-FPN), dynamic anchor match strategy (DAM) and the corresponding margin binary classification loss (MBC). In particular, to generate features with diverse receptive fields, multi-scale pyramid features are extracted by VoVNetV3, and then fused and enhanced simultaneously by ABi-FPN for handling the faces in some extreme poses and have disparate aspect ratios. Besides, DAM is used to match enough high-quality anchors for each face, and MBC is for the strong power of discrimination. With the effectiveness of these modules, our ACFD achieves the 1st place on the detection track of 2020 iCartoon Face Challenge under the constraints of model size 200MB, inference time 50ms per image, and without any pretrained models., Comment: 1st place of IJCAI 2020 iCartoon Face Challenge (Detection Track)
- Published
- 2020
47. Arbitrary Style Transfer via Multi-Adaptation Network
- Author
-
Deng, Yingying, Tang, Fan, Dong, Weiming, Sun, Wen, Huang, Feiyue, and Xu, Changsheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Arbitrary style transfer is a significant topic with research value and application prospect. A desired style transfer, given a content image and referenced style painting, would render the content image with the color tone and vivid stroke patterns of the style painting while synchronously maintaining the detailed content structure information. Style transfer approaches would initially learn content and style representations of the content and style references and then generate the stylized images guided by these representations. In this paper, we propose the multi-adaptation network which involves two self-adaptation (SA) modules and one co-adaptation (CA) module: the SA modules adaptively disentangle the content and style representations, i.e., content SA module uses position-wise self-attention to enhance content representation and style SA module uses channel-wise self-attention to enhance style representation; the CA module rearranges the distribution of style representation based on content representation distribution by calculating the local similarity between the disentangled content and style features in a non-local fashion. Moreover, a new disentanglement loss function enables our network to extract main style patterns and exact content structures to adapt to various input images, respectively. Various qualitative and quantitative experiments demonstrate that the proposed multi-adaptation network leads to better results than the state-of-the-art style transfer methods.
- Published
- 2020
48. CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition
- Author
-
Huang, Yuge, Wang, Yuhan, Tai, Ying, Liu, Xiaoming, Shen, Pengcheng, Li, Shaoxin, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
As an emerging topic in face recognition, designing margin-based loss functions can increase the feature margin between different classes for enhanced discriminability. More recently, the idea of mining-based strategies is adopted to emphasize the misclassified samples, achieving promising results. However, during the entire training process, the prior methods either do not explicitly emphasize the sample based on its importance that renders the hard samples not fully exploited; or explicitly emphasize the effects of semi-hard/hard samples even at the early training stage that may lead to convergence issue. In this work, we propose a novel Adaptive Curriculum Learning loss (CurricularFace) that embeds the idea of curriculum learning into the loss function to achieve a novel training strategy for deep face recognition, which mainly addresses easy samples in the early training stage and hard ones in the later stage. Specifically, our CurricularFace adaptively adjusts the relative importance of easy and hard samples during different training stages. In each stage, different samples are assigned with different importance according to their corresponding difficultness. Extensive experimental results on popular benchmarks demonstrate the superiority of our CurricularFace over the state-of-the-art competitors., Comment: CVPR 2020
- Published
- 2020
49. Towards Palmprint Verification On Smartphones
- Author
-
Zhang, Yingyi, Zhang, Lin, Zhang, Ruixin, Li, Shaoxin, Li, Jilin, and Huang, Feiyue
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
With the rapid development of mobile devices, smartphones have gradually become an indispensable part of people's lives. Meanwhile, biometric authentication has been corroborated to be an effective method for establishing a person's identity with high confidence. Hence, recently, biometric technologies for smartphones have also become increasingly sophisticated and popular. But it is noteworthy that the application potential of palmprints for smartphones is seriously underestimated. Studies in the past two decades have shown that palmprints have outstanding merits in uniqueness and permanence, and have high user acceptance. However, currently, studies specializing in palmprint verification for smartphones are still quite sporadic, especially when compared to face- or fingerprint-oriented ones. In this paper, aiming to fill the aforementioned research gap, we conducted a thorough study of palmprint verification on smartphones and our contributions are twofold. First, to facilitate the study of palmprint verification on smartphones, we established an annotated palmprint dataset named MPD, which was collected by multi-brand smartphones in two separate sessions with various backgrounds and illumination conditions. As the largest dataset in this field, MPD contains 16,000 palm images collected from 200 subjects. Second, we built a DCNN-based palmprint verification system named DeepMPV+ for smartphones. In DeepMPV+, two key steps, ROI extraction and ROI matching, are both formulated as learning problems and then solved naturally by modern DCNN models. The efficiency and efficacy of DeepMPV+ have been corroborated by extensive experiments. To make our results fully reproducible, the labeled dataset and the relevant source codes have been made publicly available at https://cslinzhang.github.io/MobilePalmPrint/.
- Published
- 2020
50. Architecture Disentanglement for Deep Neural Networks
- Author
-
Hu, Jie, Cao, Liujuan, Ye, Qixiang, Tong, Tong, Zhang, ShengChuan, Li, Ke, Huang, Feiyue, Ji, Rongrong, and Shao, Ling
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Understanding the inner workings of deep neural networks (DNNs) is essential to provide trustworthy artificial intelligence techniques for practical applications. Existing studies typically involve linking semantic concepts to units or layers of DNNs, but fail to explain the inference process. In this paper, we introduce neural architecture disentanglement (NAD) to fill the gap. Specifically, NAD learns to disentangle a pre-trained DNN into sub-architectures according to independent tasks, forming information flows that describe the inference processes. We investigate whether, where, and how the disentanglement occurs through experiments conducted with handcrafted and automatically-searched network architectures, on both object-based and scene-based datasets. Based on the experimental results, we present three new findings that provide fresh insights into the inner logic of DNNs. First, DNNs can be divided into sub-architectures for independent tasks. Second, deeper layers do not always correspond to higher semantics. Third, the connection type in a DNN affects how the information flows across layers, leading to different disentanglement behaviors. With NAD, we further explain why DNNs sometimes give wrong predictions. Experimental results show that misclassified images have a high probability of being assigned to task sub-architectures similar to the correct ones. Code will be available at: https://github.com/hujiecpp/NAD.
- Published
- 2020
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.