738 results on '"Yan, Shuicheng"'
Search Results
2. Correction: Instant3D: Instant Text-to-3D Generation
- Author
-
Li, Ming, Zhou, Pan, Liu, Jia-Wei, Keppo, Jussi, Lin, Min, Yan, Shuicheng, and Xu, Xiangyu
- Published
- 2024
- Full Text
- View/download PDF
3. Data-Driven single image deraining: A Comprehensive review and new perspectives
- Author
-
Zhang, Zhao, Wei, Yanyan, Zhang, Haijun, Yang, Yi, Yan, Shuicheng, and Wang, Meng
- Published
- 2023
- Full Text
- View/download PDF
4. Dual-Constrained Deep Semi-Supervised Coupled Factorization Network with Enriched Prior
- Author
-
Zhang, Yan, Zhang, Zhao, Wang, Yang, Zhang, Zheng, Zhang, Li, Yan, Shuicheng, and Wang, Meng
- Published
- 2021
- Full Text
- View/download PDF
5. A Survey on Concept Factorization: From Shallow to Deep Representation Learning
- Author
-
Zhang, Zhao, Zhang, Yan, Xu, Mingliang, Zhang, Li, Yang, Yi, and Yan, Shuicheng
- Published
- 2021
- Full Text
- View/download PDF
6. Arbitrary Virtual Try-on Network: Characteristics Preservation and Tradeoff between Body and Clothing.
- Author
-
Liu, Yu, Zhao, Mingbo, Zhang, Zhao, Liu, Yuping, and Yan, Shuicheng
- Subjects
VIRTUAL networks ,DEEP learning ,GENERATIVE adversarial networks ,HUMAN skin color ,CLOTHING & dress - Abstract
Deep learning based virtual try-on system has achieved some encouraging progress recently, but there still remain several big challenges that need to be solved, such as trying on arbitrary clothes of all types, trying on the clothes from one category to another and generating image-realistic results with few artifacts. To handle this issue, we in this article first collect a new dataset with all types of clothes, i.e., tops, bottoms, and whole clothes, each one has multiple categories with rich information of clothing characteristics such as patterns, logos, and other details. Based on this dataset, we then propose the Arbitrary Virtual Try-On Network (AVTON) that is utilized for all-type clothes, which can synthesize realistic try-on images by preserving and trading off characteristics of the target clothes and the reference person. Our approach includes three modules: (1) Limbs Prediction Module, which is utilized for predicting the human body parts by preserving the characteristics of the reference person. This is especially good for handling cross-category try-on task (e.g., long sleeves ↔ short sleeves or long pants ↔ skirts), where the exposed arms or legs with the skin colors and details can be reasonably predicted; (2) Improved Geometric Matching Module, which is designed to warp clothes according to the geometry of the target person. We improve the TPS based warping method with a compactly supported radial function (Wendland's Ψ-function); (3) Trade-Off Fusion Module, which is to tradeoff the characteristics of the warped clothes and the reference person. This module is to make the generated try-on images look more natural and realistic based on a fine-tune symmetry of the network structure. Extensive simulations are conducted and our approach can achieve better performance compared with the state-of-the-art virtual try-on methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Fine-Grained Multi-human Parsing
- Author
-
Zhao, Jian, Li, Jianshu, Liu, Hengzhu, Yan, Shuicheng, and Feng, Jiashi
- Published
- 2020
- Full Text
- View/download PDF
8. Recognizing Profile Faces by Imagining Frontal View
- Author
-
Zhao, Jian, Xing, Junliang, Xiong, Lin, Yan, Shuicheng, and Feng, Jiashi
- Published
- 2020
- Full Text
- View/download PDF
9. Learning with rethinking: Recurrently improving convolutional neural networks through feedback
- Author
-
Li, Xin, Jie, Zequn, Feng, Jiashi, Liu, Changsong, and Yan, Shuicheng
- Published
- 2018
- Full Text
- View/download PDF
10. Video super-resolution based on spatial-temporal recurrent residual networks
- Author
-
Yang, Wenhan, Feng, Jiashi, Xie, Guosen, Liu, Jiaying, Guo, Zongming, and Yan, Shuicheng
- Published
- 2018
- Full Text
- View/download PDF
11. Robust Alternating Low-Rank Representation by joint [formula omitted]- and [formula omitted]-norm minimization
- Author
-
Zhang, Zhao, Zhao, Mingbo, Li, Fanzhang, Zhang, Li, and Yan, Shuicheng
- Published
- 2017
- Full Text
- View/download PDF
12. LG-CNN: From local parts to global discrimination for fine-grained recognition
- Author
-
Xie, Guo-Sen, Zhang, Xu-Yao, Yang, Wenhan, Xu, Mingliang, Yan, Shuicheng, and Liu, Cheng-Lin
- Published
- 2017
- Full Text
- View/download PDF
13. Discriminative sparse flexible manifold embedding with novel graph for robust visual representation and label propagation
- Author
-
Zhang, Zhao, Zhang, Yan, Li, Fanzhang, Zhao, Mingbo, Zhang, Li, and Yan, Shuicheng
- Published
- 2017
- Full Text
- View/download PDF
14. Towards Garment Sewing Pattern Reconstruction from a Single Image.
- Author
-
Liu, Lijuan, Xu, Xiangyu, Lin, Zhijie, Liang, Jiabin, and Yan, Shuicheng
- Abstract
Garment sewing pattern represents the intrinsic rest shape of a garment, and is the core for many applications like fashion design, virtual try-on, and digital avatars. In this work, we explore the challenging problem of recovering garment sewing patterns from daily photos for augmenting these applications. To solve the problem, we first synthesize a versatile dataset, named SewFactory, which consists of around 1M images and ground-truth sewing patterns for model training and quantitative evaluation. SewFactory covers a wide range of human poses, body shapes, and sewing patterns, and possesses realistic appearances thanks to the proposed human texture synthesis network. Then, we propose a two-level Transformer network called Sewformer, which significantly improves the sewing pattern prediction performance. Extensive experiments demonstrate that the proposed framework is effective in recovering sewing patterns and well generalizes to casually-taken human photos. Code, dataset, and pre-trained models will be released. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
15. Offline Prioritized Experience Replay
- Author
-
Yue, Yang, Kang, Bingyi, Ma, Xiao, Huang, Gao, Song, Shiji, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Machine Learning (cs.LG) - Abstract
Offline reinforcement learning (RL) is challenged by the distributional shift problem. To address this problem, existing works mainly focus on designing sophisticated policy constraints between the learned policy and the behavior policy. However, these constraints are applied equally to well-performing and inferior actions through uniform sampling, which might negatively affect the learned policy. To alleviate this issue, we propose Offline Prioritized Experience Replay (OPER), featuring a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training. Through theoretical analysis, we show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution. We develop two practical strategies to obtain priority weights by estimating advantages based on a fitted value network (OPER-A) or utilizing trajectory returns (OPER-R) for quick computation. OPER is a plug-and-play component for offline RL algorithms. As case studies, we evaluate OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and IQL. Extensive experiments demonstrate that both OPER-A and OPER-R significantly improve the performance for all baseline methods. Codes and priority weights are availiable at https://github.com/sail-sg/OPER., preprint
- Published
- 2023
16. Nonparametric Generative Modeling with Conditional Sliced-Wasserstein Flows
- Author
-
Du, Chao, Li, Tianbo, Pang, Tianyu, Yan, Shuicheng, and Lin, Min
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
Sliced-Wasserstein Flow (SWF) is a promising approach to nonparametric generative modeling but has not been widely adopted due to its suboptimal generative quality and lack of conditional modeling capabilities. In this work, we make two major contributions to bridging this gap. First, based on a pleasant observation that (under certain conditions) the SWF of joint distributions coincides with those of conditional distributions, we propose Conditional Sliced-Wasserstein Flow (CSWF), a simple yet effective extension of SWF that enables nonparametric conditional modeling. Second, we introduce appropriate inductive biases of images into SWF with two techniques inspired by local connectivity and multiscale representation in vision research, which greatly improve the efficiency and quality of modeling images. With all the improvements, we achieve generative performance comparable with many deep parametric generative models on both conditional and unconditional tasks in a purely nonparametric fashion, demonstrating its great potential., ICML 2023
- Published
- 2023
17. A Review of Deep Learning for Video Captioning
- Author
-
Abdar, Moloud, Kollati, Meenakshi, Kuraparthi, Swaraja, Pourpanah, Farhad, McDuff, Daniel, Ghavamzadeh, Mohammad, Yan, Shuicheng, Mohamed, Abduallah, Khosravi, Abbas, Cambria, Erik, and Porikli, Fatih
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Video captioning (VC) is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction. In essence, VC involves understanding a video and describing it with language. Captioning is used in a host of applications from creating more accessible interfaces (e.g., low-vision navigation) to video question answering (V-QA), video retrieval and content generation. This survey covers deep learning-based VC, including but, not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC), and more. We discuss the datasets and evaluation metrics used in the field, and limitations, applications, challenges, and future directions for VC., 42 pages, 10 figures
- Published
- 2023
18. CoSDA: Continual Source-Free Domain Adaptation
- Author
-
Feng, Haozhe, Yang, Zhaorui, Chen, Hesun, Pang, Tianyu, Du, Chao, Zhu, Minfeng, Chen, Wei, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Without access to the source data, source-free domain adaptation (SFDA) transfers knowledge from a source-domain trained model to target domains. Recently, SFDA has gained popularity due to the need to protect the data privacy of the source domain, but it suffers from catastrophic forgetting on the source domain due to the lack of data. To systematically investigate the mechanism of catastrophic forgetting, we first reimplement previous SFDA approaches within a unified framework and evaluate them on four benchmarks. We observe that there is a trade-off between adaptation gain and forgetting loss, which motivates us to design a consistency regularization to mitigate forgetting. In particular, we propose a continual source-free domain adaptation approach named CoSDA, which employs a dual-speed optimized teacher-student model pair and is equipped with consistency learning capability. Our experiments demonstrate that CoSDA outperforms state-of-the-art approaches in continuous adaptation. Notably, our CoSDA can also be integrated with other SFDA methods to alleviate forgetting., 15 pages, 6 figures
- Published
- 2023
19. Learning to segment with image-level annotations
- Author
-
Wei, Yunchao, Liang, Xiaodan, Chen, Yunpeng, Jie, Zequn, Xiao, Yanhui, Zhao, Yao, and Yan, Shuicheng
- Published
- 2016
- Full Text
- View/download PDF
20. Kinship-Guided Age Progression
- Author
-
Shu, Xiangbo, Tang, Jinhui, Lai, Hanjiang, Niu, Zhiheng, and Yan, Shuicheng
- Published
- 2016
- Full Text
- View/download PDF
21. Event-based media processing and analysis: A survey of the literature
- Author
-
Tzelepis, Christos, Ma, Zhigang, Mezaris, Vasileios, Ionescu, Bogdan, Kompatsiaris, Ioannis, Boato, Giulia, Sebe, Nicu, and Yan, Shuicheng
- Published
- 2016
- Full Text
- View/download PDF
22. Recognizing violent activity without decoding video streams
- Author
-
Xie, Jianbin, Yan, Wei, Mu, Chundi, Liu, Tong, Li, Peiqin, and Yan, Shuicheng
- Published
- 2016
- Full Text
- View/download PDF
23. Contrastive Video Question Answering via Video Graph Transformer
- Author
-
Xiao, Junbin, Zhou, Pan, Yao, Angela, Li, Yicong, Hong, Richang, Yan, Shuicheng, and Chua, Tat-Seng
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia ,Multimedia (cs.MM) - Abstract
We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT., Accepted by IEEE T-PAMI'23
- Published
- 2023
24. On Calibrating Diffusion Probabilistic Models
- Author
-
Pang, Tianyu, Lu, Cheng, Du, Chao, Lin, Min, Yan, Shuicheng, and Deng, Zhijie
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Recently, diffusion probabilistic models (DPMs) have achieved promising results in diverse generative tasks. A typical DPM framework includes a forward process that gradually diffuses the data distribution and a reverse process that recovers the data distribution from time-dependent data scores. In this work, we observe that the stochastic reverse process of data scores is a martingale, from which concentration bounds and the optional stopping theorem for data scores can be derived. Then, we discover a simple way for calibrating an arbitrary pretrained DPM, with which the score matching loss can be reduced and the lower bounds of model likelihood can consequently be increased. We provide general calibration guidelines under various model parametrizations. Our calibration method is performed only once and the resulting models can be used repeatedly for sampling. We conduct experiments on multiple datasets to empirically validate our proposal. Our code is at https://github.com/thudzj/Calibrated-DPMs.
- Published
- 2023
25. Bag of Tricks for Training Data Extraction from Language Models
- Author
-
Yu, Weichen, Pang, Tianyu, Liu, Qian, Du, Chao, Kang, Bingyi, Huang, Yan, Lin, Min, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Computation and Language ,Computer Science - Cryptography and Security ,Computer Science - Artificial Intelligence ,Computation and Language (cs.CL) ,Cryptography and Security (cs.CR) ,Machine Learning (cs.LG) - Abstract
With the advance of language models, privacy protection is receiving more attention. Training data extraction is therefore of great importance, as it can serve as a potential tool to assess privacy leakage. However, due to the difficulty of this task, most of the existing methods are proof-of-concept and still not effective enough. In this paper, we investigate and benchmark tricks for improving training data extraction using a publicly available dataset. Because most existing extraction methods use a pipeline of generating-then-ranking, i.e., generating text candidates as potential training data and then ranking them based on specific criteria, our research focuses on the tricks for both text generation (e.g., sampling strategy) and text ranking (e.g., token-level criteria). The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction. Based on the GPT-Neo 1.3B evaluation results, our proposed tricks outperform the baseline by a large margin in most cases, providing a much stronger baseline for future research. The code is available at https://github.com/weichen-yu/LM-Extraction., ICML 2023
- Published
- 2023
26. Learning to Optimize for Reinforcement Learning
- Author
-
Lan, Qingfeng, Mahmood, A. Rupam, Yan, Shuicheng, and Xu, Zhongwen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Machine Learning (cs.LG) - Abstract
In recent years, by leveraging more data, computation, and diverse tasks, learned optimizers have achieved remarkable success in supervised learning, outperforming classical hand-designed optimizers. Reinforcement learning (RL) is essentially different from supervised learning and in practice these learned optimizers do not work well even in simple RL tasks. We investigate this phenomenon and identity three issues. First, the gradients of an RL agent vary across a wide range in logarithms while their absolute values are in a small range, making neural networks hard to obtain accurate parameter updates. Second, the agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training. Finally, due to highly stochastic agent-environment interactions, the agent-gradients have high bias and variance, which increase the difficulty of learning an optimizer for RL. We propose gradient processing, pipeline training, and a novel optimizer structure with good inductive bias to address these issues. By applying these techniques, for the first time, we show that learning an optimizer for RL from scratch is possible. Although only trained in toy tasks, our learned optimizer can generalize to unseen complex tasks in Brax., For code release, see https://github.com/sail-sg/optim4rl
- Published
- 2023
27. Visual Imitation Learning with Patch Rewards
- Author
-
Liu, Minghuan, He, Tairan, Zhang, Weinan, Yan, Shuicheng, and Xu, Zhongwen
- Subjects
FOS: Computer and information sciences ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence - Abstract
Visual imitation learning enables reinforcement learning agents to learn to behave from expert visual demonstrations such as videos or image sequences, without explicit, well-defined rewards. Previous research either adopted supervised learning techniques or induce simple and coarse scalar rewards from pixels, neglecting the dense information contained in the image demonstrations. In this work, we propose to measure the expertise of various local regions of image samples, or called \textit{patches}, and recover multi-dimensional \textit{patch rewards} accordingly. Patch reward is a more precise rewarding characterization that serves as a fine-grained expertise measurement and visual explainability tool. Specifically, we present Adversarial Imitation Learning with Patch Rewards (PatchAIL), which employs a patch-based discriminator to measure the expertise of different local parts from given images and provide patch rewards. The patch-based knowledge is also used to regularize the aggregated reward and stabilize the training. We evaluate our method on DeepMind Control Suite and Atari tasks. The experiment results have demonstrated that PatchAIL outperforms baseline methods and provides valuable interpretations for visual demonstrations., Accepted by ICLR 2023. 18 pages, 14 figures, 2 tables. Codes are available at https://github.com/sail-sg/PatchAIL
- Published
- 2023
28. Reinforcement Learning from Diverse Human Preferences
- Author
-
Xue, Wanqi, An, Bo, Yan, Shuicheng, and Xu, Zhongwen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for crowd-sourcing preference labels and learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to the prior distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.
- Published
- 2023
29. MADAv2: Advanced Multi-Anchor Based Active Domain Adaptation Segmentation
- Author
-
Ning, Munan, Lu, Donghuan, Xie, Yujia, Chen, Dongdong, Wei, Dong, Zheng, Yefeng, Tian, Yonghong, Yan, Shuicheng, and Yuan, Li
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Unsupervised domain adaption has been widely adopted in tasks with scarce annotated data. Unfortunately, mapping the target-domain distribution to the source-domain unconditionally may distort the essential structural information of the target-domain data, leading to inferior performance. To address this issue, we firstly propose to introduce active sample selection to assist domain adaptation regarding the semantic segmentation task. By innovatively adopting multiple anchors instead of a single centroid, both source and target domains can be better characterized as multimodal distributions, in which way more complementary and informative samples are selected from the target domain. With only a little workload to manually annotate these active samples, the distortion of the target-domain distribution can be effectively alleviated, achieving a large performance gain. In addition, a powerful semi-supervised domain adaptation strategy is proposed to alleviate the long-tail distribution problem and further improve the segmentation performance. Extensive experiments are conducted on public datasets, and the results demonstrate that the proposed approach outperforms state-of-the-art methods by large margins and achieves similar performance to the fully-supervised upperbound, i.e., 71.4% mIoU on GTA5 and 71.8% mIoU on SYNTHIA. The effectiveness of each component is also verified by thorough ablation studies., Accepted by TPAMI-IEEE Transactions on Pattern Analysis and Machine Intelligence. arXiv admin note: substantial text overlap with arXiv:2108.08012
- Published
- 2023
30. Special Issue on Generating Realistic Visual Data of Human Behavior
- Author
-
Alameda-Pineda, Xavier, Ricci, Elisa, Salah, Albert Ali, Sebe, Nicu, and Yan, Shuicheng
- Published
- 2020
- Full Text
- View/download PDF
31. Weakly-supervised scene parsing with multiple contextual cues
- Author
-
Li, Teng, Wu, Xinyu, Ni, Bingbing, Lu, Ke, and Yan, Shuicheng
- Published
- 2015
- Full Text
- View/download PDF
32. On robust image spam filtering via comprehensive visual modeling
- Author
-
Shen, Jialie, Deng, Robert H., Cheng, Zhiyong, Nie, Liqiang, and Yan, Shuicheng
- Published
- 2015
- Full Text
- View/download PDF
33. Visual data denoising with a unified Schatten-p norm and ℓq norm regularized principal component pursuit
- Author
-
Wang, Jing, Wang, Meng, Hu, Xuegang, and Yan, Shuicheng
- Published
- 2015
- Full Text
- View/download PDF
34. Bilinear low-rank coding framework and extension for robust image recovery and feature representation
- Author
-
Zhang, Zhao, Yan, Shuicheng, Zhao, Mingbo, and Li, Fan-Zhang
- Published
- 2015
- Full Text
- View/download PDF
35. Visibility-aware part model for robust facial point detection
- Author
-
Liu, Yanfei, Zhou, Xi, Li, Yuanqian, Wang, Yihao, and Yan, Shuicheng
- Published
- 2015
- Full Text
- View/download PDF
36. Attentive Systems: A Survey
- Author
-
Nguyen, Tam V., Zhao, Qi, and Yan, Shuicheng
- Published
- 2017
- Full Text
- View/download PDF
37. SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition
- Author
-
Xie, Guo-Sen, Zhang, Xu-Yao, Yan, Shuicheng, and Liu, Cheng-Lin
- Published
- 2017
- Full Text
- View/download PDF
38. A survey on deep learning-based fine-grained object classification and semantic segmentation
- Author
-
Zhao, Bo, Feng, Jiashi, Wu, Xiao, and Yan, Shuicheng
- Published
- 2017
- Full Text
- View/download PDF
39. Position-guided Text Prompt for Vision-Language Pre-training
- Author
-
Wang, Alex Jinpeng, Zhou, Pan, Shou, Mike Zheng, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into $N\times N$ blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling `P" or ``O" in aPTP ``The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT \cite{vilt} baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP \cite{blip} baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released at \url{https://github.com/sail-sg/ptp}., Camera-ready version, code is in https://github.com/sail-sg/ptp
- Published
- 2022
40. Decoupled Cross-Scale Cross-View Interaction for Stereo Image Enhancement in The Dark
- Author
-
Zheng, Huan, Zhang, Zhao, Fan, Jicong, Hong, Richang, Yang, Yi, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Low-light stereo image enhancement (LLSIE) is a relatively new task to enhance the quality of visually unpleasant stereo images captured in dark condition. However, current methods achieve inferior performance on detail recovery and illumination adjustment. We find it is because: 1) the insufficient single-scale inter-view interaction makes the cross-view cues unable to be fully exploited; 2) lacking long-range dependency leads to the inability to deal with the spatial long-range effects caused by illumination degradation. To alleviate such limitations, we propose a LLSIE model termed Decoupled Cross-scale Cross-view Interaction Network (DCI-Net). Specifically, we present a decoupled interaction module (DIM) that aims for sufficient dual-view information interaction. DIM decouples the dual-view information exchange into discovering multi-scale cross-view correlations and further exploring cross-scale information flow. Besides, we present a spatial-channel information mining block (SIMB) for intra-view feature extraction, and the benefits are twofold. One is the long-range dependency capture to build spatial long-range relationship, and the other is expanded channel information refinement that enhances information flow in channel dimension. Extensive experiments on Flickr1024, KITTI 2012, KITTI 2015 and Middlebury datasets show that our method obtains better illumination adjustment and detail recovery, and achieves SOTA performance compared to other related methods. Our codes, datasets and models will be publicly available.
- Published
- 2022
41. RPM: Generalizable Behaviors for Multi-Agent Reinforcement Learning
- Author
-
Qiu, Wei, Ma, Xiao, An, Bo, Obraztsova, Svetlana, Yan, Shuicheng, and Xu, Zhongwen
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Science - Multiagent Systems ,Multiagent Systems (cs.MA) ,Machine Learning (cs.LG) - Abstract
Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in the evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the problem with Markov Games and propose a simple yet effective method, ranked policy memory (RPM), to collect diverse multi-agent trajectories for training MARL policies with good generalizability. The main idea of RPM is to maintain a look-up memory of policies. In particular, we try to acquire various levels of behaviors by saving policies via ranking the training episode return, i.e., the episode return of agents in the training environment; when an episode starts, the learning agent can then choose a policy from the RPM as the behavior policy. This innovative self-play training framework leverages agents' past policies and guarantees the diversity of multi-agent interaction in the training data. We implement RPM on top of MARL algorithms and conduct extensive experiments on Melting Pot. It has been demonstrated that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks, and it significantly boosts the performance up to 402% on average.
- Published
- 2022
42. Mutual Information Regularized Offline Reinforcement Learning
- Author
-
Ma, Xiao, Kang, Bingyi, Xu, Zhongwen, Lin, Min, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Machine Learning (cs.LG) - Abstract
Offline reinforcement learning (RL) aims at learning an effective policy from offline datasets without active interactions with the environment. The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy for deviating from the behavior policy during policy improvement or making conservative updates for value functions during policy evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. Intuitively, mutual information measures the mutual dependence of actions and states, which reflects how a behavior agent reacts to certain environment states during data collection. To effectively utilize this information to facilitate policy learning, MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. In this way, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding a mutual information regularization. MISA is a general offline RL framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. Our experiments show that MISA performs significantly better than existing methods and achieves new state-of-the-art on various tasks of the D4RL benchmark., 15 pages
- Published
- 2022
43. Seeing Through the Noisy Dark: Towards Real-world Low-Light Image Enhancement and Denoising
- Author
-
Ren, Jiahuan, Zhang, Zhao, Hong, Richang, Xu, Mingliang, Yang, Yi, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Low-light image enhancement (LLIE) aims at improving the illumination and visibility of dark images with lighting noise. To handle the real-world low-light images often with heavy and complex noise, some efforts have been made for joint LLIE and denoising, which however only achieve inferior restoration performance. We attribute it to two challenges: 1) in real-world low-light images, noise is somewhat covered by low-lighting and the left noise after denoising would be inevitably amplified during enhancement; 2) conversion of raw data to sRGB would cause information loss and also more noise, and hence prior LLIE methods trained on raw data are unsuitable for more common sRGB images. In this work, we propose a novel Low-light Enhancement & Denoising Network for real-world low-light images (RLED-Net) in the sRGB color space. In RLED-Net, we apply a plug-and-play differentiable Latent Subspace Reconstruction Block (LSRB) to embed the real-world images into low-rank subspaces to suppress the noise and rectify the errors, such that the impact of noise during enhancement can be effectively shrunk. We then present an efficient Crossed-channel & Shift-window Transformer (CST) layer with two branches to calculate the window and channel attentions to resist the degradation (e.g., speckle noise and blur) caused by the noise in input images. Based on the CST layers, we further present a U-structure network CSTNet as backbone for deep feature recovery, and construct a feature refine block to refine the final features. Extensive experiments on both real noisy images and public image databases well verify the effectiveness of the proposed RLED-Net for RLLIE and denoising simultaneously.
- Published
- 2022
44. Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
- Author
-
Xie, Xingyu, Zhou, Pan, Li, Huan, Lin, Zhouchen, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,FOS: Mathematics ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-3.5})$ stochastic gradient complexity on the non-convex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE, e.t.c., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has been used in multiple popular deep learning frameworks or projects.
- Published
- 2022
45. Video Graph Transformer for Video Question Answering
- Author
-
Xiao, Junbin, Zhou, Pan, Chua, Tat-Seng, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper proposes a Video Graph Transformer (VGT) model for Video Quetion Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled cross-modal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from self-supervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT., ECCV'22
- Published
- 2022
46. Recognizing human group action by layered model with multiple cues
- Author
-
Cheng, Zhongwei, Qin, Lei, Huang, Qingming, Yan, Shuicheng, and Tian, Qi
- Published
- 2014
- Full Text
- View/download PDF
47. Similarity preserving low-rank representation for enhanced data representation and effective subspace learning
- Author
-
Zhang, Zhao, Yan, Shuicheng, and Zhao, Mingbo
- Published
- 2014
- Full Text
- View/download PDF
48. Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks
- Author
-
Pan, Jiachun, Zhou, Pan, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Neural and Evolutionary Computing ,Machine Learning (stat.ML) ,Neural and Evolutionary Computing (cs.NE) ,Machine Learning (cs.LG) - Abstract
For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.
- Published
- 2022
49. Mugs: A Multi-Granular Self-Supervised Learning Framework
- Author
-
Zhou, Pan, Zhou, Yichen, Si, Chenyang, Yu, Weihao, Ng, Teck Khim, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
In self-supervised learning, multi-granular features are heavily desired though rarely investigated, as different downstream tasks (e.g., general and fine-grained classification) often require different or multi-granular features, e.g.~fine- or coarse-grained one or their mixture. In this work, for the first time, we propose an effective MUlti-Granular Self-supervised learning (Mugs) framework to explicitly learn multi-granular visual features. Mugs has three complementary granular supervisions: 1) an instance discrimination supervision (IDS), 2) a novel local-group discrimination supervision (LGDS), and 3) a group discrimination supervision (GDS). IDS distinguishes different instances to learn instance-level fine-grained features. LGDS aggregates features of an image and its neighbors into a local-group feature, and pulls local-group features from different crops of the same image together and push them away for others. It provides complementary instance supervision to IDS via an extra alignment on local neighbors, and scatters different local-groups separately to increase discriminability. Accordingly, it helps learn high-level fine-grained features at a local-group level. Finally, to prevent similar local-groups from being scattered randomly or far away, GDS brings similar samples close and thus pulls similar local-groups together, capturing coarse-grained features at a (semantic) group level. Consequently, Mugs can capture three granular features that often enjoy higher generality on diverse downstream tasks over single-granular features, e.g.~instance-level fine-grained features in contrastive learning. By only pretraining on ImageNet-1K, Mugs sets new SoTA linear probing accuracy 82.1$\%$ on ImageNet-1K and improves previous SoTA by $1.1\%$. It also surpasses SoTAs on other tasks, e.g. transfer learning, detection and segmentation., code and models are available at https://github.com/sail-sg/mugs
- Published
- 2022
50. Robustness and Accuracy Could Be Reconcilable by (Proper) Definition
- Author
-
Pang, Tianyu, Lin, Min, Yang, Xiao, Zhu, Jun, and Yan, Shuicheng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Cryptography and Security ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Cryptography and Security (cs.CR) ,Machine Learning (cs.LG) - Abstract
The trade-off between robustness and accuracy has been widely studied in the adversarial literature. Although still controversial, the prevailing view is that this trade-off is inherent, either empirically or theoretically. Thus, we dig for the origin of this trade-off in adversarial training and find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance -- an overcorrection towards smoothness. Given this, we advocate employing local equivariance to describe the ideal behavior of a robust model, leading to a self-consistent robust error named SCORE. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty via robust optimization. By simply substituting KL divergence with variants of distance metrics, SCORE can be efficiently minimized. Empirically, our models achieve top-rank performance on RobustBench under AutoAttack. Besides, SCORE provides instructive insights for explaining the overfitting phenomenon and semantic input gradients observed on robust models. Code is available at https://github.com/P2333/SCORE., ICML 2022
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.