36 results on '"Zhang, Shanghang"'
Search Results
2. A multimodal physiological dataset for driving behaviour analysis
- Author
-
Tao, Xiaoming, Gao, Dingcheng, Zhang, Wenqi, Liu, Tianqi, Du, Bing, Zhang, Shanghang, and Qin, Yanjun
- Published
- 2024
- Full Text
- View/download PDF
3. EfficientBioAI: making bioimaging AI models efficient in energy and latency
- Author
-
Zhou, Yu, Cao, Jiajun, Sonneck, Justin, Banerjee, Sweta, Dörr, Stefanie, Grüneboom, Anika, Lorenz, Kristina, Zhang, Shanghang, and Chen, Jianxu
- Published
- 2024
- Full Text
- View/download PDF
4. A lightweight multi-layer perceptron for efficient multivariate time series forecasting
- Author
-
Wang, Zhenghong, Ruan, Sijie, Huang, Tianqiang, Zhou, Haoyi, Zhang, Shanghang, Wang, Yi, Wang, Leye, Huang, Zhou, and Liu, Yu
- Published
- 2024
- Full Text
- View/download PDF
5. Expanding the prediction capacity in long sequence time-series forecasting
- Author
-
Zhou, Haoyi, Li, Jianxin, Zhang, Shanghang, Zhang, Shuai, Yan, Mengyi, and Xiong, Hui
- Published
- 2023
- Full Text
- View/download PDF
6. Learning graph attention-aware knowledge graph embedding
- Author
-
Li, Chen, Peng, Xutan, Niu, Yuhang, Zhang, Shanghang, Peng, Hao, Zhou, Chuan, and Li, Jianxin
- Published
- 2021
- Full Text
- View/download PDF
7. Modeling relation paths for knowledge base completion via joint adversarial training
- Author
-
Li, Chen, Peng, Xutan, Zhang, Shanghang, Peng, Hao, Yu, Philip S., He, Min, Du, Linfeng, and Wang, Lihong
- Published
- 2020
- Full Text
- View/download PDF
8. PM-DETR: Domain Adaptive Prompt Memory for Object Detection with Transformers
- Author
-
Jia, Peidong, Liu, Jiaming, Yang, Senqiao, Wu, Jiarui, Xie, Xiaodong, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,68T07 ,I.5.1 ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
The Transformer-based detectors (i.e., DETR) have demonstrated impressive performance on end-to-end object detection. However, transferring DETR to different data distributions may lead to a significant performance degradation. Existing adaptation techniques focus on model-based approaches, which aim to leverage feature alignment to narrow the distribution shift between different domains. In this study, we propose a hierarchical Prompt Domain Memory (PDM) for adapting detection transformers to different distributions. PDM comprehensively leverages the prompt memory to extract domain-specific knowledge and explicitly constructs a long-term memory space for the data distribution, which represents better domain diversity compared to existing methods. Specifically, each prompt and its corresponding distribution value are paired in the memory space, and we inject top M distribution-similar prompts into the input and multi-level embeddings of DETR. Additionally, we introduce the Prompt Memory Alignment (PMA) to reduce the discrepancy between the source and target domains by fully leveraging the domain-specific knowledge extracted from the prompt domain memory. Extensive experiments demonstrate that our method outperforms state-of-the-art domain adaptive object detection methods on three benchmarks, including scene, synthetic to real, and weather adaptation. Codes will be released., cs.cv
- Published
- 2023
9. Chain of Thought Prompt Tuning in Vision Language Models
- Author
-
Ge, Jiaxin, Luo, Hongyin, Qian, Siyuan, Gan, Yulu, Fu, Jie, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Language-Image Pre-training has demonstrated promising results on zero-shot and few-shot downstream tasks by prompting visual models with natural language prompts. However, most recent studies only use a single prompt for tuning, neglecting the inherent step-to-step cognitive reasoning process that humans conduct in complex task settings, for example, when processing images from unfamiliar domains. Chain of Thought is a simple and effective approximation to human reasoning process and has been proven useful for natural language processing (NLP) tasks. Based on this cognitive intuition, we believe that conducting effective reasoning is also an important problem in visual tasks, and a chain of thought could be a solution to this problem. In this work, we propose a novel chain of thought prompt tuning for vision-language modeling. Extensive experiments show that our method not only generalizes better in image classification tasks, has greater transferability beyond a single dataset, and has stronger domain generalization performance, but also performs much better in imagetext retrieval and visual question answering, which require more reasoning capabilities. We are the first to successfully adapt chain-of-thought prompting that combines visual and textual embeddings. We will release our codes
- Published
- 2023
10. MoWE: Mixture of Weather Experts for Multiple Adverse Weather Removal
- Author
-
Luo, Yulin, Zhao, Rui, Wei, Xiaobao, Chen, Jinwei, Lu, Yijie, Xie, Shenghao, Wang, Tianyu, Xiong, Ruiqin, Lu, Ming, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Currently, most adverse weather removal tasks are handled independently, such as deraining, desnowing, and dehazing. However, in autonomous driving scenarios, the type, intensity, and mixing degree of the weather are unknown, so the separated task setting cannot deal with these complex conditions well. Besides, the vision applications in autonomous driving often aim at high-level tasks, but existing weather removal methods neglect the connection between performance on perceptual tasks and signal fidelity. To this end, in upstream task, we propose a novel \textbf{Mixture of Weather Experts(MoWE)} Transformer framework to handle complex weather removal in a perception-aware fashion. We design a \textbf{Weather-aware Router} to make the experts targeted more relevant to weather types while without the need for weather type labels during inference. To handle diverse weather conditions, we propose \textbf{Multi-scale Experts} to fuse information among neighbor tokens. In downstream task, we propose a \textbf{Label-free Perception-aware Metric} to measure whether the outputs of image processing models are suitable for high level perception tasks without the demand for semantic labels. We collect a syntactic dataset \textbf{MAW-Sim} towards autonomous driving scenarios to benchmark the multiple weather removal performance of existing methods. Our MoWE achieves SOTA performance in upstream task on the proposed dataset and two public datasets, i.e. All-Weather and Rain/Fog-Cityscapes, and also have better perceptual results in downstream segmentation task compared to other methods. Our codes and datasets will be released after acceptance.
- Published
- 2023
11. MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID
- Author
-
Gu, Jianyang, Wang, Kai, Luo, Hao, Chen, Chen, Jiang, Wei, Fang, Yuqiang, Zhang, Shanghang, You, Yang, and Zhao, Jian
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Neural Architecture Search (NAS) has been increasingly appealing to the society of object Re-Identification (ReID), for that task-specific architectures significantly improve the retrieval performance. Previous works explore new optimizing targets and search spaces for NAS ReID, yet they neglect the difference of training schemes between image classification and ReID. In this work, we propose a novel Twins Contrastive Mechanism (TCM) to provide more appropriate supervision for ReID architecture search. TCM reduces the category overlaps between the training and validation data, and assists NAS in simulating real-world ReID training schemes. We then design a Multi-Scale Interaction (MSI) search space to search for rational interaction operations between multi-scale features. In addition, we introduce a Spatial Alignment Module (SAM) to further enhance the attention consistency confronted with images from different sources. Under the proposed NAS scheme, a specific architecture is automatically searched, named as MSINet. Extensive experiments demonstrate that our method surpasses state-of-the-art ReID methods on both in-domain and cross-domain scenarios. Source code available in https://github.com/vimar-gu/MSINet., Accepted by CVPR2023
- Published
- 2023
12. BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for BEV 3D Object Detection
- Author
-
Li, Jianing, Lu, Ming, Liu, Jiaming, Guo, Yandong, Du, Li, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,I.4.9 - Abstract
Recently, Bird's-Eye-View (BEV) representation has gained increasing attention in multi-view 3D object detection, which has demonstrated promising applications in autonomous driving. Although multi-view camera systems can be deployed at low cost, the lack of depth information makes current approaches adopt large models for good performance. Therefore, it is essential to improve the efficiency of BEV 3D object detection. Knowledge Distillation (KD) is one of the most practical techniques to train efficient yet accurate models. However, BEV KD is still under-explored to the best of our knowledge. Different from image classification tasks, BEV 3D object detection approaches are more complicated and consist of several components. In this paper, we propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner. However, directly applying the teacher-student paradigm to BEV features fails to achieve satisfying results due to heavy background information in RGB cameras. To solve this problem, we propose to leverage the localization advantage of LiDAR points. Specifically, we transform the LiDAR points to BEV space and generate the foreground mask and view-dependent mask for the teacher-student paradigm. It is to be noted that our method only uses LiDAR points to guide the KD between RGB models. As the quality of depth estimation is crucial for BEV perception, we further introduce depth distillation to our framework. Our unified framework is simple yet effective and achieves a significant performance boost. Code will be released., 12pages
- Published
- 2022
13. Multi-latent Space Alignments for Unsupervised Domain Adaptation in Multi-view 3D Object Detection
- Author
-
Liu, Jiaming, Zhang, Rongyu, Chi, Xiaowei, Li, Xiaoqi, Lu, Ming, Guo, Yandong, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision-Centric Bird-Eye-View (BEV) perception has shown promising potential and attracted increasing attention in autonomous driving. Recent works mainly focus on improving efficiency or accuracy but neglect the domain shift problem, resulting in severe degradation of transfer performance. With extensive observations, we figure out the significant domain gaps existing in the scene, weather, and day-night changing scenarios and make the first attempt to solve the domain adaption problem for multi-view 3D object detection. Since BEV perception approaches are usually complicated and contain several components, the domain shift accumulation on multi-latent spaces makes BEV domain adaptation challenging. In this paper, we propose a novel Multi-level Multi-space Alignment Teacher-Student ($M^{2}ATS$) framework to ease the domain shift accumulation, which consists of a Depth-Aware Teacher (DAT) and a Multi-space Feature Aligned (MFA) student model. Specifically, DAT model adopts uncertainty guidance to sample reliable depth information in target domain. After constructing domain-invariant BEV perception, it then transfers pixel and instance-level knowledge to student model. To further alleviate the domain shift at the global level, MFA student model is introduced to align task-relevant multi-space features of two domains. To verify the effectiveness of $M^{2}ATS$, we conduct BEV 3D object detection experiments on four cross domain scenarios and achieve state-of-the-art performance (e.g., +12.6% NDS and +9.1% mAP on Day-Night). Code and dataset will be released.
- Published
- 2022
14. NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers
- Author
-
Liu, Yijiang, Yang, Huanrui, Dong, Zhen, Keutzer, Kurt, Du, Li, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
The complicated architecture and high training cost of vision transformers urge the exploration of post-training quantization. However, the heavy-tailed distribution of vision transformer activations hinders the effectiveness of previous post-training quantization methods, even with advanced quantizer designs. Instead of tuning the quantizer to better fit the complicated activation distribution, this paper proposes NoisyQuant, a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers. We make a surprising theoretical discovery that for a given quantizer, adding a fixed Uniform noisy bias to the values being quantized can significantly reduce the quantization error under provable conditions. Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution with additive noisy bias to fit a given quantizer. Extensive experiments show NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead. For instance, on linear uniform 6-bit activation quantization, NoisyQuant improves SOTA top-1 accuracy on ImageNet by up to 1.7%, 1.1% and 0.5% for ViT, DeiT, and Swin Transformer respectively, achieving on-par or even higher performance than previous nonlinear, mixed-precision quantization., Accepted to CVPR2023
- Published
- 2022
15. Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer
- Author
-
Liu, Jiaming, Zhang, Qizhe, Li, Jianing, Lu, Ming, Huang, Tiejun, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Neuromorphic spike data, an upcoming modality with high temporal resolution, has shown promising potential in real-world applications due to its inherent advantage to overcome high-velocity motion blur. However, training the spike depth estimation network holds significant challenges in two aspects: sparse spatial information for dense regression tasks, and difficulties in achieving paired depth labels for temporally intensive spike streams. In this paper, we thus propose a cross-modality cross-domain (BiCross) framework to realize unsupervised spike depth estimation with the help of open-source RGB data. It first transfers cross-modality knowledge from source RGB to mediates simulated source spike data, then realizes cross-domain learning from simulated source spike to target spike data. Specifically, Coarse-to-Fine Knowledge Distillation (CFKD) is introduced to transfer cross-modality knowledge in global and pixel-level in the source domain, which complements sparse spike features by sufficient semantic knowledge of image features. We then propose Uncertainty Guided Teacher-Student (UGTS) method to realize cross-domain learning on spike target domain, ensuring domain-invariant global and pixel-level knowledge of teacher and student model through alignment and uncertainty guided depth selection measurement. To verify the effectiveness of BiCross, we conduct extensive experiments on three scenarios, including Synthetic to Real, Extreme Weather, and Scene Changing. The code and datasets will be released.
- Published
- 2022
16. Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning
- Author
-
Lu, Yuheng, Xu, Chenfeng, Wei, Xiaobao, Xie, Xiaodong, Tomizuka, Masayoshi, Keutzer, Kurt, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world, due to their limited generalization capability. Moreover, it is extremely laborious and expensive to collect and fully annotate a point-cloud detection dataset with numerous classes of objects, leading to the limited classes of existing point-cloud datasets and hindering the model to learn general representations to achieve open-vocabulary point-cloud detection. As far as we know, we are the first to study the problem of open-vocabulary 3D point-cloud detection. Instead of seeking a point-cloud dataset with full labels, we resort to ImageNet1K to broaden the vocabulary of the point-cloud detector. We propose OV-3DETIC, an Open-Vocabulary 3D DETector using Image-level Class supervision. Specifically, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo labels for unseen classes. Then we propose a novel debiased cross-modal contrastive learning method to transfer the knowledge from image modality to point-cloud modality during training. Without hurting the latency during inference, OV-3DETIC makes the point-cloud detector capable of achieving open-vocabulary detection. Extensive experiments demonstrate that the proposed OV-3DETIC achieves at least 10.77 % mAP improvement (absolute value) and 9.56 % mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD dataset and ScanNet dataset, respectively. Besides, we conduct sufficient experiments to shed light on why the proposed OV-3DETIC works.
- Published
- 2022
17. UnrealNAS: Can We Search Neural Architectures with Unreal Data?
- Author
-
Dong, Zhen, Zhou, Kaicheng, Li, Guohao, Zhou, Qiang, Guo, Mingfei, Ghanem, Bernard, Keutzer, Kurt, and Zhang, Shanghang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Neural architecture search (NAS) has shown great success in the automatic design of deep neural networks (DNNs). However, the best way to use data to search network architectures is still unclear and under exploration. Previous work has analyzed the necessity of having ground-truth labels in NAS and inspired broad interest. In this work, we take a further step to question whether real data is necessary for NAS to be effective. The answer to this question is important for applications with limited amount of accessible data, and can help people improve NAS by leveraging the extra flexibility of data generation. To explore if NAS needs real data, we construct three types of unreal datasets using: 1) randomly labeled real images; 2) generated images and labels; and 3) generated Gaussian noise with random labels. These datasets facilitate to analyze the generalization and expressivity of the searched architectures. We study the performance of architectures searched on these constructed datasets using popular differentiable NAS methods. Extensive experiments on CIFAR, ImageNet and CheXpert show that the searched architectures can achieve promising results compared with those derived from the conventional NAS pipeline with real labeled data, suggesting the feasibility of performing NAS with unreal data.
- Published
- 2022
18. P 2 FEViT: Plug-and-Play CNN Feature Embedded Hybrid Vision Transformer for Remote Sensing Image Classification.
- Author
-
Wang, Guanqun, Chen, He, Chen, Liang, Zhuang, Yin, Zhang, Shanghang, Zhang, Tong, Dong, Hao, and Gao, Peng
- Subjects
IMAGE recognition (Computer vision) ,TRANSFORMER models ,REMOTE sensing ,CONVOLUTIONAL neural networks ,DATA mining ,SPATIAL ability - Abstract
Remote sensing image classification (RSIC) is a classical and fundamental task in the intelligent interpretation of remote sensing imagery, which can provide unique labeling information for each acquired remote sensing image. Thanks to the potent global context information extraction ability of the multi-head self-attention (MSA) mechanism, visual transformer (ViT)-based architectures have shown excellent capability in natural scene image classification. However, in order to achieve powerful RSIC performance, it is insufficient to capture global spatial information alone. Specifically, for fine-grained target recognition tasks with high inter-class similarity, discriminative and effective local feature representations are key to correct classification. In addition, due to the lack of inductive biases, the powerful global spatial context representation capability of ViT requires lengthy training procedures and large-scale pre-training data volume. To solve the above problems, a hybrid architecture of convolution neural network (CNN) and ViT is proposed to improve the RSIC ability, called P 2 FEViT, which integrates plug-and-play CNN features with ViT. In this paper, the feature representation capabilities of CNN and ViT applying for RSIC are first analyzed. Second, aiming to integrate the advantages of CNN and ViT, a novel approach embedding CNN features into the ViT architecture is proposed, which can make the model synchronously capture and fuse global context and local multimodal information to further improve the classification capability of ViT. Third, based on the hybrid structure, only a simple cross-entropy loss is employed for model training. The model can also have rapid and comfortable convergence with relatively less training data than the original ViT. Finally, extensive experiments are conducted on the public and challenging remote sensing scene classification dataset of NWPU-RESISC45 (NWPU-R45) and the self-built fine-grained target classification dataset called BIT-AFGR50. The experimental results demonstrate that the proposed P 2 FEViT can effectively improve the feature description capability and obtain outstanding image classification performance, while significantly reducing the high dependence of ViT on large-scale pre-training data volume and accelerating the convergence speed. The code and self-built dataset will be released at our webpages. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
19. 2nd Place Solution for VisDA 2021 Challenge -- Universally Domain Adaptive Image Recognition
- Author
-
Liao, Haojin, Song, Xiaolin, Zhao, Sicheng, Zhang, Shanghang, Yue, Xiangyu, Yao, Xingxu, Zhang, Yueming, Xing, Tengfei, Xu, Pengfei, and Wang, Qiang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
The Visual Domain Adaptation (VisDA) 2021 Challenge calls for unsupervised domain adaptation (UDA) methods that can deal with both input distribution shift and label set variance between the source and target domains. In this report, we introduce a universal domain adaptation (UniDA) method by aggregating several popular feature extraction and domain adaptation schemes. First, we utilize VOLO, a Transformer-based architecture with state-of-the-art performance in several visual tasks, as the backbone to extract effective feature representations. Second, we modify the open-set classifier of OVANet to recognize the unknown class with competitive accuracy and robustness. As shown in the leaderboard, our proposed UniDA method ranks the 2nd place with 48.56% ACC and 70.72% AUROC in the VisDA 2021 Challenge.
- Published
- 2021
20. Online Continual Adaptation with Active Self-Training
- Author
-
Zhou, Shiji, Zhao, Han, Zhang, Shanghang, Wang, Lianzhe, Chang, Heng, Wang, Zhi, and Zhu, Wenwu
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Models trained with offline data often suffer from continual distribution shifts and expensive labeling in changing environments. This calls for a new online learning paradigm where the learner can continually adapt to changing environments with limited labels. In this paper, we propose a new online setting -- Online Active Continual Adaptation, where the learner aims to continually adapt to changing distributions using both unlabeled samples and active queries of limited labels. To this end, we propose Online Self-Adaptive Mirror Descent (OSAMD), which adopts an online teacher-student structure to enable online self-training from unlabeled data, and a margin-based criterion that decides whether to query the labels to track changing distributions. Theoretically, we show that, in the separable case, OSAMD has an $O({T}^{2/3})$ dynamic regret bound under mild assumptions, which is aligned with the $\Omega(T^{2/3})$ lower bound of online learning algorithms with full labels. In the general case, we show a regret bound of $O({T}^{2/3} + \alpha^* T)$, where $\alpha^*$ denotes the separability of domains and is usually small. Our theoretical results show that OSAMD can fast adapt to changing environments with active queries. Empirically, we demonstrate that OSAMD achieves favorable regrets under changing environments with limited labels on both simulated and real-world data, which corroborates our theoretical findings.
- Published
- 2021
21. P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding
- Author
-
Liu, Yunze, Yi, Li, Zhang, Shanghang, Fan, Qingnan, Funkhouser, Thomas, and Dong, Hao
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization for more efficient and effective training on downstream tasks. A promising approach is to use contrastive learning to learn a latent space where features are close for similar data samples and far apart for dissimilar ones. This approach has demonstrated tremendous success for pretraining both image and point cloud feature extractors, but it has been barely investigated for multi-modal RGB-D scans, especially with the goal of facilitating high-level scene understanding. To solve this problem, we propose contrasting "pairs of point-pixel pairs", where positives include pairs of RGB-D points in correspondence, and negatives include pairs where one of the two modalities has been disturbed and/or the two RGB-D points are not in correspondence. This provides extra flexibility in making hard negatives and helps networks to learn features from both modalities, not just the more discriminating one of the two. Experiments show that this proposed approach yields better performance on three large-scale RGB-D scene understanding benchmarks (ScanNet, SUN RGB-D, and 3RScan) than previous pretraining approaches.
- Published
- 2020
22. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
- Author
-
Zhou, Haoyi, Zhang, Shanghang, Peng, Jieqi, Zhang, Shuai, Li, Jianxin, Xiong, Hui, and Zhang, Wancai
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,General Medicine ,Information Retrieval (cs.IR) ,Machine Learning (cs.LG) ,Computer Science - Information Retrieval - Abstract
Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a $ProbSparse$ self-attention mechanism, which achieves $O(L \log L)$ in time complexity and memory usage, and has comparable performance on sequences' dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem., 8 pages (main), 5 pages (appendix) and to be appeared in AAAI2021
- Published
- 2020
23. Dual Adversarial Semantics-Consistent Network for Generalized Zero-Shot Learning
- Author
-
Ni, Jian, Zhang, Shanghang, and Xie, Haiyong
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Generalized zero-shot learning (GZSL) is a challenging class of vision and knowledge transfer problems in which both seen and unseen classes appear during testing. Existing GZSL approaches either suffer from semantic loss and discard discriminative information at the embedding stage, or cannot guarantee the visual-semantic interactions. To address these limitations, we propose the Dual Adversarial Semantics-Consistent Network (DASCN), which learns primal and dual Generative Adversarial Networks (GANs) in a unified framework for GZSL. In particular, the primal GAN learns to synthesize inter-class discriminative and semantics-preserving visual features from both the semantic representations of seen/unseen classes and the ones reconstructed by the dual GAN. The dual GAN enforces the synthetic visual features to represent prior semantic knowledge well via semantics-consistent adversarial learning. To the best of our knowledge, this is the first work that employs a novel dual-GAN mechanism for GZSL. Extensive experiments show that our approach achieves significant improvements over the state-of-the-art approaches., 10 pages, 5 figures
- Published
- 2019
24. MaCow: Masked Convolutional Generative Flow
- Author
-
Ma, Xuezhe, Kong, Xiang, Zhang, Shanghang, and Hovy, Eduard
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Flow-based generative models, conceptually attractive due to tractability of both the exact log-likelihood computation and latent-variable inference, and efficiency of both training and sampling, has led to a number of impressive empirical successes and spawned many advanced variants and theoretical investigations. Despite their computational efficiency, the density estimation performance of flow-based generative models significantly falls behind those of state-of-the-art autoregressive models. In this work, we introduce masked convolutional generative flow (MaCow), a simple yet effective architecture of generative flow using masked convolution. By restricting the local connectivity in a small kernel, MaCow enjoys the properties of fast and stable training, and efficient sampling, while achieving significant improvements over Glow for density estimation on standard image benchmarks, considerably narrowing the gap to autoregressive models., In Proceedings of Thirty-third Conference on Neural Information Processing Systems (NeurIPS-2019)
- Published
- 2019
25. A Review of Single-Source Deep Unsupervised Visual Domain Adaptation.
- Author
-
Zhao, Sicheng, Yue, Xiangyu, Zhang, Shanghang, Li, Bo, Zhao, Han, Wu, Bichen, Krishna, Ravi, Gonzalez, Joseph E., Sangiovanni-Vincentelli, Alberto L., Seshia, Sanjit A., and Keutzer, Kurt
- Subjects
VISUAL accommodation ,MACHINE learning - Abstract
Large-scale labeled training datasets have enabled deep neural networks to excel across a wide range of benchmark vision tasks. However, in many applications, it is prohibitively expensive and time-consuming to obtain large quantities of labeled data. To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain. Unfortunately, direct transfer across domains often performs poorly due to the presence of domain shift or dataset bias. Domain adaptation (DA) is a machine learning paradigm that aims to learn a model from a source domain that can perform well on a different (but related) target domain. In this article, we review the latest single-source deep unsupervised DA methods focused on visual tasks and discuss new perspectives for future research. We begin with the definitions of different DA strategies and the descriptions of existing benchmark datasets. We then summarize and compare different categories of single-source unsupervised DA methods, including discrepancy-based methods, adversarial discriminative methods, adversarial generative methods, and self-supervision-based methods. Finally, we discuss future research directions with challenges and possible solutions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
26. Multiple Source Domain Adaptation with Adversarial Training of Neural Networks
- Author
-
Zhao, Han, Zhang, Shanghang, Wu, Guanhang, Costeira, Jo��o P., Moura, Jos�� M. F., and Gordon, Geoffrey J.
- Subjects
FOS: Computer and information sciences ,Computer Science - Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
While domain adaptation has been actively researched in recent years, most theoretical results and algorithms focus on the single-source-single-target adaptation setting. Naive application of such algorithms on multiple source domain adaptation problem may lead to suboptimal solutions. As a step toward bridging the gap, we propose a new generalization bound for domain adaptation when there are multiple source domains with labeled instances and one target domain with unlabeled instances. Compared with existing bounds, the new bound does not require expert knowledge about the target distribution, nor the optimal combination rule for multisource domains. Interestingly, our theory also leads to an efficient learning strategy using adversarial neural networks: we show how to interpret it as learning feature representations that are invariant to the multiple domain shifts while still being discriminative for the learning task. To this end, we propose two models, both of which we call multisource domain adversarial networks (MDANs): the first model optimizes directly our bound, while the second model is a smoothed approximation of the first one, leading to a more data-efficient and task-adaptive model. The optimization tasks of both models are minimax saddle point problems that can be optimized by adversarial training. To demonstrate the effectiveness of MDANs, we conduct extensive experiments showing superior adaptation performance on three real-world datasets: sentiment analysis, digit classification, and vehicle counting.
- Published
- 2017
27. Traffic flow from a low frame rate city camera.
- Author
-
Toropov, Evgeny, Gui, Liangyan, Zhang, Shanghang, Kottur, Satwik, and Moura, Jose M. F.
- Published
- 2015
- Full Text
- View/download PDF
28. Bayesian model fusion: Enabling test cost reduction of analog/RF circuits via wafer-level spatial variation modeling.
- Author
-
Zhang, Shanghang, Li, Xin, Blanton, R. D., da Silva, Jose Machado, Carulli, John M., and Butler, Kenneth M.
- Published
- 2014
- Full Text
- View/download PDF
29. A high-throughput low-latency arithmetic encoder design for HDTV.
- Author
-
Li, Yuan, Zhang, Shanghang, Jia, Huizhu, Xie, Xiaodong, and Gao, Wen
- Published
- 2013
- Full Text
- View/download PDF
30. A flexible and high-performance hardware video encoder architecture.
- Author
-
Wei, Kaijin, Zhang, Shanghang, Jia, Huizhu, Xie, Don, and Gao, Wen
- Abstract
This paper presents a new video encoder architecture for H.264 and AVS, which adopts a novel macroblock (MB) encoding order. As a replacement of Level C+ zigzag coding order, the so-called Level C+ slash scan coding order with NOP insertion is used as MB scheduling to remove MB-level data dependency of the pipeline so that the left MB's coded results such as motion vector (MV) and reconstructed pixels can be obtained early in motion estimation (ME) stages. As a result, by sharing the reconstruction (REC) loop, sequential intra prediction (INTRA) can be split into multiple pipeline stages to explore more block-level parallelization and rate distortion optimization (RDO) based mode decision is apt to implement. The exact MV predictors (MVP) obtained in motion estimation can not only improve coding performance but also make pre-skip ME algorithm able to be applied into this architecture for low power applications. Since the proposed scheme is attributed to Level C+ data reuse, the bandwidth is decreased greatly. A real-time high-definition (HD) 1080P AVS encoder implementation on FPGA verification board with search range [−128, 128]×[−96, 96] and two reference frames at an operating frequency of 160 MHz validates the efficiency of proposed architecture. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
31. An Optimized Hardware Video Encoder for AVS with Level C+ Data Reuse Scheme for Motion Estimation.
- Author
-
Wei, Kaijin, Zhou, Rongwei, Zhang, Shanghang, Jia, Huizhu, Xie, Don, and Gao, Wen
- Abstract
In a hardware video encoder, Level C+ data reuse for motion estimation can reuse two-dimensional overlapped search window (SW) and thus is a good choice to trade off the memory bandwidth with the on-chip buffer size. However, the irregular zigzag coding order brings some other troubles to the encoder implementation. This paper mainly focuses on the special considerations for a Level C+ zigzag encoder. First we present a guideline about how to select the Level C+ zigzag HFmVn scan for the adopted encoder pipeline. Second, according to the guideline, zigzag HF5V3 coding order is applied into our Level C+ encoder in which a new function is added to alter zigzag bit-stream into standard raster order and exact motion vector predictor (MVP) can be used for most macro blocks (MBs) except some corner MBs to increase the coding performance. Third, zigzag-aware scheduling for prefetching the SW is proposed so that the pipeline will never be disturbed by this irregular coding order and can smoothly run MB by MB. In addition, balancing the bandwidth into each MB processing period can improve the bandwidth utilization. With these techniques, a real-time high-definition (HD) 1080P AVS encoder is successfully implemented on FPGA verification board with search range [-128, 128]Ã -- [-96, 96] and two reference frames at an operating frequency of 160 MHz. [ABSTRACT FROM PUBLISHER]
- Published
- 2012
- Full Text
- View/download PDF
32. On a Highly Efficient RDO-Based Mode Decision Pipeline Design for AVS.
- Author
-
Zhu, Chuang, Jia, Huizhu, Zhang, Shanghang, Huang, Xiaofeng, Xie, Xiaodong, and Gao, Wen
- Abstract
Rate distortion optimization (RDO) is the best known mode decision method, while the high implementation complexity limits its applications and almost no real-time hardware encoder is truly full-featured RDO based. In this paper, first, a full-featured RDO-based mode decision (MD) algorithm is presented, which makes more modes enter RDO process. Second, the throughput of RDO-based MD pipeline is thoroughly analyzed and modeled. Third, a highly efficient adaptive block-level pipelining architecture of RDO-based MD for AVS video encoder is proposed which can achieve the highest throughput to alleviate the RDO burden. Our design is described in high-level Verilog/VHDL hardware description language and implemented under SMIC 0.18-\mum CMOS technology with 232 K logic gates and 85 Kb SRAMs. The implementation results validate our architectural design and the proposed architecture can support real time processing of 1080P@30 fps. The coding efficiency of our adopted method far outperforms (0.57 dB PSNR gain in average) the traditional low-complexity MD (LCMD) methods and the throughput of our designed pipeline is increased by 11.3%, 19% and 17% for I, P and B frames, respectively, compared with the existed RDO-based architecture. [ABSTRACT FROM PUBLISHER]
- Published
- 2013
- Full Text
- View/download PDF
33. Associations of gestational thyrotropin levels with disease progression among pregnant women with differentiated thyroid cancer: a retrospective cohort study.
- Author
-
Li X, Fu P, Xiao WC, Mei F, Zhang F, Zhang S, Chen J, Shan R, Sun BK, Song SB, Yuan CH, and Liu Z
- Subjects
- Humans, Female, Pregnancy, Retrospective Studies, Adult, China epidemiology, Follow-Up Studies, Cohort Studies, Prognosis, Thyroid Neoplasms blood, Thyroid Neoplasms pathology, Thyroid Neoplasms surgery, Disease Progression, Thyrotropin blood, Pregnancy Complications, Neoplastic blood, Pregnancy Complications, Neoplastic pathology, Pregnancy Complications, Neoplastic surgery
- Abstract
Purpose: Pregnant women with a diagnosis of differentiated thyroid cancer (DTC) were potentially high-risk but largely ignored study population. We aimed to explore whether gestational thyrotropin levels were associated with progression of DTC., Methods: We conducted a retrospective cohort study at Peking University Third Hospital in Beijing, China from January 2012 to December 2022. We included pregnant women with a pre-pregnancy DTC managed by active surveillance (under-surveillance DTC) or surgical treatment (after-surgery DTC). Dynamic changes of gestational thyrotropin levels across multiple time points were characterized by both statistical (average level, change instability, longitudinal trajectory) and clinical (thyroid dysfunction, thyrotropin suppression, and achievement of thyrotropin suppression target) indicators. Outcomes were clinician-validated progression of DTC, measured separately for patients under surveillance (tumor enlargement or lymph node metastasis) and those after surgery (≥ 3 mm growth in the size of existing metastatic foci, development of new lymph node metastases, ≥ 2 mm growth in the size of existing cancer foci in the contralateral thyroid, or biochemical progression)., Results: Among 43 and 118 patients with under-surveillance and after-surgery DTC, we observed no evidence of associations between any of the quantitative or clinical indicators of gestational thyrotropin levels and progression-free survival, after a median of 2.63 (IQR: 0.90-4.73) and 4.22 (2.53-6.02) year follow-up, respectively (all P values > 0.05)., Conclusions: Gestational thyrotropin levels appeared to play a minor role in the progression of under-surveillance or after-surgery DTC. Clinicians might focus on the risk of adverse pregnancy outcomes when optimizing thyrotropin levels for pregnant women with a diagnosis of DTC., Competing Interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest., (Copyright © 2024 Li, Fu, Xiao, Mei, Zhang, Zhang, Chen, Shan, Sun, Song, Yuan and Liu.)
- Published
- 2024
- Full Text
- View/download PDF
34. Differentiated thyroid cancer and adverse pregnancy outcomes: a propensity score-matched retrospective cohort study.
- Author
-
Li X, Mei F, Xiao WC, Zhang F, Zhang S, Fu P, Chen J, Shan R, Sun BK, Song SB, Yuan C, and Liu Z
- Abstract
Background: Differentiated thyroid cancer (DTC) has been increasingly common in women of reproductive age. However, the evidence remains mixed regarding the association of DTC with adverse pregnancy outcomes in pregnant women previously diagnosed with DTC., Methods: We conducted a retrospective cohort study in the Peking University Third Hospital in Beijing, China between January 2012 and December 2022. We included singleton-pregnancy women with a pre-pregnancy DTC managed by surgical treatment (after-surgery DTC) or active surveillance (under-surveillance DTC). To reduce the confounding effects, we adopted a propensity score to match the after-surgery and under-surveillance DTC groups with the non-DTC group, respectively, on age, parity, gravidity, pre-pregnancy weight, height, and Hashimoto's thyroiditis. We used conditional logistics regressions, separately for the after-surgery and under-surveillance DTC groups, to estimate the adjusted associations of DTC with both the composite of adverse pregnancy outcomes and the specific mother-, neonate-, and placenta-related pregnancy outcomes., Results: After the propensity-score matching, the DTC and non-DTC groups were comparable in the measured confounders. In the after-surgery DTC group ( n = 204), the risk of the composite or specific adverse pregnancy outcomes was not significantly different from that of the matched, non-DTC groups ( n = 816; P > 0.05), and the results showed no evidence of difference across different maternal thyroid dysfunctions, gestational thyrotropin levels, and other pre-specified subgroup variables. We observed broadly similar results in the under-surveillance DTC group ( n = 37), except that the risk of preterm birth, preeclampsia, and delivering the low-birth-weight births was higher than that of the matched, non-DTC group [ n = 148; OR (95% CI): 4.79 (1.31, 17.59); 4.00 (1.16, 13.82); 6.67 (1.59, 27.90)]., Conclusions: DTC was not associated with adverse pregnancy outcomes in pregnant women previously treated for DTC. However, more evidence is urgently needed for pregnant women with under-surveillance DTC, which finding will be clinically significant in individualizing prenatal care., Competing Interests: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest., (© 2024 Li, Mei, Xiao, Zhang, Zhang, Fu, Chen, Shan, Sun, Song, Yuan and Liu.)
- Published
- 2024
- Full Text
- View/download PDF
35. Exploring Generalizable Distillation for Efficient Medical Image Segmentation.
- Author
-
Qi X, Wu Z, Zou W, Ren M, Gao Y, Sun M, Zhang S, Shan C, and Sun Z
- Subjects
- Humans, Algorithms, Neural Networks, Computer, Databases, Factual, Deep Learning, Image Processing, Computer-Assisted methods
- Abstract
Efficient medical image segmentation aims to provide accurate pixel-wise predictions with a lightweight implementation framework. However, existing lightweight networks generally overlook the generalizability of the cross-domain medical segmentation tasks. In this paper, we propose Generalizable Knowledge Distillation (GKD), a novel framework for enhancing the performance of lightweight networks on cross-domain medical segmentation by generalizable knowledge distillation from powerful teacher networks. Considering the domain gaps between different medical datasets, we propose the Model-Specific Alignment Networks (MSAN) to obtain the domain-invariant representations. Meanwhile, a customized Alignment Consistency Training (ACT) strategy is designed to promote the MSAN training. Based on the domain-invariant vectors in MSAN, we propose two generalizable distillation schemes, Dual Contrastive Graph Distillation (DCGD) and Domain-Invariant Cross Distillation (DICD). In DCGD, two implicit contrastive graphs are designed to model the intra-coupling and inter-coupling semantic correlations. Then, in DICD, the domain-invariant semantic vectors are reconstructed from two networks (i.e., teacher and student) with a crossover manner to achieve simultaneous generalization of lightweight networks, hierarchically. Moreover, a metric named Fréchet Semantic Distance (FSD) is tailored to verify the effectiveness of the regularized domain-invariant features. Extensive experiments conducted on the Liver, Retinal Vessel and Colonoscopy segmentation datasets demonstrate the superiority of our method, in terms of performance and generalization ability on lightweight networks.
- Published
- 2024
- Full Text
- View/download PDF
36. Biphasic Face Photo-Sketch Synthesis via Semantic-Driven Generative Adversarial Network With Graph Representation Learning.
- Author
-
Qi X, Sun M, Wang Z, Liu J, Li Q, Zhao F, Zhang S, and Shan C
- Abstract
Biphasic face photo-sketch synthesis has significant practical value in wide-ranging fields such as digital entertainment and law enforcement. Previous approaches directly generate the photo-sketch in a global view, they always suffer from the low quality of sketches and complex photograph variations, leading to unnatural and low-fidelity results. In this article, we propose a novel semantic-driven generative adversarial network to address the above issues, cooperating with graph representation learning. Considering that human faces have distinct spatial structures, we first inject class-wise semantic layouts into the generator to provide style-based spatial information for synthesized face photographs and sketches. In addition, to enhance the authenticity of details in generated faces, we construct two types of representational graphs via semantic parsing maps upon input faces, dubbed the intraclass semantic graph (IASG) and the interclass structure graph (IRSG). Specifically, the IASG effectively models the intraclass semantic correlations of each facial semantic component, thus producing realistic facial details. To preserve the generated faces being more structure-coordinated, the IRSG models interclass structural relations among every facial component by graph representation learning. To further enhance the perceptual quality of synthesized images, we present a biphasic interactive cycle training strategy by fully taking advantage of the multilevel feature consistency between the photograph and sketch. Extensive experiments demonstrate that our method outperforms the state-of-the-art competitors on the CUHK Face Sketch (CUFS) and CUHK Face Sketch FERET (CUFSF) datasets.
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.