281 results on '"Wang, Zhangyang"'
Search Results
2. Harnessing the power of longitudinal medical imaging for eye disease prognosis using Transformer-based sequence modeling
- Author
-
Holste, Gregory, Lin, Mingquan, Zhou, Ruiwen, Wang, Fei, Liu, Lei, Yan, Qi, Van Tassel, Sarah H., Kovacs, Kyle, Chew, Emily Y., Lu, Zhiyong, Wang, Zhangyang, and Peng, Yifan
- Published
- 2024
- Full Text
- View/download PDF
3. Efficient deep learning-based automated diagnosis from echocardiography with contrastive self-supervised learning
- Author
-
Holste, Gregory, Oikonomou, Evangelos K., Mortazavi, Bobak J., Wang, Zhangyang, and Khera, Rohan
- Published
- 2024
- Full Text
- View/download PDF
4. Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge
- Author
-
Holste, Gregory, Zhou, Yiliang, Wang, Song, Jaiswal, Ajay, Lin, Mingquan, Zhuge, Sherry, Yang, Yuzhe, Kim, Dongkyun, Nguyen-Mau, Trong-Hieu, Tran, Minh-Triet, Jeong, Jaehyup, Park, Wongi, Ryu, Jongbin, Hong, Feng, Verma, Arsh, Yamagishi, Yosuke, Kim, Changhyun, Seo, Hyeryeong, Kang, Myungjoo, Celi, Leo Anthony, Lu, Zhiyong, Summers, Ronald M., Shih, George, Wang, Zhangyang, and Peng, Yifan
- Published
- 2024
- Full Text
- View/download PDF
5. Saccharide mapping as an extraordinary method on characterization and identification of plant and fungi polysaccharides: A review
- Author
-
Ma, Yuntian, Zhang, Lichen, Ma, Xiaoyu, Bai, Ke, Tian, Zhuoer, Wang, Zhangyang, Muratkhan, Marat, Wang, Xin, Lü, Xin, and Liu, Manshun
- Published
- 2024
- Full Text
- View/download PDF
6. Improving model fairness in image-based computer-aided diagnosis
- Author
-
Lin, Mingquan, Li, Tianhao, Yang, Yifan, Holste, Gregory, Ding, Ying, Van Tassel, Sarah H., Kovacs, Kyle, Shih, George, Wang, Zhangyang, Lu, Zhiyong, Wang, Fei, and Peng, Yifan
- Published
- 2023
- Full Text
- View/download PDF
7. Troubleshooting image segmentation models with human-in-the-loop
- Author
-
Wang, Haotao, Chen, Tianlong, Wang, Zhangyang, and Ma, Kede
- Published
- 2023
- Full Text
- View/download PDF
8. Integrating the traffic science with representation learning for city-wide network congestion prediction
- Author
-
Zheng, Wenqing, Yang, Hao (Frank), Cai, Jiarui, Wang, Peihao, Jiang, Xuan, Du, Simon Shaolei, Wang, Yinhai, and Wang, Zhangyang
- Published
- 2023
- Full Text
- View/download PDF
9. Abstract 18776: ECG-GPT: Automated Complete Diagnosis Generation From ECG Images Using Novel Vision-Text Transformer Model
- Author
-
Khunte, Akshay, Sangha, Veer, Holste, Gregory, Dhingra, Lovedeep S, Aminorroaya, Arya, Wang, Zhangyang, and Khera, Rohan
- Published
- 2023
- Full Text
- View/download PDF
10. Conditional knockout of ASK1 in microglia/macrophages attenuates epileptic seizures and long-term neurobehavioural comorbidities by modulating the inflammatory responses of microglia/macrophages
- Author
-
Zhang, Yiying, Wang, Zhangyang, Wang, Rongrong, Xia, Lu, Cai, Yiying, Tong, Fangchao, Gao, Yanqin, Ding, Jing, and Wang, Xin
- Published
- 2022
- Full Text
- View/download PDF
11. Contrastive learning improves critical event prediction in COVID-19 patients
- Author
-
Wanyan, Tingyi, Honarvar, Hossein, Jaladanki, Suraj K., Zang, Chengxi, Naik, Nidhi, Somani, Sulaiman, De Freitas, Jessica K., Paranjpe, Ishan, Vaid, Akhil, Zhang, Jing, Miotto, Riccardo, Wang, Zhangyang, Nadkarni, Girish N., Zitnik, Marinka, Azad, Ariful, Wang, Fei, Ding, Ying, and Glicksberg, Benjamin S.
- Published
- 2021
- Full Text
- View/download PDF
12. Report on UG[formula omitted] challenge Track 1: Assessing algorithms to improve video object detection and classification from unconstrained mobility platforms
- Author
-
Banerjee, Sreya, VidalMata, Rosaura G., Wang, Zhangyang, and Scheirer, Walter J.
- Published
- 2021
- Full Text
- View/download PDF
13. Modeling user choice behavior under data corruption: Robust learning of the latent decision threshold model.
- Author
-
Lin, Feng, Qian, Xiaoning, Mortazavi, Bobak, Wang, Zhangyang, Huang, Shuai, and Chen, Cynthia
- Subjects
DATA corruption ,MOBILE apps ,PREDICTION models ,ALGORITHMS ,SUCCESS - Abstract
Recent years have witnessed the emergence of many new mobile apps and user-centered systems that interact with users by offering choices with rewards. These applications have been promising to address challenging societal problems such as congestion in transportation and behavior changes for healthier lifestyles. Considerable research efforts have been devoted to model the user behaviors in these new applications. However, as real-world user data is often prone to data corruptions, the success of these models hinges on a robust learning method. Building on the recently proposed Latent Decision Threshold model, this article shows that, among the existing robust learning frameworks, the L
0 -norm-based framework can outperform other state-of-the-art methods in terms of prediction accuracy and model estimation. And based on the L0 -norm framework, we further develop a user screening algorithm to identify potential bad actors. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
14. Abstract 13659: Automated Detection of Aortic Stenosis From Single-View 2-Dimensional Echocardiography Using a Semi-Supervised, Contrastive Learning Approach
- Author
-
Oikonomou, Evangelos K, Holste, Gregory, Mortazavi, Bobak, Wang, Zhangyang, and Khera, Rohan
- Published
- 2022
- Full Text
- View/download PDF
15. Exposing Semantic Segmentation Failures via Maximum Discrepancy Competition
- Author
-
Yan, Jiebin, Zhong, Yu, Fang, Yuming, Wang, Zhangyang, and Ma, Kede
- Published
- 2021
- Full Text
- View/download PDF
16. A Comprehensive Benchmark Analysis of Single Image Deraining: Current Challenges and Future Perspectives
- Author
-
Li, Siyuan, Ren, Wenqi, Wang, Feng, Araujo, Iago Breno, Tokuda, Eric K., Junior, Roberto Hirata, Cesar-Jr., Roberto M., Wang, Zhangyang, and Cao, Xiaochun
- Published
- 2021
- Full Text
- View/download PDF
17. A Multimodal Video-Based AI Biomarker for Aortic Stenosis Development and Progression.
- Author
-
Oikonomou, Evangelos K., Holste, Gregory, Yuan, Neal, Coppi, Andreas, McNamara, Robert L., Haynes, Norrisa A., Vora, Amit N., Velazquez, Eric J., Li, Fan, Menon, Venu, Kapadia, Samir R., Gill, Thomas M., Nadkarni, Girish N., Krumholz, Harlan M., Wang, Zhangyang, Ouyang, David, and Khera, Rohan
- Published
- 2024
- Full Text
- View/download PDF
18. Novel mutations in HINT1 gene cause the autosomal recessive axonal neuropathy with neuromyotonia
- Author
-
Wang, Zhangyang, Lin, Jie, Qiao, Kai, Cai, Shuang, Zhang, Victor W., Zhao, Chongbo, and Lu, Jiahong
- Published
- 2019
- Full Text
- View/download PDF
19. Biometric contrastive learning for data-efficient deep learning from electrocardiographic images.
- Author
-
Sangha, Veer, Khunte, Akshay, Holste, Gregory, Mortazavi, Bobak J, Wang, Zhangyang, Oikonomou, Evangelos K, and Khera, Rohan
- Abstract
Objective Artificial intelligence (AI) detects heart disease from images of electrocardiograms (ECGs). However, traditional supervised learning is limited by the need for large amounts of labeled data. We report the development of Biometric Contrastive Learning (BCL), a self-supervised pretraining approach for label-efficient deep learning on ECG images. Materials and Methods Using pairs of ECGs from 78 288 individuals from Yale (2000-2015), we trained a convolutional neural network to identify temporally separated ECG pairs that varied in layouts from the same patient. We fine-tuned BCL-pretrained models to detect atrial fibrillation (AF), gender, and LVEF < 40%, using ECGs from 2015 to 2021. We externally tested the models in cohorts from Germany and the United States. We compared BCL with ImageNet initialization and general-purpose self-supervised contrastive learning for images (simCLR). Results While with 100% labeled training data, BCL performed similarly to other approaches for detecting AF/Gender/LVEF < 40% with an AUROC of 0.98/0.90/0.90 in the held-out test sets, it consistently outperformed other methods with smaller proportions of labeled data, reaching equivalent performance at 50% of data. With 0.1% data, BCL achieved AUROC of 0.88/0.79/0.75, compared with 0.51/0.52/0.60 (ImageNet) and 0.61/0.53/0.49 (simCLR). In external validation, BCL outperformed other methods even at 100% labeled training data, with an AUROC of 0.88/0.88 for Gender and LVEF < 40% compared with 0.83/0.83 (ImageNet) and 0.84/0.83 (simCLR). Discussion and Conclusion A pretraining strategy that leverages biometric signatures of different ECGs from the same patient enhances the efficiency of developing AI models for ECG images. This represents a major advance in detecting disorders from ECG images with limited labeled data. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
20. Physics-Driven Turbulence Image Restoration with Stochastic Refinement
- Author
-
Jaiswal, Ajay, Zhang, Xingguang, Chan, Stanley H., and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Image and Video Processing (eess.IV) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Computer Vision and Pattern Recognition ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
Image distortion by atmospheric turbulence is a stochastic degradation, which is a critical problem in long-range optical imaging systems. A number of research has been conducted during the past decades, including model-based and emerging deep-learning solutions with the help of synthetic data. Although fast and physics-grounded simulation tools have been introduced to help the deep-learning models adapt to real-world turbulence conditions recently, the training of such models only relies on the synthetic data and ground truth pairs. This paper proposes the Physics-integrated Restoration Network (PiRN) to bring the physics-based simulator directly into the training process to help the network to disentangle the stochasticity from the degradation and the underlying image. Furthermore, to overcome the ``average effect" introduced by deterministic models and the domain gap between the synthetic and real-world degradation, we further introduce PiRN with Stochastic Refinement (PiRN-SR) to boost its perceptual quality. Overall, our PiRN and PiRN-SR improve the generalization to real-world unknown turbulence conditions and provide a state-of-the-art restoration in both pixel-wise accuracy and perceptual quality. Our codes are available at \url{https://github.com/VITA-Group/PiRN}., Accepted by ICCV 2023
- Published
- 2023
21. Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities
- Author
-
Li, Guihong, Hoang, Duc, Bhardwaj, Kartikeya, Lin, Ming, Wang, Zhangyang, and Marculescu, Radu
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Recently, zero-shot (or training-free) Neural Architecture Search (NAS) approaches have been proposed to liberate the NAS from training requirements. The key idea behind zero-shot NAS approaches is to design proxies that predict the accuracies of the given networks without training network parameters. The proxies proposed so far are usually inspired by recent progress in theoretical deep learning and have shown great potential on several NAS benchmark datasets. This paper aims to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS approaches, with an emphasis on their hardware awareness. To this end, we first review the mainstream zero-shot proxies and discuss their theoretical underpinnings. We then compare these zero-shot proxies through large-scale experiments and demonstrate their effectiveness in both hardware-aware and hardware-oblivious NAS scenarios. Finally, we point out several promising ideas to design better proxies. Our source code and the related paper list are available on https://github.com/SLDGroup/survey-zero-shot-nas.
- Published
- 2023
22. FarSight: A Physics-Driven Whole-Body Biometric System at Large Distance and Altitude
- Author
-
Liu, Feng, Ashbaugh, Ryan, Chimitt, Nicholas, Hassan, Najmul, Hassani, Ali, Jaiswal, Ajay, Kim, Minchul, Mao, Zhiyuan, Perry, Christopher, Ren, Zhiyuan, Su, Yiyang, Varghaei, Pegah, Wang, Kai, Zhang, Xingguang, Chan, Stanley, Ross, Arun, Shi, Humphrey, Wang, Zhangyang, Jain, Anil, and Liu, Xiaoming
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Whole-body biometric recognition is an important area of research due to its vast applications in law enforcement, border security, and surveillance. This paper presents the end-to-end design, development and evaluation of FarSight, an innovative software system designed for whole-body (fusion of face, gait and body shape) biometric recognition. FarSight accepts videos from elevated platforms and drones as input and outputs a candidate list of identities from a gallery. The system is designed to address several challenges, including (i) low-quality imagery, (ii) large yaw and pitch angles, (iii) robust feature extraction to accommodate large intra-person variabilities and large inter-person similarities, and (iv) the large domain gap between training and test sets. FarSight combines the physics of imaging and deep learning models to enhance image restoration and biometric feature encoding. We test FarSight's effectiveness using the newly acquired IARPA Biometric Recognition and Identification at Altitude and Range (BRIAR) dataset. Notably, FarSight demonstrated a substantial performance increase on the BRIAR dataset, with gains of +11.82% Rank-20 identification and +11.3% TAR@1% FAR., 11 pages, 7 figures
- Published
- 2023
23. Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models
- Author
-
Jaiswal, Ajay, Liu, Shiwei, Chen, Tianlong, Ding, Ying, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Large pre-trained transformers have been receiving explosive attention in the past few years, due to their wide adaptability for numerous downstream applications via fine-tuning, but their exponentially increasing parameter counts are becoming a primary hurdle to even just fine-tune them without industry-standard hardware. Recently, Lottery Ticket Hypothesis (LTH) and its variants, have been exploited to prune these large pre-trained models generating subnetworks that can achieve similar performance as their dense counterparts, but LTH pragmatism is enormously inhibited by repetitive full training and pruning routine of iterative magnitude pruning (IMP) which worsens with increasing model size. Motivated by the recent observations of model soups, which suggest that fine-tuned weights of multiple models can be merged to a better minima, we propose Instant Soup Pruning (ISP) to generate lottery ticket quality subnetworks, using a fraction of the original IMP cost by replacing the expensive intermediate pruning stages of IMP with computationally efficient weak mask generation and aggregation routine. More specifically, during the mask generation stage, ISP takes a small handful of iterations using varying training protocols and data subsets to generate many weak and noisy subnetworks, and superpose them to average out the noise creating a high-quality denoised subnetwork. Our extensive experiments and ablation on two popular large-scale pre-trained models: CLIP (unexplored in pruning till date) and BERT across multiple benchmark vision and language datasets validate the effectiveness of ISP compared to several state-of-the-art pruning methods. Codes are available at: \url{https://github.com/VITA-Group/instant_soup}, Accepted in ICML 2023
- Published
- 2023
24. Graph Ladling: Shockingly Simple Parallel GNN Training without Intermediate Communication
- Author
-
Jaiswal, Ajay, Liu, Shiwei, Chen, Tianlong, Ding, Ying, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
Graphs are omnipresent and GNNs are a powerful family of neural networks for learning over graphs. Despite their popularity, scaling GNNs either by deepening or widening suffers from prevalent issues of unhealthy gradients, over-smoothening, information squashing, which often lead to sub-standard performance. In this work, we are interested in exploring a principled way to scale GNNs capacity without deepening or widening, which can improve its performance across multiple small and large graphs. Motivated by the recent intriguing phenomenon of model soups, which suggest that fine-tuned weights of multiple large-language pre-trained models can be merged to a better minima, we argue to exploit the fundamentals of model soups to mitigate the aforementioned issues of memory bottleneck and trainability during GNNs scaling. More specifically, we propose not to deepen or widen current GNNs, but instead present a data-centric perspective of model soups tailored for GNNs, i.e., to build powerful GNNs by dividing giant graph data to build independently and parallelly trained multiple comparatively weaker GNNs without any intermediate communication, and combining their strength using a greedy interpolation soup procedure to achieve state-of-the-art performance. Moreover, we provide a wide variety of model soup preparation techniques by leveraging state-of-the-art graph sampling and graph partitioning approaches that can handle large graph data structures. Our extensive experiments across many real-world small and large graphs, illustrate the effectiveness of our approach and point towards a promising orthogonal direction for GNN scaling. Codes are available at: \url{https://github.com/VITA-Group/graph_ladling}., Accepted in ICML 2023
- Published
- 2023
25. Evaluate underdiagnosis and overdiagnosis bias of deep learning model on primary open-angle glaucoma diagnosis in under-served populations
- Author
-
Lin, Mingquan, Xiao, Yunyu, Hou, Bojian, Wanyan, Tingyi, Sharma, Mohit Manoj, Wang, Zhangyang, Wang, Fei, Tassel, Sarah Van, and Peng, Yifan
- Subjects
Articles - Abstract
In the United States, primary open-angle glaucoma (POAG) is the leading cause of blindness, especially among African American and Hispanic individuals. Deep learning has been widely used to detect POAG using fundus images as its performance is comparable to or even surpasses diagnosis by clinicians. However, human bias in clinical diagnosis may be reflected and amplified in the widely-used deep learning models, thus impacting their performance. Biases may cause (1) underdiagnosis, increasing the risks of delayed or inadequate treatment, and (2) overdiagnosis, which may increase individuals’ stress, fear, well-being, and unnecessary/costly treatment. In this study, we examined the underdiagnosis and overdiagnosis when applying deep learning in POAG detection based on the Ocular Hypertension Treatment Study (OHTS) from 22 centers across 16 states in the United States. Our results show that the widely-used deep learning model can underdiagnose or overdiagnose under-served populations. The most underdiagnosed group is female younger (< 60 yrs) group, and the most overdiagnosed group is Black older (≥ 60 yrs) group. Biased diagnosis through traditional deep learning methods may delay disease detection, treatment and create burdens among under-served populations, thereby, raising ethical concerns about using deep learning models in ophthalmology clinics.
- Published
- 2023
26. Dynamic Sparsity Is Channel-Level Sparsity Learner
- Author
-
Yin, Lu, Li, Gen, Fang, Meng, Shen, Li, Huang, Tianjin, Wang, Zhangyang, Menkovski, Vlado, Ma, Xiaolong, Pechenizkiy, Mykola, and Liu, Shiwei
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Sparse training has received an upsurging interest in machine learning due to its tantalizing saving potential for the entire training process as well as inference. Dynamic sparse training (DST), as a leading sparse training approach, can train deep neural networks at high sparsity from scratch to match the performance of their dense counterparts. However, most if not all DST prior arts demonstrate their effectiveness on unstructured sparsity with highly irregular sparse patterns, which receives limited support in common hardware. This limitation hinders the usage of DST in practice. In this paper, we propose Channel-aware dynamic sparse (Chase), which for the first time seamlessly translates the promise of unstructured dynamic sparsity to GPU-friendly channel-level sparsity (not fine-grained N:M or group sparsity) during one end-to-end training process, without any ad-hoc operations. The resulting small sparse networks can be directly accelerated by commodity hardware, without using any particularly sparsity-aware hardware accelerators. This appealing outcome is partially motivated by a hidden phenomenon of dynamic sparsity: off-the-shelf unstructured DST implicitly involves biased parameter reallocation across channels, with a large fraction of channels (up to 60\%) being sparser than others. By progressively identifying and removing these channels during training, our approach translates unstructured sparsity to channel-wise sparsity. Our experimental results demonstrate that Chase achieves 1.7 X inference throughput speedup on common GPU devices without compromising accuracy with ResNet-50 on ImageNet. We release our codes in https://github.com/luuyin/chase.
- Published
- 2023
27. POPE: 6-DoF Promptable Pose Estimation of Any Object, in Any Scene, with One Reference
- Author
-
Fan, Zhiwen, Pan, Panwang, Wang, Peihao, Jiang, Yifan, Xu, Dejia, Jiang, Hanwen, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Despite the significant progress in six degrees-of-freedom (6DoF) object pose estimation, existing methods have limited applicability in real-world scenarios involving embodied agents and downstream 3D vision tasks. These limitations mainly come from the necessity of 3D models, closed-category detection, and a large number of densely annotated support views. To mitigate this issue, we propose a general paradigm for object pose estimation, called Promptable Object Pose Estimation (POPE). The proposed approach POPE enables zero-shot 6DoF object pose estimation for any target object in any scene, while only a single reference is adopted as the support view. To achieve this, POPE leverages the power of the pre-trained large-scale 2D foundation model, employs a framework with hierarchical feature representation and 3D geometry principles. Moreover, it estimates the relative camera pose between object prompts and the target object in new views, enabling both two-view and multi-view 6DoF pose estimation tasks. Comprehensive experimental results demonstrate that POPE exhibits unrivaled robust performance in zero-shot settings, by achieving a significant reduction in the averaged Median Pose Error by 52.38% and 50.47% on the LINEMOD and OnePose datasets, respectively. We also conduct more challenging testings in causally captured images (see Figure 1), which further demonstrates the robustness of POPE. Project page can be found with https://paulpanwang.github.io/POPE/.
- Published
- 2023
28. MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition
- Author
-
Gong, Xinyu, Mohan, Sreyas, Dhingra, Naina, Bazin, Jean-Charles, Li, Yilei, Wang, Zhangyang, and Ranjan, Rakesh
- Subjects
FOS: Computer and information sciences ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code will be available at https://github.com/facebookresearch/MMG_Ego4D., Accepted to CVPR 2023
- Published
- 2023
29. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity
- Author
-
Liu, Shiwei, Chen, Tianlong, Chen, Xiaohan, Chen, Xuxi, Xiao, Qiao, Wu, Boqian, Pechenizkiy, Mykola, Mocanu, Decebal Constantin, Wang, Zhangyang, Datamanagement & Biometrics, and Digital Society Institute
- Abstract
Transformers have quickly shined in the computer vision world since the emergence of Vision Transformers (ViTs). The dominant role of convolutional neural networks (CNNs) seems to be challenged by increasingly effective transformer-based models. Very recently, a couple of advanced convolutional models strike back with large kernels motivated by the local-window attention mechanism, showing appealing performance and efficiency. While one of them, i.e. RepLKNet, impressively manages to scale the kernel size to 31x31 with improved performance, the performance starts to saturate as the kernel size continues growing, compared to the scaling trend of advanced ViTs such as Swin Transformer. In this paper, we explore the possibility of training extreme convolutions larger than 31x31 and test whether the performance gap can be eliminated by strategically enlarging convolutions. This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61x61 with better performance. Built on this recipe, we propose Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with sparse factorized 51x51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures like ConvNeXt and RepLKNet, on ImageNet classification as well as a wide range of downstream tasks including semantic segmentation on ADE20K, object detection on PASCAL VOC 2007, and object detection/segmentation on MS COCO.
- Published
- 2023
30. Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation
- Author
-
Zheng, Wenqing, Sharan, S P, Jaiswal, Ajay Kumar, Wang, Kevin, Xi, Yihan, Xu, Dejia, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Programming Languages ,Computer Science - Artificial Intelligence ,Programming Languages (cs.PL) ,Machine Learning (cs.LG) - Abstract
For a complicated algorithm, its implementation by a human programmer usually starts with outlining a rough control flow followed by iterative enrichments, eventually yielding carefully generated syntactic structures and variables in a hierarchy. However, state-of-the-art large language models generate codes in a single pass, without intermediate warm-ups to reflect the structured thought process of "outline-then-detail". Inspired by the recent success of chain-of-thought prompting, we propose ChainCoder, a program synthesis language model that generates Python code progressively, i.e. from coarse to fine in multiple passes. We first decompose source code into layout frame components and accessory components via abstract syntax tree parsing to construct a hierarchical representation. We then reform our prediction target into a multi-pass objective, each pass generates a subsequence, which is concatenated in the hierarchy. Finally, a tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples. Extensive evaluations show that ChainCoder outperforms state-of-the-arts, demonstrating that our progressive generation eases the reasoning procedure and guides the language model to generate higher-quality solutions. Our codes are available at: https://github.com/VITA-Group/ChainCoder., Accepted in ICML 2023
- Published
- 2023
31. Graph Mixture of Experts: Learning on Large-Scale Graphs with Explicit Diversity Modeling
- Author
-
Wang, Haotao, Jiang, Ziyu, Han, Yan, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
Graph neural networks (GNNs) have been widely applied to learning over graph data. Yet, real-world graphs commonly exhibit diverse graph structures and contain heterogeneous nodes and edges. Moreover, to enhance the generalization ability of GNNs, it has become common practice to further increase the diversity of training graph structures by incorporating graph augmentations and/or performing large-scale pre-training on more graphs. Therefore, it becomes essential for a GNN to simultaneously model diverse graph structures. Yet, naively increasing the GNN model capacity will suffer from both higher inference costs and the notorious trainability issue of GNNs. This paper introduces the Mixture-of-Expert (MoE) idea to GNNs, aiming to enhance their ability to accommodate the diversity of training graph structures, without incurring computational overheads. Our new Graph Mixture of Expert (GMoE) model enables each node in the graph to dynamically select its own optimal \textit{information aggregation experts}. These experts are trained to model different subgroups of graph structures in the training set. Additionally, GMoE includes information aggregation experts with varying aggregation hop sizes, where the experts with larger hop sizes are specialized in capturing information over longer ranges. The effectiveness of GMoE is verified through experimental results on a large variety of graph, node, and link prediction tasks in the OGB benchmark. For instance, it enhances ROC-AUC by $1.81\%$ in ogbg-molhiv and by $1.40\%$ in ogbg-molbbbp, as compared to the non-MoE baselines. Our code is available at https://github.com/VITA-Group/Graph-Mixture-of-Experts., Preprint
- Published
- 2023
32. Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
- Author
-
Khachatryan, Levon, Movsisyan, Andranik, Tadevosyan, Vahram, Henschel, Roberto, Wang, Zhangyang, Navasardyan, Shant, and Shi, Humphrey
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero ., The project is available at: https://github.com/Picsart-AI-Research/Text2Video-Zero
- Published
- 2023
33. Severe aortic stenosis detection by deep learning applied to echocardiography.
- Author
-
Holste, Gregory, Oikonomou, Evangelos K, Mortazavi, Bobak J, Coppi, Andreas, Faridi, Kamil F, Miller, Edward J, Forrest, John K, McNamara, Robert L, Ohno-Machado, Lucila, Yuan, Neal, Gupta, Aakriti, Ouyang, David, Krumholz, Harlan M, Wang, Zhangyang, and Khera, Rohan
- Subjects
DEEP learning ,AORTIC stenosis ,ECHOCARDIOGRAPHY ,RECEIVER operating characteristic curves ,CONVOLUTIONAL neural networks - Abstract
Background and Aims Early diagnosis of aortic stenosis (AS) is critical to prevent morbidity and mortality but requires skilled examination with Doppler imaging. This study reports the development and validation of a novel deep learning model that relies on two-dimensional (2D) parasternal long axis videos from transthoracic echocardiography without Doppler imaging to identify severe AS, suitable for point-of-care ultrasonography. Methods and results In a training set of 5257 studies (17 570 videos) from 2016 to 2020 [Yale-New Haven Hospital (YNHH), Connecticut], an ensemble of three-dimensional convolutional neural networks was developed to detect severe AS, leveraging self-supervised contrastive pretraining for label-efficient model development. This deep learning model was validated in a temporally distinct set of 2040 consecutive studies from 2021 from YNHH as well as two geographically distinct cohorts of 4226 and 3072 studies, from California and other hospitals in New England, respectively. The deep learning model achieved an area under the receiver operating characteristic curve (AUROC) of 0.978 (95% CI: 0.966, 0.988) for detecting severe AS in the temporally distinct test set, maintaining its diagnostic performance in geographically distinct cohorts [0.952 AUROC (95% CI: 0.941, 0.963) in California and 0.942 AUROC (95% CI: 0.909, 0.966) in New England]. The model was interpretable with saliency maps identifying the aortic valve, mitral annulus, and left atrium as the predictive regions. Among non-severe AS cases, predicted probabilities were associated with worse quantitative metrics of AS suggesting an association with various stages of AS severity. Conclusion This study developed and externally validated an automated approach for severe AS detection using single-view 2D echocardiography, with potential utility for point-of-care screening. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
34. Increasing brain N‐acetylneuraminic acid alleviates hydrocephalus‐induced neurological deficits.
- Author
-
Wang, Zhangyang, Nie, Xiaoqun, Gao, Fang, Tang, Yanmin, Ma, Yuanyuan, Zhang, Yiying, Gao, Yanqin, Yang, Chen, Ding, Jing, and Wang, Xin
- Subjects
- *
MULTIVARIATE analysis , *WHITE matter (Nerve tissue) , *CEREBROSPINAL fluid , *HYDROCEPHALUS , *DEMYELINATION - Abstract
Aims: This metabolomic study aimed to evaluate the role of N‐acetylneuraminic acid (Neu5Ac) in the neurological deficits of normal pressure hydrocephalus (NPH) and its potential therapeutic effect. Methods: We analyzed the metabolic profiles of NPH using cerebrospinal fluid with multivariate and univariate statistical analyses in a set of 42 NPH patients and 38 controls. We further correlated the levels of differential metabolites with severity‐related clinical parameters, including the normal pressure hydrocephalus grading scale (NPHGS). We then established kaolin‐induced hydrocephalus in mice and treated them using N‐acetylmannosamine (ManNAc), a precursor of Neu5Ac. We examined brain Neu5Ac, astrocyte polarization, demyelination, and neurobehavioral outcomes to explore its therapeutic effect. Results: Three metabolites were significantly altered in NPH patients. Only decreased Neu5Ac levels were correlated with NPHGS scores. Decreased brain Neu5Ac levels have been observed in hydrocephalic mice. Increasing brain Neu5Ac by ManNAc suppressed the activation of astrocytes and promoted their transition from A1 to A2 polarization. ManNAc also attenuated the periventricular white matter demyelination and improved neurobehavioral outcomes in hydrocephalic mice. Conclusion: Increasing brain Neu5Ac improved the neurological outcomes associated with the regulation of astrocyte polarization and the suppression of demyelination in hydrocephalic mice, which may be a potential therapeutic strategy for NPH. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
35. Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations
- Author
-
Jiang, Ziyu, Chen, Yinpeng, Liu, Mengchen, Chen, Dongdong, Dai, Xiyang, Yuan, Lu, Liu, Zicheng, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, both Contrastive Learning (CL) and Mask Image Modeling (MIM) demonstrate that self-supervision is powerful to learn good representations. However, naively combining them is far from success. In this paper, we start by making the empirical observation that a naive joint optimization of CL and MIM losses leads to conflicting gradient directions - more severe as the layers go deeper. This motivates us to shift the paradigm from combining loss at the end, to choosing the proper learning method per network layer. Inspired by experimental observations, we find that MIM and CL are suitable to lower and higher layers, respectively. We hence propose to combine them in a surprisingly simple, "sequential cascade" fashion: early layers are first trained under one MIM loss, on top of which latter layers continue to be trained under another CL loss. The proposed Layer Grafted Pre-training learns good visual representations that demonstrate superior label efficiency in downstream applications, in particular yielding strong few-shot performance besides linear evaluation. For instance, on ImageNet-1k, Layer Grafted Pre-training yields 65.5% Top-1 accuracy in terms of 1% few-shot learning with ViT-B/16, which improves MIM and CL baselines by 14.4% and 2.1% with no bells and whistles. The code is available at https://github.com/VITA-Group/layerGraftedPretraining_ICLR23.git., Accepted by ICLR 2023
- Published
- 2023
36. You Only Transfer What You Share: Intersection-Induced Graph Transfer Learning for Link Prediction
- Author
-
Zheng, Wenqing, Huang, Edward W, Rao, Nikhil, Wang, Zhangyang, and Subbian, Karthik
- Subjects
Social and Information Networks (cs.SI) ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Science - Social and Information Networks ,Machine Learning (cs.LG) - Abstract
Link prediction is central to many real-world applications, but its performance may be hampered when the graph of interest is sparse. To alleviate issues caused by sparsity, we investigate a previously overlooked phenomenon: in many cases, a densely connected, complementary graph can be found for the original graph. The denser graph may share nodes with the original graph, which offers a natural bridge for transferring selective, meaningful knowledge. We identify this setting as Graph Intersection-induced Transfer Learning (GITL), which is motivated by practical applications in e-commerce or academic co-authorship predictions. We develop a framework to effectively leverage the structural prior in this setting. We first create an intersection subgraph using the shared nodes between the two graphs, then transfer knowledge from the source-enriched intersection subgraph to the full target graph. In the second step, we consider two approaches: a modified label propagation, and a multi-layer perceptron (MLP) model in a teacher-student regime. Experimental results on proprietary e-commerce datasets and open-source citation graphs show that the proposed workflow outperforms existing transfer learning baselines that do not explicitly utilize the intersection structure., Accepted in TMLR (https://openreview.net/forum?id=Nn71AdKyYH)
- Published
- 2023
37. Learning to Generalize Provably in Learning to Optimize
- Author
-
Yang, Junjie, Chen, Tianlong, Zhu, Mingkang, He, Fengxiang, Tao, Dacheng, Liang, Yingbin, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Learning to optimize (L2O) has gained increasing popularity, which automates the design of optimizers by data-driven approaches. However, current L2O methods often suffer from poor generalization performance in at least two folds: (i) applying the L2O-learned optimizer to unseen optimizees, in terms of lowering their loss function values (optimizer generalization, or ``generalizable learning of optimizers"); and (ii) the test performance of an optimizee (itself as a machine learning model), trained by the optimizer, in terms of the accuracy over unseen data (optimizee generalization, or ``learning to generalize"). While the optimizer generalization has been recently studied, the optimizee generalization (or learning to generalize) has not been rigorously studied in the L2O context, which is the aim of this paper. We first theoretically establish an implicit connection between the local entropy and the Hessian, and hence unify their roles in the handcrafted design of generalizable optimizers as equivalent metrics of the landscape flatness of loss functions. We then propose to incorporate these two metrics as flatness-aware regularizers into the L2O framework in order to meta-train optimizers to learn to generalize, and theoretically show that such generalization ability can be learned during the L2O meta-training process and then transformed to the optimizee loss function. Extensive experiments consistently validate the effectiveness of our proposals with substantially improved generalization on multiple sophisticated L2O models and diverse optimizees. Our code is available at: https://github.com/VITA-Group/Open-L2O/tree/main/Model_Free_L2O/L2O-Entropy., This paper is accepted in AISTATS 2023
- Published
- 2023
38. Ten Lessons We Have Learned in the New 'Sparseland': A Short Handbook for Sparse Neural Network Researchers
- Author
-
Liu, Shiwei and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
This article does not propose any novel algorithm or new hardware for sparsity. Instead, it aims to serve the "common good" for the increasingly prosperous Sparse Neural Network (SNN) research community. We attempt to summarize some most common confusions in SNNs, that one may come across in various scenarios such as paper review/rebuttal and talks - many drawn from the authors' own bittersweet experiences! We feel that doing so is meaningful and timely, since the focus of SNN research is notably shifting from traditional pruning to more diverse and profound forms of sparsity before, during, and after training. The intricate relationships between their scopes, assumptions, and approaches lead to misunderstandings, for non-experts or even experts in SNNs. In response, we summarize ten Q\&As of SNNs from many key aspects, including dense vs. sparse, unstructured sparse vs. structured sparse, pruning vs. sparse training, dense-to-sparse training vs. sparse-to-sparse training, static sparsity vs. dynamic sparsity, before-training/during-training vs. post-training sparsity, and many more. We strive to provide proper and generically applicable answers to clarify those confusions to the best extent possible. We hope our summary provides useful general knowledge for people who want to enter and engage with this exciting community; and also provides some "mind of ease" convenience for SNN researchers to explain their work in the right contexts. At the very least (and perhaps as this article's most insignificant target functionality), if you are writing/planning to write a paper or rebuttal in the field of SNNs, we hope some of our answers could help you!
- Published
- 2023
39. Evaluate underdiagnosis and overdiagnosis bias of deep learning model on primary open-angle glaucoma diagnosis in under-served patient populations
- Author
-
Lin, Mingquan, Xiao, Yuyun, Hou, Bojian, Wanyan, Tingyi, Sharma, Mohit Manoj, Wang, Zhangyang, Wang, Fei, Van Tassel, Sarah, and Peng, Yifan
- Subjects
FOS: Computer and information sciences ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
In the United States, primary open-angle glaucoma (POAG) is the leading cause of blindness, especially among African American and Hispanic individuals. Deep learning has been widely used to detect POAG using fundus images as its performance is comparable to or even surpasses diagnosis by clinicians. However, human bias in clinical diagnosis may be reflected and amplified in the widely-used deep learning models, thus impacting their performance. Biases may cause (1) underdiagnosis, increasing the risks of delayed or inadequate treatment, and (2) overdiagnosis, which may increase individuals' stress, fear, well-being, and unnecessary/costly treatment. In this study, we examined the underdiagnosis and overdiagnosis when applying deep learning in POAG detection based on the Ocular Hypertension Treatment Study (OHTS) from 22 centers across 16 states in the United States. Our results show that the widely-used deep learning model can underdiagnose or overdiagnose underserved populations. The most underdiagnosed group is female younger (< 60 yrs) group, and the most overdiagnosed group is Black older (>=60 yrs) group. Biased diagnosis through traditional deep learning methods may delay disease detection, treatment and create burdens among under-served populations, thereby, raising ethical concerns about using deep learning models in ophthalmology clinics., 9 pages, 2 figures, Accepted by AMIA 2023 Informatics Summit
- Published
- 2023
40. You Can Have Better Graph Neural Networks by Not Training Weights at All: Finding Untrained Graph Tickets
- Author
-
Huang, Tianjin, Chen, Tianlong, Fang, Meng, Menkovski, Vlado, Zhao, Jiaxu, Yin, Lu, Pei, Yulong, Mocanu, Decebal Constantin, Wang, Zhangyang, Pechenizkiy, Mykola, Liu, Shiwei, Datamanagement & Biometrics, and Digital Society Institute
- Abstract
Recent works have impressively demonstrated that there exists a subnetwork in randomly initialized convolutional neural networks (CNNs) that can match the performance of the fully trained dense networks at initialization, without any optimization of the weights of the network (i.e., untrained networks). However, the presence of such untrained subnetworks in graph neural networks (GNNs) still remains mysterious. In this paper we carry out the first-of-its-kind exploration of discovering matching untrained GNNs. With sparsity as the core tool, we can find \textit{untrained sparse subnetworks} at the initialization, that can match the performance of \textit{fully trained dense} GNNs. Besides this already encouraging finding of comparable performance, we show that the found untrained subnetworks can substantially mitigate the GNN over-smoothing problem, hence becoming a powerful tool to enable deeper GNNs without bells and whistles. We also observe that such sparse untrained subnetworks have appealing performance in out-of-distribution detection and robustness of input perturbations. We evaluate our method across widely-used GNN architectures on various popular datasets including the Open Graph Benchmark (OGB).
- Published
- 2022
41. Augmentations in Hypergraph Contrastive Learning: Fabricated and Generative
- Author
-
Wei, Tianxin, You, Yuning, Chen, Tianlong, Shen, Yang, He, Jingrui, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Article ,Machine Learning (cs.LG) - Abstract
This paper targets at improving the generalizability of hypergraph neural networks in the low-label regime, through applying the contrastive learning approach from images/graphs (we refer to it as HyperGCL). We focus on the following question: How to construct contrastive views for hypergraphs via augmentations? We provide the solutions in two folds. First, guided by domain knowledge, we fabricate two schemes to augment hyperedges with higher-order relations encoded, and adopt three vertex augmentation strategies from graph-structured data. Second, in search of more effective views in a data-driven manner, we for the first time propose a hypergraph generative model to generate augmented views, and then an end-to-end differentiable pipeline to jointly learn hypergraph augmentations and model parameters. Our technical innovations are reflected in designing both fabricated and generative augmentations of hypergraphs. The experimental findings include: (i) Among fabricated augmentations in HyperGCL, augmenting hyperedges provides the most numerical gains, implying that higher-order information in structures is usually more downstream-relevant; (ii) Generative augmentations do better in preserving higher-order information to further benefit generalizability; (iii) HyperGCL also boosts robustness and fairness in hypergraph representation learning. Codes are released at https://github.com/weitianxin/HyperGCL., NeurIPS 2022. Supplementary materials are available at https://weitianxin.github.io/files/neurips22_hypergcl_appendix.pdf
- Published
- 2022
42. NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360{\deg} Views
- Author
-
Xu, Dejia, Jiang, Yifan, Wang, Peihao, Fan, Zhiwen, Wang, Yi, and Wang, Zhangyang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Virtual reality and augmented reality (XR) bring increasing demand for 3D content. However, creating high-quality 3D content requires tedious work that a human expert must do. In this work, we study the challenging task of lifting a single image to a 3D object and, for the first time, demonstrate the ability to generate a plausible 3D object with 360{\deg} views that correspond well with the given reference image. By conditioning on the reference image, our model can fulfill the everlasting curiosity for synthesizing novel views of objects from images. Our technique sheds light on a promising direction of easing the workflows for 3D artists and XR designers. We propose a novel framework, dubbed NeuralLift-360, that utilizes a depth-aware neural radiance representation (NeRF) and learns to craft the scene guided by denoising diffusion models. By introducing a ranking loss, our NeuralLift-360 can be guided with rough depth estimation in the wild. We also adopt a CLIP-guided sampling strategy for the diffusion prior to provide coherent guidance. Extensive experiments demonstrate that our NeuralLift-360 significantly outperforms existing state-of-the-art baselines. Project page: https://vita-group.github.io/NeuralLift-360/, Comment: Project page: https://vita-group.github.io/NeuralLift-360/
- Published
- 2022
43. StyleNAT: Giving Each Head a New Perspective
- Author
-
Walton, Steven, Hassani, Ali, Xu, Xingqian, Wang, Zhangyang, and Shi, Humphrey
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Image generation has been a long sought-after but challenging task, and performing the generation task in an efficient manner is similarly difficult. Often researchers attempt to create a "one size fits all" generator, where there are few differences in the parameter space for drastically different datasets. Herein, we present a new transformer-based framework, dubbed StyleNAT, targeting high-quality image generation with superior efficiency and flexibility. At the core of our model, is a carefully designed framework that partitions attention heads to capture local and global information, which is achieved through using Neighborhood Attention (NA). With different heads able to pay attention to varying receptive fields, the model is able to better combine this information, and adapt, in a highly flexible manner, to the data at hand. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin, and a new transformer SOTA on FFHQ-1024 with an FID score of 4.174. These results show a 6.4% improvement on FFHQ-256 scores when compared to StyleGAN-XL with a 28% reduction in the number of parameters and 56% improvement in sampling throughput. Code and models will be open-sourced at https://github.com/SHI-Labs/StyleNAT .
- Published
- 2022
44. Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
- Author
-
Wu, Junru, Liang, Yi, Han, Feng, Akbari, Hassan, Wang, Zhangyang, and Yu, Cong
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Multimedia ,Machine Learning (cs.LG) ,Multimedia (cs.MM) - Abstract
Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data. This paper first verifies our conjecture by observing that, even in the latest VATT pre-training using only instructional videos, there exist strong gradient conflicts between different CMA losses within the same video, audio, text triplet, indicating them as the noisy source of supervision. We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients for each sample triplet, so that their gradient directions are more aligned; and (ii) gradient-based curriculum learning: leveraging the gradient conflict information on an indicator of sample noisiness, to develop a curriculum learning strategy to prioritize training on less noisy sample triplets. Applying those techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts., Accepted at NeurIPS 2022
- Published
- 2022
45. M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design
- Author
-
Liang, Hanxue, Fan, Zhiwen, Sarkar, Rishov, Jiang, Ziyu, Chen, Tianlong, Zou, Kai, Cheng, Yu, Hao, Cong, and Wang, Zhangyang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed M$^3$ViT, customizes mixture-of-experts (MoE) layers into a vision transformer (ViT) backbone for MTL, and sparsely activates task-specific experts during training. Then at inference with any task of interest, the same design allows for activating only the task-corresponding sparse expert pathway, instead of the full model. Our new model design is further enhanced by hardware-level innovations, in particular, a novel computation reordering scheme tailored for memory-constrained MTL that achieves zero-overhead switching between tasks and can scale to any number of experts. When executing single-task inference, M$^{3}$ViT achieves higher accuracies than encoder-focused MTL methods, while significantly reducing 88% inference FLOPs. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory requirement by 2.4 times, while achieving energy efficiency up to 9.23 times higher than a comparable FPGA baseline. Code is available at: https://github.com/VITA-Group/M3ViT.
- Published
- 2022
46. Old can be Gold: Better Gradient Flow can Make Vanilla-GCNs Great Again
- Author
-
Jaiswal, Ajay, Wang, Peihao, Chen, Tianlong, Rousseau, Justin F., Ding, Ying, and Wang, Zhangyang
- Subjects
Social and Information Networks (cs.SI) ,FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Social and Information Networks ,Machine Learning (cs.LG) - Abstract
Despite the enormous success of Graph Convolutional Networks (GCNs) in modeling graph-structured data, most of the current GCNs are shallow due to the notoriously challenging problems of over-smoothening and information squashing along with conventional difficulty caused by vanishing gradients and over-fitting. Previous works have been primarily focused on the study of over-smoothening and over-squashing phenomena in training deep GCNs. Surprisingly, in comparison with CNNs/RNNs, very limited attention has been given to understanding how healthy gradient flow can benefit the trainability of deep GCNs. In this paper, firstly, we provide a new perspective of gradient flow to understand the substandard performance of deep GCNs and hypothesize that by facilitating healthy gradient flow, we can significantly improve their trainability, as well as achieve state-of-the-art (SOTA) level performance from vanilla-GCNs. Next, we argue that blindly adopting the Glorot initialization for GCNs is not optimal, and derive a topology-aware isometric initialization scheme for vanilla-GCNs based on the principles of isometry. Additionally, contrary to ad-hoc addition of skip-connections, we propose to use gradient-guided dynamic rewiring of vanilla-GCNs} with skip connections. Our dynamic rewiring method uses the gradient flow within each layer during training to introduce on-demand skip-connections adaptively. We provide extensive empirical evidence across multiple datasets that our methods improve gradient flow in deep vanilla-GCNs and significantly boost their performance to comfortably compete and outperform many fancy state-of-the-art methods. Codes are available at: https://github.com/VITA-Group/GradientGCN., Advances in Neural Information Processing Systems (NeurIPS), 2022
- Published
- 2022
47. Long-Tailed Classification of Thorax Diseases on Chest X-Ray: A New Benchmark Study
- Author
-
Holste, Gregory, Wang, Song, Jiang, Ziyu, Shen, Thomas C., Shih, George, Summers, Ronald M., Peng, Yifan, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Imaging exams, such as chest radiography, will yield a small set of common findings and a much larger set of uncommon findings. While a trained radiologist can learn the visual presentation of rare conditions by studying a few representative examples, teaching a machine to learn from such a "long-tailed" distribution is much more difficult, as standard methods would be easily biased toward the most frequent classes. In this paper, we present a comprehensive benchmark study of the long-tailed learning problem in the specific domain of thorax diseases on chest X-rays. We focus on learning from naturally distributed chest X-ray data, optimizing classification accuracy over not only the common "head" classes, but also the rare yet critical "tail" classes. To accomplish this, we introduce a challenging new long-tailed chest X-ray benchmark to facilitate research on developing long-tailed learning methods for medical image classification. The benchmark consists of two chest X-ray datasets for 19- and 20-way thorax disease classification, containing classes with as many as 53,000 and as few as 7 labeled training images. We evaluate both standard and state-of-the-art long-tailed learning methods on this new benchmark, analyzing which aspects of these methods are most beneficial for long-tailed medical image classification and summarizing insights for future algorithm design. The datasets, trained models, and code are available at https://github.com/VITA-Group/LongTailCXR., DALI 2022 (MICCAI workshop)
- Published
- 2022
48. Is Attention All That NeRF Needs?
- Author
-
T, Mukund Varma, Wang, Peihao, Chen, Xuxi, Chen, Tianlong, Venugopalan, Subhashini, and Wang, Zhangyang
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
We present Generalizable NeRF Transformer (GNT), a transformer-based architecture that reconstructs Neural Radiance Fields (NeRFs) and learns to renders novel views on the fly from source views. While prior works on NeRFs optimize a scene representation by inverting a handcrafted rendering equation, GNT achieves neural representation and rendering that generalizes across scenes using transformers at two stages. (1) The view transformer leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. (2) The ray transformer renders novel views using attention to decode the features from the view transformer along the sampled points during ray marching. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without an explicit rendering formula due to the learned ray renderer. When trained on multiple scenes, GNT consistently achieves state-of-the-art performance when transferring to unseen scenes and outperform all other methods by ~10% on average. Our analysis of the learned attention maps to infer depth and occlusion indicate that attention enables learning a physically-grounded rendering. Our results show the promise of transformers as a universal modeling tool for graphics. Please refer to our project page for video results: https://vita-group.github.io/GNT/., International Conference on Learning Representations (ICLR), 2023
- Published
- 2022
49. Self-Supervised Learning of Echocardiogram Videos Enables Data-Efficient Clinical Diagnosis
- Author
-
Holste, Gregory, Oikonomou, Evangelos K., Mortazavi, Bobak, Wang, Zhangyang, and Khera, Rohan
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Given the difficulty of obtaining high-quality labels for medical image recognition tasks, there is a need for deep learning techniques that can be adequately fine-tuned on small labeled data sets. Recent advances in self-supervised learning techniques have shown that such an in-domain representation learning approach can provide a strong initialization for supervised fine-tuning, proving much more data-efficient than standard transfer learning from a supervised pretraining task. However, these applications are not adapted to applications to medical diagnostics captured in a video format. With this progress in mind, we developed a self-supervised learning approach catered to echocardiogram videos with the goal of learning strong representations for downstream fine-tuning on the task of diagnosing aortic stenosis (AS), a common and dangerous disease of the aortic valve. When fine-tuned on 1% of the training data, our best self-supervised learning model achieves 0.818 AUC (95% CI: 0.794, 0.840), while the standard transfer learning approach reaches 0.644 AUC (95% CI: 0.610, 0.677). We also find that our self-supervised model attends more closely to the aortic valve when predicting severe AS as demonstrated by saliency map visualizations., Accepted to IMLH 2022 (https://sites.google.com/view/imlh2022)
- Published
- 2022
50. Single Frame Atmospheric Turbulence Mitigation: A Benchmark Study and A New Physics-Inspired Transformer Model
- Author
-
Mao, Zhiyuan, Jaiswal, Ajay, Wang, Zhangyang, and Chan, Stanley H.
- Subjects
FOS: Computer and information sciences ,Computer Vision and Pattern Recognition (cs.CV) ,Image and Video Processing (eess.IV) ,FOS: Electrical engineering, electronic engineering, information engineering ,Computer Science - Computer Vision and Pattern Recognition ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
Image restoration algorithms for atmospheric turbulence are known to be much more challenging to design than traditional ones such as blur or noise because the distortion caused by the turbulence is an entanglement of spatially varying blur, geometric distortion, and sensor noise. Existing CNN-based restoration methods built upon convolutional kernels with static weights are insufficient to handle the spatially dynamical atmospheric turbulence effect. To address this problem, in this paper, we propose a physics-inspired transformer model for imaging through atmospheric turbulence. The proposed network utilizes the power of transformer blocks to jointly extract a dynamical turbulence distortion map and restore a turbulence-free image. In addition, recognizing the lack of a comprehensive dataset, we collect and present two new real-world turbulence datasets that allow for evaluation with both classical objective metrics (e.g., PSNR and SSIM) and a new task-driven metric using text recognition accuracy. Both real testing sets and all related code will be made publicly available., This paper is accepted as a poster at ECCV 2022
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.