Author: "Zhai, Xiaohua" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhai, Xiaohua"' showing total 185 results

Start Over Author "Zhai, Xiaohua"

185 results on '"Zhai, Xiaohua"'

1. PaliGemma 2: A Family of Versatile VLMs for Transfer

Author: Steiner, Andreas, Pinto, André Susano, Tschannen, Michael, Keysers, Daniel, Wang, Xiao, Bitton, Yonatan, Gritsenko, Alexey, Minderer, Matthias, Sherbondy, Anthony, Long, Shangbang, Qin, Siyang, Ingle, Reeve, Bugliarello, Emanuele, Kazemzadeh, Sahar, Mesnard, Thomas, Alabdulmohsin, Ibrahim, Beyer, Lucas, and Zhai, Xiaohua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.
Published: 2024

2. PaliGemma: A versatile 3B VLM for transfer

Author: Beyer, Lucas, Steiner, Andreas, Pinto, André Susano, Kolesnikov, Alexander, Wang, Xiao, Salz, Daniel, Neumann, Maxim, Alabdulmohsin, Ibrahim, Tschannen, Michael, Bugliarello, Emanuele, Unterthiner, Thomas, Keysers, Daniel, Koppula, Skanda, Liu, Fangyu, Grycner, Adam, Gritsenko, Alexey, Houlsby, Neil, Kumar, Manoj, Rong, Keran, Eisenschlos, Julian, Kabra, Rishabh, Bauer, Matthias, Bošnjak, Matko, Chen, Xi, Minderer, Matthias, Voigtlaender, Paul, Bica, Ioana, Balazevic, Ivana, Puigcerver, Joan, Papalampidi, Pinelopi, Henaff, Olivier, Xiong, Xi, Soricut, Radu, Harmsen, Jeremiah, and Zhai, Xiaohua
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation., Comment: v2 adds Appendix H and I and a few citations
Published: 2024

3. Toward a Diffusion-Based Generalist for Dense Vision Tasks

Author: Fan, Yue, Xian, Yongqin, Zhai, Xiaohua, Kolesnikov, Alexander, Naeem, Muhammad Ferjad, Schiele, Bernt, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists., Comment: Published at CVPR 2024 as a workshop paper
Published: 2024

4. No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models

Author: Pouget, Angéline, Beyer, Lucas, Bugliarello, Emanuele, Wang, Xiao, Steiner, Andreas Peter, Zhai, Xiaohua, and Alabdulmohsin, Ibrahim
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training data to English image-text pairs disadvantages communities of lower socioeconomic status and negatively impacts cultural understanding. Notably, this performance gap is not captured by - and even at odds with - the currently popular evaluation metrics derived from the Western-centric ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English content can improve cultural understanding without sacrificing performance on said popular benchmarks. Third, we introduce the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs. Our work underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives., Comment: 17 pages, 5 figures, 4 tables. 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
Published: 2024

5. LocCa: Visual Pretraining with Location-aware Captioners

Author: Wan, Bo, Tschannen, Michael, Xian, Yongqin, Pavetic, Filip, Alabdulmohsin, Ibrahim, Wang, Xiao, Pinto, André Susano, Steiner, Andreas, Beyer, Lucas, and Zhai, Xiaohua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image captioning has been shown as an effective pretraining method similar to contrastive pretraining. However, the incorporation of location-aware information into visual pretraining remains an area with limited research. In this paper, we propose a simple visual pretraining method with location-aware captioners (LocCa). LocCa uses a simple image captioner task interface, to teach a model to read out rich information, i.e. bounding box coordinates, and captions, conditioned on the image pixel input. Thanks to the multitask capabilities of an encoder-decoder architecture, we show that an image captioner can easily handle multiple tasks during pretraining. Our experiments demonstrate that LocCa outperforms standard captioners significantly on localization downstream tasks while maintaining comparable performance on holistic tasks.
Published: 2024

6. CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?

Author: Alabdulmohsin, Ibrahim, Wang, Xiao, Steiner, Andreas, Goyal, Priya, D'Amour, Alexander, and Zhai, Xiaohua
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We study the effectiveness of data-balancing for mitigating biases in contrastive language-image pretraining (CLIP), identifying areas of strength and limitation. First, we reaffirm prior conclusions that CLIP models can inadvertently absorb societal stereotypes. To counter this, we present a novel algorithm, called Multi-Modal Moment Matching (M4), designed to reduce both representation and association biases (i.e. in first- and second-order statistics) in multimodal data. We use M4 to conduct an in-depth analysis taking into account various factors, such as the model, representation, and data size. Our study also explores the dynamic nature of how CLIP learns and unlearns biases. In particular, we find that fine-tuning is effective in countering representation biases, though its impact diminishes for association biases. Also, data balancing has a mixed impact on quality: it tends to improve classification but can hurt retrieval. Interestingly, data and architectural improvements seem to mitigate the negative impact of data balancing on performance; e.g. applying M4 to SigLIP-B/16 with data quality filters improves COCO image-to-text retrieval @5 from 86% (without data balancing) to 87% and ImageNet 0-shot classification from 77% to 77.5%! Finally, we conclude with recommendations for improving the efficacy of data balancing in multimodal systems., Comment: 32 pages, 20 figures, 7 tables
Published: 2024

7. SILC: Improving Vision Language Pretraining with Self-distillation

Author: Naeem, Muhammad Ferjad, Xian, Yongqin, Zhai, Xiaohua, Hoyer, Lukas, Van Gool, Luc, Tombari, Federico, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

8. SILC: Improving Vision Language Pretraining with Self-Distillation

Author: Naeem, Muhammad Ferjad, Xian, Yongqin, Zhai, Xiaohua, Hoyer, Lukas, Van Gool, Luc, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation, while also providing improvements on image-level tasks such as classification and retrieval. SILC models sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation. We further show that SILC features greatly benefit open vocabulary detection, captioning and visual question answering.
Published: 2023

9. PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Author: Chen, Xi, Wang, Xiao, Beyer, Lucas, Kolesnikov, Alexander, Wu, Jialin, Voigtlaender, Paul, Mustafa, Basil, Goodman, Sebastian, Alabdulmohsin, Ibrahim, Padlewski, Piotr, Salz, Daniel, Xiong, Xi, Vlasic, Daniel, Pavetic, Filip, Rong, Keran, Yu, Tianli, Keysers, Daniel, Zhai, Xiaohua, and Soricut, Radu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.
Published: 2023

10. Image Captioners Are Scalable Vision Learners Too

Author: Tschannen, Michael, Kumar, Manoj, Steiner, Andreas, Zhai, Xiaohua, Houlsby, Neil, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Contrastive pretraining on image-text pairs from the web is one of the most popular large-scale pretraining strategies for vision backbones, especially in the context of large multimodal models. At the same time, image captioning on this type of data is commonly considered an inferior pretraining strategy. In this paper, we perform a fair comparison of these two pretraining strategies, carefully matching training data, compute, and model capacity. Using a standard encoder-decoder transformer, we find that captioning alone is surprisingly effective: on classification tasks, captioning produces vision encoders competitive with contrastively pretrained encoders, while surpassing them on vision & language tasks. We further analyze the effect of the model architecture and scale, as well as the pretraining data on the representation quality, and find that captioning exhibits the same or better scaling behavior along these axes. Overall our results show that plain image captioning is a more powerful pretraining strategy than was previously believed., Comment: Accepted at NeurIPS 2023. v2 adds SugarCrepe results and more ablations, v3 has minor fixes. v4 adds a code link ( https://github.com/google-research/big_vision ). v5 has minor fixes
Published: 2023

11. PaLI-X: On Scaling up a Multilingual Vision and Language Model

Author: Chen, Xi, Djolonga, Josip, Padlewski, Piotr, Mustafa, Basil, Changpinyo, Soravit, Wu, Jialin, Ruiz, Carlos Riquelme, Goodman, Sebastian, Wang, Xiao, Tay, Yi, Shakeri, Siamak, Dehghani, Mostafa, Salz, Daniel, Lucic, Mario, Tschannen, Michael, Nagrani, Arsha, Hu, Hexiang, Joshi, Mandar, Pang, Bo, Montgomery, Ceslee, Pietrzyk, Paulina, Ritter, Marvin, Piergiovanni, AJ, Minderer, Matthias, Pavetic, Filip, Waters, Austin, Li, Gang, Alabdulmohsin, Ibrahim, Beyer, Lucas, Amelot, Julien, Lee, Kenton, Steiner, Andreas Peter, Li, Yang, Keysers, Daniel, Arnab, Anurag, Xu, Yuanzhong, Rong, Keran, Kolesnikov, Alexander, Seyedhosseini, Mojtaba, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, and Soricut, Radu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
Published: 2023

12. Three Towers: Flexible Contrastive Learning with Pretrained Image Models

Author: Kossen, Jannik, Collier, Mark, Mustafa, Basil, Wang, Xiao, Zhai, Xiaohua, Beyer, Lucas, Steiner, Andreas, Berent, Jesse, Jenatton, Rodolphe, and Kokiopoulou, Efi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT (Zhai et al., 2022) has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits from training the image tower contrastively. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining., Comment: Accepted for publication at NeurIPS 2023
Published: 2023

13. Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design

Author: Alabdulmohsin, Ibrahim, Zhai, Xiaohua, Kolesnikov, Alexander, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, I.2.10, I.2.6
Abstract: Scaling laws have been recently employed to derive compute-optimal model size (number of parameters) for a given compute duration. We advance and refine such methods to infer compute-optimal model shapes, such as width and depth, and successfully implement this in vision transformers. Our shape-optimized vision transformer, SoViT, achieves results competitive with models that exceed twice its size, despite being pre-trained with an equivalent amount of compute. For example, SoViT-400m/14 achieves 90.3% fine-tuning accuracy on ILSRCV2012, surpassing the much larger ViT-g/14 and approaching ViT-G/14 under identical settings, with also less than half the inference cost. We conduct a thorough evaluation across multiple tasks, such as image classification, captioning, VQA and zero-shot transfer, demonstrating the effectiveness of our model across a broad range of domains and identifying limitations. Overall, our findings challenge the prevailing approach of blindly scaling up vision models and pave a path for a more informed scaling., Comment: 10 pages, 7 figures, 9 tables. Version 2: Layout fixes
Published: 2023

14. A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

Author: Beyer, Lucas, Wan, Bo, Madan, Gagan, Pavetic, Filip, Steiner, Andreas, Kolesnikov, Alexander, Pinto, André Susano, Bugliarello, Emanuele, Wang, Xiao, Yu, Qihang, Chen, Liang-Chieh, and Zhai, Xiaohua
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: There has been a recent explosion of computer vision models which perform many tasks and are composed of an image encoder (usually a ViT) and an autoregressive decoder (usually a Transformer). However, most of this work simply presents one system and its results, leaving many questions regarding design decisions and trade-offs of such systems unanswered. In this work, we aim to provide such answers. We take a close look at autoregressive decoders for multi-task learning in multimodal computer vision, including classification, captioning, visual question answering, and optical character recognition. Through extensive systematic experiments, we study the effects of task and data mixture, training and regularization hyperparameters, conditioning type and specificity, modality combination, and more. Importantly, we compare these to well-tuned single-task baselines to highlight the cost incurred by multi-tasking. A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well. We call this setup locked-image tuning with decoder (LiT-decoder). It can be seen as teaching a decoder to interact with a pretrained vision model via natural language.
Published: 2023

15. Sigmoid Loss for Language Image Pre-Training

Author: Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP). Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days. The disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We release our models at https://github.com/google-research/big_vision and hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training., Comment: ICCV'23 Oral. arXiv v2: fix typo in pseudocode; v3: clarify t vs t' init; v4: add SigLIP Base, Large, Shape-Optimized 400M results. Models released at: https://github.com/google-research/big_vision. Xiaohua and Lucas contributed equally
Published: 2023

16. Tuning computer vision models with task rewards

Author: Pinto, André Susano, Kolesnikov, Alexander, Shi, Yuge, Beyer, Lucas, and Zhai, Xiaohua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. The issue is exacerbated when the task involves complex structured outputs, as it becomes harder to design procedures which address this misalignment. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward. We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning. We believe this approach has the potential to be widely useful for better aligning models with a diverse range of computer vision tasks., Comment: 11 pages
Published: 2023

17. Scaling Vision Transformers to 22 Billion Parameters

Author: Dehghani, Mostafa, Djolonga, Josip, Mustafa, Basil, Padlewski, Piotr, Heek, Jonathan, Gilmer, Justin, Steiner, Andreas, Caron, Mathilde, Geirhos, Robert, Alabdulmohsin, Ibrahim, Jenatton, Rodolphe, Beyer, Lucas, Tschannen, Michael, Arnab, Anurag, Wang, Xiao, Riquelme, Carlos, Minderer, Matthias, Puigcerver, Joan, Evci, Utku, Kumar, Manoj, van Steenkiste, Sjoerd, Elsayed, Gamaleldin F., Mahendran, Aravindh, Yu, Fisher, Oliver, Avital, Huot, Fantine, Bastings, Jasmijn, Collier, Mark Patrick, Gritsenko, Alexey, Birodkar, Vighnesh, Vasconcelos, Cristina, Tay, Yi, Mensink, Thomas, Kolesnikov, Alexander, Pavetić, Filip, Tran, Dustin, Kipf, Thomas, Lučić, Mario, Zhai, Xiaohua, Keysers, Daniel, Harmsen, Jeremiah, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
Published: 2023

18. Guest Editorial: Special Issue on the Promises and Dangers of Large Vision Models

Author: Zhou, Kaiyang, Liu, Ziwei, Zhai, Xiaohua, Li, Chunyuan, and Saenko, Kate
Published: 2024
Full Text: View/download PDF

19. FlexiViT: One Model for All Patch Sizes

Author: Beyer, Lucas, Izmailov, Pavel, Kolesnikov, Alexander, Caron, Mathilde, Kornblith, Simon, Zhai, Xiaohua, Minderer, Matthias, Tschannen, Michael, Alabdulmohsin, Ibrahim, and Pavetic, Filip
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision, Comment: Code and pre-trained models available at https://github.com/google-research/big_vision. All authors made significant technical contributions. CVPR 2023
Published: 2022

20. PaLI: A Jointly-Scaled Multilingual Language-Image Model

Author: Chen, Xi, Wang, Xiao, Changpinyo, Soravit, Piergiovanni, AJ, Padlewski, Piotr, Salz, Daniel, Goodman, Sebastian, Grycner, Adam, Mustafa, Basil, Beyer, Lucas, Kolesnikov, Alexander, Puigcerver, Joan, Ding, Nan, Rong, Keran, Akbari, Hassan, Mishra, Gaurav, Xue, Linting, Thapliyal, Ashish, Bradbury, James, Kuo, Weicheng, Seyedhosseini, Mojtaba, Jia, Chao, Ayan, Burcu Karagol, Riquelme, Carlos, Steiner, Andreas, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, and Soricut, Radu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design., Comment: ICLR 2023 (Notable-top-5%)
Published: 2022

21. Revisiting Neural Scaling Laws in Language and Vision

Author: Alabdulmohsin, Ibrahim, Neyshabur, Behnam, and Zhai, Xiaohua
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.
Published: 2022

22. UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes

Author: Kolesnikov, Alexander, Pinto, André Susano, Beyer, Lucas, Zhai, Xiaohua, Harmsen, Jeremiah, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks; it requires no task-specific modifications which require extensive human expertise. The approach involves two components: (I) a base model (feed-forward) which is trained to directly predict raw vision outputs, guided by a learned discrete code and (II) a language model (autoregressive) that is trained to generate the guiding code. These components complement each other: the language model is well-suited to modeling structured interdependent data, while the base model is efficient at dealing with high-dimensional outputs. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks: panoptic segmentation, depth prediction and image colorization, where we achieve competitive and near state-of-the-art results. Our experimental results suggest that UViM is a promising candidate for a unified modeling approach in computer vision., Comment: 22 pages. Accepted at NeurIPS 2022
Published: 2022

23. Simple Open-Vocabulary Object Detection with Vision Transformers

Author: Minderer, Matthias, Gritsenko, Alexey, Stone, Austin, Neumann, Maxim, Weissenborn, Dirk, Dosovitskiy, Alexey, Mahendran, Aravindh, Arnab, Anurag, Dehghani, Mostafa, Shen, Zhuoran, Wang, Xiao, Zhai, Xiaohua, Kipf, Thomas, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub., Comment: ECCV 2022 camera-ready version
Published: 2022

24. Better plain ViT baselines for ImageNet-1k

Author: Beyer, Lucas, Zhai, Xiaohua, and Kolesnikov, Alexander
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76% top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80% in less than one day., Comment: Code available at https://github.com/google-research/big_vision
Published: 2022

25. A Simple Single-Scale Vision Transformer for Object Localization and Instance Segmentation

Author: Chen, Wuyang, Du, Xianzhi, Yang, Fan, Beyer, Lucas, Zhai, Xiaohua, Lin, Tsung-Yi, Chen, Huizhong, Li, Jing, Song, Xiaodan, Wang, Zhangyang, and Zhou, Denny
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work presents a simple vision transformer design as a strong baseline for object localization and instance segmentation tasks. Transformers recently demonstrate competitive performance in image classification tasks. To adopt ViT to object detection and dense prediction tasks, many works inherit the multistage design from convolutional networks and highly customized ViT architectures. Behind this design, the goal is to pursue a better trade-off between computational cost and effective aggregation of multiscale global contexts. However, existing works adopt the multistage architectural design as a black-box solution without a clear understanding of its true benefits. In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy. We further complete a scaling rule to optimize our model's trade-off on accuracy and computation cost / model size. By leveraging a constant feature resolution and hidden size throughout the encoder blocks, we propose a simple and compact ViT architecture called Universal Vision Transformer (UViT) that achieves strong performance on COCO object detection and instance segmentation tasks., Comment: ECCV 2022 accepted
Published: 2021

26. LiT: Zero-Shot Transfer with Locked-image text Tuning

Author: Zhai, Xiaohua, Wang, Xiao, Mustafa, Basil, Steiner, Andreas, Keysers, Daniel, Kolesnikov, Alexander, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set., Comment: Xiaohua, Xiao, Basil, Andreas and Lucas contributed equally; CVPR 2022
Published: 2021

27. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Author: Steiner, Andreas, Kolesnikov, Alexander, Zhai, Xiaohua, Wightman, Ross, Uszkoreit, Jakob, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugReg" for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset., Comment: Andreas, Alex, Xiaohua and Lucas contributed equally. We release more than 50'000 ViT models trained under diverse settings on various datasets. Available at https://github.com/google-research/big_vision, https://github.com/google-research/vision_transformer and https://github.com/rwightman/pytorch-image-models TMLR review at https://openreview.net/forum?id=4nPswr1KcP
Published: 2021

28. Revisiting the Calibration of Modern Neural Networks

Author: Minderer, Matthias, Djolonga, Josip, Romijnders, Rob, Hubis, Frances, Zhai, Xiaohua, Houlsby, Neil, Tran, Dustin, and Lucic, Mario
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties., Comment: 35th Conference on Neural Information Processing Systems (NeurIPS 2021)
Published: 2021

29. Knowledge distillation: A good teacher is patient and consistent

Author: Beyer, Lucas, Zhai, Xiaohua, Royer, Amélie, Markeeva, Larisa, Anil, Rohan, and Kolesnikov, Alexander
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. In this paper we address this issue and significantly bridge the gap between these two types of models. Throughout our empirical investigation we do not aim to necessarily propose a new method, but strive to identify a robust and effective recipe for making state-of-the-art large scale models affordable in practice. We demonstrate that, when performed correctly, knowledge distillation can be a powerful tool for reducing the size of large models without compromising their performance. In particular, we uncover that there are certain implicit design choices, which may drastically affect the effectiveness of distillation. Our key contribution is the explicit identification of these design choices, which were not previously articulated in the literature. We back up our findings by a comprehensive empirical study, demonstrate compelling results on a wide range of vision datasets and, in particular, obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy., Comment: Lucas, Xiaohua, Am\'elie, Larisa, and Alex contributed equally; CVPR 2022
Published: 2021

30. Scaling Vision Transformers

Author: Zhai, Xiaohua, Kolesnikov, Alexander, Houlsby, Neil, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class., Comment: Xiaohua, Alex, and Lucas contributed equally; CVPR 2022
Published: 2021

31. MLP-Mixer: An all-MLP Architecture for Vision

Author: Tolstikhin, Ilya, Houlsby, Neil, Kolesnikov, Alexander, Beyer, Lucas, Zhai, Xiaohua, Unterthiner, Thomas, Yung, Jessica, Steiner, Andreas, Keysers, Daniel, Uszkoreit, Jakob, Lucic, Mario, and Dosovitskiy, Alexey
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Convolutional Neural Networks (CNNs) are the go-to model for computer vision. Recently, attention-based networks, such as the Vision Transformer, have also become popular. In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary. We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information). When trained on large datasets, or with modern regularization schemes, MLP-Mixer attains competitive scores on image classification benchmarks, with pre-training and inference cost comparable to state-of-the-art models. We hope that these results spark further research beyond the realms of well established CNNs and Transformers., Comment: v2: Fixed parameter counts in Table 1. v3: Added results on JFT-3B in Figure 2(right); Added Section 3.4 on the input permutations. v4: Updated the x label in Figure 2(right)
Published: 2021

32. SI-Score: An image dataset for fine-grained analysis of robustness to object location, rotation and size

Author: Yung, Jessica, Romijnders, Rob, Kolesnikov, Alexander, Beyer, Lucas, Djolonga, Josip, Houlsby, Neil, Gelly, Sylvain, Lucic, Mario, and Zhai, Xiaohua
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Before deploying machine learning models it is critical to assess their robustness. In the context of deep neural networks for image understanding, changing the object location, rotation and size may affect the predictions in non-trivial ways. In this work we perform a fine-grained analysis of robustness with respect to these factors of variation using SI-Score, a synthetic dataset. In particular, we investigate ResNets, Vision Transformers and CLIP, and identify interesting qualitative differences between these., Comment: 4 pages (10 pages including references and appendix), 10 figures. Accepted at the ICLR 2021 RobustML Workshop. arXiv admin note: text overlap with arXiv:2007.08558
Published: 2021

33. Comparing Transfer and Meta Learning Approaches on a Unified Few-Shot Classification Benchmark

Author: Dumoulin, Vincent, Houlsby, Neil, Evci, Utku, Zhai, Xiaohua, Goroshin, Ross, Gelly, Sylvain, and Larochelle, Hugo
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Meta and transfer learning are two successful families of approaches to few-shot learning. Despite highly related goals, state-of-the-art advances in each family are measured largely in isolation of each other. As a result of diverging evaluation norms, a direct or thorough comparison of different approaches is challenging. To bridge this gap, we perform a cross-family study of the best transfer and meta learners on both a large-scale meta-learning benchmark (Meta-Dataset, MD), and a transfer learning benchmark (Visual Task Adaptation Benchmark, VTAB). We find that, on average, large-scale transfer methods (Big Transfer, BiT) outperform competing approaches on MD, even when trained only on ImageNet. In contrast, meta-learning approaches struggle to compete on VTAB when trained and validated on MD. However, BiT is not without limitations, and pushing for scale does not improve performance on highly out-of-distribution MD tasks. In performing this study, we reveal a number of discrepancies in evaluation norms and study some of these in light of the performance gap. We hope that this work facilitates sharing of insights from each community, and accelerates progress on few-shot learning.
Published: 2021

34. Underspecification Presents Challenges for Credibility in Modern Machine Learning

Author: D'Amour, Alexander, Heller, Katherine, Moldovan, Dan, Adlam, Ben, Alipanahi, Babak, Beutel, Alex, Chen, Christina, Deaton, Jonathan, Eisenstein, Jacob, Hoffman, Matthew D., Hormozdiari, Farhad, Houlsby, Neil, Hou, Shaobo, Jerfel, Ghassen, Karthikesalingam, Alan, Lucic, Mario, Ma, Yian, McLean, Cory, Mincu, Diana, Mitani, Akinori, Montanari, Andrea, Nado, Zachary, Natarajan, Vivek, Nielson, Christopher, Osborne, Thomas F., Raman, Rajiv, Ramasamy, Kim, Sayres, Rory, Schrouff, Jessica, Seneviratne, Martin, Sequeira, Shannon, Suresh, Harini, Veitch, Victor, Vladymyrov, Max, Wang, Xuezhi, Webster, Kellie, Yadlowsky, Steve, Yun, Taedong, Zhai, Xiaohua, and Sculley, D.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain., Comment: Updates: Updated statistical analysis in Section 6; Additional citations
Published: 2020

35. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Author: Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, Uszkoreit, Jakob, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train., Comment: Fine-tuning code and pre-trained models are available at https://github.com/google-research/vision_transformer. ICLR camera-ready version with 2 small modifications: 1) Added a discussion of CLS vs GAP classifier in the appendix, 2) Fixed an error in exaFLOPs computation in Figure 5 and Table 6 (relative performance of models is basically not affected)
Published: 2020

36. Training general representations for remote sensing using in-domain knowledge

Author: Neumann, Maxim, Pinto, André Susano, Zhai, Xiaohua, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Automatically finding good and general remote sensing representations allows to perform transfer learning on a wide range of applications - improving the accuracy and reducing the required number of training samples. This paper investigates development of generic remote sensing representations, and explores which characteristics are important for a dataset to be a good source for representation learning. For this analysis, five diverse remote sensing datasets are selected and used for both, disjoint upstream representation learning and downstream model training and evaluation. A common evaluation protocol is used to establish baselines for these datasets that achieve state-of-the-art performance. As the results indicate, especially with a low number of available training samples a significant performance enhancement can be observed when including additionally in-domain data in comparison to training models from scratch or fine-tuning only on ImageNet (up to 11% and 40%, respectively, at 100 training samples). All datasets and pretrained representation models are published online., Comment: Accepted at the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2020. arXiv admin note: substantial text overlap with arXiv:1911.06721
Published: 2020

37. On Robustness and Transferability of Convolutional Neural Networks

Author: Djolonga, Josip, Yung, Jessica, Tschannen, Michael, Romijnders, Rob, Beyer, Lucas, Kolesnikov, Alexander, Puigcerver, Joan, Minderer, Matthias, D'Amour, Alexander, Moldovan, Dan, Gelly, Sylvain, Houlsby, Neil, Zhai, Xiaohua, and Lucic, Mario
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Modern deep convolutional networks (CNNs) are often criticized for not generalizing under distributional shifts. However, several recent breakthroughs in transfer learning suggest that these networks can cope with severe distribution shifts and successfully adapt to new tasks from a few training examples. In this work we study the interplay between out-of-distribution and transfer performance of modern image classification CNNs for the first time and investigate the impact of the pre-training data size, the model scale, and the data preprocessing pipeline. We find that increasing both the training set and model sizes significantly improve the distributional shift robustness. Furthermore, we show that, perhaps surprisingly, simple changes in the preprocessing such as modifying the image resolution can significantly mitigate robustness issues in some cases. Finally, we outline the shortcomings of existing robustness evaluation datasets and introduce a synthetic dataset SI-Score we use for a systematic analysis across factors of variation common in visual data such as object size and position., Comment: Accepted at CVPR 2021
Published: 2020

38. Are we done with ImageNet?

Author: Beyer, Lucas, Hénaff, Olivier J., Kolesnikov, Alexander, Zhai, Xiaohua, and Oord, Aäron van den
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Yes, and no. We ask whether recent progress on the ImageNet classification benchmark continues to represent meaningful generalization, or whether the community has started to overfit to the idiosyncrasies of its labeling procedure. We therefore develop a significantly more robust procedure for collecting human annotations of the ImageNet validation set. Using these new labels, we reassess the accuracy of recently proposed ImageNet classifiers, and find their gains to be substantially smaller than those reported on the original labels. Furthermore, we find the original ImageNet labels to no longer be the best predictors of this independently-collected set, indicating that their usefulness in evaluating vision models may be nearing an end. Nevertheless, we find our annotation procedure to have largely remedied the errors in the original labels, reinforcing ImageNet as a powerful benchmark for future research in visual recognition., Comment: All five authors contributed equally. New labels at https://github.com/google-research/reassessed-imagenet
Published: 2020

39. Big Transfer (BiT): General Visual Representation Learning

Author: Kolesnikov, Alexander, Beyer, Lucas, Zhai, Xiaohua, Puigcerver, Joan, Yung, Jessica, Gelly, Sylvain, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -- from 1 example per class to 1M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance., Comment: The first three authors contributed equally. Results on ObjectNet are reported in v3
Published: 2019

40. Self-Supervised Learning of Video-Induced Visual Invariances

Author: Tschannen, Michael, Djolonga, Josip, Ritter, Marvin, Mahendran, Aravindh, Zhai, Xiaohua, Houlsby, Neil, Gelly, Sylvain, and Lucic, Mario
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 by 0.8 points with 10x fewer labeled images, as well as the previous best supervised model by 3.7 points using the full ImageNet data set., Comment: CVPR 2020
Published: 2019

41. In-domain representation learning for remote sensing

Author: Neumann, Maxim, Pinto, Andre Susano, Zhai, Xiaohua, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Given the importance of remote sensing, surprisingly little attention has been paid to it by the representation learning community. To address it and to establish baselines and a common evaluation protocol in this domain, we provide simplified access to 5 diverse remote sensing datasets in a standardized form. Specifically, we investigate in-domain representation learning to develop generic remote sensing representations and explore which characteristics are important for a dataset to be a good source for remote sensing representation learning. The established baselines achieve state-of-the-art performance on these datasets.
Published: 2019

42. A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Author: Zhai, Xiaohua, Puigcerver, Joan, Kolesnikov, Alexander, Ruyssen, Pierre, Riquelme, Carlos, Lucic, Mario, Djolonga, Josip, Pinto, Andre Susano, Neumann, Maxim, Dosovitskiy, Alexey, Beyer, Lucas, Bachem, Olivier, Tschannen, Michael, Michalski, Marcin, Bousquet, Olivier, Gelly, Sylvain, and Houlsby, Neil
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, reconstruction error). We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples. With VTAB, we conduct a large-scale study of many popular publicly-available representation learning algorithms. We carefully control confounders such as architecture and tuning budget. We address questions like: How effective are ImageNet representations beyond standard natural datasets? How do representations trained via generative and discriminative models compare? To what extent can self-supervision replace labels? And, how close are we to general visual representations?
Published: 2019

43. S4L: Self-Supervised Semi-Supervised Learning

Author: Zhai, Xiaohua, Oliver, Avital, Kolesnikov, Alexander, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This work tackles the problem of semi-supervised learning of image classifiers. Our main insight is that the field of semi-supervised learning can benefit from the quickly advancing field of self-supervised visual representation learning. Unifying these two approaches, we propose the framework of self-supervised semi-supervised learning and use it to derive two novel semi-supervised image classification methods. We demonstrate the effectiveness of these methods in comparison to both carefully tuned baselines, and existing semi-supervised learning methods. We then show that our approach and existing semi-supervised methods can be jointly trained, yielding a new state-of-the-art result on semi-supervised ILSVRC-2012 with 10% of labels., Comment: All four authors contributed equally
Published: 2019

44. High-Fidelity Image Generation With Fewer Labels

Author: Lucic, Mario, Tschannen, Michael, Ritter, Marvin, Zhai, Xiaohua, Bachem, Olivier, and Gelly, Sylvain
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Deep generative models are becoming a cornerstone of modern machine learning. Recent work on conditional generative adversarial networks has shown that learning complex, high-dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work we demonstrate how one can benefit from recent work on self- and semi-supervised learning to outperform the state of the art on both unsupervised ImageNet synthesis, as well as in the conditional setting. In particular, the proposed approach is able to match the sample quality (as measured by FID) of the current state-of-the-art conditional model BigGAN on ImageNet using only 10% of the labels and outperform it using 20% of the labels., Comment: Mario Lucic, Michael Tschannen, and Marvin Ritter contributed equally to this work. ICML 2019 camera-ready version. Code available at https://github.com/google/compare_gan
Published: 2019

45. Revisiting Self-Supervised Visual Representation Learning

Author: Kolesnikov, Alexander, Zhai, Xiaohua, and Beyer, Lucas
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Unsupervised visual representation learning remains a largely unsolved problem in computer vision research. Among a big body of recently proposed approaches for unsupervised learning of visual representations, a class of self-supervised techniques achieves superior performance on many challenging benchmarks. A large number of the pretext tasks for self-supervised learning have been studied, but other important aspects, such as the choice of convolutional neural networks (CNN), has not received equal attention. Therefore, we revisit numerous previously proposed self-supervised models, conduct a thorough large scale study and, as a result, uncover multiple crucial insights. We challenge a number of common practices in selfsupervised visual representation learning and observe that standard recipes for CNN design do not always translate to self-supervised representation learning. As part of our study, we drastically boost the performance of previously proposed techniques and outperform previously published state-of-the-art results by a large margin., Comment: All three authors contributed equally. Code is available at https://github.com/google/revisiting-self-supervised
Published: 2019

46. A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation

Author: Chen, Wuyang, Du, Xianzhi, Yang, Fan, Beyer, Lucas, Zhai, Xiaohua, Lin, Tsung-Yi, Chen, Huizhong, Li, Jing, Song, Xiaodan, Wang, Zhangyang, Zhou, Denny, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

47. Simple Open-Vocabulary Object Detection

Author: Minderer, Matthias, Gritsenko, Alexey, Stone, Austin, Neumann, Maxim, Weissenborn, Dirk, Dosovitskiy, Alexey, Mahendran, Aravindh, Arnab, Anurag, Dehghani, Mostafa, Shen, Zhuoran, Wang, Xiao, Zhai, Xiaohua, Kipf, Thomas, Houlsby, Neil, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

48. Self-Supervised GANs via Auxiliary Rotation Loss

Author: Chen, Ting, Zhai, Xiaohua, Ritter, Marvin, Lucic, Mario, and Houlsby, Neil
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Conditional GANs are at the forefront of natural image synthesis. The main drawback of such models is the necessity for labeled data. In this work we exploit two popular unsupervised learning techniques, adversarial training and self-supervision, and take a step towards bridging the gap between conditional and unconditional GANs. In particular, we allow the networks to collaborate on the task of representation learning, while being adversarial with respect to the classic GAN game. The role of self-supervision is to encourage the discriminator to learn meaningful feature representations which are not forgotten during training. We test empirically both the quality of the learned image representations, and the quality of the synthesized images. Under the same conditions, the self-supervised GAN attains a similar performance to state-of-the-art conditional counterparts. Finally, we show that this approach to fully unsupervised learning can be scaled to attain an FID of 23.4 on unconditional ImageNet generation.
Published: 2018

49. Self-Supervised GAN to Counter Forgetting

Author: Chen, Ting, Zhai, Xiaohua, and Houlsby, Neil
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: GANs involve training two networks in an adversarial game, where each network's task depends on its adversary. Recently, several works have framed GAN training as an online or continual learning problem. We focus on the discriminator, which must perform classification under an (adversarially) shifting data distribution. When trained on sequential tasks, neural networks exhibit \emph{forgetting}. For GANs, discriminator forgetting leads to training instability. To counter forgetting, we encourage the discriminator to maintain useful representations by adding a self-supervision. Conditional GANs have a similar effect using labels. However, our self-supervised GAN does not require labels, and closes the performance gap between conditional and unconditional models. We show that, in doing so, the self-supervised discriminator learns better representations than regular GANs., Comment: NeurIPS'18 Continual Learning workshop
Published: 2018

50. A Large-Scale Study on Regularization and Normalization in GANs

Author: Kurach, Karol, Lucic, Mario, Zhai, Xiaohua, Michalski, Marcin, and Gelly, Sylvain
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Generative adversarial networks (GANs) are a class of deep generative models which aim to learn a target distribution in an unsupervised fashion. While they were successfully applied to many problems, training a GAN is a notoriously challenging task and requires a significant number of hyperparameter tuning, neural architecture engineering, and a non-trivial amount of "tricks". The success in many practical applications coupled with the lack of a measure to quantify the failure modes of GANs resulted in a plethora of proposed losses, regularization and normalization schemes, as well as neural architectures. In this work we take a sober view of the current state of GANs from a practical perspective. We discuss and evaluate common pitfalls and reproducibility issues, open-source our code on Github, and provide pre-trained models on TensorFlow Hub., Comment: Revision accepted to ICML'19: More focus on regularization and normalization aspects. Added recent references and promising future directions
Published: 2018

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

185 results on '"Zhai, Xiaohua"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources