Author: "Piergiovanni, AJ" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Piergiovanni, AJ"' showing total 130 results

Start Over Author "Piergiovanni, AJ"

130 results on '"Piergiovanni, AJ"'

1. Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Author: Piergiovanni, AJ, Kim, Dahun, Ryoo, Michael S., Noble, Isaac, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.
Published: 2024

2. Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Author: Piergiovanni, AJ, Noble, Isaac, Kim, Dahun, Ryoo, Michael S., Gomes, Victor, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g., video, audio, text). For example, video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text, which comes as a global context, e.g., a title, or a description. Furthermore, video and audio inputs are of much larger volumes, and grow as the video length increases, which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling, dividing it into separate, focused autoregressive models, processing the inputs according to the characteristics of the modalities. We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs, we propose to further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end, we propose a Combiner mechanism, which models the audio-video information jointly within a timeframe. The Combiner learns to extract audio and video features from raw spatio-temporal signals, and then learns to fuse these features producing compact but expressive representations per snippet. Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models. It effectively addresses the high computational demand of media inputs by both learning compact representations, controlling the sequence length of the audio-video feature representations, and modeling their dependencies in time., Comment: CVPR 2024
Published: 2023

3. Diversifying Joint Vision-Language Tokenization Learning

Author: Pahuja, Vardaan, Piergiovanni, AJ, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering. In this work, we find that the representations must not only jointly capture features from both modalities but should also be diverse for better generalization performance. To this end, we propose joint vision-language representation learning by diversifying the tokenization learning process, enabling tokens that are sufficiently disentangled from each other to be learned from both modalities. We observe that our approach outperforms the baseline models in a majority of settings and is competitive with state-of-the-art methods., Comment: Accepted to Transformers for Vision (T4V) workshop, CVPR 2023; 7 pages, 5 figures
Published: 2023

4. Joint Adaptive Representations for Image-Language Learning

Author: Piergiovanni, AJ and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image-language learning has made unprecedented progress in visual understanding. These developments have come at high costs, as contemporary vision-language models require large model scales and amounts of data. We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets. Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features. This results in a more effective image-language learning, greatly lowering the FLOPs by combining and reducing the number of tokens for both text and images, e.g. a 33\% reduction in FLOPs is achieved, compared to baseline fusion techniques used by popular image-language models, while improving performance. This also allows the model to scale without a large increase in FLOPs or memory. In addition, we propose adaptive pre-training data sampling which improves the data efficiency. The proposed approach achieves competitive performance compared to much larger models, and does so with significantly less data and FLOPs. With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples., Comment: T4V Workshop
Published: 2023

5. PaLI-X: On Scaling up a Multilingual Vision and Language Model

Author: Chen, Xi, Djolonga, Josip, Padlewski, Piotr, Mustafa, Basil, Changpinyo, Soravit, Wu, Jialin, Ruiz, Carlos Riquelme, Goodman, Sebastian, Wang, Xiao, Tay, Yi, Shakeri, Siamak, Dehghani, Mostafa, Salz, Daniel, Lucic, Mario, Tschannen, Michael, Nagrani, Arsha, Hu, Hexiang, Joshi, Mandar, Pang, Bo, Montgomery, Ceslee, Pietrzyk, Paulina, Ritter, Marvin, Piergiovanni, AJ, Minderer, Matthias, Pavetic, Filip, Waters, Austin, Li, Gang, Alabdulmohsin, Ibrahim, Beyer, Lucas, Amelot, Julien, Lee, Kenton, Steiner, Andreas Peter, Li, Yang, Keysers, Daniel, Arnab, Anurag, Xu, Yuanzhong, Rong, Keran, Kolesnikov, Alexander, Seyedhosseini, Mojtaba, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, and Soricut, Radu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
Published: 2023

6. MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

Author: Kuo, Weicheng, Piergiovanni, AJ, Kim, Dahun, Luo, Xiyang, Caine, Ben, Li, Wei, Ogale, Abhijit, Zhou, Luowei, Dai, Andrew, Chen, Zhifeng, Cui, Claire, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach., Comment: Published in Transactions on Machine Learning Research ( https://jmlr.org/tmlr/ ). 18 pages, 4 figures
Published: 2023

7. Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

Author: Piergiovanni, AJ, Kuo, Weicheng, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.
Published: 2022

8. Compound Tokens: Channel Fusion for Vision-Language Representation Learning

Author: Aladago, Maxwell Mbabilla and Piergiovanni, AJ
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal representations or use only cross-attention, we compose multimodal representations via channel fusion. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. These multimodal representations, which we call compound tokens are generated with cross-attention transformer layers. First, vision tokens are used as queries to retrieve compatible text tokens through cross-attention. We then chain the vision tokens and the queried text tokens along the channel dimension. We call the resulting representations compound tokens. A second group of compound tokens are generated using an analogous process where the text tokens serve as queries to the cross-attention layer. We concatenate all the compound tokens for further processing with multimodal encoder. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting. Compound Tokens achieve highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.
Published: 2022

9. F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

Author: Kuo, Weicheng, Cui, Yin, Gu, Xiuye, Piergiovanni, AJ, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home, Comment: Accepted to ICLR 2023 (https://iclr.cc/Conferences/2023). 20 pages, 7 figures
Published: 2022

10. PaLI: A Jointly-Scaled Multilingual Language-Image Model

Author: Chen, Xi, Wang, Xiao, Changpinyo, Soravit, Piergiovanni, AJ, Padlewski, Piotr, Salz, Daniel, Goodman, Sebastian, Grycner, Adam, Mustafa, Basil, Beyer, Lucas, Kolesnikov, Alexander, Puigcerver, Joan, Ding, Nan, Rong, Keran, Akbari, Hassan, Mishra, Gaurav, Xue, Linting, Thapliyal, Ashish, Bradbury, James, Kuo, Weicheng, Seyedhosseini, Mojtaba, Jia, Chao, Ayan, Burcu Karagol, Riquelme, Carlos, Steiner, Andreas, Angelova, Anelia, Zhai, Xiaohua, Houlsby, Neil, and Soricut, Radu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design., Comment: ICLR 2023 (Notable-top-5%)
Published: 2022

11. Pre-training image-language transformers for open-vocabulary tasks

Author: Piergiovanni, AJ, Kuo, Weicheng, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks. We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model. We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
Published: 2022

12. Video Question Answering with Iterative Video-Text Co-Tokenization

Author: Piergiovanni, AJ, Morton, Kairo, Kuo, Weicheng, Ryoo, Michael S., and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video. In this paper, we propose a novel multi-stream video encoder for video question answering that uses multiple video inputs and a new video-text iterative co-tokenization approach to answer a variety of questions related to videos. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA, outperforming the previous state-of-the-art by large margins. Simultaneously, our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model., Comment: ECCV 2022
Published: 2022

13. Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

Author: Piergiovanni, AJ, Li, Wei, Kuo, Weicheng, Saffar, Mohammad, Bertsch, Fred, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses only noisy image captioning data, and is formulated to use the entire architecture end-to-end with both a strong language encoder and decoder. Our results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results across a variety of question answering tasks. Our multi-task mixture training learns from tasks of various question intents and thus generalizes better, including on zero-shot vision-language tasks. We conduct experiments in the challenging multi-task and open-vocabulary settings and across a variety of datasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA. We observe that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.
Published: 2022

14. FindIt: Generalized Localization with Natural Language Queries

Author: Kuo, Weicheng, Bertsch, Fred, Li, Wei, Piergiovanni, AJ, Saffar, Mohammad, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection. Key to our architecture is an efficient multi-scale fusion module that unifies the disparate localization requirements across the tasks. In addition, we discover that a standard object detector is surprisingly effective in unifying these tasks without a need for task-specific design, losses, or pre-computed detections. Our end-to-end trainable framework responds flexibly and accurately to a wide range of referring expression, localization or detection queries for zero, one, or multiple objects. Jointly trained on these tasks, FindIt outperforms the state of the art on both referring expression and text-based localization, and shows competitive performance on object detection. Finally, FindIt generalizes better to out-of-distribution data and novel categories compared to strong single-task baselines. All of these are accomplished by a single, unified and efficient model. The code will be released., Comment: Accepted to ECCV 2022 (European Conference on Computer Vision)
Published: 2022

15. 4D-Net for Learned Multi-Modal Alignment

Author: Piergiovanni, AJ, Casser, Vincent, Ryoo, Michael S., and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time. We are able to incorporate the 4D information by performing a novel dynamic connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints. Our approach outperforms the state-of-the-art and strong baselines on the Waymo Open Dataset. 4D-Net is better able to use motion cues and dense image information to detect distant objects more successfully., Comment: ICCV 2021
Published: 2021

16. Unsupervised Discovery of Actions in Instructional Videos

Author: Piergiovanni, AJ, Angelova, Anelia, Ryoo, Michael S., and Essa, Irfan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos. Instructional videos contain complex activities and are a rich source of information for intelligent agents, such as, autonomous robots or virtual assistants, which can, for example, automatically `read' the steps from an instructional video and execute them. However, videos are rarely annotated with atomic activities, their boundaries or duration. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos. We propose a sequential stochastic autoregressive model for temporal segmentation of videos, which learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling for videos. Our approach outperforms the state-of-the-art unsupervised methods with large margins. We will open source the code., Comment: Full paper
Published: 2021

17. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Author: Ryoo, Michael S., Piergiovanni, AJ, Arnab, Anurag, Dehghani, Mostafa, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks. Instead of relying on hand-designed splitting strategies to obtain visual tokens and processing a large number of densely sampled patches for attention, our approach learns to mine important tokens in visual data. This results in efficiently and effectively finding a few important visual tokens and enables modeling of pairwise attention between such tokens, over a longer temporal horizon for videos, or the spatial content in images. Our experiments demonstrate strong performance on several challenging benchmarks for both image and video recognition tasks. Importantly, due to our tokens being adaptive, we accomplish competitive results at significantly reduced compute amount. We obtain comparable results to the state-of-the-arts on ImageNet while being computationally more efficient. We also confirm the effectiveness of the approach on multiple video datasets, including Kinetics-400, Kinetics-600, Charades, and AViD. The code is available at: https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner, Comment: This is the full version of the paper, extending its conference paper at NeurIPS 2021. Version 1.1 of the code is released
Published: 2021

18. Unsupervised Action Segmentation for Instructional Videos

Author: Piergiovanni, AJ, Angelova, Anelia, Ryoo, Michael S., and Essa, Irfan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions. We present an unsupervised approach to learn atomic actions of structured human tasks from a variety of instructional videos based on a sequential stochastic autoregressive model for temporal segmentation of videos. This learns to represent and discover the sequential relationship between different atomic actions of the task, and which provides automatic and unsupervised self-labeling., Comment: 4 page abstract for LUV workshop
Published: 2021

19. Adaptive Intermediate Representations for Video Understanding

Author: Kangaspunta, Juhana, Piergiovanni, AJ, Jonschkowski, Rico, Ryoo, Michael, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow. In this work, we introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding and use it in a way that requires no additional labeling. Second, we propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task and allows the adaptation of the representations to the end goal. Despite the use of intermediate representations within the network, during inference, no additional data beyond RGB sequences is needed, enabling efficient recognition with a single network. Finally, we present a way to find the optimal learning configuration by searching the best loss weighting via evolution. We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art.
Published: 2021

20. Recognizing Actions in Videos from Unseen Viewpoints

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Standard methods for video recognition use large CNNs designed to capture spatio-temporal data. However, training these models requires a large amount of labeled training data, containing a wide variety of actions, scenes, settings and camera viewpoints. In this paper, we show that current convolutional neural network models are unable to recognize actions from camera viewpoints not present in their training data (i.e., unseen view action recognition). To address this, we develop approaches based on 3D representations and introduce a new geometric convolutional layer that can learn viewpoint invariant representations. Further, we introduce a new, challenging dataset for unseen view recognition and show the approaches ability to learn viewpoint invariant representations.
Published: 2021

21. AssembleNet++: Assembling Modality Representations via Attention Connections

Author: Ryoo, Michael S., Piergiovanni, AJ, Kangaspunta, Juhana, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality. Even without pre-training, our models outperform the previous work on standard public activity recognition datasets with continuous videos, establishing new state-of-the-art. We also confirm that our findings of having neural connections from the object modality and the use of peer-attention is generally applicable for different existing architectures, improving their performances. We name our model explicitly as AssembleNet++. The code will be available at: https://sites.google.com/corp/view/assemblenet/, Comment: ECCV 2020 camera-ready version
Published: 2020

22. Adversarial Generative Grammars for Human Activity Prediction

Author: Piergiovanni, AJ, Angelova, Anelia, Toshev, Alexander, and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work., Comment: ECCV 2020 (Oral)
Published: 2020

23. AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Author: Wang, Xiaofang, Xiong, Xuehan, Neumann, Maxim, Piergiovanni, AJ, Ryoo, Michael S., Angelova, Anelia, Kitani, Kris M., and Hua, Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Convolutional operations have two limitations: (1) do not explicitly model where to focus as the same filter is applied to all the positions, and (2) are unsuitable for modeling long-range dependencies as they only operate on a small neighborhood. While both limitations can be alleviated by attention operations, many design choices remain to be determined to use attention, especially when applying attention to videos. Towards a principled way of applying attention to videos, we address the task of spatiotemporal attention cell search. We propose a novel search space for spatiotemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell. The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets. The discovered attention cells outperform non-local blocks on both datasets, and demonstrate strong generalization across different modalities, backbones, and datasets. Inserting our attention cells into I3D-R50 yields state-of-the-art performance on both datasets., Comment: ECCV 2020
Published: 2020

24. AViD Dataset: Anonymized Videos from Diverse Countries

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce a new public video dataset for action recognition: Anonymized Videos from Diverse countries (AViD). Unlike existing public video datasets, AViD is a collection of action videos from many different countries. The motivation is to create a public dataset that would benefit training and pretraining of action recognition models for everybody, rather than making it useful for limited countries. Further, all the face identities in the AViD videos are properly anonymized to protect their privacy. It also is a static dataset where each video is licensed with the creative commons license. We confirm that most of the existing video datasets are statistically biased to only capture action videos from a limited number of countries. We experimentally illustrate that models trained with such biased datasets do not transfer perfectly to action videos from the other countries, and show that AViD addresses such problem. We also confirm that the new AViD dataset could serve as a good dataset for pretraining the models, performing comparably or better than prior datasets., Comment: https://github.com/piergiaj/AViD
Published: 2020

25. Evolving Losses for Unsupervised Video Representation Learning

Author: Piergiovanni, AJ, Angelova, Anelia, and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We present a new method to learn video representations from large-scale unlabeled video data. Ideally, this representation will be generic and transferable, directly usable for new tasks such as action recognition and zero or few-shot learning. We formulate unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are shared across different modalities via distillation. Further, we introduce the concept of loss function evolution by using an evolutionary search algorithm to automatically find optimal combination of loss functions capturing many (self-supervised) tasks and modalities. Thirdly, we propose an unsupervised representation evaluation metric using distribution matching to a large unlabeled dataset as a prior constraint, based on Zipf's law. This unsupervised constraint, which is not guided by any labeling, produces similar results to weakly-supervised, task-specific ones. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods. Notably, it is also more effective than several label-based methods (e.g., ImageNet), with the exception of large, fully labeled video datasets., Comment: arXiv admin note: text overlap with arXiv:1906.03248
Published: 2020

26. Tiny Video Networks

Author: Piergiovanni, AJ, Angelova, Anelia, and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world. Yet, solutions so far have been computationally intensive, with the fastest algorithms running for more than half a second per video snippet on powerful GPUs. We propose a novel idea on video architecture learning - Tiny Video Networks - which automatically designs highly efficient models for video understanding. The tiny video models run with competitive performance for as low as 37 milliseconds per video on a CPU and 10 milliseconds on a standard GPU.
Published: 2019

27. Model-based Behavioral Cloning with Future Image Similarity Learning

Author: Wu, Alan, Piergiovanni, AJ, and Ryoo, Michael S.
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a visual imitation learning framework that enables learning of robot action policies solely based on expert samples without any robot trials. Robot exploration and on-policy trials in a real-world environment could often be expensive/dangerous. We present a new approach to address this problem by learning a future scene prediction model solely on a collection of expert trajectories consisting of unlabeled example videos and actions, and by enabling generalized action cloning using future image similarity. The robot learns to visually predict the consequences of taking an action, and obtains the policy by evaluating how similar the predicted future image is to an expert image. We develop a stochastic action-conditioned convolutional autoencoder, and present how we take advantage of future images for robot learning. We conduct experiments in simulated and real-life environments using a ground mobility robot with and without obstacles, and compare our models to multiple baseline methods.
Published: 2019

28. Evolving Losses for Unlabeled Video Representation Learning

Author: Piergiovanni, AJ, Angelova, Anelia, and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a new method to learn video representations from unlabeled data. Given large-scale unlabeled video data, the objective is to benefit from such data by learning a generic and transferable representation space that can be directly used for a new task such as zero/few-shot learning. We formulate our unsupervised representation learning as a multi-modal, multi-task learning problem, where the representations are also shared across different modalities via distillation. Further, we also introduce the concept of finding a better loss function to train such multi-task multi-modal representation space using an evolutionary algorithm; our method automatically searches over different combinations of loss functions capturing multiple (self-supervised) tasks and modalities. Our formulation allows for the distillation of audio, optical flow and temporal information into a single, RGB-based convolutional neural network. We also compare the effects of using additional unlabeled video data and evaluate our representation learning on standard public video datasets., Comment: Non-archival abstract for CVPR Workshop on Learning from Unlabeled Videos
Published: 2019

29. AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

Author: Ryoo, Michael S., Piergiovanni, AJ, Tan, Mingxing, and Angelova, Anelia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, using modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity and spatio-temporal interactions for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and 34.27% accuracy on Moments-in-Time.
Published: 2019

30. Early Detection of Injuries in MLB Pitchers from Video

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Injuries are a major cost in sports. Teams spend millions of dollars every year on players who are hurt and unable to play, resulting in lost games, decreased fan interest and additional wages for replacement players. Modern convolutional neural networks have been successfully applied to many video recognition tasks. In this paper, we introduce the problem of injury detection/prediction in MLB pitchers and experimentally evaluate the ability of such convolutional models to detect and predict injuries in pitches only from video data. We conduct experiments on a large dataset of TV broadcast MLB videos of 20 different pitchers who were injured during the 2017 season. We experimentally evaluate the model's performance on each individual pitcher, how well it generalizes to new pitchers, how it performs for various injuries, and how early it can predict or detect an injury., Comment: CVPR Workshop on Computer Vision in Sports 2019
Published: 2019

31. Differentiable Grammars for Videos

Author: Piergiovanni, AJ, Angelova, Anelia, and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos. Learning latent terminals, non-terminals, and production rules directly from continuous data allows the construction of a generative model capturing sequential structures with multiple possibilities. Our model is fully differentiable, and provides easily interpretable results which are important in order to understand the learned structures. It outperforms the state-of-the-art on several challenging datasets and is more accurate for forecasting future activities in videos. We plan to open-source the code. https://sites.google.com/view/differentiable-grammars
Published: 2019

32. Evolving Space-Time Neural Architectures for Videos

Author: Piergiovanni, AJ, Angelova, Anelia, Toshev, Alexander, and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos. Previous work, taking advantage of 3D convolutions, obtained promising results by manually designing video CNN architectures. We here develop a novel evolutionary search algorithm that automatically explores models with different types and combinations of layers to jointly learn interactions between spatial and temporal aspects of video representations. We demonstrate the generality of this algorithm by applying it to two meta-architectures, obtaining new architectures superior to manually designed architectures. Further, we propose a new component, the iTGM layer, which more efficiently utilizes its parameters to allow learning of space-time interactions over longer time horizons. The iTGM layer is often preferred by the evolutionary algorithm and allows building cost-efficient networks. The proposed approach discovers new and diverse video architectures that were previously unknown. More importantly they are both more accurate and faster than prior models, and outperform the state-of-the-art results on multiple datasets we test, including HMDB, Kinetics, and Moments in Time. We will open source the code and models, to encourage future model development.
Published: 2018

33. Representation Flow for Action Recognition

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a convolutional layer inspired by optical flow algorithms to learn motion representations. Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition. Its parameters for iterative flow optimization are learned in an end-to-end fashion together with the other CNN model parameters, maximizing the action recognition performance. Furthermore, we newly introduce the concept of learning `flow of flow' representations by stacking multiple representation flow layers. We conducted extensive experimental evaluations, confirming its advantages over previous recognition models using traditional optical flows in both computational speed and performance. Code/models available here: https://piergiaj.github.io/rep-flow-site/, Comment: CVPR 2019
Published: 2018

34. Learning Multimodal Representations for Unseen Activities

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos. We first compare the effect of placing various constraints on the embedding space using paired text and video data. We also propose a method to improve the joint embedding space using an adversarial formulation, allowing it to benefit from unpaired text and video data. By using unpaired text data, we show the ability to learn a representation that better captures unseen activities. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that using paired and unpaired data to learn a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning, outperforming the state-of-the-arts.
Published: 2018

35. Learning Real-World Robot Policies by Dreaming

Author: Piergiovanni, AJ, Wu, Alan, and Ryoo, Michael S.
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: Learning to control robots directly based on images is a primary challenge in robotics. However, many existing reinforcement learning approaches require iteratively obtaining millions of robot samples to learn a policy, which can take significant time. In this paper, we focus on learning a realistic world model capturing the dynamics of scene changes conditioned on robot actions. Our dreaming model can emulate samples equivalent to a sequence of images from the actual environment, technically by learning an action-conditioned future representation/scene regressor. This allows the agent to learn action policies (i.e., visuomotor policies) by interacting with the dreaming model rather than the real-world. We experimentally confirm that our dreaming model enables robot learning of policies that transfer to the real-world.
Published: 2018

36. Fine-grained Activity Recognition in Baseball Videos

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection. The dataset contains two settings: segmented video classification as well as activity detection in continuous videos. We experimentally compare various recognition approaches capturing temporal structure in activity videos, by classifying segmented videos and extending those approaches to continuous videos. We also compare models on the extremely difficult task of predicting pitch speed and pitch type from broadcast baseball videos. We find that learning temporal structure is valuable for fine-grained activity recognition., Comment: CVPR Workshop on Computer Vision in Sports
Published: 2018

37. Temporal Gaussian Mixture Layer for Videos

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture longer-term temporal information in continuous activity videos. The TGM layer is a temporal convolutional layer governed by a much smaller set of parameters (e.g., location/variance of Gaussians) that are fully differentiable. We present our fully convolutional video models with multiple TGM layers for activity detection. The extensive experiments on multiple datasets, including Charades and MultiTHUMOS, confirm the effectiveness of TGM layers, significantly outperforming the state-of-the-arts., Comment: ICML 2019
Published: 2018

38. Learning Latent Super-Events to Detect Multiple Activities in Videos

Author: Piergiovanni, AJ and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we introduce the concept of learning latent super-events from activity videos, and present how it benefits activity detection in continuous videos. We define a super-event as a set of multiple events occurring together in videos with a particular temporal organization; it is the opposite concept of sub-events. Real-world videos contain multiple activities and are rarely segmented (e.g., surveillance videos), and learning latent super-events allows the model to capture how the events are temporally related in videos. We design temporal structure filters that enable the model to focus on particular sub-intervals of the videos, and use them together with a soft attention mechanism to learn representations of latent super-events. Super-event representations are combined with per-frame or per-segment CNNs to provide frame-level annotations. Our approach is designed to be fully differentiable, enabling end-to-end learning of latent super-event representations jointly with the activity detector using them. Our experiments with multiple public video datasets confirm that the proposed concept of latent super-event learning significantly benefits activity detection, advancing the state-of-the-arts., Comment: CVPR 2018
Published: 2017

39. Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

Author: Piergiovanni, AJ, Fan, Chenyou, and Ryoo, Michael S.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos. Many high-level activities are often composed of multiple temporal parts (e.g., sub-events) with different duration/speed, and our objective is to make the model explicitly learn such temporal structure using multiple attention filters and benefit from them. Our temporal filters are designed to be fully differentiable, allowing end-of-end training of the temporal filters together with the underlying frame-based or segment-based convolutional neural network architectures. This paper presents an approach of learning a set of optimal static temporal attention filters to be shared across different videos, and extends this approach to dynamically adjust attention filters per testing video using recurrent long short-term memory networks (LSTMs). This allows our temporal attention filters to learn latent sub-events specific to each activity. We experimentally confirm that the proposed concept of temporal attention filters benefits the activity recognition, and we visualize the learned latent sub-events.
Published: 2016

40. Video Question Answering with Iterative Video-Text Co-tokenization

Author: Piergiovanni, AJ, primary, Morton, Kairo, additional, Kuo, Weicheng, additional, Ryoo, Michael S., additional, and Angelova, Anelia, additional
Published: 2022
Full Text: View/download PDF

41. A computational framework for understanding the roles of simplicity and rational support in people's behavior explanations

Author: Jern, Alan, Derrow-Pinion, Austin, and Piergiovanni, AJ
Published: 2021
Full Text: View/download PDF

42. SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation

Author: Mei, Jie, primary, Piergiovanni, AJ, additional, Hwang, Jenq-Neng, additional, and Li, Wei, additional
Published: 2024
Full Text: View/download PDF

43. Computational principles underlying people‚Äôs behavior explanations

Author: Piergiovanni, AJ and Jern, Alan
Subjects: behavior explanation, social cognition, decisionnetworks
Abstract: There are often multiple explanations for someone‚Äôs behavior,but people generally find some behavior explanations more satisfyingthan others. We hypothesized that people prefer behaviorexplanations that are simple and rational. We presenta computational account of behavior explanation that capturesthese two principles. Our computational account is based ondecision networks. Decision networks allow us to formallycapture what it means for an explanation to be simple and rational.We tested our account by asking people to rate how satisfyingseveral behavior explanations were (Experiment 1) orto generate their own explanations (Experiment 2). We foundthat people‚Äôs responses were well predicted by our account.
Published: 2015

44. AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Author: Wang, Xiaofang, primary, Xiong, Xuehan, additional, Neumann, Maxim, additional, Piergiovanni, AJ, additional, Ryoo, Michael S., additional, Angelova, Anelia, additional, Kitani, Kris M., additional, and Hua, Wei, additional
Published: 2020
Full Text: View/download PDF

45. Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

Author: Piergiovanni, AJ, primary, Kuo, Weicheng, additional, and Angelova, Anelia, additional
Published: 2023
Full Text: View/download PDF

46. 4D-Net for Learned Multi-Modal Alignment

Author: Piergiovanni, AJ, primary, Casser, Vincent, additional, Ryoo, Michael S., additional, and Angelova, Anelia, additional
Published: 2021
Full Text: View/download PDF

47. Tiny Video Networks

Author: Piergiovanni, AJ, primary, Angelova, Anelia, additional, and Ryoo, Michael, additional
Published: 2021
Full Text: View/download PDF

48. Adaptive Intermediate Representations for Video Understanding

Author: Kangaspunta, Juhana, primary, Piergiovanni, AJ, additional, Jonschkowski, Rico, additional, Ryoo, Michael, additional, and Angelova, Anelia, additional
Published: 2021
Full Text: View/download PDF

49. Recognizing Actions in Videos from Unseen Viewpoints

Author: Piergiovanni, AJ, primary and Ryoo, Michael S., additional
Published: 2021
Full Text: View/download PDF

50. Evolving Losses for Unsupervised Video Representation Learning

Author: Piergiovanni, AJ, primary, Angelova, Anelia, additional, and Ryoo, Michael S., additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

130 results on '"Piergiovanni, AJ"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources