Author: "Tapaswi, Makarand" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tapaswi, Makarand"' showing total 111 results

Start Over Author "Tapaswi, Makarand"

111 results on '"Tapaswi, Makarand"'

1. No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Author: Gaur, Manu, S, Darshan Singh, and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.
Published: 2024

2. Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Author: Manikantan, Kawshik, Toshniwal, Shubham, Tapaswi, Makarand, and Gandhi, Vineet
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, I.2.7
Abstract: The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task's broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest., Comment: 16 pages, 6 figures
Published: 2024

3. VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Author: Saravanan, Darshana, Singh, Darshan, Gupta, Varun, Khan, Zeeshan, Gandhi, Vineet, and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests., Comment: 26 pages, 17 figures, 3 tables
Published: 2024

4. 'Previously on ...' From Recaps to Story Summarization

Author: Singh, Aditya Kumar, Srivastava, Dhruv, and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap, a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations, and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization, our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization, including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks., Comment: CVPR 2024; Project page: https://katha-ai.github.io/projects/recap-story-summ/
Published: 2024

5. MICap: A Unified Model for Identity-aware Movie Descriptions

Author: Raajesh, Haran, Desanur, Naveen Reddy, Khan, Zeeshan, and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics., Comment: CVPR 2024, Project Page: https://katha-ai.github.io/projects/micap/
Published: 2024

6. NurtureNet: A Multi-task Video-based Approach for Newborn Anthropometry

Author: Khandelwal, Yash, Arvind, Mayur, Kumar, Sriram, Gupta, Ashish, Danisetty, Sachin Kumar, Bagad, Piyush, Madan, Anish, Lunayach, Mayank, Annavajjala, Aditya, Maiti, Abhishek, Jain, Sansiddh, Dalmia, Aman, Deka, Namrata, White, Jerome, Doshi, Jigar, Kanazawa, Angjoo, Panicker, Rahul, Raval, Alpan, Rana, Srinivas, and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Malnutrition among newborns is a top public health concern in developing countries. Identification and subsequent growth monitoring are key to successful interventions. However, this is challenging in rural communities where health systems tend to be inaccessible and under-equipped, with poor adherence to protocol. Our goal is to equip health workers and public health systems with a solution for contactless newborn anthropometry in the community. We propose NurtureNet, a multi-task model that fuses visual information (a video taken with a low-cost smartphone) with tabular inputs to regress multiple anthropometry estimates including weight, length, head circumference, and chest circumference. We show that visual proxy tasks of segmentation and keypoint prediction further improve performance. We establish the efficacy of the model through several experiments and achieve a relative error of 3.9% and mean absolute error of 114.3 g for weight estimation. Model compression to 15 MB also allows offline deployment to low-cost smartphones., Comment: Accepted at CVPM Workshop at CVPR 2024
Published: 2024

7. Generalized Cross-domain Multi-label Few-shot Learning for Chest X-rays

Author: Aimen, Aroof, Verma, Arsh, Tapaswi, Makarand, and Krishnan, Narayanan C.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Real-world application of chest X-ray abnormality classification requires dealing with several challenges: (i) limited training data; (ii) training and evaluation sets that are derived from different domains; and (iii) classes that appear during training may have partial overlap with classes of interest during evaluation. To address these challenges, we present an integrated framework called Generalized Cross-Domain Multi-Label Few-Shot Learning (GenCDML-FSL). The framework supports overlap in classes during training and evaluation, cross-domain transfer, adopts meta-learning to learn using few training samples, and assumes each chest X-ray image is either normal or associated with one or more abnormalities. Furthermore, we propose Generalized Episodic Training (GenET), a training strategy that equips models to operate with multiple challenges observed in the GenCDML-FSL scenario. Comparisons with well-established methods such as transfer learning, hybrid transfer learning, and multi-label meta-learning on multiple datasets show the superiority of our approach., Comment: 17 pages
Published: 2023

8. How you feelin'? Learning Emotions and Mental States in Movie Scenes

Author: Srivastava, Dhruv, Singh, Aditya Kumar, and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Movie story analysis requires understanding characters' emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset, we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted state-of-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx's self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues., Comment: CVPR 2023. Project Page: https://katha-ai.github.io/projects/emotx/
Published: 2023

9. GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

Author: Taunk, Dhaval, Khanna, Lakshya, Kandru, Pavan, Varma, Vasudeva, Sharma, Charu, and Tapaswi, Makarand
Subjects: Computer Science - Computation and Language
Abstract: Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG). A typical approach collects nodes relevant to the QA pair from a KG to form a Working Graph (WG) followed by reasoning using Graph Neural Networks(GNNs). This faces two major challenges: (i) it is difficult to capture all the information from the QA in the WG, and (ii) the WG contains some irrelevant nodes from the KG. To address these, we propose GrapeQA with two simple improvements on the WG: (i) Prominent Entities for Graph Augmentation identifies relevant text chunks from the QA pair and augments the WG with corresponding latent representations from the LM, and (ii) Context-Aware Node Pruning removes nodes that are less relevant to the QA pair. We evaluate our results on OpenBookQA, CommonsenseQA and MedQA-USMLE and see that GrapeQA shows consistent improvements over its LM + KG predecessor (QA-GNN in particular) and large improvements on OpenBookQA.
Published: 2023

10. Test of Time: Instilling Video-Language Models with a Sense of Time

Author: Bagad, Piyush, Tapaswi, Makarand, and Snoek, Cees G. M.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Modelling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that seven existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require varying degrees of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch., Comment: Accepted for publication at CVPR 2023. Project page: https://bpiyush.github.io/testoftime-website/index.html
Published: 2023

11. Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations

Author: Shriram, Jaidev, Tapaswi, Makarand, and Alluri, Vinoo
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey. Listening to complementary music has the potential to amplify the reading experience, especially when the music is stylistically cohesive and emotionally relevant. In this paper, we propose the first fully automatic method to build a dense soundtrack for books, which can play high-quality instrumental music for the entirety of the reading duration. Our work employs a unique text processing and music weaving pipeline that determines the context and emotional composition of scenes in a chapter. This allows our method to identify and play relevant excerpts from the soundtrack of the book's movie adaptation. By relying on the movie composer's craftsmanship, our book soundtracks include expert-made motifs and other scene-specific musical characteristics. We validate the design decisions of our approach through a perceptual study. Our readers note that the book soundtrack greatly enhanced their reading experience, due to high immersiveness granted via uninterrupted and style-consistent music, and a heightened emotional state attained via high precision emotion and scene context recognition., Comment: Accepted to ISMIR 2022. Project page: https://auto-book-soundtrack.github.io/
Published: 2022

12. Can we Adopt Self-supervised Pretraining for Chest X-Rays?

Author: Verma, Arsh and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Chest radiograph (or Chest X-Ray, CXR) is a popular medical imaging modality that is used by radiologists across the world to diagnose heart or lung conditions. Over the last decade, Convolutional Neural Networks (CNN), have seen success in identifying pathologies in CXR images. Typically, these CNNs are pretrained on the standard ImageNet classification task, but this assumes availability of large-scale annotated datasets. In this work, we analyze the utility of pretraining on unlabeled ImageNet or Chest X-Ray (CXR) datasets using various algorithms and in multiple settings. Some findings of our work include: (i) supervised training with labeled ImageNet learns strong representations that are hard to beat; (ii) self-supervised pretraining on ImageNet (~1M images) shows performance similar to self-supervised pretraining on a CXR dataset (~100K images); and (iii) the CNN trained on supervised ImageNet can be trained further with self-supervised CXR images leading to improvements, especially when the downstream dataset is on the order of a few thousand images., Comment: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2022, November 28th, 2022, New Orleans, United States & Virtual, http://www.ml4h.cc, 10 pages
Published: 2022

13. Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Author: Chen, Shizhe, Guhur, Pierre-Louis, Tapaswi, Makarand, Schmid, Cordelia, and Laptev, Ivan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this end, we design a spatial self-attention layer that accounts for relative distances and orientations between objects in input 3D point clouds. Training such a layer with visual and language inputs enables to disambiguate spatial relations and to localize objects referred by the text. To facilitate the cross-modal learning of relations, we further propose a teacher-student approach where the teacher model is first trained using ground-truth object labels, and then helps to train a student model using point cloud inputs. We perform ablation studies showing advantages of our approach. We also demonstrate our model to significantly outperform the state of the art on the challenging Nr3D, Sr3D and ScanRefer 3D object grounding datasets., Comment: Accepted in NeurIPS 2022; Project website: https://cshizhe.github.io/projects/vil3dref.html
Published: 2022

14. Unsupervised Audio-Visual Lecture Segmentation

Author: S, Darshan Singh, Gupta, Anchit, Jawahar, C. V., and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Over the last decade, online lecture videos have become increasingly popular and have experienced a meteoric rise during the pandemic. However, video-language research has primarily focused on instructional videos or movies, and tools to help students navigate the growing online lectures are lacking. Our first contribution is to facilitate research in the educational domain, by introducing AVLectures, a large-scale dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects. Each course contains video lectures, transcripts, OCR outputs for lecture frames, and optionally lecture notes, slides, assignments, and related educational content that can inspire a variety of tasks. Our second contribution is introducing video lecture segmentation that splits lectures into bite-sized topics that show promise in improving learner engagement. We formulate lecture segmentation as an unsupervised task that leverages visual, textual, and OCR cues from the lecture, while clip representations are fine-tuned on a pretext self-supervised task of matching the narration with the temporally aligned visual content. We use these representations to generate segments using a temporally consistent 1-nearest neighbor algorithm, TW-FINCH. We evaluate our method on 15 courses and compare it against various visual and textual baselines, outperforming all of them. Our comprehensive ablation studies also identify the key factors driving the success of our approach., Comment: 17 pages, 14 figures, 14 tables, Accepted to WACV 2023. Project page: https://cvit.iiit.ac.in/research/projects/cvit-projects/avlectures
Published: 2022

15. Grounded Video Situation Recognition

Author: Khan, Zeeshan, Jawahar, C. V., and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the video clips to enable fine-grained spatio-temporal reasoning. The second stage sees verb-role queries attend and pool information from object embeddings, localising answers to questions posed about the action. The final stage generates these answers as captions to describe each verb-role pair present in the video. Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on a grounding-augmented version of the VidSitu dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time., Comment: Accepted to NeurIPS 2022. Project Page: https://zeeshank95.github.io/grvidsitu
Published: 2022

16. Instruction-driven history-aware policies for robotic manipulations

Author: Guhur, Pierre-Louis, Chen, Shizhe, Garcia, Ricardo, Tapaswi, Makarand, Laptev, Ivan, and Schmid, Cordelia
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations., Comment: Accepted in CoRL 2022 (oral); project page at https://guhur.github.io/hiveformer/
Published: 2022

17. Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

Author: Chen, Shizhe, Guhur, Pierre-Louis, Tapaswi, Makarand, Schmid, Cordelia, and Laptev, Ivan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In vision-and-language navigation (VLN), an embodied agent is required to navigate in realistic 3D environments following natural language instructions. One major bottleneck for existing VLN approaches is the lack of sufficient training data, resulting in unsatisfactory generalization to unseen environments. While VLN data is typically collected manually, such an approach is expensive and prevents scalability. In this work, we address the data scarcity issue by proposing to automatically create a large-scale VLN dataset from 900 unlabeled 3D buildings from HM3D. We generate a navigation graph for each building and transfer object predictions from 2D to generate pseudo 3D object labels by cross-view consistency. We then fine-tune a pretrained language model using pseudo object labels as prompts to alleviate the cross-modal gap in instruction generation. Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions. We experimentally demonstrate that HM3D-AutoVLN significantly increases the generalization ability of resulting VLN models. On the SPL metric, our approach improves over state of the art by 7.1% and 8.1% on the unseen validation splits of REVERIE and SOON datasets respectively., Comment: ECCV 2022
Published: 2022

18. Learning Object Manipulation Skills from Video via Approximate Differentiable Physics

Author: Petrik, Vladimir, Qureshi, Mohammad Nomaan, Sivic, Josef, and Tapaswi, Makarand
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We aim to teach robots to perform simple object manipulation tasks by watching a single video demonstration. Towards this goal, we propose an optimization approach that outputs a coarse and temporally evolving 3D scene to mimic the action demonstrated in the input video. Similar to previous work, a differentiable renderer ensures perceptual fidelity between the 3D scene and the 2D video. Our key novelty lies in the inclusion of a differentiable approach to solve a set of Ordinary Differential Equations (ODEs) that allows us to approximately model laws of physics such as gravity, friction, and hand-object or object-object interactions. This not only enables us to dramatically improve the quality of estimated hand and object states, but also produces physically admissible trajectories that can be directly translated to a robot without the need for costly reinforcement learning. We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something. Our approach improves over previous state-of-the-art by almost 30%, demonstrating superior quality on especially challenging actions involving physical interactions of two objects such as put something onto something. Finally, we showcase the learned skills on a Franka Emika Panda robot., Comment: Accepted for IROS2022, code at https://github.com/petrikvladimir/video_skills_learning_with_approx_physics, project page at https://data.ciirc.cvut.cz/public/projects/2022Real2SimPhysics/
Published: 2022

19. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Author: Chen, Shizhe, Guhur, Pierre-Louis, Tapaswi, Makarand, Schmid, Cordelia, and Laptev, Ivan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Following language instructions to navigate in unseen environments is a challenging problem for autonomous embodied agents. The agent not only needs to ground languages in visual scenes, but also should explore the environment to reach its target. In this work, we propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation (VLN) benchmarks REVERIE and SOON. It also improves the success rate on the fine-grained VLN benchmark R2R.
Published: 2022

20. Feature Generation for Long-tail Classification

Author: Vigneswaran, Rahul, Law, Marc T., Balasubramanian, Vineeth N., and Tapaswi, Makarand
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: The visual world naturally exhibits an imbalance in the number of object or scene instances resulting in a \emph{long-tailed distribution}. This imbalance poses significant challenges for classification models based on deep learning. Oversampling instances of the tail classes attempts to solve this imbalance. However, the limited visual diversity results in a network with poor representation ability. A simple counter to this is decoupling the representation and classifier networks and using oversampling only to train the classifier. In this paper, instead of repeatedly re-sampling the same image (and thereby features), we explore a direction that attempts to generate meaningful features by estimating the tail category's distribution. Inspired by ideas from recent work on few-shot learning, we create calibrated distributions to sample additional features that are subsequently used to train the classifier. Through several experiments on the CIFAR-100-LT (long-tail) dataset with varying imbalance factors and on mini-ImageNet-LT (long-tail), we show the efficacy of our approach and establish a new state-of-the-art. We also present a qualitative analysis of generated features using t-SNE visualizations and analyze the nearest neighbors used to calibrate the tail class distributions. Our code is available at https://github.com/rahulvigneswaran/TailCalibX., Comment: Accepted at ICVGIP'21. Code available at https://github.com/rahulvigneswaran/TailCalibX
Published: 2021

21. Airbert: In-domain Pretraining for Vision-and-Language Navigation

Author: Guhur, Pierre-Louis, Tapaswi, Makarand, Chen, Shizhe, Laptev, Ivan, and Schmid, Cordelia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Vision-and-language navigation (VLN) aims to enable embodied agents to navigate in realistic environments using natural language instructions. Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging. Recent methods explore pretraining to improve generalization, however, the use of generic image-caption datasets or existing small-scale VLN environments is suboptimal and results in limited improvements. In this work, we introduce BnB, a large-scale and diverse in-domain VLN dataset. We first collect image-caption (IC) pairs from hundreds of thousands of listings from online rental marketplaces. Using IC pairs we next propose automatic strategies to generate millions of VLN path-instruction (PI) pairs. We further propose a shuffling loss that improves the learning of temporal order inside PI pairs. We use BnB pretrain our Airbert model that can be adapted to discriminative and generative settings and show that it outperforms state of the art for Room-to-Room (R2R) navigation and Remote Referring Expression (REVERIE) benchmarks. Moreover, our in-domain pretraining significantly increases performance on a challenging few-shot VLN evaluation, where we train the model only on VLN instructions from a few houses., Comment: To be published on ICCV 2021. Webpage is at https://airbert-vln.github.io/ linking to our dataset, codes and models
Published: 2021

22. Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

Author: Petrík, Vladimír, Tapaswi, Makarand, Laptev, Ivan, and Sivic, Josef
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Humans are adept at learning new tasks by watching a few instructional videos. On the other hand, robots that learn new actions either require a lot of effort through trial and error, or use expert demonstrations that are challenging to obtain. In this paper, we explore a method that facilitates learning object manipulation skills directly from videos. Leveraging recent advances in 2D visual recognition and differentiable rendering, we develop an optimization based method to estimate a coarse 3D state representation for the hand and the manipulated object(s) without requiring any supervision. We use these trajectories as dense rewards for an agent that learns to mimic them through reinforcement learning. We evaluate our method on simple single- and two-object actions from the Something-Something dataset. Our approach allows an agent to learn actions from single videos, while watching multiple demonstrations makes the policy more robust. We show that policies learned in a simulated environment can be easily transferred to a real robot., Comment: CoRL 2020, code at https://github.com/makarandtapaswi/Real2Sim_CoRL2020, project page at https://data.ciirc.cvut.cz/public/projects/2020Real2Sim/
Published: 2020

23. Clustering based Contrastive Learning for Improving Face Representations

Author: Sharma, Vivek, Tapaswi, Makarand, Sarfraz, M. Saquib, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: A good clustering algorithm can discover natural groupings in data. These groupings, if used wisely, provide a form of weak supervision for learning representations. In this work, we present Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that uses labels obtained from clustering along with video constraints to learn discriminative face features. We demonstrate our method on the challenging task of learning representations for video face clustering. Through several ablation studies, we analyze the impact of creating pair-wise positive and negative labels from different sources. Experiments on three challenging video face clustering datasets: BBT-0101, BF-0502, and ACCIO show that CCL achieves a new state-of-the-art on all datasets., Comment: To appear at IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2020
Published: 2020

24. Deep Multimodal Feature Encoding for Video Ordering

Author: Sharma, Vivek, Tapaswi, Makarand, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions. We present a way to learn a compact multimodal feature representation that encodes all these modalities. Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline. To this end, we create a new multimodal dataset for temporal ordering that consists of approximately 30K scenes (2-6 clips per scene) based on the "Large Scale Movie Description Challenge". We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition. We demonstrate empirically that multimodal representations are indeed complementary, and can play a key role in improving the performance of many applications., Comment: IEEE International Conference on Computer Vision (ICCV) Workshop on Large Scale Holistic Video Understanding. The datasets and code are available at https://github.com/vivoutlaw/tcbp
Published: 2020

25. Learning Interactions and Relationships between Movie Characters

Author: Kukleva, Anna, Tapaswi, Makarand, and Laptev, Ivan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Interactions between people are often governed by their relationships. On the flip side, social relationships are built upon several interactions. Two strangers are more likely to greet and introduce themselves while becoming friends over time. We are fascinated by this interplay between interactions and relationships, and believe that it is an important aspect of understanding social situations. In this work, we propose neural models to learn and jointly predict interactions, relationships, and the pair of characters that are involved. We note that interactions are informed by a mixture of visual and dialog cues, and present a multimodal architecture to extract meaningful information from them. Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels. We evaluate our models on the MovieGraphs dataset and show the impact of modalities, use of longer temporal context for predicting relationships, and achieve encouraging performance using weak labels as compared with ground-truth labels. Code is online., Comment: CVPR 2020 (Oral)
Published: 2020

26. The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

Author: Chaudhury, Atef, Tapaswi, Makarand, Kim, Seung Wook, and Fidler, Sanja
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies. In this paper, we introduce the Shmoop Corpus: a dataset of 231 stories that are paired with detailed multi-paragraph summaries for each individual chapter (7,234 chapters), where the summary is chronologically aligned with respect to the story chapter. From the corpus, we construct a set of common NLP tasks, including Cloze-form question answering and a simplified form of abstractive summarization, as benchmarks for reading comprehension on stories. We then show that the chronological alignment provides a strong supervisory signal that learning-based methods can exploit leading to significant improvements on these tasks. We believe that the unique structure of this corpus provides an important foothold towards making machine story comprehension more approachable., Comment: Project page: http://www.cs.toronto.edu/~makarand/shmoop/ Dataset at: https://github.com/achaudhury/shmoop-corpus/
Published: 2019

27. Video Face Clustering with Unknown Number of Clusters

Author: Tapaswi, Makarand, Law, Marc T., and Fidler, Sanja
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing. We address the challenging problem of clustering face tracks based on their identity. Different from previous work in this area, we choose to operate in a realistic and difficult setting where: (i) the number of characters is not known a priori; and (ii) face tracks belonging to minor or background characters are not discarded. To this end, we propose Ball Cluster Learning (BCL), a supervised approach to carve the embedding space into balls of equal size, one for each cluster. The learned ball radius is easily translated to a stopping criterion for iterative merging algorithms. This gives BCL the ability to estimate the number of clusters as well as their assignment, achieving promising results on commonly used datasets. We also present a thorough discussion of how existing metric learning literature can be adapted for this task., Comment: Accepted to ICCV 2019, code and data at https://github.com/makarandtapaswi/BallClustering_ICCV2019
Published: 2019

28. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Author: Miech, Antoine, Zhukov, Dimitri, Alayrac, Jean-Baptiste, Tapaswi, Makarand, Laptev, Ivan, and Sivic, Josef
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/., Comment: Accepted at ICCV 2019
Published: 2019

29. Self-Supervised Learning of Face Representations for Video Face Clustering

Author: Sharma, Vivek, Tapaswi, Makarand, Sarfraz, M. Saquib, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Analyzing the story behind TV series and movies often requires understanding who the characters are and what they are doing. With improving deep face models, this may seem like a solved problem. However, as face detectors get better, clustering/identification needs to be revisited to address increasing diversity in facial appearance. In this paper, we address video face clustering using unsupervised methods. Our emphasis is on distilling the essential information, identity, from the representations obtained using deep pre-trained face networks. We propose a self-supervised Siamese network that can be trained without the need for video/track based supervision, and thus can also be applied to image collections. We evaluate our proposed method on three video face clustering datasets. The experiments show that our methods outperform current state-of-the-art methods on all datasets. Video face clustering is lacking a common benchmark as current works are often evaluated with different metrics and/or different sets of face tracks., Comment: To appear at International Conference on Automatic Face and Gesture Recognition (2019) as an Oral. The datasets and code are available at https://github.com/vivoutlaw/SSIAM
Published: 2019

30. Visual Reasoning by Progressive Module Networks

Author: Kim, Seung Wook, Tapaswi, Makarand, and Fidler, Sanja
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Humans learn to solve tasks of increasing complexity by building on top of previously acquired knowledge. Typically, there exists a natural progression in the tasks that we learn - most do not require completely independent solutions, but can be broken down into simpler subtasks. We propose to represent a solver for each task as a neural module that calls existing modules (solvers for simpler tasks) in a functional program-like manner. Lower modules are a black box to the calling module, and communicate only via a query and an output. Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output. Our model effectively combines previous skill-sets, does not suffer from forgetting, and is fully differentiable. We test our model in learning a set of visual reasoning tasks, and demonstrate improved performances in all tasks by learning progressively. By evaluating the reasoning process using human judges, we show that our model is more interpretable than an attention-based baseline., Comment: 17 pages, 5 figures
Published: 2018

31. MovieGraphs: Towards Understanding Human-Centric Situations from Videos

Author: Vicol, Paul, Tapaswi, Makarand, Castrejon, Lluis, and Fidler, Sanja
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: There is growing interest in artificial intelligence to build socially intelligent robots. This requires machines to have the ability to "read" people's emotions, motivations, and other factors that affect behavior. Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions. In addition, most interactions and many attributes are grounded in the video with time stamps. We provide a thorough analysis of our dataset, showing interesting common-sense correlations between different social aspects of scenes, as well as across scenes over time. We propose a method for querying videos and text with graphs, and show that: 1) our graphs contain rich and sufficient information to summarize and localize each scene; and 2) subgraphs allow us to describe situations at an abstract level and retrieve multiple semantically relevant situations. We also propose methods for interaction understanding via ordering, and reason understanding. MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents., Comment: Spotlight at CVPR 2018. Webpage: http://moviegraphs.cs.toronto.edu
Published: 2017

32. Situation Recognition with Graph Neural Networks

Author: Li, Ruiyu, Tapaswi, Makarand, Liao, Renjie, Jia, Jiaya, Urtasun, Raquel, and Fidler, Sanja
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We address the problem of recognizing situations in images. Given an image, the task is to predict the most salient verb (action), and fill its semantic roles such as who is performing the action, what is the source and target of the action, etc. Different verbs have different roles (e.g. attacking has weapon), and each role can take on many possible values (nouns). We propose a model based on Graph Neural Networks that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph. Experiments with different graph connectivities show that our approach that propagates information between roles significantly outperforms existing work, as well as multiple baselines. We obtain roughly 3-5% improvement over previous work in predicting the full situation. We also provide a thorough qualitative analysis of our model and influence of different roles in the verbs., Comment: ICCV2017
Published: 2017

33. Relaxed Earth Mover's Distances for Chain- and Tree-connected Spaces and their use as a Loss Function in Deep Learning

Author: Martinez, Manuel, Haurilet, Monica, Al-Halah, Ziad, Tapaswi, Makarand, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The Earth Mover's Distance (EMD) computes the optimal cost of transforming one distribution into another, given a known transport metric between them. In deep learning, the EMD loss allows us to embed information during training about the output space structure like hierarchical or semantic relations. This helps in achieving better output smoothness and generalization. However EMD is computationally expensive.Moreover, solving EMD optimization problems usually require complex techniques like lasso. These properties limit the applicability of EMD-based approaches in large scale machine learning. We address in this work the difficulties facing incorporation of EMD-based loss in deep learning frameworks. Additionally, we provide insight and novel solutions on how to integrate such loss function in training deep neural networks. Specifically, we make three main contributions: (i) we provide an in-depth analysis of the fastest state-of-the-art EMD algorithm (Sinkhorn Distance) and discuss its limitations in deep learning scenarios. (ii) we derive fast and numerically stable closed-form solutions for the EMD gradient in output spaces with chain- and tree- connectivity; and (iii) we propose a relaxed form of the EMD gradient with equivalent computational complexity but faster convergence rate. We support our claims with experiments on real datasets. In a restricted data setting on the ImageNet dataset, we train a model to classify 1000 categories using 50K images, and demonstrate that our relaxed EMD loss achieves better Top-1 accuracy than the cross entropy loss. Overall, we show that our relaxed EMD loss criterion is a powerful asset for deep learning in the small data regime.
Published: 2016

34. Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning

Author: Al-Halah, Ziad, Tapaswi, Makarand, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Collecting training images for all visual categories is not only expensive but also impractical. Zero-shot learning (ZSL), especially using attributes, offers a pragmatic solution to this problem. However, at test time most attribute-based methods require a full description of attribute associations for each unseen class. Providing these associations is time consuming and often requires domain specific knowledge. In this work, we aim to carry out attribute-based zero-shot classification in an unsupervised manner. We propose an approach to learn relations that couples class embeddings with their corresponding attributes. Given only the name of an unseen class, the learned relationship model is used to automatically predict the class-attribute associations. Furthermore, our model facilitates transferring attributes across data sets without additional effort. Integrating knowledge from multiple sources results in a significant additional improvement in performance. We evaluate on two public data sets: Animals with Attributes and aPascal/aYahoo. Our approach outperforms state-of-the-art methods in both predicting class-attribute associations and unsupervised ZSL by a large margin., Comment: Published as a conference paper at CVPR 2016
Published: 2016

35. Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

Author: Chen, Shizhe, primary, Guhur, Pierre-Louis, additional, Tapaswi, Makarand, additional, Schmid, Cordelia, additional, and Laptev, Ivan, additional
Published: 2022
Full Text: View/download PDF

36. Long term spatio-temporal modeling for action detection

Author: Tapaswi, Makarand, Kumar, Vijay, and Laptev, Ivan
Published: 2021
Full Text: View/download PDF

37. MovieQA: Understanding Stories in Movies through Question-Answering

Author: Tapaswi, Makarand, Zhu, Yukun, Stiefelhagen, Rainer, Torralba, Antonio, Urtasun, Raquel, and Fidler, Sanja
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The dataset consists of 14,944 questions about 408 movies with high semantic diversity. The questions range from simpler "Who" did "What" to "Whom", to "Why" and "How" certain events occurred. Each question comes with a set of five possible answers; a correct one and four deceiving answers provided by human annotators. Our dataset is unique in that it contains multiple sources of information -- video clips, plots, subtitles, scripts, and DVS. We analyze our data through various statistics and methods. We further extend existing QA techniques to show that question-answering with such open-ended semantics is hard. We make this data set public along with an evaluation benchmark to encourage inspiring work in this challenging domain., Comment: CVPR 2016, Spotlight presentation. Benchmark @ http://movieqa.cs.toronto.edu/ Code @ https://github.com/makarandtapaswi/MovieQA_CVPR2016/
Published: 2015

38. How You Feelin’? Learning Emotions and Mental States in Movie Scenes

Author: Srivastava, Dhruv, primary, Singh, Aditya Kumar, additional, and Tapaswi, Makarand, additional
Published: 2023
Full Text: View/download PDF

39. Test of Time: Instilling Video-Language Models with a Sense of Time

Author: Bagad, Piyush, primary, Tapaswi, Makarand, additional, and Snoek, Cees G.M., additional
Published: 2023
Full Text: View/download PDF

40. GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

Author: Taunk, Dhaval, primary, Khanna, Lakshya, additional, Kandru, Siri Venkata Pavan Kumar, additional, Varma, Vasudeva, additional, Sharma, Charu, additional, and Tapaswi, Makarand, additional
Published: 2023
Full Text: View/download PDF

41. Unsupervised Audio-Visual Lecture Segmentation

Author: Singh S, Darshan, primary, Gupta, Anchit, additional, Jawahar, C. V., additional, and Tapaswi, Makarand, additional
Published: 2023
Full Text: View/download PDF

42. Fusion of Speech, Faces and Text for Person Identification in TV Broadcast

Author: Bredin, Hervé, Poignant, Johann, Tapaswi, Makarand, Fortier, Guillaume, Le, Viet Bac, Napoleon, Thibault, Gao, Hua, Barras, Claude, Rosset, Sophie, Besacier, Laurent, Verbeek, Jakob, Quénot, Georges, Jurie, Frédéric, Ekenel, Hazim Kemal, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Fusiello, Andrea, editor, Murino, Vittorio, editor, and Cucchiara, Rita, editor
Published: 2012
Full Text: View/download PDF

43. Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

Author: Chen, Shizhe, primary, Guhur, Pierre-Louis, additional, Tapaswi, Makarand, additional, Schmid, Cordelia, additional, and Laptev, Ivan, additional
Published: 2022
Full Text: View/download PDF

44. Aligning plot synopses to videos for story-based retrieval

Author: Tapaswi, Makarand, Bäuml, Martin, and Stiefelhagen, Rainer
Published: 2015
Full Text: View/download PDF

45. Feature generation for long-tail classification

Author: Vigneswaran, Rahul, primary, Law, Marc T., additional, Balasubramanian, Vineeth N., additional, and Tapaswi, Makarand, additional
Published: 2021
Full Text: View/download PDF

46. Airbert: In-domain Pretraining for Vision-and-Language Navigation

Author: Guhur, Pierre-Louis, primary, Tapaswi, Makarand, additional, Chen, Shizhe, additional, Laptev, Ivan, additional, and Schmid, Cordelia, additional
Published: 2021
Full Text: View/download PDF

47. Clustering based Contrastive Learning for Improving Face Representations

Author: Sharma, Vivek, primary, Tapaswi, Makarand, additional, Sarfraz, M. Saquib, additional, and Stiefelhagen, Rainer, additional
Published: 2020
Full Text: View/download PDF

48. Fusion of Speech, Faces and Text for Person Identification in TV Broadcast

Author: Bredin, Hervé, primary, Poignant, Johann, additional, Tapaswi, Makarand, additional, Fortier, Guillaume, additional, Le, Viet Bac, additional, Napoleon, Thibault, additional, Gao, Hua, additional, Barras, Claude, additional, Rosset, Sophie, additional, Besacier, Laurent, additional, Verbeek, Jakob, additional, Quénot, Georges, additional, Jurie, Frédéric, additional, and Ekenel, Hazim Kemal, additional
Published: 2012
Full Text: View/download PDF

49. Learning Interactions and Relationships Between Movie Characters

Author: Kukleva, Anna, primary, Tapaswi, Makarand, additional, and Laptev, Ivan, additional
Published: 2020
Full Text: View/download PDF

50. Video Face Clustering With Self-Supervised Representation Learning

Author: Sharma, Vivek, primary, Tapaswi, Makarand, additional, Sarfraz, M. Saquib, additional, and Stiefelhagen, Rainer, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

111 results on '"Tapaswi, Makarand"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources