Author: "Mirza, M. Jehanzeb" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Mirza, M. Jehanzeb"' showing total 16 results

Start Over Author "Mirza, M. Jehanzeb"

16 results on '"Mirza, M. Jehanzeb"'

1. LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

Author: Shabtay, Nimrod, Polo, Felipe Maia, Doveh, Sivan, Lin, Wei, Mirza, M. Jehanzeb, Chosen, Leshem, Yurochkin, Mikhail, Sun, Yuekai, Arbelle, Assaf, Karlinsky, Leonid, and Giryes, Raja
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The large-scale training of multi-modal models on data scraped from the web has shown outstanding utility in infusing these models with the required world knowledge to perform effectively on multiple downstream tasks. However, one downside of scraping data from the web can be the potential sacrifice of the benchmarks on which the abilities of these models are often evaluated. To safeguard against test data contamination and to truly test the abilities of these foundation models we propose LiveXiv: A scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs (VQA). This is done without any human-in-the-loop, using the multi-modal content in the manuscripts, like graphs, charts, and tables. Moreover, we introduce an efficient evaluation approach that estimates the performance of all models on the evolving benchmark using evaluations of only a subset of models. This significantly reduces the overall evaluation cost. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities, avoiding contamination. Lastly, in our commitment to high quality, we have collected and evaluated a manually verified subset. By comparing its overall results to our automatic annotations, we have found that the performance variance is indeed minimal (<2.5%). Our dataset is available online on HuggingFace, and our code will be available here.
Published: 2024

2. GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Author: Mirza, M. Jehanzeb, Zhao, Mengjie, Mao, Zhuoyuan, Doveh, Sivan, Lin, Wei, Gavrikov, Paul, Dorkenwald, Michael, Yang, Shiqi, Jha, Saurav, Wakaki, Hiromi, Mitsufuji, Yuki, Possegger, Horst, Feris, Rogerio, Karlinsky, Leonid, and Glass, James
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this work, we propose a novel method (GLOV) enabling Large Language Models (LLMs) to act as implicit Optimizers for Vision-Langugage Models (VLMs) to enhance downstream vision tasks. Our GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP). These prompts are ranked according to a purity measure obtained through a fitness function. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of text prompts preferred by the downstream VLM. Furthermore, we also explicitly steer the LLM generation process in each optimization step by specifically adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step. This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models -- showing that the discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models., Comment: Code: https://github.com/jmiemirza/GLOV
Published: 2024

3. ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

Author: Huang, Irene, Lin, Wei, Mirza, M. Jehanzeb, Hansen, Jacob A., Doveh, Sivan, Butoi, Victor Ion, Herzig, Roei, Arbelle, Assaf, Kuehne, Hilde, Darrell, Trevor, Gan, Chuang, Oliva, Aude, Feris, Rogerio, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order. Recent Vision-Language Models (VLMs), comprising a visual encoder and a Large Language Model (LLM) decoder, have demonstrated remarkable proficiency in such reasoning tasks. This prompts a crucial question: have VLMs effectively tackled the CR challenge? We conjecture that existing CR benchmarks may not adequately push the boundaries of modern VLMs due to the reliance on an LLM-only negative text generation pipeline. Consequently, the negatives produced either appear as outliers from the natural language distribution learned by VLMs' LLM decoders or as improbable within the corresponding image context. To address these limitations, we introduce ConMe -- a compositional reasoning benchmark and a novel data generation pipeline leveraging VLMs to produce `hard CR Q&A'. Through a new concept of VLMs conversing with each other to collaboratively expose their weaknesses, our pipeline autonomously generates, evaluates, and selects challenging compositional reasoning questions, establishing a robust CR benchmark, also subsequently validated manually. Our benchmark provokes a noteworthy, up to 33%, decrease in CR performance compared to preceding benchmarks, reinstating the CR challenge even for state-of-the-art VLMs., Comment: NeurIPS 2024 Camera Ready
Published: 2024

4. Into the Fog: Evaluating Robustness of Multiple Object Tracking

Author: Kirillova, Nadezda, Mirza, M. Jehanzeb, Bischof, Horst, and Possegger, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: State-of-the-art Multiple Object Tracking (MOT) approaches have shown remarkable performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear weather scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of trackers against these challenging conditions remains underexplored. To address this gap, we introduce physics-based volumetric fog simulation method for arbitrary MOT datasets, utilizing frame-by-frame monocular depth estimation and a fog formation optical model. We enhance our simulation by rendering both homogeneous and heterogeneous fog and propose to use the dark channel prior method to estimate atmospheric light, showing promising results even in night and indoor scenes. We present the leading benchmark MOTChallenge (third release) augmented with fog (smoke for indoor scenes) of various intensities and conduct a comprehensive evaluation of MOT methods, revealing their limitations under fog and fog-like challenges.
Published: 2024

5. Towards Multimodal In-Context Learning for Vision & Language Models

Author: Doveh, Sivan, Perek, Shaked, Mirza, M. Jehanzeb, Lin, Wei, Alfassy, Amit, Arbelle, Assaf, Ullman, Shimon, and Karlinsky, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: State-of-the-art Vision-Language Models (VLMs) ground the vision and the language modality primarily via projecting the vision tokens from the encoder to language-like tokens, which are directly fed to the Large Language Model (LLM) decoder. While these models have shown unprecedented performance in many downstream zero-shot tasks (eg image captioning, question answers, etc), still little emphasis has been put on transferring one of the core LLM capability of In-Context Learning (ICL). ICL is the ability of a model to reason about a downstream task with a few examples demonstrations embedded in the prompt. In this work, through extensive evaluations, we find that the state-of-the-art VLMs somewhat lack the ability to follow ICL instructions. In particular, we discover that even models that underwent large-scale mixed modality pre-training and were implicitly guided to make use of interleaved image and text information (intended to consume helpful context from multiple images) under-perform when prompted with few-shot demonstrations (in an ICL way), likely due to their lack of direct ICL instruction tuning. To enhance the ICL abilities of the present VLM, we propose a simple yet surprisingly effective multi-turn curriculum-based learning methodology with effective data mixes, leading up to a significant 21.03% (and 11.3% on average) ICL performance boost over the strongest VLM baselines and a variety of ICL benchmarks. Furthermore, we also contribute new benchmarks for ICL evaluation in VLMs and discuss their advantages over the prior art.
Published: 2024

6. Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Author: Mirza, M. Jehanzeb, Karlinsky, Leonid, Lin, Wei, Doveh, Sivan, Micorek, Jakub, Kozinski, Mateusz, Kuehne, Hilde, and Possegger, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively, Comment: ECCV Camera Ready. Code & Data: https://jmiemirza.github.io/Meta-Prompting/
Published: 2024

7. Meta-prompting for Automating Zero-Shot Visual Recognition with LLMs

Author: Mirza, M. Jehanzeb, Karlinsky, Leonid, Lin, Wei, Doveh, Sivan, Micorek, Jakub, Kozinski, Mateusz, Kuehne, Hilde, Possegger, Horst, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

8. TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

Author: Mirza, M. Jehanzeb, Karlinsky, Leonid, Lin, Wei, Possegger, Horst, Feris, Rogerio, and Bischof, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines., Comment: Code is available at: https://github.com/jmiemirza/TAP
Published: 2023

9. Sit Back and Relax: Learning to Drive Incrementally in All Weather Conditions

Author: Leitner, Stefan, Mirza, M. Jehanzeb, Lin, Wei, Micorek, Jakub, Masana, Marc, Kozinski, Mateusz, Possegger, Horst, and Bischof, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In autonomous driving scenarios, current object detection models show strong performance when tested in clear weather. However, their performance deteriorates significantly when tested in degrading weather conditions. In addition, even when adapted to perform robustly in a sequence of different weather conditions, they are often unable to perform well in all of them and suffer from catastrophic forgetting. To efficiently mitigate forgetting, we propose Domain-Incremental Learning through Activation Matching (DILAM), which employs unsupervised feature alignment to adapt only the affine parameters of a clear weather pre-trained network to different weather conditions. We propose to store these affine parameters as a memory bank for each weather condition and plug-in their weather-specific parameters during driving (i.e. test time) when the respective weather conditions are encountered. Our memory bank is extremely lightweight, since affine parameters account for less than 2% of a typical object detector. Furthermore, contrary to previous domain-incremental learning approaches, we do not require the weather label when testing and propose to automatically infer the weather condition by a majority voting linear classifier., Comment: Intelligent Vehicle Conference (oral presentation)
Published: 2023

10. LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

Author: Mirza, M. Jehanzeb, Karlinsky, Leonid, Lin, Wei, Kozinski, Mateusz, Possegger, Horst, Feris, Rogerio, and Bischof, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision., Comment: NeurIPS 2023 (Camera Ready) - Project Page: https://jmiemirza.github.io/LaFTer/
Published: 2023

11. MATE: Masked Autoencoders are Online 3D Test-Time Learners

Author: Mirza, M. Jehanzeb, Shin, Inkyu, Lin, Wei, Schriebl, Andreas, Sun, Kunyang, Choe, Jaesung, Possegger, Horst, Kozinski, Mateusz, Kweon, In So, Yoon, Kun-Jin, and Bischof, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. We show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, making it extremely lightweight. Our experiments show that MATE also achieves competitive performance by adapting sparsely on the test data, which further reduces its computational overhead, making it ideal for real-time applications., Comment: Code is available at this repository: https://github.com/jmiemirza/MATE
Published: 2022

12. An Efficient Domain-Incremental Learning Approach to Drive in All Weather Conditions

Author: Mirza, M. Jehanzeb, Masana, Marc, Possegger, Horst, and Bischof, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Although deep neural networks enable impressive visual perception performance for autonomous driving, their robustness to varying weather conditions still requires attention. When adapting these models for changed environments, such as different weather conditions, they are prone to forgetting previously learned information. This catastrophic forgetting is typically addressed via incremental learning approaches which usually re-train the model by either keeping a memory bank of training samples or keeping a copy of the entire model or model parameters for each scenario. While these approaches show impressive results, they can be prone to scalability issues and their applicability for autonomous driving in all weather conditions has not been shown. In this paper we propose DISC -- Domain Incremental through Statistical Correction -- a simple online zero-forgetting approach which can incrementally learn new tasks (i.e weather conditions) without requiring re-training or expensive memory banks. The only information we store for each task are the statistical parameters as we categorize each domain by the change in first and second order statistics. Thus, as each task arrives, we simply 'plug and play' the statistical vectors for the corresponding task into the model and it immediately starts to perform well on that task. We show the efficacy of our approach by testing it for object detection in a challenging domain-incremental autonomous driving scenario where we encounter different adverse weather conditions, such as heavy rain, fog, and snow., Comment: Accepted to CVPR Workshops - Camera Ready Version
Published: 2022

13. The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization

Author: Mirza, M. Jehanzeb, Micorek, Jakub, Possegger, Horst, and Bischof, Horst
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Domain adaptation is crucial to adapt a learned model to new scenarios, such as domain shifts or changing data distributions. Current approaches usually require a large amount of labeled or unlabeled data from the shifted domain. This can be a hurdle in fields which require continuous dynamic adaptation or suffer from scarcity of data, e.g. autonomous driving in challenging weather conditions. To address this problem of continuous adaptation to distribution shifts, we propose Dynamic Unsupervised Adaptation (DUA). By continuously adapting the statistics of the batch normalization layers we modify the feature representations of the model. We show that by sequentially adapting a model with only a fraction of unlabeled data, a strong performance gain can be achieved. With even less than 1% of unlabeled data from the target domain, DUA already achieves competitive results to strong baselines. In addition, the computational overhead is minimal in contrast to previous approaches. Our approach is simple, yet effective and can be applied to any architecture which uses batch normalization as one of its components. We show the utility of DUA by evaluating it on a variety of domain adaptation datasets and tasks including object recognition, digit recognition and object detection., Comment: Accepted to CVPR 2022 - Camera Ready Version - Code: https://github.com/jmiemirza/DUA
Published: 2021

14. Into the Fog: Evaluating Multiple Object Tracking Robustness

Author: Kirillova, Nadezda, Mirza, M. Jehanzeb, Possegger, Horst, Bischof, Horst, Kirillova, Nadezda, Mirza, M. Jehanzeb, Possegger, Horst, and Bischof, Horst
Abstract: State-of-the-art (SOTA) trackers have shown remarkable Multiple Object Tracking (MOT) performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of SOTA trackers remains underexplored. To address these limitations, we propose a pipeline for physic-based volumetric fog simulation in arbitrary real-world MOT dataset utilizing frame-by-frame monocular depth estimation and a fog formation optical model. Moreover, we enhance our simulation by rendering of both homogeneous and heterogeneous fog effects. We propose to use the dark channel prior method to estimate fog (smoke) color, which shows promising results even in night and indoor scenes. We present the leading tracking benchmark MOTChallenge (MOT17 dataset) overlaid by fog (smoke for indoor scenes) of various intensity levels and conduct a comprehensive evaluation of SOTA MOT methods, revealing their limitations under fog and fog-similar challenges.
Published: 2024

15. ActMAD: Activation Matching to Align Distributions for Test-Time-Training

Author: Mirza, M. Jehanzeb, primary, Soneira, Pol Jané, additional, Lin, Wei, additional, Kozinski, Mateusz, additional, Possegger, Horst, additional, and Bischof, Horst, additional
Published: 2023
Full Text: View/download PDF

16. The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization

Author: Mirza, M. Jehanzeb, primary, Micorek, Jakub, additional, Possegger, Horst, additional, and Bischof, Horst, additional
Published: 2022
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

16 results on '"Mirza, M. Jehanzeb"'

1. LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content

2. GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

3. ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs

4. Into the Fog: Evaluating Robustness of Multiple Object Tracking

5. Towards Multimodal In-Context Learning for Vision & Language Models

6. Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

7. Meta-prompting for Automating Zero-Shot Visual Recognition with LLMs

8. TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification

9. Sit Back and Relax: Learning to Drive Incrementally in All Weather Conditions

10. LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

11. MATE: Masked Autoencoders are Online 3D Test-Time Learners

12. An Efficient Domain-Incremental Learning Approach to Drive in All Weather Conditions

13. The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization

14. Into the Fog: Evaluating Multiple Object Tracking Robustness

15. ActMAD: Activation Matching to Align Distributions for Test-Time-Training

16. The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

16 results on '"Mirza, M. Jehanzeb"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources