Author: "Naeem, Muhammad Ferjad" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Naeem, Muhammad Ferjad"' showing total 34 results

Start Over Author "Naeem, Muhammad Ferjad"

34 results on '"Naeem, Muhammad Ferjad"'

1. Active Data Curation Effectively Distills Large-Scale Multimodal Models

Author: Udandarao, Vishaal, Parthasarathy, Nikhil, Naeem, Muhammad Ferjad, Evans, Talfan, Albanie, Samuel, Tombari, Federico, Xian, Yongqin, Tonioni, Alessio, and Hénaff, Olivier J.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. Prior works have explored ever more complex KD strategies involving different objective functions, teacher-ensembles, and weight inheritance. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations. Further, we find such an active data curation strategy to in fact be complementary to standard KD, and can be effectively combined to train highly performant inference-efficient models. Our simple and scalable pretraining framework, ACED, achieves state-of-the-art results across 27 zero-shot classification and retrieval tasks with upto 11% less inference FLOPs. We further demonstrate that our ACED models yield strong vision-encoders for training generative multimodal models in the LiT-Decoder setting, outperforming larger vision encoders for image-captioning and visual question-answering tasks.
Published: 2024

2. TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

Author: Wang, Haiyang, Fan, Yue, Naeem, Muhammad Ferjad, Xian, Yongqin, Lenssen, Jan Eric, Wang, Liwei, Tombari, Federico, and Schiele, Bernt
Subjects: Computer Science - Machine Learning
Abstract: Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce TokenFormer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code and models are available at \url{https://github.com/Haiyang-W/TokenFormer}.
Published: 2024

3. Toward a Diffusion-Based Generalist for Dense Vision Tasks

Author: Fan, Yue, Xian, Yongqin, Zhai, Xiaohua, Kolesnikov, Alexander, Naeem, Muhammad Ferjad, Schiele, Bernt, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Building generalized models that can solve many computer vision tasks simultaneously is an intriguing direction. Recent works have shown image itself can be used as a natural interface for general-purpose visual perception and demonstrated inspiring results. In this paper, we explore diffusion-based vision generalists, where we unify different types of dense prediction tasks as conditional image generation and re-purpose pre-trained diffusion models for it. However, directly applying off-the-shelf latent diffusion models leads to a quantization issue. Thus, we propose to perform diffusion in pixel space and provide a recipe for finetuning pre-trained text-to-image diffusion models for dense vision tasks. In experiments, we evaluate our method on four different types of tasks and show competitive performance to the other vision generalists., Comment: Published at CVPR 2024 as a workshop paper
Published: 2024

4. How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Author: Khattak, Muhammad Uzair, Naeem, Muhammad Ferjad, Hassan, Jameel, Naseer, Muzammal, Tombari, Federico, Khan, Fahad Shahbaz, and Khan, Salman
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 9 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to enhance the performance of existing Video-LMMs. Our findings provide valuable insights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code are publicly available at: https://mbzuai-oryx.github.io/CVRR-Evaluation-Suite/., Comment: Technical report
Published: 2024

5. GiT: Towards Generalist Vision Transformer through Universal Language Interface

Author: Wang, Haiyang, Tang, Hao, Jiang, Li, Shi, Shaoshuai, Naeem, Muhammad Ferjad, Li, Hongsheng, Schiele, Bernt, and Wang, Liwei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT. Motivated by the universality of the Multi-layer Transformer architecture (e.g, GPT) widely used in large language models (LLMs), we seek to broaden its scope to serve as a powerful vision foundation model (VFM). However, unlike language modeling, visual tasks typically require specific modules, such as bounding box heads for detection and pixel decoders for segmentation, greatly hindering the application of powerful multi-layer transformers in the vision domain. To solve this, we design a universal language interface that empowers the successful auto-regressive decoding to adeptly unify various visual tasks, from image-level understanding (e.g., captioning), over sparse perception (e.g., detection), to dense prediction (e.g., segmentation). Based on the above designs, the entire model is composed solely of a ViT, without any specific additions, offering a remarkable architectural simplification. GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning. Interestingly, our GiT builds a new benchmark in generalist performance, and fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training. This reflects a similar impact observed in LLMs. Further enriching training with 27 datasets, GiT achieves strong zero-shot results over various tasks. Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language. Code and models will be available at \url{https://github.com/Haiyang-W/GiT}.
Published: 2024

6. Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

Author: Khan, Muhammad Saif Ullah, Naeem, Muhammad Ferjad, Tombari, Federico, Van Gool, Luc, Stricker, Didier, and Afzal, Muhammad Zeshan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel LLM-based pipeline for creating contextual descriptions of human body poses in images using only auxiliary attributes. This approach facilitates the creation of the MPII Pose Descriptions dataset, which includes natural language annotations for 17,367 images containing people engaged in 410 distinct activities. We demonstrate the effectiveness of our pose descriptions in enabling zero-shot human-centric classification using CLIP. Moreover, we introduce the FocusCLIP framework, which incorporates Subject-Focused Attention (SFA) in CLIP for improved text-to-image alignment. Our models were pretrained on the MPII Pose Descriptions dataset and their zero-shot performance was evaluated on five unseen datasets covering three tasks. FocusCLIP outperformed the baseline CLIP model, achieving an average accuracy increase of 8.61\% (33.65\% compared to CLIP's 25.04\%). Notably, our approach yielded improvements of 3.98\% in activity recognition, 14.78\% in age classification, and 7.06\% in emotion recognition. These results highlight the potential of integrating detailed pose descriptions and subject-level guidance into general pretraining frameworks for enhanced performance in downstream tasks.
Published: 2024

7. Learning to Prompt with Text Only Supervision for Vision-Language Models

Author: Khattak, Muhammad Uzair, Naeem, Muhammad Ferjad, Naseer, Muzammal, Van Gool, Luc, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pre-trained models are available at https://github.com/muzairkhattak/ProText., Comment: Project Page: https://muzairkhattak.github.io/ProText/
Published: 2024

8. SILC: Improving Vision Language Pretraining with Self-distillation

Author: Naeem, Muhammad Ferjad, Xian, Yongqin, Zhai, Xiaohua, Hoyer, Lukas, Van Gool, Luc, Tombari, Federico, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

9. SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Author: Hoyer, Lukas, Tan, David Joseph, Naeem, Muhammad Ferjad, Van Gool, Luc, Tombari, Federico, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

10. I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification

Author: Naeem, Muhammad Ferjad, Xian, Yongqin, Gool, Luc Van, and Tombari, Federico
Published: 2024
Full Text: View/download PDF

11. SILC: Improving Vision Language Pretraining with Self-Distillation

Author: Naeem, Muhammad Ferjad, Xian, Yongqin, Zhai, Xiaohua, Hoyer, Lukas, Van Gool, Luc, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation, while also providing improvements on image-level tasks such as classification and retrieval. SILC models sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation. We further show that SILC features greatly benefit open vocabulary detection, captioning and visual question answering.
Published: 2023

12. Introducing Language Guidance in Prompt-based Continual Learning

Author: Khan, Muhammad Gul Zain Ali, Naeem, Muhammad Ferjad, Van Gool, Luc, Stricker, Didier, Tombari, Federico, and Afzal, Muhammad Zeshan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive when the number of tasks becomes large or data can not be stored for privacy reasons. As an alternative, prompt-based methods have been proposed that store the task information in a learnable prompt pool. This prompt pool instructs a frozen image encoder on how to solve each task. While the model faces a disjoint set of classes in each task in this setting, we argue that these classes can be encoded to the same embedding space of a pre-trained language encoder. In this work, we propose Language Guidance for Prompt-based Continual Learning (LGCL) as a plug-in for prompt-based methods. LGCL is model agnostic and introduces language guidance at the task level in the prompt pool and at the class level on the output feature of the vision encoder. We show with extensive experimentation that LGCL consistently improves the performance of prompt-based continual learning methods to set a new state-of-the art. LGCL achieves these performance improvements without needing any additional learnable parameters., Comment: Accepted at ICCV 2023
Published: 2023

13. I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Author: Naeem, Muhammad Ferjad, Khan, Muhammad Gul Zain Ali, Xian, Yongqin, Afzal, Muhammad Zeshan, Stricker, Didier, Van Gool, Luc, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class(referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
Published: 2022

14. Learning Attention Propagation for Compositional Zero-Shot Learning

Author: Khan, Muhammad Gul Zain Ali, Naeem, Muhammad Ferjad, Van Gool, Luc, Pagani, Alain, Stricker, Didier, and Afzal, Muhammad Zeshan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Compositional zero-shot learning aims to recognize unseen compositions of seen visual primitives of object classes and their states. While all primitives (states and objects) are observable during training in some combination, their complex interaction makes this task especially hard. For example, wet changes the visual appearance of a dog very differently from a bicycle. Furthermore, we argue that relationships between compositions go beyond shared states or objects. A cluttered office can contain a busy table; even though these compositions don't share a state or object, the presence of a busy table can guide the presence of a cluttered office. We propose a novel method called Compositional Attention Propagated Embedding (CAPE) as a solution. The key intuition to our method is that a rich dependency structure exists between compositions arising from complex interactions of primitives in addition to other dependencies between compositions. CAPE learns to identify this structure and propagates knowledge between them to learn class embedding for all seen and unseen compositions. In the challenging generalized compositional zero-shot setting, we show that our method outperforms previous baselines to set a new state-of-the-art on three publicly available benchmarks.
Published: 2022

15. I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Author: Naeem, Muhammad Ferjad, Xian, Yongqin, Van Gool, Luc, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the tremendous progress in zero-shot learning(ZSL), the majority of existing methods still rely on human-annotated attributes, which are difficult to annotate and scale. An unsupervised alternative is to represent each class using the word embedding associated with its semantic class name. However, word embeddings extracted from pre-trained language models do not necessarily capture visual similarities, resulting in poor zero-shot performance. In this work, we argue that online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes, therefore can be used as powerful unsupervised side information for ZSL. To this end, we propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents by aligning both modalities in a shared embedding space. In order to distill discriminative visual words from noisy documents, we introduce a new cross-modal attention module that learns fine-grained interactions between image patches and document words. Consequently, our I2DFormer not only learns highly discriminative document embeddings that capture visual similarities but also gains the ability to localize visually relevant words in image regions. Quantitatively, we demonstrate that our I2DFormer significantly outperforms previous unsupervised semantic embeddings under both zero-shot and generalized zero-shot learning settings on three public datasets. Qualitatively, we show that our method leads to highly interpretable results where document words can be grounded in the image regions., Comment: 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
Published: 2022

16. 3D Compositional Zero-shot Learning with DeCompositional Consensus

Author: Naeem, Muhammad Ferjad, Örnek, Evin Pınar, Xian, Yongqin, Van Gool, Luc, and Tombari, Federico
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Parts represent a basic unit of geometric and semantic similarity across different objects. We argue that part knowledge should be composable beyond the observed object classes. Towards this, we present 3D Compositional Zero-shot Learning as a problem of part generalization from seen to unseen object classes for semantic segmentation. We provide a structured study through benchmarking the task with the proposed Compositional-PartNet dataset. This dataset is created by processing the original PartNet to maximize part overlap across different objects. The existing point cloud part segmentation methods fail to generalize to unseen object classes in this setting. As a solution, we propose DeCompositional Consensus, which combines a part segmentation network with a part scoring network. The key intuition to our approach is that a segmentation mask over some parts should have a consensus with its part scores when each part is taken apart. The two networks reason over different part combinations defined in a per-object part prior to generate the most suitable segmentation mask. We demonstrate that our method allows compositional zero-shot segmentation and generalized zero-shot classification, and establishes the state of the art on both tasks.
Published: 2021

17. Learning Graph Embeddings for Open World Compositional Zero-Shot Learning

Author: Mancini, Massimiliano, Naeem, Muhammad Ferjad, Xian, Yongqin, and Akata, Zeynep
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available at test time. In this work, we overcome this assumption operating on the open world setting, where no limit is imposed on the compositional space at test time, and the search space contains a large number of unseen compositions. To address this problem, we propose a new approach, Compositional Cosine Graph Embeddings (Co-CGE), based on two principles. First, Co-CGE models the dependency between states, objects and their compositions through a graph convolutional neural network. The graph propagates information from seen to unseen concepts, improving their representations. Second, since not all unseen compositions are equally feasible, and less feasible ones may damage the learned representations, Co-CGE estimates a feasibility score for each unseen composition, using the scores as margins in a cosine similarity-based loss and as weights in the adjacency matrix of the graphs. Experiments show that our approach achieves state-of-the-art performances in standard CZSL while outperforming previous methods in the open world scenario., Comment: Accepted by T-PAMI in March, 2022. arXiv admin note: text overlap with arXiv:2101.12609
Published: 2021

18. Learning Graph Embeddings for Compositional Zero-shot Learning

Author: Naeem, Muhammad Ferjad, Xian, Yongqin, Tombari, Federico, and Akata, Zeynep
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In compositional zero-shot learning, the goal is to recognize unseen compositions (e.g. old dog) of observed visual primitives states (e.g. old, cute) and objects (e.g. car, dog) in the training set. This is challenging because the same state can for example alter the visual appearance of a dog drastically differently from a car. As a solution, we propose a novel graph formulation called Compositional Graph Embedding (CGE) that learns image features, compositional classifiers, and latent representations of visual primitives in an end-to-end manner. The key to our approach is exploiting the dependency between states, objects, and their compositions within a graph structure to enforce the relevant knowledge transfer from seen to unseen compositions. By learning a joint compatibility that encodes semantics between concepts, our model allows for generalization to unseen compositions without relying on an external knowledge base like WordNet. We show that in the challenging generalized compositional zero-shot setting our CGE significantly outperforms the state of the art on MIT-States and UT-Zappos. We also propose a new benchmark for this task based on the recent GQA dataset. Code is available at: https://github.com/ExplainableML/czsl, Comment: Accepted in IEEE CVPR 2021
Published: 2021

19. Open World Compositional Zero-Shot Learning

Author: Mancini, Massimiliano, Naeem, Muhammad Ferjad, Xian, Yongqin, and Akata, Zeynep
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Compositional Zero-Shot learning (CZSL) requires to recognize state-object compositions unseen during training. In this work, instead of assuming prior knowledge about the unseen compositions, we operate in the open world setting, where the search space includes a large number of unseen compositions some of which might be unfeasible. In this setting, we start from the cosine similarity between visual features and compositional embeddings. After estimating the feasibility score of each composition, we use these scores to either directly mask the output space or as a margin for the cosine similarity between visual features and compositional embeddings during training. Our experiments on two standard CZSL benchmarks show that all the methods suffer severe performance degradation when applied in the open world setting. While our simple CZSL model achieves state-of-the-art performances in the closed world scenario, our feasibility scores boost the performance of our approach in the open world setting, clearly outperforming the previous state of the art., Comment: Accepted in IEEE CVPR 2021
Published: 2021

20. Reliable Fidelity and Diversity Metrics for Generative Models

Author: Naeem, Muhammad Ferjad, Oh, Seong Joon, Uh, Youngjung, Choi, Yunjey, and Yoo, Jaejun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Devising indicative evaluation metrics for the image generation task remains an open problem. The most widely used metric for measuring the similarity between real and generated images has been the Fr\'echet Inception Distance (FID) score. Because it does not differentiate the fidelity and diversity aspects of the generated images, recent papers have introduced variants of precision and recall metrics to diagnose those properties separately. In this paper, we show that even the latest version of the precision and recall metrics are not reliable yet. For example, they fail to detect the match between two identical distributions, they are not robust against outliers, and the evaluation hyperparameters are selected arbitrarily. We propose density and coverage metrics that solve the above issues. We analytically and experimentally show that density and coverage provide more interpretable and reliable signals for practitioners than the existing metrics. Code: https://github.com/clovaai/generative-evaluation-prdc., Comment: First two authors have contributed equally; ICML 2020 accepted
Published: 2020

21. Deep Learning Under the Microscope: Improving the Interpretability of Medical Imaging Neural Networks

Author: Paschali, Magdalini, Naeem, Muhammad Ferjad, Simson, Walter, Steiger, Katja, Mollenhauer, Martin, and Navab, Nassir
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a novel interpretation method tailored to histological Whole Slide Image (WSI) processing. A Deep Neural Network (DNN), inspired by Bag-of-Features models is equipped with a Multiple Instance Learning (MIL) branch and trained with weak supervision for WSI classification. MIL avoids label ambiguity and enhances our model's expressive power without guiding its attention. We utilize a fine-grained logit heatmap of the models activations to interpret its decision-making process. The proposed method is quantitatively and qualitatively evaluated on two challenging histology datasets, outperforming a variety of baselines. In addition, two expert pathologists were consulted regarding the interpretability provided by our method and acknowledged its potential for integration into several clinical applications.
Published: 2019

22. Data Augmentation with Manifold Exploring Geometric Transformations for Increased Performance and Robustness

Author: Paschali, Magdalini, Simson, Walter, Roy, Abhijit Guha, Naeem, Muhammad Ferjad, Göbl, Rüdiger, Wachinger, Christian, and Navab, Nassir
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing, Statistics - Machine Learning
Abstract: In this paper we propose a novel augmentation technique that improves not only the performance of deep neural networks on clean test data, but also significantly increases their robustness to random transformations, both affine and projective. Inspired by ManiFool, the augmentation is performed by a line-search manifold-exploration method that learns affine geometric transformations that lead to the misclassification on an image, while ensuring that it remains on the same manifold as the training data. This augmentation method populates any training dataset with images that lie on the border of the manifolds between two-classes and maximizes the variance the network is exposed to during training. Our method was thoroughly evaluated on the challenging tasks of fine-grained skin lesion classification from limited data, and breast tumor classification of mammograms. Compared with traditional augmentation methods, and with images synthesized by Generative Adversarial Networks our method not only achieves state-of-the-art performance but also significantly improves the network's robustness., Comment: Under Review for the 26th International Conference on Information Processing in Medical Imaging (IPMI) 2019
Published: 2019

23. 3D Compositional Zero-Shot Learning with DeCompositional Consensus

Author: Naeem, Muhammad Ferjad, Örnek, Evin Pınar, Xian, Yongqin, Van Gool, Luc, Tombari, Federico, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

24. A convolutional recursive deep architecture for unconstrained Urdu handwriting recognition

Author: ul Sehr Zia, Noor, Naeem, Muhammad Ferjad, Raza, Syed Muhammad Kumail, Khan, Muhammad Mubasher, Ul-Hasan, Adnan, and Shafait, Faisal
Published: 2022
Full Text: View/download PDF

25. FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks

Author: Khan, Muhammad Saif Ullah, Naeem, Muhammad Ferjad, Tombari, Federico, Van Gool, Luc, Stricker, Didier, Afzal, Muhammad Zeshan, Khan, Muhammad Saif Ullah, Naeem, Muhammad Ferjad, Tombari, Federico, Van Gool, Luc, Stricker, Didier, and Afzal, Muhammad Zeshan
Abstract: We propose FocusCLIP, integrating subject-level guidance--a specialized mechanism for target-specific supervision--into the CLIP framework for improved zero-shot transfer on human-centric tasks. Our novel contributions enhance CLIP on both the vision and text sides. On the vision side, we incorporate ROI heatmaps emulating human visual attention mechanisms to emphasize subject-relevant image regions. On the text side, we introduce human pose descriptions to provide rich contextual information. For human-centric tasks, FocusCLIP is trained with images from the MPII Human Pose dataset. The proposed approach surpassed CLIP by an average of 8.61% across five previously unseen datasets covering three human-centric tasks. FocusCLIP achieved an average accuracy of 33.65% compared to 25.04% by CLIP. We observed a 3.98% improvement in activity recognition, a 14.78% improvement in age classification, and a 7.06% improvement in emotion recognition. Moreover, using our proposed single-shot LLM prompting strategy, we release a high-quality MPII Pose Descriptions dataset to encourage further research in multimodal learning for human-centric tasks. Furthermore, we also demonstrate the effectiveness of our subject-level supervision on non-human-centric tasks. FocusCLIP shows a 2.47% improvement over CLIP in zero-shot bird classification using the CUB dataset. Our findings emphasize the potential of integrating subject-level guidance with general pretraining methods for enhanced downstream performance.
Published: 2024

26. I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

Author: Naeem, Muhammad Ferjad, primary, Ali Khan, Muhammad Gul Zain, additional, Xian, Yongqin, additional, Afzal, Muhammad Zeshan, additional, Stricker, Didier, additional, Van Gool, Luc, additional, and Tombari, Federico, additional
Published: 2023
Full Text: View/download PDF

27. SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Author: Hoyer, Lukas, Tan, David Joseph, Naeem, Muhammad Ferjad, Van Gool, Luc, Tombari, Federico, Hoyer, Lukas, Tan, David Joseph, Naeem, Muhammad Ferjad, Van Gool, Luc, and Tombari, Federico
Abstract: In semi-supervised semantic segmentation, a model is trained with a limited number of labeled images along with a large corpus of unlabeled images to reduce the high annotation effort. While previous methods are able to learn good segmentation boundaries, they are prone to confuse classes with similar visual appearance due to the limited supervision. On the other hand, vision-language models (VLMs) are able to learn diverse semantic knowledge from image-caption datasets but produce noisy segmentation due to the image-level training. In SemiVL, we propose to integrate rich priors from VLM pre-training into semi-supervised semantic segmentation to learn better semantic decision boundaries. To adapt the VLM from global to local reasoning, we introduce a spatial fine-tuning strategy for label-efficient learning. Further, we design a language-guided decoder to jointly reason over vision and language. Finally, we propose to handle inherent ambiguities in class labels by providing the model with language guidance in the form of class definitions. We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods. For instance, SemiVL improves the state-of-the-art by +13.5 mIoU on COCO with 232 annotated images and by +6.1 mIoU on Pascal VOC with 92 labels. Project page: https://github.com/google-research/semivl
Published: 2023

28. Learning Graph Embeddings for Open World Compositional Zero-Shot Learning

Author: Mancini, Massimiliano, Naeem, Muhammad Ferjad, Xian, Yongqin, and Akata, Zeynep
Abstract: Compositional Zero-Shot learning (CZSL) aims to recognize unseen compositions of state and object visual primitives seen during training. A problem with standard CZSL is the assumption of knowing which unseen compositions will be available at test time. In this work, we overcome this assumption operating on the open world setting, where no limit is imposed on the compositional space at test time, and the search space contains a large number of unseen compositions. To address this problem, we propose a new approach, Compositional Cosine Graph Embeddings (Co-CGE), based on two principles. First, Co-CGE models the dependency between states, objects and their compositions through a graph convolutional neural network. The graph propagates information from seen to unseen concepts, improving their representations. Second, since not all unseen compositions are equally feasible, and less feasible ones may damage the learned representations, Co-CGE estimates a feasibility score for each unseen composition, using the scores as margins in a cosine similarity-based loss and as weights in the adjacency matrix of the graphs. Experiments show that our approach achieves state-of-the-art performances in standard CZSL while outperforming previous methods in the open world scenario.
Published: 2024
Full Text: View/download PDF

29. Learning Attention Propagation for Compositional Zero-Shot Learning

Author: Ali Khan, Muhammad Gul Zain, primary, Naeem, Muhammad Ferjad, additional, Van Gool, Luc, additional, Pagani, A., additional, Stricker, Didier, additional, and Afzal, Muhammad Zeshan, additional
Published: 2023
Full Text: View/download PDF

30. Learning Graph Embeddings for Open World Compositional Zero-Shot Learning

Author: Mancini, Massimiliano, primary, Naeem, Muhammad Ferjad, additional, Xian, Yongqin, additional, and Akata, Zeynep, additional
Published: 2022
Full Text: View/download PDF

31. A convolutional recursive deep architecture for unconstrained Urdu handwriting recognition

Author: ul Sehr Zia, Noor, primary, Naeem, Muhammad Ferjad, additional, Raza, Syed Muhammad Kumail, additional, Khan, Muhammad Mubasher, additional, Ul-Hasan, Adnan, additional, and Shafait, Faisal, additional
Published: 2021
Full Text: View/download PDF

32. Learning Graph Embeddings for Compositional Zero-shot Learning

Author: Naeem, Muhammad Ferjad, primary, Xian, Yongqin, additional, Tombari, Federico, additional, and Akata, Zeynep, additional
Published: 2021
Full Text: View/download PDF

33. Open World Compositional Zero-Shot Learning

Author: Mancini, Massimiliano, primary, Naeem, Muhammad Ferjad, additional, Xian, Yongqin, additional, and Akata, Zeynep, additional
Published: 2021
Full Text: View/download PDF

34. Impact of Ligature Coverage on Training Practical Urdu OCR Systems

Author: Naeem, Muhammad Ferjad, primary, Zia, Noor ul Sehr, additional, Awan, Aqsa Ahmed, additional, Shafait, Faisal, additional, and Hasan, Adnan ul, additional
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

34 results on '"Naeem, Muhammad Ferjad"'

1. Active Data Curation Effectively Distills Large-Scale Multimodal Models

2. TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

3. Toward a Diffusion-Based Generalist for Dense Vision Tasks

4. How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

5. GiT: Towards Generalist Vision Transformer through Universal Language Interface

6. Human Pose Descriptions and Subject-Focused Attention for Improved Zero-Shot Transfer in Human-Centric Classification Tasks

7. Learning to Prompt with Text Only Supervision for Vision-Language Models

8. SILC: Improving Vision Language Pretraining with Self-distillation

9. SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

10. I2DFormer+: Learning Image to Document Summary Attention for Zero-Shot Image Classification

11. SILC: Improving Vision Language Pretraining with Self-Distillation

12. Introducing Language Guidance in Prompt-based Continual Learning

13. I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

14. Learning Attention Propagation for Compositional Zero-Shot Learning

15. I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

16. 3D Compositional Zero-shot Learning with DeCompositional Consensus

17. Learning Graph Embeddings for Open World Compositional Zero-Shot Learning

18. Learning Graph Embeddings for Compositional Zero-shot Learning

19. Open World Compositional Zero-Shot Learning

20. Reliable Fidelity and Diversity Metrics for Generative Models

21. Deep Learning Under the Microscope: Improving the Interpretability of Medical Imaging Neural Networks

22. Data Augmentation with Manifold Exploring Geometric Transformations for Increased Performance and Robustness

23. 3D Compositional Zero-Shot Learning with DeCompositional Consensus

24. A convolutional recursive deep architecture for unconstrained Urdu handwriting recognition

25. FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks

26. I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification

27. SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

28. Learning Graph Embeddings for Open World Compositional Zero-Shot Learning

29. Learning Attention Propagation for Compositional Zero-Shot Learning

30. Learning Graph Embeddings for Open World Compositional Zero-Shot Learning

31. A convolutional recursive deep architecture for unconstrained Urdu handwriting recognition

32. Learning Graph Embeddings for Compositional Zero-shot Learning

33. Open World Compositional Zero-Shot Learning

34. Impact of Ligature Coverage on Training Practical Urdu OCR Systems

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

34 results on '"Naeem, Muhammad Ferjad"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources