Author: "Kafle, Kushal" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Kafle, Kushal"' showing total 41 results

Start Over Author "Kafle, Kushal"

41 results on '"Kafle, Kushal"'

1. Revisiting Multi-Modal LLM Evaluation

Author: Lu, Jian, Srivastava, Shikhar, Chen, Junyu, Shrestha, Robik, Acharya, Manoj, Kafle, Kushal, and Kanan, Christopher
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: With the advent of multi-modal large language models (MLLMs), datasets used for visual question answering (VQA) and referring expression comprehension have seen a resurgence. However, the most popular datasets used to evaluate MLLMs are some of the earliest ones created, and they have many known problems, including extreme bias, spurious correlations, and an inability to permit fine-grained analysis. In this paper, we pioneer evaluating recent MLLMs (LLaVA 1.5, LLaVA-NeXT, BLIP2, InstructBLIP, GPT-4V, and GPT-4o) on datasets designed to address weaknesses in earlier ones. We assess three VQA datasets: 1) TDIUC, which permits fine-grained analysis on 12 question types; 2) TallyQA, which has simple and complex counting questions; and 3) DVQA, which requires optical character recognition for chart understanding. We also study VQDv1, a dataset that requires identifying all image regions that satisfy a given query. Our experiments reveal the weaknesses of many MLLMs that have not previously been reported. Our code is integrated into the widely used LAVIS framework for MLLM evaluation, enabling the rapid assessment of future MLLMs. Project webpage: https://kevinlujian.github.io/MLLM_Evaluations/
Published: 2024

2. They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias

Author: Magid, Salma Abdel, Wang, Jui-Hsien, Kafle, Kushal, and Pfister, Hanspeter
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_\alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.
Published: 2024

3. FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Author: Slyman, Eric, Lee, Stefan, Cohen, Scott, and Kafle, Kushal
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, I.4.10, I.2.7, E.0
Abstract: Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks., Comment: Conference paper at CVPR 2024. 6 pages, 8 figures. Project Page: https://ericslyman.com/fairdedup/
Published: 2024

4. FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

Author: Hua, Hang, Shi, Jing, Kafle, Kushal, Jenni, Simon, Zhang, Daoan, Collomosse, John, Cohen, Scott, and Luo, Jiebo
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction., Comment: ECCV 2024
Published: 2024

5. FineMatch: Aspect-Based Fine-Grained Image and Text Mismatch Detection and Correction

Author: Hua, Hang, Shi, Jing, Kafle, Kushal, Jenni, Simon, Zhang, Daoan, Collomosse, John, Cohen, Scott, Luo, Jiebo, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

6. SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

Author: Yang, Ziyan, Kafle, Kushal, Lin, Zhe, Cohen, Scott, Ding, Zhihong, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose Subject-Conditional Relation Detection SCoRD, where conditioned on an input subject, the goal is to predict all its relations to other objects in a scene along with their locations. Based on the Open Images dataset, we propose a challenging OIv6-SCoRD benchmark such that the training and testing splits have a distribution shift in terms of the occurrence statistics of $\langle$subject, relation, object$\rangle$ triplets. To solve this problem, we propose an auto-regressive model that given a subject, it predicts its relations, objects, and object locations by casting this output as a sequence of tokens. First, we show that previous scene-graph prediction methods fail to produce as exhaustive an enumeration of relation-object pairs when conditioned on a subject on this benchmark. Particularly, we obtain a recall@3 of 83.8% for our relation-object predictions compared to the 49.75% obtained by a recent scene graph detector. Then, we show improved generalization on both relation-object and object-box predictions by leveraging during training relation-object pairs obtained automatically from textual captions and for which no object-box annotations are available. Particularly, for $\langle$subject, relation, object$\rangle$ triplets for which no object locations are available during training, we are able to obtain a recall@3 of 33.80% for relation-object pairs and 26.75% for their box locations., Comment: WACV 2024
Published: 2023

7. Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations

Author: Yang, Ziyan, Kafle, Kushal, Dernoncourt, Franck, and Ordonez, Vicente
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations., Comment: CVPR 2023. Fix ReferIt results. Code: https://github.com/uvavision/AMC-grounding Project Webpage: https://vislang.ai/amc
Published: 2022

8. OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses

Author: Shrestha, Robik, Kafle, Kushal, and Kanan, Christopher
Subjects: Computer Science - Machine Learning
Abstract: Dataset bias and spurious correlations can significantly impair generalization in deep neural networks. Many prior efforts have addressed this problem using either alternative loss functions or sampling strategies that focus on rare patterns. We propose a new direction: modifying the network architecture to impose inductive biases that make the network robust to dataset bias. Specifically, we propose OccamNets, which are biased to favor simpler solutions by design. OccamNets have two inductive biases. First, they are biased to use as little network depth as needed for an individual example. Second, they are biased toward using fewer image locations for prediction. While OccamNets are biased toward simpler hypotheses, they can learn more complex hypotheses if necessary. In experiments, OccamNets outperform or rival state-of-the-art methods run on architectures that do not incorporate these inductive biases. Furthermore, we demonstrate that when the state-of-the-art debiasing methods are combined with OccamNets results further improve., Comment: ECCV 2022
Published: 2022

9. Learning to Predict Visual Attributes in the Wild

Author: Pham, Khoi, Kafle, Kushal, Lin, Zhe, Ding, Zhihong, Cohen, Scott, Tran, Quan, and Shrivastava, Abhinav
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual attributes constitute a large portion of information contained in a scene. Objects can be described using a wide variety of attributes which portray their visual appearance (color, texture), geometry (shape, size, posture), and other intrinsic properties (state, action). Existing work is mostly limited to study of attribute prediction in specific domains. In this paper, we introduce a large-scale in-the-wild visual attribute prediction dataset consisting of over 927K attribute annotations for over 260K object instances. Formally, object attribute prediction is a multi-label classification problem where all attributes that apply to an object must be predicted. Our dataset poses significant challenges to existing methods due to large number of attributes, label sparsity, data imbalance, and object occlusion. To this end, we propose several techniques that systematically tackle these challenges, including a base model that utilizes both low- and high-level CNN features with multi-hop attention, reweighting and resampling techniques, a novel negative label expansion scheme, and a novel supervised attribute-aware contrastive learning algorithm. Using these techniques, we achieve near 3.7 mAP and 5.7 overall F1 points improvement over the current state of the art. Further details about the VAW dataset can be found at http://vawdataset.com/., Comment: Accepted to CVPR 2021
Published: 2021

10. Are Bias Mitigation Techniques for Deep Learning Effective?

Author: Shrestha, Robik, Kafle, Kushal, and Kanan, Christopher
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it is not clear how effective these methods are. This is because study protocols differ among papers, systems are tested on datasets that fail to test many forms of bias, and systems have access to hidden knowledge or are tuned specifically to the test set. To address this, we introduce an improved evaluation protocol, sensible metrics, and a new dataset, which enables us to ask and answer critical questions about bias mitigation algorithms. We evaluate seven state-of-the-art algorithms using the same network architecture and hyperparameter selection policy across three benchmark datasets. We introduce a new dataset called Biased MNIST that enables assessment of robustness to multiple bias sources. We use Biased MNIST and a visual question answering (VQA) benchmark to assess robustness to hidden biases. Rather than only tuning to the test set distribution, we study robustness across different tuning distributions, which is critical because for many applications the test distribution may not be known during development. We find that algorithms exploit hidden biases, are unable to scale to multiple forms of bias, and are highly sensitive to the choice of tuning set. Based on our findings, we implore the community to adopt more rigorous assessment of future bias mitigation methods. All data, code, and results are publicly available at: https://github.com/erobic/bias-mitigators., Comment: Published in WACV 2022 under the title "An Investigation of Critical Issues in Bias Mitigation Techniques"
Published: 2021

11. On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law

Author: Teney, Damien, Kafle, Kushal, Shrestha, Robik, Abbasnejad, Ehsan, Kanan, Christopher, and Hengel, Anton van den
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Out-of-distribution (OOD) testing is increasingly popular for evaluating a machine learning system's ability to generalize beyond the biases of a training set. OOD benchmarks are designed to present a different joint distribution of data and labels between training and test time. VQA-CP has become the standard OOD benchmark for visual question answering, but we discovered three troubling practices in its current use. First, most published methods rely on explicit knowledge of the construction of the OOD splits. They often rely on ``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the common training answer is 'no'. Second, the OOD test set is used for model selection. Third, a model's in-domain performance is assessed after retraining it on in-domain splits (VQA v2) that exhibit a more balanced distribution of labels. These three practices defeat the objective of evaluating generalization, and put into question the value of methods specifically designed for this dataset. We show that embarrassingly-simple methods, including one that generates answers at random, surpass the state of the art on some question types. We provide short- and long-term solutions to avoid these pitfalls and realize the benefits of OOD evaluation.
Published: 2020

12. Do We Need Fully Connected Output Layers in Convolutional Networks?

Author: Qian, Zhongchao, Hayes, Tyler L., Kafle, Kushal, and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing, Statistics - Machine Learning
Abstract: Traditionally, deep convolutional neural networks consist of a series of convolutional and pooling layers followed by one or more fully connected (FC) layers to perform the final classification. While this design has been successful, for datasets with a large number of categories, the fully connected layers often account for a large percentage of the network's parameters. For applications with memory constraints, such as mobile devices and embedded platforms, this is not ideal. Recently, a family of architectures that involve replacing the learned fully connected output layer with a fixed layer has been proposed as a way to achieve better efficiency. In this paper we examine this idea further and demonstrate that fixed classifiers offer no additional benefit compared to simply removing the output layer along with its parameters. We further demonstrate that the typical approach of having a fully connected final output layer is inefficient in terms of parameter count. We are able to achieve comparable performance to a traditionally learned fully connected classification output layer on the ImageNet-1K, CIFAR-100, Stanford Cars-196, and Oxford Flowers-102 datasets, while not having a fully connected output layer at all.
Published: 2020

13. Visual Grounding Methods for VQA are Working for the Wrong Reasons!

Author: Shrestha, Robik, Kafle, Kushal, and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2., Comment: Published in ACL 2020 under the title "A negative case analysis of visual grounding methods for VQA"
Published: 2020

14. REMIND Your Neural Network to Prevent Catastrophic Forgetting

Author: Hayes, Tyler L., Kafle, Kushal, Shrestha, Robik, Acharya, Manoj, and Kanan, Christopher
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Neural and Evolutionary Computing
Abstract: People learn throughout life. However, incrementally updating conventional neural networks leads to catastrophic forgetting. A common remedy is replay, which is inspired by how the brain consolidates memory. Replay involves fine-tuning a network on a mixture of new and old instances. While there is neuroscientific evidence that the brain replays compressed memories, existing methods for convolutional networks replay raw images. Here, we propose REMIND, a brain-inspired approach that enables efficient replay with compressed representations. REMIND is trained in an online manner, meaning it learns one example at a time, which is closer to how humans learn. Under the same constraints, REMIND outperforms other methods for incremental class learning on the ImageNet ILSVRC-2012 dataset. We probe REMIND's robustness to data ordering schemes known to induce catastrophic forgetting. We demonstrate REMIND's generality by pioneering online learning for Visual Question Answering (VQA)., Comment: To appear in the European Conference on Computer Vision (ECCV-2020)
Published: 2019

15. Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Author: Kafle, Kushal, Shrestha, Robik, Price, Brian, Cohen, Scott, and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Chart question answering (CQA) is a newly proposed visual question answering (VQA) task where an algorithm must answer questions about data visualizations, e.g. bar charts, pie charts, and line graphs. CQA requires capabilities that natural-image VQA algorithms lack: fine-grained measurements, optical character recognition, and handling out-of-vocabulary words in both questions and answers. Without modifications, state-of-the-art VQA algorithms perform poorly on this task. Here, we propose a novel CQA algorithm called parallel recurrent fusion of image and language (PReFIL). PReFIL first learns bimodal embeddings by fusing question and image features and then intelligently aggregates these learned embeddings to answer the given question. Despite its simplicity, PReFIL greatly surpasses state-of-the art systems and human baselines on both the FigureQA and DVQA datasets. Additionally, we demonstrate that PReFIL can be used to reconstruct tables by asking a series of questions about a chart., Comment: Presented at WACV, 2020
Published: 2019

16. Challenges and Prospects in Vision and Language Research

Author: Kafle, Kushal, Shrestha, Robik, and Kanan, Christopher
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: Language grounded image understanding tasks have often been proposed as a method for evaluating progress in artificial intelligence. Ideally, these tasks should test a plethora of capabilities that integrate computer vision, reasoning, and natural language understanding. However, rather than behaving as visual Turing tests, recent studies have demonstrated state-of-the-art systems are achieving good performance through flaws in datasets and evaluation procedures. We review the current state of affairs and outline a path forward.
Published: 2019

17. Answer Them All! Toward Universal Visual Question Answering Models

Author: Shrestha, Robik, Kafle, Kushal, and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual Question Answering (VQA) research is split into two camps: the first focuses on VQA datasets that require natural image understanding and the second focuses on synthetic datasets that test reasoning. A good VQA algorithm should be capable of both, but only a few VQA algorithms are tested in this manner. We compare five state-of-the-art VQA algorithms across eight VQA datasets covering both domains. To make the comparison fair, all of the models are standardized as much as possible, e.g., they use the same visual features, answer vocabularies, etc. We find that methods do not generalize across the two domains. To address this problem, we propose a new VQA algorithm that rivals or exceeds the state-of-the-art for both domains., Comment: 8 pages
Published: 2019

18. Improving Closed and Open-Vocabulary Attribute Prediction Using Transformers

Author: Pham, Khoi, Kafle, Kushal, Lin, Zhe, Ding, Zhihong, Cohen, Scott, Tran, Quan, Shrivastava, Abhinav, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

19. TallyQA: Answering Complex Counting Questions

Author: Acharya, Manoj, Kafle, Kushal, and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world's largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields state-of-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark., Comment: To appear in AAAI 2019 ( To download the dataset please go to http://www.manojacharya.com/ )
Published: 2018

20. DVQA: Understanding Data Visualizations via Question Answering

Author: Kafle, Kushal, Price, Brian, Cohen, Scott, and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Bar charts are an effective way to convey numeric information, but today's algorithms cannot parse them. Existing methods fail when faced with even minor variations in appearance. Here, we present DVQA, a dataset that tests many aspects of bar chart understanding in a question answering framework. Unlike visual question answering (VQA), DVQA requires processing words and answers that are unique to a particular bar chart. State-of-the-art VQA algorithms perform poorly on DVQA, and we propose two strong baselines that perform considerably better. Our work will enable algorithms to automatically extract numeric and semantic information from vast quantities of bar charts found in scientific publications, Internet articles, business reports, and many other areas., Comment: CVPR 2018 Camera Ready Version
Published: 2018

21. An Analysis of Visual Question Answering Algorithms

Author: Kafle, Kushal and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are evaluated on them. As a result, evaluation scores are inflated and predominantly determined by answering easier questions, making it difficult to compare different methods. In this paper, we analyze existing VQA algorithms using a new dataset. It contains over 1.6 million questions organized into 12 different categories. We also introduce questions that are meaningless for a given image to force a VQA system to reason about image content. We propose new evaluation schemes that compensate for over-represented question-types and make it easier to study the strengths and weaknesses of algorithms. We analyze the performance of both baseline and state-of-the-art VQA models, including multi-modal compact bilinear pooling (MCB), neural module networks, and recurrent answering units. Our experiments establish how attention helps certain categories more than others, determine which models work better than others, and explain how simple models (e.g. MLP) can surpass more complex models (MCB) by simply learning to answer large, easy question categories., Comment: To appear in ICCV 2017. Visit http://kushalkafle.com/projects/tdiuc to download the dataset
Published: 2017

22. REMIND Your Neural Network to Prevent Catastrophic Forgetting

Author: Hayes, Tyler L., Kafle, Kushal, Shrestha, Robik, Acharya, Manoj, Kanan, Christopher, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vedaldi, Andrea, editor, Bischof, Horst, editor, Brox, Thomas, editor, and Frahm, Jan-Michael, editor
Published: 2020
Full Text: View/download PDF

23. Visual Question Answering: Datasets, Algorithms, and Future Challenges

Author: Kafle, Kushal and Kanan, Christopher
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Visual Question Answering (VQA) is a recent problem in computer vision and natural language processing that has garnered a large amount of interest from the deep learning, computer vision, and natural language processing communities. In VQA, an algorithm needs to answer text-based questions about images. Since the release of the first VQA dataset in 2014, additional datasets have been released and many algorithms have been proposed. In this review, we critically examine the current state of VQA in terms of problem formulation, existing datasets, evaluation metrics, and algorithms. In particular, we discuss the limitations of current datasets with regard to their ability to properly train and assess VQA algorithms. We then exhaustively review existing algorithms for VQA. Finally, we discuss possible future directions for VQA and image understanding research.
Published: 2016
Full Text: View/download PDF

24. OccamNets: Mitigating Dataset Bias by Favoring Simpler Hypotheses

Author: Shrestha, Robik, primary, Kafle, Kushal, additional, and Kanan, Christopher, additional
Published: 2022
Full Text: View/download PDF

25. SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

Author: Yang, Ziyan, primary, Kafle, Kushal, additional, Lin, Zhe, additional, Cohen, Scott, additional, Ding, Zhihong, additional, and Ordonez, Vicente, additional
Published: 2024
Full Text: View/download PDF

26. REMIND Your Neural Network to Prevent Catastrophic Forgetting

Author: Hayes, Tyler L., primary, Kafle, Kushal, additional, Shrestha, Robik, additional, Acharya, Manoj, additional, and Kanan, Christopher, additional
Published: 2020
Full Text: View/download PDF

27. Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations

Author: Yang, Ziyan, primary, Kafle, Kushal, additional, Dernoncourt, Franck, additional, and Ordonez, Vicente, additional
Published: 2023
Full Text: View/download PDF

28. Visual question answering: Datasets, algorithms, and future challenges

Author: Kafle, Kushal and Kanan, Christopher
Published: 2017
Full Text: View/download PDF

29. An Investigation of Critical Issues in Bias Mitigation Techniques

Author: Shrestha, Robik, Kafle, Kushal, and Kanan, Christopher
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: A critical problem in deep learning is that systems learn inappropriate biases, resulting in their inability to perform well on minority groups. This has led to the creation of multiple algorithms that endeavor to mitigate bias. However, it is not clear how effective these methods are. This is because study protocols differ among papers, systems are tested on datasets that fail to test many forms of bias, and systems have access to hidden knowledge or are tuned specifically to the test set. To address this, we introduce an improved evaluation protocol, sensible metrics, and a new dataset, which enables us to ask and answer critical questions about bias mitigation algorithms. We evaluate seven state-of-the-art algorithms using the same network architecture and hyperparameter selection policy across three benchmark datasets. We introduce a new dataset called Biased MNIST that enables assessment of robustness to multiple bias sources. We use Biased MNIST and a visual question answering (VQA) benchmark to assess robustness to hidden biases. Rather than only tuning to the test set distribution, we study robustness across different tuning distributions, which is critical because for many applications the test distribution may not be known during development. We find that algorithms exploit hidden biases, are unable to scale to multiple forms of bias, and are highly sensitive to the choice of tuning set. Based on our findings, we implore the community to adopt more rigorous assessment of future bias mitigation methods. All data, code, and results are publicly available at: https://github.com/erobic/bias-mitigators.
Published: 2022

30. Learning to Predict Visual Attributes in the Wild

Author: Pham, Khoi, primary, Kafle, Kushal, additional, Lin, Zhe, additional, Ding, Zhihong, additional, Cohen, Scott, additional, Tran, Quan, additional, and Shrivastava, Abhinav, additional
Published: 2021
Full Text: View/download PDF

31. Answering Questions about Data Visualizations using Efficient Bimodal Fusion

Author: Kafle, Kushal, primary, Shrestha, Robik, additional, Price, Brian, additional, Cohen, Scott, additional, and Kanan, Christopher, additional
Published: 2020
Full Text: View/download PDF

32. A negative case analysis of visual grounding methods for VQA

Author: Shrestha, Robik, primary, Kafle, Kushal, additional, and Kanan, Christopher, additional
Published: 2020
Full Text: View/download PDF

33. Challenges and Prospects in Vision and Language Research

Author: Kafle, Kushal, primary, Shrestha, Robik, additional, and Kanan, Christopher, additional
Published: 2019
Full Text: View/download PDF

34. TallyQA: Answering Complex Counting Questions

Author: Acharya, Manoj, primary, Kafle, Kushal, additional, and Kanan, Christopher, additional
Published: 2019
Full Text: View/download PDF

35. Answer Them All! Toward Universal Visual Question Answering Models

Author: Shrestha, Robik, primary, Kafle, Kushal, additional, and Kanan, Christopher, additional
Published: 2019
Full Text: View/download PDF

36. DVQA: Understanding Data Visualizations via Question Answering

Author: Kafle, Kushal, primary, Price, Brian, additional, Cohen, Scott, additional, and Kanan, Christopher, additional
Published: 2018
Full Text: View/download PDF

37. An Analysis of Visual Question Answering Algorithms

Author: Kafle, Kushal, primary and Kanan, Christopher, additional
Published: 2017
Full Text: View/download PDF

38. Data Augmentation for Visual Question Answering

Author: Kafle, Kushal, primary, Yousefhussien, Mohammed, additional, and Kanan, Christopher, additional
Published: 2017
Full Text: View/download PDF

39. A Bayesian Model of Visual Question Answering

Author: Kanan, Christopher, primary and Kafle, Kushal, additional
Published: 2016
Full Text: View/download PDF

40. Answer-Type Prediction for Visual Question Answering

Author: Kafle, Kushal, primary and Kanan, Christopher, additional
Published: 2016
Full Text: View/download PDF

41. Advancing Multi-Modal Deep Learning: Towards Language-Grounded Visual Understanding

Author: Kafle, Kushal
Subjects: Computer vision, Dataset bias, Deep learning, Natural language processing, Visual question answering
Abstract: Using deep learning, computer vision now rivals people at object recognition and detection, opening doors to tackle new challenges in image understanding. Among these challenges, understanding and reasoning about language grounded visual content is of fundamental importance to advancing artificial intelligence. Recently, multiple datasets and algorithms have been created as proxy tasks towards this goal, with visual question answering (VQA) being the most widely studied. In VQA, an algorithm needs to produce an answer to a natural language question about an image. However, our survey of datasets and algorithms for VQA uncovered several sources of dataset bias and sub-optimal evaluation metrics that allowed algorithms to perform well by merely exploiting superficial statistical patterns. In this dissertation, we describe new algorithms and datasets that address these issues. We developed two new datasets and evaluation metrics that enable a more accurate measurement of abilities of a VQA model, and also expand VQA to include new abilities, such as reading text, handling out-of-vocabulary words, and understanding data-visualization. We also created new algorithms for VQA that have helped advance the state-of-the-art for VQA, including an algorithm that surpasses humans on two different chart question answering datasets about bar-charts, line-graphs and pie charts. Finally, we provide a holistic overview of several yet-unsolved challenges in not only VQA but vision and language research at large. Despite enormous progress, we find that a robust understanding and integration of vision and language is still an elusive goal, and much of the progress may be misleading due to dataset bias, superficial correlations and flaws in standard evaluation metrics. We carefully study and categorize these issues for several vision and language tasks and outline several possible paths towards development of safe, robust and trustworthy AI for language-grounded visual understanding.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

41 results on '"Kafle, Kushal"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources