Author: "Karanam, Srikrishna" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Karanam, Srikrishna"' showing total 157 results

Start Over Author "Karanam, Srikrishna"

157 results on '"Karanam, Srikrishna"'

1. TIDE: Training Locally Interpretable Domain Generalization Models Enables Test-time Correction

Author: Agarwal, Aishwarya, Karanam, Srikrishna, and Gandhi, Vineet
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We consider the problem of single-source domain generalization. Existing methods typically rely on extensive augmentations to synthetically cover diverse domains during training. However, they struggle with semantic shifts (e.g., background and viewpoint changes), as they often learn global features instead of local concepts that tend to be domain invariant. To address this gap, we propose an approach that compels models to leverage such local concepts during prediction. Given no suitable dataset with per-class concepts and localization maps exists, we first develop a novel pipeline to generate annotations by exploiting the rich features of diffusion and large-language models. Our next innovation is TIDE, a novel training scheme with a concept saliency alignment loss that ensures model focus on the right per-concept regions and a local concept contrastive loss that promotes learning domain-invariant concept representations. This not only gives a robust model but also can be visually interpreted using the predicted concept saliency maps. Given these maps at test time, our final contribution is a new correction algorithm that uses the corresponding local concept representations to iteratively refine the prediction until it aligns with prototypical concept representations that we store at the end of model training. We evaluate our approach extensively on four standard DG benchmark datasets and substantially outperform the current state-ofthe-art (12% improvement on average) while also demonstrating that our predictions can be visually interpreted, Comment: 14 pages, 11 figures
Published: 2024

2. CoCoNO: Attention Contrast-and-Complete for Initial Noise Optimization in Text-to-Image Synthesis

Author: Sundaram, Aravindan, Pal, Ujjayan, Chauhan, Abhimanyu, Agarwal, Aishwarya, and Karanam, Srikrishna
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite recent advancements in text-to-image models, achieving semantically accurate images in text-to-image diffusion models is a persistent challenge. While existing initial latent optimization methods have demonstrated impressive performance, we identify two key limitations: (a) attention neglect, where the synthesized image omits certain subjects from the input prompt because they do not have a designated segment in the self-attention map despite despite having a high-response cross-attention, and (b) attention interference, where the generated image has mixed-up properties of multiple subjects because of a conflicting overlap between cross- and self-attention maps of different subjects. To address these limitations, we introduce CoCoNO, a new algorithm that optimizes the initial latent by leveraging the complementary information within self-attention and cross-attention maps. Our method introduces two new loss functions: the attention contrast loss, which minimizes undesirable overlap by ensuring each self-attention segment is exclusively linked to a specific subject's cross attention map, and the attention complete loss, which maximizes the activation within these segments to guarantee that each subject is fully and distinctly represented. Our approach operates within a noise optimization framework, avoiding the need to retrain base models. Through extensive experiments on multiple benchmarks, we demonstrate that CoCoNO significantly improves text-image alignment and outperforms the current state of the art., Comment: 15 pages, 12 figures
Published: 2024

3. Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

Author: Shukla, Tripti, Karanam, Srikrishna, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We consider the problem of conditional text-to-image synthesis with diffusion models. Most recent works need to either finetune specific parts of the base diffusion model or introduce new trainable parameters, leading to deployment inflexibility due to the need for training. To address this gap in the current literature, we propose our method called TINTIN: Test-time Conditional Text-to-Image Synthesis using Diffusion Models which is a new training-free test-time only algorithm to condition text-to-image diffusion model outputs on conditioning factors such as color palettes and edge maps. In particular, we propose to interpret noise predictions during denoising as gradients of an energy-based model, leading to a flexible approach to manipulate the noise by matching predictions inferred from them to the ground truth conditioning input. This results in, to the best of our knowledge, the first approach to control model outputs with input color palettes, which we realize using a novel color distribution matching loss. We also show this test-time noise manipulation can be easily extensible to other types of conditioning, e.g., edge maps. We conduct extensive experiments using a variety of text prompts, color palettes, and edge maps and demonstrate significant improvement over the current state-of-the-art, both qualitatively and quantitatively.
Published: 2024

4. Training-free Color-Style Disentanglement for Constrained Text-to-Image Synthesis

Author: Agarwal, Aishwarya, Karanam, Srikrishna, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We consider the problem of independently, in a disentangled fashion, controlling the outputs of text-to-image diffusion models with color and style attributes of a user-supplied reference image. We present the first training-free, test-time-only method to disentangle and condition text-to-image models on color and style attributes from reference image. To realize this, we propose two key innovations. Our first contribution is to transform the latent codes at inference time using feature transformations that make the covariance matrix of current generation follow that of the reference image, helping meaningfully transfer color. Next, we observe that there exists a natural disentanglement between color and style in the LAB image space, which we exploit to transform the self-attention feature maps of the image being generated with respect to those of the reference computed from its L channel. Both these operations happen purely at test time and can be done independently or merged. This results in a flexible method where color and style information can come from the same reference image or two different sources, and a new generation can seamlessly fuse them in either scenario., Comment: 16 pages, 17 figures
Published: 2024

5. SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Author: Nag, Sayan, Goswami, Koustava, and Karanam, Srikrishna
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: Referring Expression Segmentation (RES) aims to provide a segmentation mask of the target object in an image referred to by the text (i.e., referring expression). Existing methods require large-scale mask annotations. Moreover, such approaches do not generalize well to unseen/zero-shot scenarios. To address the aforementioned issues, we propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations. To the best of our knowledge, ours is the first approach that considers only a fraction of both mask and box annotations (shown in Figure 1 and Table 1) for training. To enable principled training of models in such low-annotation settings, improve image-text region-level alignment, and further enhance spatial localization of the target object in the image, we propose Cross-modal Fusion with Attention Consistency module. For automatic pseudo-labeling of unlabeled samples, we introduce a novel Mask Validity Filtering routine based on a spatially aware zero-shot proposal scoring approach. Extensive experiments show that with just 30% annotations, our model SafaRi achieves 59.31 and 48.26 mIoUs as compared to 58.93 and 48.19 mIoUs obtained by the fully-supervised SOTA method SeqTR respectively on RefCOCO+@testA and RefCOCO+testB datasets. SafaRi also outperforms SeqTR by 11.7% (on RefCOCO+testA) and 19.6% (on RefCOCO+testB) in a fully-supervised setting and demonstrates strong generalization capabilities in unseen/zero-shot tasks., Comment: Accepted at ECCV 2024
Published: 2024

6. AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

Author: Agarwal, Aishwarya, Karanam, Srikrishna, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We consider the problem of customizing text-to-image diffusion models with user-supplied reference images. Given new prompts, the existing methods can capture the key concept from the reference images but fail to align the generated image with the prompt. In this work, we seek to address this key issue by proposing new methods that can easily be used in conjunction with existing customization methods that optimize the embeddings/weights at various intermediate stages of the text encoding process. The first contribution of this paper is a dissection of the various stages of the text encoding process leading up to the conditioning vector for text-to-image models. We take a holistic view of existing customization methods and notice that key and value outputs from this process differs substantially from their corresponding baseline (non-customized) models (e.g., baseline stable diffusion). While this difference does not impact the concept being customized, it leads to other parts of the generated image not being aligned with the prompt. Further, we also observe that these keys and values allow independent control various aspects of the final generation, enabling semantic manipulation of the output. Taken together, the features spanning these keys and values, serve as the basis for our next contribution where we fix the aforementioned issues with existing methods. We propose a new post-processing algorithm, AlignIT, that infuses the keys and values for the concept of interest while ensuring the keys and values for all other tokens in the input prompt are unchanged. Our proposed method can be plugged in directly to existing customization methods, leading to a substantial performance improvement in the alignment of the final result with the input prompt while retaining the customization quality., Comment: 10 pages, 9 figures
Published: 2024

7. Crafting Parts for Expressive Object Composition

Author: Rangwani, Harsh, Agarwal, Aishwarya, Kulkarni, Kuldeep, Babu, R. Venkatesh, and Karanam, Srikrishna
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Text-to-image generation from large generative models like Stable Diffusion, DALLE-2, etc., have become a common base for various tasks due to their superior quality and extensive knowledge bases. As image composition and generation are creative processes the artists need control over various parts of the images being generated. We find that just adding details about parts in the base text prompt either leads to an entirely different image (e.g., missing/incorrect identity) or the extra part details simply being ignored. To mitigate these issues, we introduce PartCraft, which enables image generation based on fine-grained part-level details specified for objects in the base text prompt. This allows more control for artists and enables novel object compositions by combining distinctive object parts. PartCraft first localizes object parts by denoising the object region from a specific diffusion process. This enables each part token to be localized to the right object region. After obtaining part masks, we run a localized diffusion process in each of the part regions based on fine-grained part descriptions and combine them to produce the final image. All the stages of PartCraft are based on repurposing a pre-trained diffusion model, which enables it to generalize across various domains without training. We demonstrate the effectiveness of part-level control provided by PartCraft qualitatively through visual examples and quantitatively in comparison to the contemporary baselines., Comment: Project Page Will Be Here: https://rangwani-harsh.github.io/PartCraft
Published: 2024

8. Few Shot Class Incremental Learning using Vision-Language models

Author: Kumar, Anurag, Bharti, Chinmay, Dutta, Saikat, Karanam, Srikrishna, and Banerjee, Biplab
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Recent advancements in deep learning have demonstrated remarkable performance comparable to human capabilities across various supervised computer vision tasks. However, the prevalent assumption of having an extensive pool of training data encompassing all classes prior to model training often diverges from real-world scenarios, where limited data availability for novel classes is the norm. The challenge emerges in seamlessly integrating new classes with few samples into the training data, demanding the model to adeptly accommodate these additions without compromising its performance on base classes. To address this exigency, the research community has introduced several solutions under the realm of few-shot class incremental learning (FSCIL). In this study, we introduce an innovative FSCIL framework that utilizes language regularizer and subspace regularizer. During base training, the language regularizer helps incorporate semantic information extracted from a Vision-Language model. The subspace regularizer helps in facilitating the model's acquisition of nuanced connections between image and text semantics inherent to base classes during incremental training. Our proposed framework not only empowers the model to embrace novel classes with limited data, but also ensures the preservation of performance on base classes. To substantiate the efficacy of our approach, we conduct comprehensive experiments on three distinct FSCIL benchmarks, where our framework attains state-of-the-art performance.
Published: 2024

9. SafaRi: Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation

Author: Nag, Sayan, Goswami, Koustava, Karanam, Srikrishna, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

10. Approximate Caching for Efficiently Serving Diffusion Models

Author: Agarwal, Shubham, Mitra, Subrata, Chakraborty, Sarthak, Karanam, Srikrishna, Mukherjee, Koyel, and Saini, Shiv
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, production-grade diffusion model serving is a resource intensive task that not only require high-end GPUs which are expensive but also incurs considerable latency. In this paper, we introduce a technique called approximate-caching that can reduce such iterative denoising steps for an image generation based on a prompt by reusing intermediate noise states created during a prior image generation for similar prompts. Based on this idea, we present an end to end text-to-image system, Nirvana, that uses the approximate-caching with a novel cache management-policy Least Computationally Beneficial and Frequently Used (LCBFU) to provide % GPU compute savings, 19.8% end-to-end latency reduction and 19% dollar savings, on average, on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment., Comment: Accepted at NSDI'24
Published: 2023

11. An Image is Worth Multiple Words: Multi-attribute Inversion for Constrained Text-to-Image Synthesis

Author: Agarwal, Aishwarya, Karanam, Srikrishna, Shukla, Tripti, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We consider the problem of constraining diffusion model outputs with a user-supplied reference image. Our key objective is to extract multiple attributes (e.g., color, object, layout, style) from this single reference image, and then generate new samples with them. One line of existing work proposes to invert the reference images into a single textual conditioning vector, enabling generation of new samples with this learned token. These methods, however, do not learn multiple tokens that are necessary to condition model outputs on the multiple attributes noted above. Another line of techniques expand the inversion space to learn multiple embeddings but they do this only along the layer dimension (e.g., one per layer of the DDPM model) or the timestep dimension (one for a set of timesteps in the denoising process), leading to suboptimal attribute disentanglement. To address the aforementioned gaps, the first contribution of this paper is an extensive analysis to determine which attributes are captured in which dimension of the denoising process. As noted above, we consider both the time-step dimension (in reverse denoising) as well as the DDPM model layer dimension. We observe that often a subset of these attributes are captured in the same set of model layers and/or across same denoising timesteps. For instance, color and style are captured across same U-Net layers, whereas layout and color are captured across same timestep stages. Consequently, an inversion process that is designed only for the time-step dimension or the layer dimension is insufficient to disentangle all attributes. This leads to our second contribution where we design a new multi-attribute inversion algorithm, MATTE, with associated disentanglement-enhancing regularization losses, that operates across both dimensions and explicitly leads to four disentangled tokens (color, style, layout, and object).
Published: 2023

12. Iterative Multi-granular Image Editing using Diffusion Models

Author: Joseph, K J, Udhayanan, Prateksha, Shukla, Tripti, Agarwal, Aishwarya, Karanam, Srikrishna, Goswami, Koustava, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent advances in text-guided image synthesis has dramatically changed how creative professionals generate artistic and aesthetically pleasing visual assets. To fully support such creative endeavors, the process should possess the ability to: 1) iteratively edit the generations and 2) control the spatial reach of desired changes (global, local or anything in between). We formalize this pragmatic problem setting as Iterative Multi-granular Editing. While there has been substantial progress with diffusion-based models for image synthesis and editing, they are all one shot (i.e., no iterative editing capabilities) and do not naturally yield multi-granular control (i.e., covering the full spectrum of local-to-global edits). To overcome these drawbacks, we propose EMILIE: Iterative Multi-granular Image Editor. EMILIE introduces a novel latent iteration strategy, which re-purposes a pre-trained diffusion model to facilitate iterative editing. This is complemented by a gradient control operation for multi-granular control. We introduce a new benchmark dataset to evaluate our newly proposed setting. We conduct exhaustive quantitatively and qualitatively evaluation against recent state-of-the-art approaches adapted to our task, to being out the mettle of EMILIE. We hope our work would attract attention to this newly identified, pragmatic problem setting., Comment: Accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2024
Published: 2023

13. Learning with Multi-modal Gradient Attention for Explainable Composed Image Retrieval

Author: Udhayanan, Prateksha, Karanam, Srikrishna, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We consider the problem of composed image retrieval that takes an input query consisting of an image and a modification text indicating the desired changes to be made on the image and retrieves images that match these changes. Current state-of-the-art techniques that address this problem use global features for the retrieval, resulting in incorrect localization of the regions of interest to be modified because of the global nature of the features, more so in cases of real-world, in-the-wild images. Since modifier texts usually correspond to specific local changes in an image, it is critical that models learn local features to be able to both localize and retrieve better. To this end, our key novelty is a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step. We achieve this by first proposing a new visual image attention computation technique, which we call multi-modal gradient attention (MMGrad) that is explicitly conditioned on the modifier text. We next demonstrate how MMGrad can be incorporated into an end-to-end model training strategy with a new learning objective that explicitly forces these MMGrad attention maps to highlight the correct local regions corresponding to the modifier text. By training retrieval models with this new loss function, we show improved grounding by means of better visual attention maps, leading to better explainability of the models as well as competitive quantitative retrieval performance on standard benchmark datasets.
Published: 2023

14. CoPL: Contextual Prompt Learning for Vision-Language Understanding

Author: Goswami, Koustava, Karanam, Srikrishna, Udhayanan, Prateksha, Joseph, K J, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalization ability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas intuitively, prompts should be reweighed according to the semantics of the image. We address these as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features., Comment: Accepted at AAAI 2024
Published: 2023

15. Learning with Difference Attention for Visually Grounded Self-supervised Representations

Author: Agarwal, Aishwarya, Karanam, Srikrishna, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent works in self-supervised learning have shown impressive results on single-object images, but they struggle to perform well on complex multi-object images as evidenced by their poor visual grounding. To demonstrate this concretely, we propose visual difference attention (VDA) to compute visual attention maps in an unsupervised fashion by comparing an image with its salient-regions-masked-out version. We use VDA to derive attention maps for state-of-the art SSL methods and show they do not highlight all salient regions in an image accurately, suggesting their inability to learn strong representations for downstream tasks like segmentation. Motivated by these limitations, we cast VDA as a differentiable operation and propose a new learning objective, Differentiable Difference Attention (DiDA) loss, which leads to substantial improvements in an SSL model's visually grounding to an image's salient regions., Comment: 15 pages, 14 figures
Published: 2023

16. A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

Author: Agarwal, Aishwarya, Karanam, Srikrishna, Joseph, K J, Saxena, Apoorv, Goswami, Koustava, and Srinivasan, Balaji Vasan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While recent developments in text-to-image generative models have led to a suite of high-performing methods capable of producing creative imagery from free-form text, there are several limitations. By analyzing the cross-attention representations of these models, we notice two key issues. First, for text prompts that contain multiple concepts, there is a significant amount of pixel-space overlap (i.e., same spatial regions) among pairs of different concepts. This eventually leads to the model being unable to distinguish between the two concepts and one of them being ignored in the final generation. Next, while these models attempt to capture all such concepts during the beginning of denoising (e.g., first few steps) as evidenced by cross-attention maps, this knowledge is not retained by the end of denoising (e.g., last few steps). Such loss of knowledge eventually leads to inaccurate generation outputs. To address these issues, our key innovations include two test-time attention-based loss functions that substantially improve the performance of pretrained baseline text-to-image diffusion models. First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt, thereby reducing the confusion/conflict among various concepts and the eventual capture of all concepts in the generated output. Next, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps, thereby leading to reduced information loss and the preservation of all concepts in the generated output., Comment: 15 pages, 16 figures
Published: 2023

17. Audio Retrieval for Multimodal Design Documents: A New Dataset and Algorithms

Author: Singh, Prachi, Karanam, Srikrishna, and Shekhar, Sumit
Subjects: Computer Science - Multimedia, Computer Science - Information Retrieval, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We consider and propose a new problem of retrieving audio files relevant to multimodal design document inputs comprising both textual elements and visual imagery, e.g., birthday/greeting cards. In addition to enhancing user experience, integrating audio that matches the theme/style of these inputs also helps improve the accessibility of these documents (e.g., visually impaired people can listen to the audio instead). While recent work in audio retrieval exists, these methods and datasets are targeted explicitly towards natural images. However, our problem considers multimodal design documents (created by users using creative software) substantially different from a naturally clicked photograph. To this end, our first contribution is collecting and curating a new large-scale dataset called Melodic-Design (or MELON), comprising design documents representing various styles, themes, templates, illustrations, etc., paired with music audio. Given our paired image-text-audio dataset, our next contribution is a novel multimodal cross-attention audio retrieval (MMCAR) algorithm that enables training neural networks to learn a common shared feature space across image, text, and audio dimensions. We use these learned features to demonstrate that our method outperforms existing state-of-the-art methods and produce a new reference benchmark for the research community on our new dataset., Comment: 5 pages including references
Published: 2023

18. Preserving Privacy in Federated Learning with Ensemble Cross-Domain Knowledge Distillation

Author: Gong, Xuan, Sharma, Abhishek, Karanam, Srikrishna, Wu, Ziyan, Chen, Terrence, Doermann, David, and Innanje, Arun
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Federated Learning (FL) is a machine learning paradigm where local nodes collaboratively train a central model while the training data remains decentralized. Existing FL methods typically share model parameters or employ co-distillation to address the issue of unbalanced data distribution. However, they suffer from communication bottlenecks. More importantly, they risk privacy leakage. In this work, we develop a privacy preserving and communication efficient method in a FL framework with one-shot offline knowledge distillation using unlabeled, cross-domain public data. We propose a quantized and noisy ensemble of local predictions from completely trained local models for stronger privacy guarantees without sacrificing accuracy. Based on extensive experiments on image classification and text classification tasks, we show that our privacy-preserving method outperforms baseline FL algorithms with superior performance in both accuracy and communication efficiency., Comment: Accepted by AAAI2022
Published: 2022

19. Self-supervised Human Mesh Recovery with Cross-Representation Alignment

Author: Gong, Xuan, Zheng, Meng, Planche, Benjamin, Karanam, Srikrishna, Chen, Terrence, Doermann, David, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Fully supervised human mesh recovery methods are data-hungry and have poor generalizability due to the limited availability and diversity of 3D-annotated benchmark datasets. Recent progress in self-supervised human mesh recovery has been made using synthetic-data-driven training paradigms where the model is trained from synthetic paired 2D representation (e.g., 2D keypoints and segmentation masks) and 3D mesh. However, on synthetic dense correspondence maps (i.e., IUV) few have been explored since the domain gap between synthetic training data and real testing data is hard to address for 2D dense representation. To alleviate this domain gap on IUV, we propose cross-representation alignment utilizing the complementary information from the robust but sparse representation (2D keypoints). Specifically, the alignment errors between initial mesh estimation and both 2D representations are forwarded into regressor and dynamically corrected in the following mesh regression. This adaptive cross-representation alignment explicitly learns from the deviations and captures complementary information: robustness from sparse representation and richness from dense representation. We conduct extensive experiments on multiple standard benchmark datasets and demonstrate competitive results, helping take a step towards reducing the annotation effort needed to produce state-of-the-art models in human mesh estimation., Comment: Accepted ECCV2022
Published: 2022

20. PseudoClick: Interactive Image Segmentation with Click Imitation

Author: Liu, Qin, Zheng, Meng, Planche, Benjamin, Karanam, Srikrishna, Chen, Terrence, Niethammer, Marc, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The goal of click-based interactive image segmentation is to obtain precise object segmentation masks with limited user interaction, i.e., by a minimal number of user clicks. Existing methods require users to provide all the clicks: by first inspecting the segmentation mask and then providing points on mislabeled regions, iteratively. We ask the question: can our model directly predict where to click, so as to further reduce the user interaction cost? To this end, we propose {\PseudoClick}, a generic framework that enables existing segmentation networks to propose candidate next clicks. These automatically generated clicks, termed pseudo clicks in this work, serve as an imitation of human clicks to refine the segmentation mask., Comment: 18 pages, 6 figures, 7 tables. ECCV 2022
Published: 2022

21. Learning Hierarchical Attention for Weakly-supervised Chest X-Ray Abnormality Localization and Diagnosis

Author: Ouyang, Xi, Karanam, Srikrishna, Wu, Ziyan, Chen, Terrence, Huo, Jiayu, Zhou, Xiang Sean, Wang, Qian, and Cheng, Jie-Zhi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We consider the problem of abnormality localization for clinical applications. While deep learning has driven much recent progress in medical imaging, many clinical challenges are not fully addressed, limiting its broader usage. While recent methods report high diagnostic accuracies, physicians have concerns trusting these algorithm results for diagnostic decision-making purposes because of a general lack of algorithm decision reasoning and interpretability. One potential way to address this problem is to further train these models to localize abnormalities in addition to just classifying them. However, doing this accurately will require a large amount of disease localization annotations by clinical experts, a task that is prohibitively expensive to accomplish for most applications. In this work, we take a step towards addressing these issues by means of a new attention-driven weakly supervised algorithm comprising a hierarchical attention mining framework that unifies activation- and gradient-based visual attention in a holistic manner. Our key algorithmic innovations include the design of explicit ordinal attention constraints, enabling principled model training in a weakly-supervised fashion, while also facilitating the generation of visual-attention-driven model explanations by means of localization cues. On two large-scale chest X-ray datasets (NIH ChestX-ray14 and CheXpert), we demonstrate significant localization performance improvements over the current state of the art while also achieving competitive classification performance. Our code is available on https://github.com/oyxhust/HAM.
Published: 2021
Full Text: View/download PDF

22. Learning Local Recurrent Models for Human Mesh Recovery

Author: Li, Runze, Karanam, Srikrishna, Li, Ren, Chen, Terrence, Bhanu, Bir, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: We consider the problem of estimating frame-level full human body meshes given a video of a person with natural motion dynamics. While much progress in this field has been in single image-based mesh estimation, there has been a recent uptick in efforts to infer mesh dynamics from video given its role in alleviating issues such as depth ambiguity and occlusions. However, a key limitation of existing work is the assumption that all the observed motion dynamics can be modeled using one dynamical/recurrent model. While this may work well in cases with relatively simplistic dynamics, inference with in-the-wild videos presents many challenges. In particular, it is typically the case that different body parts of a person undergo different dynamics in the video, e.g., legs may move in a way that may be dynamically different from hands (e.g., a person dancing). To address these issues, we present a new method for video mesh recovery that divides the human mesh into several local parts following the standard skeletal model. We then model the dynamics of each local part with separate recurrent models, with each model conditioned appropriately based on the known kinematic structure of the human body. This results in a structure-informed local recurrent learning architecture that can be trained in an end-to-end fashion with available annotations. We conduct a variety of experiments on standard video mesh recovery benchmark datasets such as Human3.6M, MPI-INF-3DHP, and 3DPW, demonstrating the efficacy of our design of modeling local dynamics as well as establishing state-of-the-art results based on standard evaluation metrics., Comment: 10 pages, 6 figures, 2 tables
Published: 2021

23. Spatio-Temporal Representation Factorization for Video-based Person Re-Identification

Author: Aich, Abhishek, Zheng, Meng, Karanam, Srikrishna, Chen, Terrence, Roy-Chowdhury, Amit K., and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite much recent progress in video-based person re-identification (re-ID), the current state-of-the-art still suffers from common real-world challenges such as appearance similarity among various people, occlusions, and frame misalignment. To alleviate these problems, we propose Spatio-Temporal Representation Factorization (STRF), a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. The key innovations of STRF over prior work include explicit pathways for learning discriminative temporal and spatial features, with each component further factorized to capture complementary person-specific appearance and motion information. Specifically, temporal factorization comprises two branches, one each for static features (e.g., the color of clothes) that do not change much over time, and dynamic features (e.g., walking patterns) that change over time. Further, spatial factorization also comprises two branches to learn both global (coarse segments) as well as local (finer segments) appearance features, with the local features particularly useful in cases of occlusion or spatial misalignment. These two factorization operations taken together result in a modular architecture for our parameter-wise light STRF unit that can be plugged in between any two 3D convolutional layers, resulting in an end-to-end learning framework. We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results using standard person re-ID evaluation protocols on three benchmarks., Comment: Accepted at IEEE ICCV 2021, Includes Supplementary Material
Published: 2021

24. Everybody Is Unique: Towards Unbiased Human Mesh Recovery

Author: Li, Ren, Zheng, Meng, Karanam, Srikrishna, Chen, Terrence, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning, Computer Science - Robotics, Statistics - Machine Learning
Abstract: We consider the problem of obese human mesh recovery, i.e., fitting a parametric human mesh to images of obese people. Despite obese person mesh fitting being an important problem with numerous applications (e.g., healthcare), much recent progress in mesh recovery has been restricted to images of non-obese people. In this work, we identify this crucial gap in the current literature by presenting and discussing limitations of existing algorithms. Next, we present a simple baseline to address this problem that is scalable and can be easily used in conjunction with existing algorithms to improve their performance. Finally, we present a generalized human mesh optimization algorithm that substantially improves the performance of existing methods on both obese person images as well as community-standard benchmark datasets. A key innovation of this technique is that it does not rely on supervision from expensive-to-create mesh parameters. Instead, starting from widely and cheaply available 2D keypoints annotations, our method automatically generates mesh parameters that can in turn be used to re-train and fine-tune any existing mesh estimation algorithm. This way, we show our method acts as a drop-in to improve the performance of a wide variety of contemporary mesh estimation methods. We conduct extensive experiments on multiple datasets comprising both standard and obese person images and demonstrate the efficacy of our proposed techniques., Comment: 10 pages, 5 figures, 4 tables
Published: 2021

25. A Peek Into the Reasoning of Neural Networks: Interpreting with Structural Visual Concepts

Author: Ge, Yunhao, Xiao, Yao, Xu, Zhi, Zheng, Meng, Karanam, Srikrishna, Chen, Terrence, Itti, Laurent, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Despite substantial progress in applying neural networks (NN) to a wide variety of areas, they still largely suffer from a lack of transparency and interpretability. While recent developments in explainable artificial intelligence attempt to bridge this gap (e.g., by visualizing the correlation between input pixels and final outputs), these approaches are limited to explaining low-level relationships, and crucially, do not provide insights on error correction. In this work, we propose a framework (VRX) to interpret classification NNs with intuitive structural visual concepts. Given a trained classification model, the proposed VRX extracts relevant class-specific visual concepts and organizes them using structural concept graphs (SCG) based on pairwise concept relationships. By means of knowledge distillation, we show VRX can take a step towards mimicking the reasoning process of NNs and provide logical, concept-level explanations for final model decisions. With extensive experiments, we empirically show VRX can meaningfully answer "why" and "why not" questions about the prediction, providing easy-to-understand insights about the reasoning process. We also show that these insights can potentially provide guidance on improving NN's performance., Comment: CVPR 2021
Published: 2021

26. Towards Visually Explaining Similarity Models

Author: Zheng, Meng, Karanam, Srikrishna, Chen, Terrence, Radke, Richard J., and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We consider the problem of visually explaining similarity models, i.e., explaining why a model predicts two images to be similar in addition to producing a scalar score. While much recent work in visual model interpretability has focused on gradient-based attention, these methods rely on a classification module to generate visual explanations. Consequently, they cannot readily explain other kinds of models that do not use or need classification-like loss functions (e.g., similarity models trained with a metric learning loss). In this work, we bridge this crucial gap, presenting a method to generate gradient-based visual attention for image similarity predictors. By relying solely on the learned feature embedding, we show that our approach can be applied to any kind of CNN-based similarity architecture, an important step towards generic visual explainability. We show that our resulting attention maps serve more than just interpretability; they can be infused into the model learning process itself with new trainable constraints. We show that the resulting similarity models perform, and can be visually explained, better than the corresponding baseline models trained without these constraints. We demonstrate our approach using extensive experiments on three different kinds of tasks: generic image retrieval, person re-identification, and low-shot semantic segmentation., Comment: 13 pages, 10 figures, 4 tables. arXiv admin note: substantial text overlap with arXiv:1911.07381
Published: 2020

27. Hierarchical Kinematic Human Mesh Recovery

Author: Georgakis, Georgios, Li, Ren, Karanam, Srikrishna, Chen, Terrence, Kosecka, Jana, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: We consider the problem of estimating a parametric model of 3D human mesh from a single image. While there has been substantial recent progress in this area with direct regression of model parameters, these methods only implicitly exploit the human body kinematic structure, leading to sub-optimal use of the model prior. In this work, we address this gap by proposing a new technique for regression of human parametric model that is explicitly informed by the known hierarchical structure, including joint interdependencies of the model. This results in a strong prior-informed design of the regressor architecture and an associated hierarchical optimization that is flexible to be used in conjunction with the current standard frameworks for 3D human mesh recovery. We demonstrate these aspects by means of extensive experiments on standard benchmark datasets, showing how our proposed new design outperforms several existing and popular methods, establishing new state-of-the-art results. By considering joint interdependencies, our method is equipped to infer joints even under data corruptions, which we demonstrate by conducting experiments under varying degrees of occlusion., Comment: 17 pages, 8 figures, 5 tables, ECCV 2020
Published: 2020

28. Towards Robust RGB-D Human Mesh Recovery

Author: Li, Ren, Cai, Changjiang, Georgakis, Georgios, Karanam, Srikrishna, Chen, Terrence, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: We consider the problem of human pose estimation. While much recent work has focused on the RGB domain, these techniques are inherently under-constrained since there can be many 3D configurations that explain the same 2D projection. To this end, we propose a new method that uses RGB-D data to estimate a parametric human mesh model. Our key innovations include (a) the design of a new dynamic data fusion module that facilitates learning with a combination of RGB-only and RGB-D datasets, (b) a new constraint generator module that provides SMPL supervisory signals when explicit SMPL annotations are not available, and (c) the design of a new depth ranking learning objective, all of which enable principled model training with RGB-D data. We conduct extensive experiments on a variety of RGB-D datasets to demonstrate efficacy., Comment: 10 pages, 4 figures, 4 tables
Published: 2019

29. Visual Similarity Attention

Author: Zheng, Meng, Karanam, Srikrishna, Chen, Terrence, Radke, Richard J., and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: While there has been substantial progress in learning suitable distance metrics, these techniques in general lack transparency and decision reasoning, i.e., explaining why the input set of images is similar or dissimilar. In this work, we solve this key problem by proposing the first method to generate generic visual similarity explanations with gradient-based attention. We demonstrate that our technique is agnostic to the specific similarity model type, e.g., we show applicability to Siamese, triplet, and quadruplet models. Furthermore, we make our proposed similarity attention a principled part of the learning process, resulting in a new paradigm for learning similarity functions. We demonstrate that our learning mechanism results in more generalizable, as well as explainable, similarity models. Finally, we demonstrate the generality of our framework by means of experiments on a variety of tasks, including image retrieval, person re-identification, and low-shot semantic segmentation., Comment: 10 pages, 7 figures, 4 tables
Published: 2019

30. Towards Visually Explaining Variational Autoencoders

Author: Liu, Wenqian, Li, Runze, Zheng, Meng, Karanam, Srikrishna, Wu, Ziyan, Bhanu, Bir, Radke, Richard J., and Camps, Octavia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent advances in Convolutional Neural Network (CNN) model interpretability have led to impressive progress in visualizing and understanding model predictions. In particular, gradient-based visual attention methods have driven much recent effort in using visual attention maps as a means for visual explanations. A key problem, however, is these methods are designed for classification and categorization tasks, and their extension to explaining generative models, e.g. variational autoencoders (VAE) is not trivial. In this work, we take a step towards bridging this crucial gap, proposing the first technique to visually explain VAEs by means of gradient-based attention. We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions. We show how these attention maps can be used to localize anomalies in images, demonstrating state-of-the-art performance on the MVTec-AD dataset. We also show how they can be infused into model training, helping bootstrap the VAE into learning improved latent space disentanglement, demonstrated on the Dsprites dataset., Comment: 10 pages, 9 figures, 2 tables, CVPR 2020
Published: 2019

31. Towards Visually Explaining Variational Autoencoders

Author: Liu, Wenqian, Li, Runze, Zheng, Meng, Karanam, Srikrishna, Wu, Ziyan, Bhanu, Bir, Radke, Richard J, and Camps, Octavia
Subjects: Eye Disease and Disorders of Vision, cs.CV, cs.LG
Abstract: Recent advances in Convolutional Neural Network (CNN) model interpretabilityhave led to impressive progress in visualizing and understanding modelpredictions. In particular, gradient-based visual attention methods have drivenmuch recent effort in using visual attention maps as a means for visualexplanations. A key problem, however, is these methods are designed forclassification and categorization tasks, and their extension to explaininggenerative models, e.g. variational autoencoders (VAE) is not trivial. In thiswork, we take a step towards bridging this crucial gap, proposing the firsttechnique to visually explain VAEs by means of gradient-based attention. Wepresent methods to generate visual attention from the learned latent space, andalso demonstrate such attention explanations serve more than just explainingVAE predictions. We show how these attention maps can be used to localizeanomalies in images, demonstrating state-of-the-art performance on the MVTec-ADdataset. We also show how they can be infused into model training, helpingbootstrap the VAE into learning improved latent space disentanglement,demonstrated on the Dsprites dataset.
Published: 2020

32. Incremental Scene Synthesis

Author: Planche, Benjamin, Rong, Xuejian, Wu, Ziyan, Karanam, Srikrishna, Kosch, Harald, Tian, YingLi, Ernst, Jan, and Hutter, Andreas
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We present a method to incrementally generate complete 2D or 3D scenes with the following properties: (a) it is globally consistent at each step according to a learned scene prior, (b) real observations of a scene can be incorporated while observing global consistency, (c) unobserved regions can be hallucinated locally in consistence with previous observations, hallucinations and global priors, and (d) hallucinations are statistical in nature, i.e., different scenes can be generated from the same observations. To achieve this, we model the virtual scene, where an active agent at each step can either perceive an observed part of the scene or generate a local hallucination. The latter can be interpreted as the agent's expectation at this step through the scene and can be applied to autonomous navigation. In the limit of observing real data at each point, our method converges to solving the SLAM problem. It can otherwise sample entirely imagined scenes from prior distributions. Besides autonomous agents, applications include problems where large data is required for building robust real-world applications, but few samples are available. We demonstrate efficacy on various 2D as well as 3D data.
Published: 2018

33. Re-Identification with Consistent Attentive Siamese Networks

Author: Zheng, Meng, Karanam, Srikrishna, Wu, Ziyan, and Radke, Richard J.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We propose a new deep architecture for person re-identification (re-id). While re-id has seen much recent progress, spatial localization and view-invariant representation learning for robust cross-view matching remain key, unsolved problems. We address these questions by means of a new attention-driven Siamese learning architecture, called the Consistent Attentive Siamese Network. Our key innovations compared to existing, competing methods include (a) a flexible framework design that produces attention with only identity labels as supervision, (b) explicit mechanisms to enforce attention consistency among images of the same person, and (c) a new Siamese framework that integrates attention and attention consistency, producing principled supervisory signals as well as the first mechanism that can explain the reasoning behind the Siamese framework's predictions. We conduct extensive evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets and report competitive performance., Comment: 10 pages, 8 figures, 3 tables, to appear in CVPR 2019
Published: 2018

34. Sharpen Focus: Learning with Attention Separability and Consistency

Author: Wang, Lezi, Wu, Ziyan, Karanam, Srikrishna, Peng, Kuan-Chuan, Singh, Rajat Vikram, Liu, Bo, and Metaxas, Dimitris N.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Recent developments in gradient-based attention modeling have seen attention maps emerge as a powerful tool for interpreting convolutional neural networks. Despite good localization for an individual class of interest, these techniques produce attention maps with substantially overlapping responses among different classes, leading to the problem of visual confusion and the need for discriminative attention. In this paper, we address this problem by means of a new framework that makes class-discriminative attention a principled part of the learning process. Our key innovations include new learning objectives for attention separability and cross-layer consistency, which result in improved attention discriminability and reduced visual confusion. Extensive experiments on image classification benchmarks show the effectiveness of our approach in terms of improved classification accuracy, including CIFAR-100 (+3.33%), Caltech-256 (+1.64%), ILSVRC2012 (+0.92%), CUB-200-2011 (+4.8%) and PASCAL VOC2012 (+5.73%)., Comment: This paper is accepted to ICCV 2019. The supplementary material (appendix) can be found after the main paper
Published: 2018

35. Learning Local RGB-to-CAD Correspondences for Object Pose Estimation

Author: Georgakis, Georgios, Karanam, Srikrishna, Wu, Ziyan, and Kosecka, Jana
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: We consider the problem of 3D object pose estimation. While much recent work has focused on the RGB domain, the reliance on accurately annotated images limits their generalizability and scalability. On the other hand, the easily available CAD models of objects are rich sources of data, providing a large number of synthetically rendered images. In this paper, we solve this key problem of existing methods requiring expensive 3D pose annotations by proposing a new method that matches RGB images to CAD models for object pose estimation. Our key innovations compared to existing work include removing the need for either real-world textures for CAD models or explicit 3D pose annotations for RGB images. We achieve this through a series of objectives that learn how to select keypoints and enforce viewpoint and modality invariance across RGB images and CAD model renderings. We conduct extensive experiments to demonstrate that the proposed method can reliably estimate object pose in RGB images, as well as generalize to object instances not seen during training., Comment: 10 pages, 6 figures, 4 tables, ICCV 2019
Published: 2018

36. Measuring the Temporal Behavior of Real-World Person Re-Identification

Author: Zheng, Meng, Karanam, Srikrishna, and Radke, Richard J.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Designing real-world person re-identification (re-id) systems requires attention to operational aspects not typically considered in academic research. Typically, the probe image or image sequence is matched to a gallery set with a fixed candidate list. On the other hand, in real-world applications of re-id, we would search for a person of interest in a gallery set that is continuously populated by new candidates over time. A key question of interest for the operator of such a system is: how long is a correct match to a probe likely to remain in a rank-k shortlist of candidates? In this paper, we propose to distill this information into what we call a Rank Persistence Curve (RPC), which unlike a conventional cumulative match characteristic (CMC) curve helps directly compare the temporal performance of different re-id algorithms. To carefully illustrate the concept, we collected a new multi-shot person re-id dataset called RPIfield. The RPIfield dataset is constructed using a network of 12 cameras with 112 explicitly time-stamped actor paths among about 4000 distractors. We then evaluate the temporal performance of different re-id algorithms using the proposed RPCs using single and pairwise camera videos from RPIfield, and discuss considerations for future research., Comment: 14 pages, 14 figures
Published: 2018

37. End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching

Author: Georgakis, Georgios, Karanam, Srikrishna, Wu, Ziyan, Ernst, Jan, and Kosecka, Jana
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Finding correspondences between images or 3D scans is at the heart of many computer vision and image retrieval applications and is often enabled by matching local keypoint descriptors. Various learning approaches have been applied in the past to different stages of the matching pipeline, considering detector, descriptor, or metric learning objectives. These objectives were typically addressed separately and most previous work has focused on image data. This paper proposes an end-to-end learning framework for keypoint detection and its representation (descriptor) for 3D depth maps or 3D scans, where the two can be jointly optimized towards task-specific objectives without a need for separate annotations. We employ a Siamese architecture augmented by a sampling layer and a novel score loss function which in turn affects the selection of region proposals. The positive and negative examples are obtained automatically by sampling corresponding region proposals based on their consistency with known 3D pose labels. Matching experiments with depth data on multiple benchmark datasets demonstrate the efficacy of the proposed approach, showing significant improvements over state-of-the-art methods., Comment: 9 pages, 9 figures, 3 tables, CVPR 2018
Published: 2018

38. Learning Compositional Visual Concepts with Mutual Consistency

Author: Gong, Yunye, Karanam, Srikrishna, Wu, Ziyan, Peng, Kuan-Chuan, Ernst, Jan, and Doerschuk, Peter C.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Learning
Abstract: Compositionality of semantic concepts in image synthesis and analysis is appealing as it can help in decomposing known and generatively recomposing unknown data. For instance, we may learn concepts of changing illumination, geometry or albedo of a scene, and try to recombine them to generate physically meaningful, but unseen data for training and testing. In practice however we often do not have samples from the joint concept space available: We may have data on illumination change in one data set and on geometric change in another one without complete overlap. We pose the following question: How can we learn two or more concepts jointly from different data sets with mutual consistency where we do not have samples from the full joint space? We present a novel answer in this paper based on cyclic consistency over multiple concepts, represented individually by generative adversarial networks (GANs). Our method, ConceptGAN, can be understood as a drop in for data augmentation to improve resilience for real world applications. Qualitative and quantitative evaluations demonstrate its efficacy in generating semantically meaningful images, as well as one shot face verification as an example application., Comment: 10 pages, 8 figures, 4 tables, CVPR 2018
Published: 2017

39. Rank Persistence: Assessing the Temporal Performance of Real-World Person Re-Identification

Author: Karanam, Srikrishna, Lam, Eric, and Radke, Richard J.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Designing useful person re-identification systems for real-world applications requires attention to operational aspects not typically considered in academic research. Here, we focus on the temporal aspect of re-identification; that is, instead of finding a match to a probe person of interest in a fixed candidate gallery, we consider the more realistic scenario in which the gallery is continuously populated by new candidates over a long time period. A key question of interest for an operator of such a system is: how long is a correct match to a probe likely to remain in a rank-k shortlist of possible candidates? We propose to distill this information into a Rank Persistence Curve (RPC), which allows different algorithms' temporal performance characteristics to be directly compared. We present examples to illustrate the RPC using a new long-term dataset with multiple candidate reappearances, and discuss considerations for future re-identification research that explicitly involves temporal aspects., Comment: 8 pages, 7 figures
Published: 2017

40. Towards Visually Interpreting Variational Autoencoders

Author: Li, Runze, primary, Liu, Wenqian, additional, Zheng, Meng, additional, Torop, Max, additional, Rajadhyaksha, Milind, additional, Dy, Jennifer, additional, Kose, Kivanc, additional, Karanam, Srikrishna, additional, Wu, Ziyan, additional, Bhanu, Bir, additional, Radke, Richard, additional, and Camps, Octavia, additional
Published: 2024
Full Text: View/download PDF

41. Robust Multi-modal 3D Patient Body Modeling

Author: Yang, Fan, Li, Ren, Georgakis, Georgios, Karanam, Srikrishna, Chen, Terrence, Ling, Haibin, Wu, Ziyan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Martel, Anne L., editor, Abolmaesumi, Purang, editor, Stoyanov, Danail, editor, Mateus, Diana, editor, Zuluaga, Maria A., editor, Zhou, S. Kevin, editor, Racoceanu, Daniel, editor, and Joskowicz, Leo, editor
Published: 2020
Full Text: View/download PDF

42. A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Author: Karanam, Srikrishna, Gou, Mengran, Wu, Ziyan, Rates-Borras, Angels, Camps, Octavia, and Radke, Richard J.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Person re-identification (re-id) is a critical problem in video analytics applications such as security and surveillance. The public release of several datasets and code for vision algorithms has facilitated rapid progress in this area over the last few years. However, directly comparing re-id algorithms reported in the literature has become difficult since a wide variety of features, experimental protocols, and evaluation metrics are employed. In order to address this need, we present an extensive review and performance evaluation of single- and multi-shot re-id algorithms. The experimental protocol incorporates the most recent advances in both feature extraction and metric learning. To ensure a fair comparison, all of the approaches were implemented using a unified code library that includes 11 feature extraction algorithms and 22 metric learning and ranking techniques. All approaches were evaluated using a new large-scale dataset that closely mimics a real-world problem setting, in addition to 16 other publicly available datasets: VIPeR, GRID, CAVIAR, DukeMTMC4ReID, 3DPeS, PRID, V47, WARD, SAIVT-SoftBio, CUHK01, CHUK02, CUHK03, RAiD, iLIDSVID, HDA+ and Market1501. The evaluation codebase and results will be made publicly available for community use., Comment: Preliminary work on person Re-Id benchmark. S. Karanam and M. Gou contributed equally. 14 pages, 6 figures, 4 tables. For supplementary material, see http://robustsystems.coe.neu.edu/sites/robustsystems.coe.neu.edu/files/systems/supmat/ReID_benchmark_supp.zip
Published: 2016

43. Self-supervised Human Mesh Recovery with Cross-Representation Alignment

Author: Gong, Xuan, primary, Zheng, Meng, additional, Planche, Benjamin, additional, Karanam, Srikrishna, additional, Chen, Terrence, additional, Doermann, David, additional, and Wu, Ziyan, additional
Published: 2022
Full Text: View/download PDF

44. PseudoClick: Interactive Image Segmentation with Click Imitation

Author: Liu, Qin, primary, Zheng, Meng, additional, Planche, Benjamin, additional, Karanam, Srikrishna, additional, Chen, Terrence, additional, Niethammer, Marc, additional, and Wu, Ziyan, additional
Published: 2022
Full Text: View/download PDF

45. Domain Adaptive 3D Shape Retrieval from Monocular Images

Author: Pal, Harsh, primary, Khandelwal, Ritwik, additional, Pande, Shivam, additional, Banerjee, Biplab, additional, and Karanam, Srikrishna, additional
Published: 2024
Full Text: View/download PDF

46. Iterative Multi-granular Image Editing using Diffusion Models

Author: Joseph, K J, primary, Udhayanan, Prateksha, additional, Shukla, Tripti, additional, Agarwal, Aishwarya, additional, Karanam, Srikrishna, additional, Goswami, Koustava, additional, and Srinivasan, Balaji Vasan, additional
Published: 2024
Full Text: View/download PDF

47. Hierarchical Kinematic Human Mesh Recovery

Author: Georgakis, Georgios, primary, Li, Ren, additional, Karanam, Srikrishna, additional, Chen, Terrence, additional, Košecká, Jana, additional, and Wu, Ziyan, additional
Published: 2020
Full Text: View/download PDF

48. Robust Multi-modal 3D Patient Body Modeling

Author: Yang, Fan, primary, Li, Ren, additional, Georgakis, Georgios, additional, Karanam, Srikrishna, additional, Chen, Terrence, additional, Ling, Haibin, additional, and Wu, Ziyan, additional
Published: 2020
Full Text: View/download PDF

49. Person re-identification with block sparse recovery

Author: Karanam, Srikrishna, Li, Yang, and Radke, Richard J.
Published: 2017
Full Text: View/download PDF

50. Contextual Prompt Learning for Vision-Language Understanding

Author: Goswami, Koustava, Karanam, Srikrishna, J, Joseph K, Udhayanan, Prateksha, and Srinivasan, Balaji Vasan
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in multimodal learning has resulted in powerful vision-language models, whose representations are generalizable across a variety of downstream tasks. Recently, their generalizability has been further extended by incorporating trainable prompts, borrowed from the natural language processing literature. While such prompt learning techniques have shown impressive results, we identify that these prompts are trained based on global image features which limits itself in two aspects: First, by using global features, these prompts could be focusing less on the discriminative foreground image, resulting in poor generalization to various out-of-distribution test cases. Second, existing work weights all prompts equally whereas our intuition is that these prompts are more specific to the type of the image. We address these issues with as part of our proposed Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image. Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand. This gives us dynamic prompts that are both aligned to local image features as well as aware of local contextual relationships. Our extensive set of experiments on a variety of standard and few-shot datasets show that our method produces substantially improved performance when compared to the current state of the art methods. We also demonstrate both few-shot and out-of-distribution performance to establish the utility of learning dynamic prompts that are aligned to local image features.
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

157 results on '"Karanam, Srikrishna"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources