Author: "A, Cucchiara" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"A, Cucchiara"' showing total 10,647 results

Start Over Author "A, Cucchiara"

10,647 results on '"A, Cucchiara"'

1. Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Author: Barsellotti, Luca, Bianchi, Lorenzo, Messina, Nicola, Carrara, Fabio, Cornia, Marcella, Baraldi, Lorenzo, Falchi, Fabrizio, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.
Published: 2024

2. Maximally Separated Active Learning

Author: Kasarla, Tejaswi, Jha, Abhishek, Tervoort, Faye, Cucchiara, Rita, and Mettes, Pascal
Subjects: Computer Science - Machine Learning
Abstract: Active Learning aims to optimize performance while minimizing annotation costs by selecting the most informative samples from an unlabelled pool. Traditional uncertainty sampling often leads to sampling bias by choosing similar uncertain samples. We propose an active learning method that utilizes fixed equiangular hyperspherical points as class prototypes, ensuring consistent inter-class separation and robust feature representations. Our approach introduces Maximally Separated Active Learning (MSAL) for uncertainty sampling and a combined strategy (MSAL-D) for incorporating diversity. This method eliminates the need for costly clustering steps, while maintaining diversity through hyperspherical uniformity. We demonstrate strong performance over existing active learning techniques across five benchmark datasets, highlighting the method's effectiveness and integration ease. The code is available on GitHub., Comment: ECCV 2024 Beyond Euclidean Workshop (proceedings)
Published: 2024

3. Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Author: Cocchi, Federico, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at https://github.com/aimagelab/ReflectiVA.
Published: 2024

4. Is Multiple Object Tracking a Matter of Specialization?

Author: Mancusi, Gianluca, Bernardi, Mattia, Panariello, Aniello, Porrello, Angelo, Cucchiara, Rita, and Calderara, Simone
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: End-to-end transformer-based trackers have achieved remarkable performance on most human-related datasets. However, training these trackers in heterogeneous scenarios poses significant challenges, including negative interference - where the model learns conflicting scene-specific parameters - and limited domain generalization, which often necessitates expensive fine-tuning to adapt the models to new domains. In response to these challenges, we introduce Parameter-efficient Scenario-specific Tracking Architecture (PASTA), a novel framework that combines Parameter-Efficient Fine-Tuning (PEFT) and Modular Deep Learning (MDL). Specifically, we define key scenario attributes (e.g, camera-viewpoint, lighting condition) and train specialized PEFT modules for each attribute. These expert modules are combined in parameter space, enabling systematic generalization to new domains without increasing inference time. Extensive experiments on MOTSynth, along with zero-shot evaluations on MOT17 and PersonPath22 demonstrate that a neural tracker built from carefully selected modules surpasses its monolithic counterpart. We release models and code., Comment: NeurIPS 2024
Published: 2024

5. TPP-Gaze: Modelling Gaze Dynamics in Space and Time with Neural Temporal Point Processes

Author: D'Amelio, Alessandro, Cartella, Giuseppe, Cuculo, Vittorio, Lucchi, Manuele, Cornia, Marcella, Cucchiara, Rita, and Boccignone, Giuseppe
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Attention guides our gaze to fixate the proper location of the scene and holds it in that location for the deserved amount of time given current processing demands, before shifting to the next one. As such, gaze deployment crucially is a temporal process. Existing computational models have made significant strides in predicting spatial aspects of observer's visual scanpaths (where to look), while often putting on the background the temporal facet of attention dynamics (when). In this paper we present TPP-Gaze, a novel and principled approach to model scanpath dynamics based on Neural Temporal Point Process (TPP), that jointly learns the temporal dynamics of fixations position and duration, integrating deep learning methodologies with point process theory. We conduct extensive experiments across five publicly available datasets. Our results show the overall superior performance of the proposed model compared to state-of-the-art approaches. Source code and trained models are publicly available at: https://github.com/phuselab/tppgaze., Comment: Accepted at WACV 2025
Published: 2024

6. Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Author: Barsellotti, Luca, Bigazzi, Roberto, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents., Comment: NeurIPS 2024 Datasets and Benchmarks Track. Project page: https://aimagelab.github.io/pin/
Published: 2024

7. Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Author: Sarto, Sara, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
Published: 2024

8. Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection

Author: Betti, Federico, Baraldi, Lorenzo, Cucchiara, Rita, and Sebe, Nicu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models have significantly advanced generative AI, but they encounter difficulties when generating complex combinations of multiple objects. As the final result heavily depends on the initial seed, accurately ensuring the desired output can require multiple iterations of the generation process. This repetition not only leads to a waste of time but also increases energy consumption, echoing the challenges of efficiency and accuracy in complex generative tasks. To tackle this issue, we introduce HEaD (Hallucination Early Detection), a new paradigm designed to swiftly detect incorrect generations at the beginning of the diffusion process. The HEaD pipeline combines cross-attention maps with a new indicator, the Predicted Final Image, to forecast the final outcome by leveraging the information available at early stages of the generation process. We demonstrate that using HEaD saves computational resources and accelerates the generation process to get a complete image, i.e. an image where all requested objects are accurately depicted. Our findings reveal that HEaD can save up to 12% of the generation time on a two objects scenario and underscore the importance of early detection mechanisms in generative models., Comment: Accepted at ECCV Workshop 2024
Published: 2024

9. KRONC: Keypoint-based Robust Camera Optimization for 3D Car Reconstruction

Author: Di Nucci, Davide, Simoni, Alessandro, Tomei, Matteo, Ciuffreda, Luca, Vezzani, Roberto, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The three-dimensional representation of objects or scenes starting from a set of images has been a widely discussed topic for years and has gained additional attention after the diffusion of NeRF-based approaches. However, an underestimated prerequisite is the knowledge of camera poses or, more specifically, the estimation of the extrinsic calibration parameters. Although excellent general-purpose Structure-from-Motion methods are available as a pre-processing step, their computational load is high and they require a lot of frames to guarantee sufficient overlapping among the views. This paper introduces KRONC, a novel approach aimed at inferring view poses by leveraging prior knowledge about the object to reconstruct and its representation through semantic keypoints. With a focus on vehicle scenes, KRONC is able to estimate the position of the views as a solution to a light optimization problem targeting the convergence of keypoints' back-projections to a singular point. To validate the method, a specific dataset of real-world car scenes has been collected. Experiments confirm KRONC's ability to generate excellent estimates of camera poses starting from very coarse initialization. Results are comparable with Structure-from-Motion methods with huge savings in computation. Code and data will be made publicly available., Comment: Accepted at ECCVW
Published: 2024

10. Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Author: Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets., Comment: ICPR 2024
Published: 2024

11. Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

Author: Quattrini, Fabio, Pippi, Vittorio, Cascianelli, Silvia, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD., Comment: Accepted at ECCV 2024
Published: 2024

12. {\mu}gat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Author: Quattrini, Fabio, Zaccagnino, Carmine, Cascianelli, Silvia, Righi, Laura, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Digital Libraries
Abstract: Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents. For this reason, they are of great interest to scholars in many social and humanities fields. In this work, we focus on Regesta Pontificum Romanum, a large collection of papal registers. Regesta are visually rich documents, where the layout is as important as the text content to convey the contained information through the structure, and are inherently multi-page documents. Among Digital Humanities techniques that can help scholars efficiently exploit regesta and other documental sources in the form of scanned documents, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language. However, current models focus on scientific and business documents, and most of them consider only single-paged documents. To overcome this limitation, in this work, we propose {\mu}gat, an extension of the recently proposed Document parsing Nougat architecture, which can handle elements spanning over the single page limits. Specifically, we adapt Nougat to process a larger, multi-page context, consisting of the previous and the following page, while parsing the current page. Experimental results, both qualitative and quantitative, demonstrate the effectiveness of our proposed approach also in the case of the challenging Regesta Pontificum Romanorum., Comment: Accepted at ECCV Workshop "AI4DH: Artificial Intelligence for Digital Humanities"
Published: 2024

13. Alfie: Democratising RGBA Image Generation With No $$$

Author: Quattrini, Fabio, Pippi, Vittorio, Cascianelli, Silvia, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers' productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at https://github.com/aimagelab/Alfie., Comment: Accepted at ECCV AI for Visual Arts Workshop and Challenges
Published: 2024

14. Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Author: Moratelli, Nicholas, Caffagni, Davide, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO., Comment: BMVC 2024
Published: 2024

15. UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

Author: Rawal, Niyati, Bigazzi, Roberto, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Robotics
Abstract: Smart autonomous agents are becoming increasingly important in various real-life applications, including robotics and autonomous vehicles. One crucial skill that these agents must possess is the ability to interact with their surrounding entities, such as other agents or humans. In this work, we aim at building an intelligent agent that can efficiently navigate in an environment while being able to interact with an oracle (or human) in natural language and ask for directions when it is unsure about its navigation performance. The interaction is started by the agent that produces a question, which is then answered by the oracle on the basis of the shortest trajectory to the goal. The process can be performed multiple times during navigation, thus enabling the agent to hold a dialogue with the oracle. To this end, we propose a novel computational model, named UNMuTe, that consists of two main components: a dialogue model and a navigator. Specifically, the dialogue model is based on a GPT-2 decoder that handles multimodal data consisting of both text and images. First, the dialogue model is trained to generate question-answer pairs: the question is generated using the current image, while the answer is produced leveraging future images on the path toward the goal. Subsequently, a VLN model is trained to follow the dialogue predicting navigation actions or triggering the dialogue model if it needs help. In our experimental analysis, we show that UNMuTe achieves state-of-the-art performance on the main navigation tasks implying dialogue, i.e. Cooperative Vision and Dialogue Navigation (CVDN) and Navigation from Dialogue History (NDH), proving that our approach is effective in generating useful questions and answers to guide navigation.
Published: 2024

16. Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Author: Baraldi, Lorenzo, Cocchi, Federico, Cornia, Marcella, Nicolosi, Alessandro, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE., Comment: ECCV 2024
Published: 2024

17. BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score., Comment: ECCV 2024
Published: 2024

18. Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning

Author: Ren, Bin, Mei, Guofeng, Paudel, Danda Pani, Wang, Weijie, Li, Yawei, Liu, Mengyuan, Cucchiara, Rita, Van Gool, Luc, and Sebe, Nicu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. However, in 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant. This raises the question: Can we take the best of both worlds? To answer this question, we first empirically validate that integrating MAE-based point cloud pre-training with the standard contrastive learning paradigm, even with meticulous design, can lead to a decrease in performance. To address this limitation, we reintroduce CL into the MAE-based point cloud pre-training paradigm by leveraging the inherent contrastive properties of MAE. Specifically, rather than relying on extensive data augmentation as commonly used in the image domain, we randomly mask the input tokens twice to generate contrastive input pairs. Subsequently, a weight-sharing encoder and two identically structured decoders are utilized to perform masked token reconstruction. Additionally, we propose that for an input token masked by both masks simultaneously, the reconstructed features should be as similar as possible. This naturally establishes an explicit contrastive constraint within the generative MAE-based pre-training paradigm, resulting in our proposed method, Point-CMAE. Consequently, Point-CMAE effectively enhances the representation quality and transfer performance compared to its MAE counterpart. Experimental evaluations across various downstream applications, including classification, part segmentation, and few-shot learning, demonstrate the efficacy of our framework in surpassing state-of-the-art techniques under standard ViTs and single-modal settings. The source code and trained models are available at: https://github.com/Amazingren/Point-CMAE., Comment: Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning
Published: 2024

19. Mask and Compress: Efficient Skeleton-based Action Recognition in Continual Learning

Author: Mosconi, Matteo, Sorokin, Andriy, Panariello, Aniello, Porrello, Angelo, Bonato, Jacopo, Cotogni, Marco, Sabetta, Luigi, Calderara, Simone, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The use of skeletal data allows deep learning models to perform action recognition efficiently and effectively. Herein, we believe that exploring this problem within the context of Continual Learning is crucial. While numerous studies focus on skeleton-based action recognition from a traditional offline perspective, only a handful venture into online approaches. In this respect, we introduce CHARON (Continual Human Action Recognition On skeletoNs), which maintains consistent performance while operating within an efficient framework. Through techniques like uniform sampling, interpolation, and a memory-efficient training stage based on masking, we achieve improved recognition accuracy while minimizing computational overhead. Our experiments on Split NTU-60 and the proposed Split NTU-120 datasets demonstrate that CHARON sets a new benchmark in this domain. The code is available at https://github.com/Sperimental3/CHARON., Comment: Accepted at ICPR 2024
Published: 2024

20. Creating a Space for Regulation and Reflection

Author: Maia Cucchiara and Mary Beth Hays
Abstract: In response to the growing awareness of trauma and its impact on student learning, schools across the country are implementing trauma-sensitive practices. Authors Maia Cucchiara and Mary Beth Hays describe an approach to trauma sensitivity in schools: the creation of restoration rooms, spaces designed to help students (and adults) return to a state of regulation. Restoration rooms are not for disciplining students. Instead, they are spaces equipped with fidget toys, soft carpets and cushions, flexible seating, and other materials to promote regulation. Schools can use these spaces as part of larger efforts to respond to student and staff trauma and promote overall wellness.
Published: 2024
Full Text: View/download PDF

21. Satisfaction with Social Connectedness Is Associated with Depression and Anxiety Symptoms in Neurodiverse First-Semester College Students

Author: Erin E. McKenney, Jared K. Richards, Talena C. Day, Steven M. Brunwasser, Claudia L. Cucchiara, Bella Kofner, Rachel G. McDonald, Kristen Gillespie-Lynch, Jenna Lamm, Erin Kang, Matthew D. Lerner, and Katherine O. Gotham
Abstract: Social difficulties and mental health are primary behavioral health concerns in autistic young adults, perhaps especially during key life transitions such as entering college. This study evaluated how dissatisfaction with social connectedness may predict and/or maintain depression and anxiety symptoms in neurodiverse, first-semester, undergraduate students (N = 263; n = 105 with diagnosed or suspected autism). Participation included a baseline survey battery, a brief survey completed twice per week across 12 weeks, and an endpoint survey battery. Social dissatisfaction at baseline was prospectively associated with biweekly ratings of depression symptoms, when controlling for baseline depressive symptoms. Social dissatisfaction was synchronously related to elevated sadness, anhedonia, and anxiety throughout the semester. These relationships were generally consistent across levels of baseline social motivation; however, there was one significant moderation effect--the negative relationship between baseline social satisfaction and anxiety was strongest for more socially motivated participants. More autistic traits were related to lower social satisfaction at baseline and greater mood concerns across timepoints. In contrast, greater autistic traits at baseline were related to greater satisfaction with social connectedness throughout the semester. Results support ongoing efforts to address mental health in autistic college students by highlighting the importance of social satisfaction.
Published: 2024
Full Text: View/download PDF

22. Trajectory Forecasting through Low-Rank Adaptation of Discrete Latent Codes

Author: Benaglia, Riccardo, Porrello, Angelo, Buzzega, Pietro, Calderara, Simone, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Trajectory forecasting is crucial for video surveillance analytics, as it enables the anticipation of future movements for a set of agents, e.g. basketball players engaged in intricate interactions with long-term intentions. Deep generative models offer a natural learning approach for trajectory forecasting, yet they encounter difficulties in achieving an optimal balance between sampling fidelity and diversity. We address this challenge by leveraging Vector Quantized Variational Autoencoders (VQ-VAEs), which utilize a discrete latent space to tackle the issue of posterior collapse. Specifically, we introduce an instance-based codebook that allows tailored latent representations for each example. In a nutshell, the rows of the codebook are dynamically adjusted to reflect contextual information (i.e., past motion patterns extracted from the observed trajectories). In this way, the discretization process gains flexibility, leading to improved reconstructions. Notably, instance-level dynamics are injected into the codebook through low-rank updates, which restrict the customization of the codebook to a lower dimension space. The resulting discrete space serves as the basis of the subsequent step, which regards the training of a diffusion-based predictive model. We show that such a two-fold framework, augmented with instance-level discretization, leads to accurate and diverse forecasts, yielding state-of-the-art performance on three established benchmarks., Comment: 15 pages, 3 figures, 5 tables
Published: 2024

23. Sharing Key Semantics in Transformer Makes Efficient Image Restoration

Author: Ren, Bin, Li, Yawei, Liang, Jingyun, Ranjan, Rakesh, Liu, Mengyuan, Cucchiara, Rita, Van Gool, Luc, Yang, Ming-Hsuan, and Sebe, Nicu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image Restoration (IR), a classic low-level vision task, has witnessed significant advancements through deep models that effectively model global information. Notably, the Vision Transformers (ViTs) emergence has further propelled these advancements. When computing, the self-attention mechanism, a cornerstone of ViTs, tends to encompass all global cues, even those from semantically unrelated objects or regions. This inclusivity introduces computational inefficiencies, particularly noticeable with high input resolution, as it requires processing irrelevant information, thereby impeding efficiency. Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction. To address these challenges, we propose boosting IR's performance by sharing the key semantics via Transformer for IR (i.e., SemanIR) in this paper. Specifically, SemanIR initially constructs a sparse yet comprehensive key-semantic dictionary within each transformer stage by establishing essential semantic connections for every degraded patch. Subsequently, this dictionary is shared across all subsequent transformer blocks within the same stage. This strategy optimizes attention calculation within each block by focusing exclusively on semantically related components stored in the key-semantic dictionary. As a result, attention calculation achieves linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed SemanIR's state-of-the-art performance, quantitatively and qualitatively showcasing advancements., Comment: 9 pages
Published: 2024

24. A Second-Order Perspective on Model Compositionality and Incremental Learning

Author: Porrello, Angelo, Bonicelli, Lorenzo, Buzzega, Pietro, Millunzi, Monica, Calderara, Simone, and Cucchiara, Rita
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The fine-tuning of deep pre-trained models has revealed compositional properties, with multiple specialized modules that can be arbitrarily composed into a single, multi-task model. However, identifying the conditions that promote compositionality remains an open issue, with recent efforts concentrating mainly on linearized networks. We conduct a theoretical study that attempts to demystify compositionality in standard non-linear networks through the second-order Taylor approximation of the loss function. The proposed formulation highlights the importance of staying within the pre-training basin to achieve composable modules. Moreover, it provides the basis for two dual incremental training algorithms: the one from the perspective of multiple models trained individually, while the other aims to optimize the composed model as a whole. We probe their application in incremental classification tasks and highlight some valuable skills. In fact, the pool of incrementally learned modules not only supports the creation of an effective multi-task model but also enables unlearning and specialization in certain tasks.
Published: 2024

25. Towards Retrieval-Augmented Architectures for Image Captioning

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Nicolosi, Alessandro, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale., Comment: ACM Transactions on Multimedia Computing, Communications and Applications (2024)
Published: 2024

26. Binarizing Documents by Leveraging both Space and Frequency

Author: Quattrini, Fabio, Pippi, Vittorio, Cascianelli, Silvia, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations., Comment: Accepted at ICDAR2024
Published: 2024

27. Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Author: Caffagni, Davide, Cocchi, Federico, Moratelli, Nicholas, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach., Comment: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models
Published: 2024

28. AIGeN: An Adversarial Approach for Instruction Generation in VLN

Author: Rawal, Niyati, Bigazzi, Roberto, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Robotics
Abstract: In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation performance by exploiting synthetic training data. In this work, we propose AIGeN, a novel architecture inspired by Generative Adversarial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance. The model is composed of a Transformer decoder (GPT-2) and a Transformer encoder (BERT). During the training phase, the decoder generates sentences for a sequence of images describing the agent's path to a particular point while the encoder discriminates between real and fake instructions. Experimentally, we evaluate the quality of the generated instructions and perform extensive ablation studies. Additionally, we generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D Dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method. The validation analysis of our proposal is conducted on REVERIE and R2R and highlights the promising aspects of our proposal, achieving state-of-the-art performance., Comment: Accepted to 7th Multimodal Learning and Applications Workshop (MULA 2024) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024
Published: 2024

29. Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

Author: Barsellotti, Luca, Amoroso, Roberto, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training., Comment: CVPR 2024. Project page: https://aimagelab.github.io/freeda/
Published: 2024

30. Defining the optimal target-to-background ratio to identify positive lymph nodes in prostate cancer patients undergoing robot-assisted [99mTc]Tc-PSMA radioguided surgery: updated results and ad interim analyses of a prospective phase II study

Author: Quarta, Leonardo, Mazzone, Elio, Cannoletta, Donato, Stabile, Armando, Scuderi, Simone, Barletta, Francesco, Cucchiara, Vito, Nocera, Luigi, Pellegrino, Antony, Robesti, Daniele, Leni, Riccardo, Zaurito, Paolo, Brembilla, Giorgio, De Cobelli, Francesco, Samanes Gajate, Ana Maria, Picchio, Maria, Chiti, Arturo, Montorsi, Francesco, Briganti, Alberto, and Gandaglia, Giorgio
Published: 2024
Full Text: View/download PDF

31. Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing

Author: Baldrati, Alberto, Morelli, Davide, Cornia, Marcella, Bertini, Marco, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
Published: 2024

32. Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

Author: Cartella, Giuseppe, Cuculo, Vittorio, Cornia, Marcella, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images. Our dataset is publicly available at: https://github.com/aimagelab/unveiling-the-truth., Comment: Accepted to IEEE Signal Processing Letters 2024
Published: 2024

33. Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Author: Bigazzi, Roberto, Baraldi, Lorenzo, Kousik, Shreyas, Cucchiara, Rita, and Pavone, Marco
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph generation; less effort has been focused on the task of purely identifying and mapping large semantic regions. The present work proposes a method for semantic region mapping via embodied navigation in indoor environments, generating a high-level representation of the knowledge of the agent. To enable region identification, the method uses a vision-to-language model to provide scene information for mapping. By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location. This mapping procedure is paired with a trained navigation policy to enable autonomous map generation. The proposed method significantly outperforms a variety of baselines, including an object-based system and a pretrained scene classifier, in experiments in a photorealistic simulator., Comment: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2024)
Published: 2024

34. Trends, Applications, and Challenges in Human Attention Modelling

Author: Cartella, Giuseppe, Cornia, Marcella, Cuculo, Vittorio, D'Amelio, Alessandro, Zanca, Dario, Boccignone, Giuseppe, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Human attention modelling has proven, in recent years, to be particularly useful not only for understanding the cognitive processes underlying visual exploration, but also for providing support to artificial intelligence models that aim to solve problems in various domains, including image and video processing, vision-and-language applications, and language modelling. This survey offers a reasoned overview of recent efforts to integrate human attention mechanisms into contemporary deep learning models and discusses future research directions and challenges. For a comprehensive overview on the ongoing research refer to our dedicated repository available at https://github.com/aimagelab/awesome-human-visual-attention., Comment: Accepted at IJCAI 2024 Survey Track
Published: 2024

35. The Revolution of Multimodal Large Language Models: A Survey

Author: Caffagni, Davide, Cocchi, Federico, Barsellotti, Luca, Moratelli, Nicholas, Sarto, Sara, Baraldi, Lorenzo, Cornia, Marcella, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs., Comment: ACL 2024 (Findings)
Published: 2024

36. VATr++: Choose Your Words Wisely for Handwritten Text Generation

Author: Vanherle, Bram, Pippi, Vittorio, Cascianelli, Silvia, Michiels, Nick, Van Reeth, Frank, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect - the impact of the input, both visual and textual, on the HTG model training and its subsequent influence on performance. This study delves deeper into a cutting-edge Styled-HTG approach, proposing strategies for input preparation and training regularization that allow the model to achieve better performance and generalize better. These aspects are validated through extensive analysis on several different settings and datasets. Moreover, in this work, we go beyond performance optimization and address a significant hurdle in HTG research - the lack of a standardized evaluation protocol. In particular, we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so, we aim to establish a foundation for fair and meaningful comparisons between HTG strategies, fostering progress in the field.
Published: 2024

37. Key-Graph Transformer for Image Restoration

Author: Ren, Bin, Li, Yawei, Liang, Jingyun, Ranjan, Rakesh, Liu, Mengyuan, Cucchiara, Rita, Van Gool, Luc, and Sebe, Nicu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: While it is crucial to capture global information for effective image restoration (IR), integrating such cues into transformer-based methods becomes computationally expensive, especially with high input resolution. Furthermore, the self-attention mechanism in transformers is prone to considering unnecessary global cues from unrelated objects or regions, introducing computational inefficiencies. In response to these challenges, we introduce the Key-Graph Transformer (KGT) in this paper. Specifically, KGT views patch features as graph nodes. The proposed Key-Graph Constructor efficiently forms a sparse yet representative Key-Graph by selectively connecting essential nodes instead of all the nodes. Then the proposed Key-Graph Attention is conducted under the guidance of the Key-Graph only among selected nodes with linear computational complexity within each window. Extensive experiments across 6 IR tasks confirm the proposed KGT's state-of-the-art performance, showcasing advancements both quantitatively and qualitatively., Comment: 9 pages, 6 figures
Published: 2024

38. DistFormer: Enhancing Local and Global Features for Monocular Per-Object Distance Estimation

Author: Panariello, Aniello, Mancusi, Gianluca, Ali, Fedy Haj, Porrello, Angelo, Calderara, Simone, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurate per-object distance estimation is crucial in safety-critical applications such as autonomous driving, surveillance, and robotics. Existing approaches rely on two scales: local information (i.e., the bounding box proportions) or global information, which encodes the semantics of the scene as well as the spatial relations with neighboring objects. However, these approaches may struggle with long-range objects and in the presence of strong occlusions or unusual visual patterns. In this respect, our work aims to strengthen both local and global cues. Our architecture -- named DistFormer -- builds upon three major components acting jointly: i) a robust context encoder extracting fine-grained per-object representations; ii) a masked encoder-decoder module exploiting self-supervision to promote the learning of useful per-object features; iii) a global refinement module that aggregates object representations and computes a joint, spatially-consistent estimation. To evaluate the effectiveness of DistFormer, we conduct experiments on the standard KITTI dataset and the large-scale NuScenes and MOTSynth datasets. Such datasets cover various indoor/outdoor environments, changing weather conditions, appearances, and camera viewpoints. Our comprehensive analysis shows that DistFormer outperforms existing methods. Moreover, we further delve into its generalization capabilities, showing its regularization benefits in zero-shot synth-to-real transfer.
Published: 2024

39. 'I Feel Like a Hypocrite': School Choice and Teacher Role Identity

Author: Sophia Seifert and Maia B. Cucchiara
Abstract: In recent decades, school choice has become a characteristic feature of urban school systems and, like students, teachers must choose among schools with various characteristics. Such decisions become new sites for teachers to enact their professional identity. This study uses qualitative data to explore the identity negotiations of 26 teachers employed in six choice high schools in Preston, a large Northeastern city. Sample teachers worked in schools that varied based on sector (public vs. charter), enrollment mechanism (neighborhood, lottery, selective), and model (progressive, highly structured). Drawing on the concept of a role identity standard--a goal against which a person judges themselves--we found that teachers in our sample held themselves to a standard that either emphasized instruction or social justice. Some viewed their school as conducive to enacting their role identity standard. These teachers were generally satisfied with their school and conceptually supportive of school choice. However, most teachers in our sample adhering to a justice-based teacher identity standard described incongruence between some aspect of their chosen school and their professional identity. This conflict created stress and drove teachers to re-frame aspects of their chosen school so that remaining there felt more consistent with their professional identity.
Published: 2024
Full Text: View/download PDF

40. Civis Americanus Sum: Mythmaking in the Movement to Reclassify Italian Alien Enemies During the Second World War

Author: Cucchiara, Antonia
Published: 2024
Full Text: View/download PDF

41. Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Author: Baraldi, Lorenzo, Cocchi, Federico, Cornia, Marcella, Nicolosi, Alessandro, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

42. BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

43. Sex Differences in Perihematomal Edema Volume and Outcome After Intracerebral Hemorrhage

Author: Witsch, Jens, Cao, Quy, Song, Jae W., Luo, Yunshi, Sloane, Kelly L., Rothstein, Aaron, Favilla, Christopher G., Cucchiara, Brett L., Kasner, Scott E., Messé, Steve R., Choi, Huimahn A., McCullough, Louise D., Mayer, Stephan A., and Gusdon, Aaron M.
Published: 2024
Full Text: View/download PDF

44. Characterization of patient-derived intestinal organoids for modelling fibrosis in Inflammatory Bowel Disease

Author: Laudadio, Ilaria, Carissimi, Claudia, Scafa, Noemi, Bastianelli, Alex, Fulci, Valerio, Renzini, Alessandra, Russo, Giusy, Oliva, Salvatore, Vitali, Roberta, Palone, Francesca, Cucchiara, Salvatore, and Stronati, Laura
Published: 2024
Full Text: View/download PDF

45. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Author: Poppi, Samuele, Poppi, Tobia, Cocchi, Federico, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip., Comment: ECCV 2024
Published: 2023

46. HWD: A Novel Evaluation Score for Styled Handwritten Text Generation

Author: Pippi, Vittorio, Quattrini, Fabio, Cascianelli, Silvia, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Digital Libraries
Abstract: Styled Handwritten Text Generation (Styled HTG) is an important task in document analysis, aiming to generate text images with the handwriting of given reference images. In recent years, there has been significant progress in the development of deep learning models for tackling this task. Being able to measure the performance of HTG models via a meaningful and representative criterion is key for fostering the development of this research topic. However, despite the current adoption of scores for natural image generation evaluation, assessing the quality of generated handwriting remains challenging. In light of this, we devise the Handwriting Distance (HWD), tailored for HTG evaluation. In particular, it works in the feature space of a network specifically trained to extract handwriting style features from the variable-lenght input images and exploits a perceptual distance to compare the subtle geometric features of handwriting. Through extensive experimental evaluation on different word-level and line-level datasets of handwritten text images, we demonstrate the suitability of the proposed HWD as a score for Styled HTG. The pretrained model used as backbone will be released to ease the adoption of the score, aiming to provide a valuable tool for evaluating HTG models and thus contributing to advancing this important research area., Comment: Accepted at BMVC2023
Published: 2023

47. Model order reduction by convex displacement interpolation

Author: Cucchiara, Simona, Iollo, Angelo, Taddei, Tommaso, and Telib, Haysam
Subjects: Mathematics - Numerical Analysis
Abstract: We present a nonlinear interpolation technique for parametric fields that exploits optimal transportation of coherent structures of the solution to achieve accurate performance. The approach generalizes the nonlinear interpolation procedure introduced in [Iollo, Taddei, J. Comput. Phys., 2022] to multi-dimensional parameter domains and to datasets of several snapshots. Given a library of high-fidelity simulations, we rely on a scalar testing function and on a point set registration method to identify coherent structures of the solution field in the form of sorted point clouds. Given a new parameter value, we exploit a regression method to predict the new point cloud; then, we resort to a boundary-aware registration technique to define bijective mappings that deform the new point cloud into the point clouds of the neighboring elements of the dataset, while preserving the boundary of the domain; finally, we define the estimate as a weighted combination of modes obtained by composing the neighboring snapshots with the previously-built mappings. We present several numerical examples for compressible and incompressible, viscous and inviscid flows to demonstrate the accuracy of the method. Furthermore, we employ the nonlinear interpolation procedure to augment the dataset of simulations for linear-subspace projection-based model reduction: our data augmentation procedure is designed to reduce offline costs -- which are dominated by snapshot generation -- of model reduction techniques for nonlinear advection-dominated problems.
Published: 2023

48. Dental pulp mesenchymal stem cell (DPSCs)-derived soluble factors, produced under hypoxic conditions, support angiogenesis via endothelial cell activation and generation of M2-like macrophages

Author: Barone, Ludovica, Cucchiara, Martina, Palano, Maria Teresa, Bassani, Barbara, Gallazzi, Matteo, Rossi, Federica, Raspanti, Mario, Zecca, Piero Antonio, De Antoni, Gianluca, Pagiatakis, Christina, Papait, Roberto, Bernardini, Giovanni, Bruno, Antonino, and Gornati, Rosalba
Published: 2024
Full Text: View/download PDF

49. Dental pulp mesenchymal stem cell (DPSCs)-derived soluble factors, produced under hypoxic conditions, support angiogenesis via endothelial cell activation and generation of M2-like macrophages

Author: Ludovica Barone, Martina Cucchiara, Maria Teresa Palano, Barbara Bassani, Matteo Gallazzi, Federica Rossi, Mario Raspanti, Piero Antonio Zecca, Gianluca De Antoni, Christina Pagiatakis, Roberto Papait, Giovanni Bernardini, Antonino Bruno, and Rosalba Gornati
Subjects: Dental pulp stem cells, Mesenchymal stem cells, Angiogenesis, Secretome, Tissue engineering, Macrophage polarization, Medicine
Abstract: Abstract Background Cell therapy has emerged as a revolutionary tool to repair damaged tissues by restoration of an adequate vasculature. Dental Pulp stem cells (DPSC), due to their easy biological access, ex vivo properties, and ability to support angiogenesis have been largely explored in regenerative medicine. Methods Here, we tested the capability of Dental Pulp Stem Cell-Conditioned medium (DPSC-CM), produced in normoxic (DPSC-CM Normox) or hypoxic (DPSC-CM Hypox) conditions, to support angiogenesis via their soluble factors. CMs were characterized by a secretome protein array, then used for in vivo and in vitro experiments. In in vivo experiments, DPSC-CMs were associated to an Ultimatrix sponge and injected in nude mice. After excision, Ultimatrix were assayed by immunohistochemistry, electron microscopy and flow cytometry, to evaluate the presence of endothelial, stromal, and immune cells. For in vitro procedures, DPSC-CMs were used on human umbilical-vein endothelial cells (HUVECs), to test their effects on cell adhesion, migration, tube formation, and on their capability to recruit human CD14+ monocytes. Results We found that DPSC-CM Hypox exert stronger pro-angiogenic activities, compared with DPSC-CM Normox, by increasing the frequency of CD31+ endothelial cells, the number of vessels and hemoglobin content in the Ultimatrix sponges. We observed that Utimatrix sponges associated with DPSC-CM Hypox or DPSC-CM Normox shared similar capability to recruit CD45− stromal cells, CD45+ leukocytes, F4/80+ macrophages, CD80+ M1-macrophages and CD206+ M2-macropages. We also observed that DPSC-CM Hypox and DPSC-CM Normox have similar capabilities to support HUVEC adhesion, migration, induction of a pro-angiogenic gene signature and the generation of capillary-like structures, together with the ability to recruit human CD14+ monocytes. Conclusions Our results provide evidence that DPSCs-CM, produced under hypoxic conditions, can be proposed as a tool able to support angiogenesis via macrophage polarization, suggesting its use to overcome the issues and restrictions associated with the use of staminal cells.
Published: 2024
Full Text: View/download PDF

50. OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data

Author: Cartella, Giuseppe, Baldrati, Alberto, Morelli, Davide, Cornia, Marcella, Bertini, Marco, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The inexorable growth of online shopping and e-commerce demands scalable and robust machine learning-based solutions to accommodate customer requirements. In the context of automatic tagging classification and multimodal retrieval, prior works either defined a low generalizable supervised learning approach or more reusable CLIP-based techniques while, however, training on closed source data. In this work, we propose OpenFashionCLIP, a vision-and-language contrastive learning method that only adopts open-source fashion data stemming from diverse domains, and characterized by varying degrees of specificity. Our approach is extensively validated across several tasks and benchmarks, and experimental results highlight a significant out-of-domain generalization capability and consistent improvements over state-of-the-art methods both in terms of accuracy and recall. Source code and trained models are publicly available at: https://github.com/aimagelab/open-fashion-clip., Comment: International Conference on Image Analysis and Processing (ICIAP) 2023
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

10,647 results on '"A, Cucchiara"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources