Author: "Baraldi, P." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Baraldi, P."' showing total 2,007 results

Start Over Author "Baraldi, P."

2,007 results on '"Baraldi, P."'

1. Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Author: Amoroso, Roberto, Zhang, Gengyuan, Koner, Rajat, Baraldi, Lorenzo, Cucchiara, Rita, and Tresp, Volker
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from a given question, and reason accurately to provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an additional space-time alignment poses a considerable challenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across multiple video QA benchmarks demonstrates that T-Former competes favorably with existing temporal modeling approaches and aligns with recent advancements in video QA., Comment: WACV 2025
Published: 2024

2. Causal Graphical Models for Vision-Language Compositional Understanding

Author: Parascandolo, Fiorenzo, Moratelli, Nicholas, Sangineto, Enver, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on compositional tasks, which require a deeper understanding of the different entities of a sentence (subject, verb, etc.) jointly with their mutual relationships in order to be solved. In this paper, we model the dependency relations among textual and visual tokens using a Causal Graphical Model (CGM), built using a dependency parser, and we train a decoder conditioned by the VLM visual encoder. Differently from standard autoregressive or parallel predictions, our decoder's generative process is partially-ordered following the CGM structure. This structure encourages the decoder to learn only the main causal dependencies in a sentence discarding spurious correlations. Using extensive experiments on five compositional benchmarks, we show that our method significantly outperforms all the state-of-the-art compositional approaches by a large margin, and it also improves over methods trained using much larger datasets.
Published: 2024

3. Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

Author: Bucciarelli, Davide, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which extend the capabilities of text-only LLMs to multiple modalities. This paper investigates whether Multimodal LLMs can supplant traditional image captioning networks by evaluating their performance on various image description benchmarks. We explore both the zero-shot capabilities of these models and their adaptability to different semantic domains through fine-tuning methods, including prompt learning, prefix tuning, and low-rank adaptation. Our results demonstrate that while Multimodal LLMs achieve impressive zero-shot performance, fine-tuning for specific domains while maintaining their generalization capabilities intact remains challenging. We discuss the implications of these findings for future research in image captioning and the development of more adaptable Multimodal LLMs., Comment: ECCV 2024 Workshop on Green Foundation Models
Published: 2024

4. Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Author: Barsellotti, Luca, Bianchi, Lorenzo, Messina, Nicola, Carrara, Fabio, Cornia, Marcella, Baraldi, Lorenzo, Falchi, Fabrizio, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.
Published: 2024

5. Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Author: Cocchi, Federico, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. They have recently garnered attention due to their capability to address complex tasks involving both modalities. However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. Our proposed model, Reflective LLaVA (ReflectiVA), utilizes reflective tokens to dynamically determine the need for external knowledge and predict the relevance of information retrieved from an external database. Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge while preserving fluency and performance on tasks where external knowledge is not needed. Through our experiments, we demonstrate the efficacy of ReflectiVA for knowledge-based visual question answering, highlighting its superior performance compared to existing methods. Source code and trained models are publicly available at https://github.com/aimagelab/ReflectiVA.
Published: 2024

6. Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments

Author: Barsellotti, Luca, Bigazzi, Roberto, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: In the last years, the research interest in visual navigation towards objects in indoor environments has grown significantly. This growth can be attributed to the recent availability of large navigation datasets in photo-realistic simulated environments, like Gibson and Matterport3D. However, the navigation tasks supported by these datasets are often restricted to the objects present in the environment at acquisition time. Also, they fail to account for the realistic scenario in which the target object is a user-specific instance that can be easily confused with similar objects and may be found in multiple locations within the environment. To address these limitations, we propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects. In each episode, the target object is presented to the agent using two modalities: a set of visual reference images on a neutral background and manually annotated textual descriptions. Through comprehensive evaluations and analyses, we showcase the challenges of the PIN task as well as the performance and shortcomings of currently available methods designed for object-driven navigation, considering modular and end-to-end agents., Comment: NeurIPS 2024 Datasets and Benchmarks Track. Project page: https://aimagelab.github.io/pin/
Published: 2024

7. Domain decomposition for integer optimal control with total variation regularization

Author: Baraldi, Robert and Manns, Paul
Subjects: Mathematics - Optimization and Control, 49K30, 49Q15, 49Q20, 49M37
Abstract: Total variation integer optimal control problems admit solutions and necessary optimality conditions via geometric variational analysis. In spite of the existence of said solutions, algorithms which solve the discretized objective suffer from high numerical cost associated with the combinatorial nature of integer programming. Hence, such methods are often limited to small- and medium-sized problems. We propose a globally convergent, coordinate descent-inspired algorithm that allows tractable subproblem solutions restricted to a partition of the domain. Our decomposition method solves relatively small trust-region subproblems that modify the control variable on a subdomain only. Given sufficient subdomain overlap, we prove that first-order optimality is equivalent to first-order optimality per subdomain. We additionally show that sufficient decrease is achieved on a single subdomain by way of a trust region subproblem solver. We utilize this to prove convergence of our algorithm, which operates via greedy patch selection. Finally, our method demonstrates the capacity to accelerate large PDE-constrained integer control problems.
Published: 2024

8. Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training

Author: Sarto, Sara, Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Despite significant advancements in caption generation, existing evaluation metrics often fail to capture the full quality or fine-grained details of captions. This is mainly due to their reliance on non-specific human-written references or noisy pre-training data. Still, finding an effective metric is crucial not only for captions evaluation but also for the generation phase. Metrics can indeed play a key role in the fine-tuning stage of captioning models, ultimately enhancing the quality of the generated captions. In this paper, we propose PAC-S++, a learnable metric that leverages the CLIP model, pre-trained on both web-collected and cleaned data and regularized through additional pairs of generated visual and textual positive samples. Exploiting this stronger and curated pre-training, we also apply PAC-S++ as a reward in the Self-Critical Sequence Training (SCST) stage typically employed to fine-tune captioning models. Extensive experiments on different image and video datasets highlight the effectiveness of PAC-S++ compared to popular metrics for the task, including its sensitivity to object hallucinations. Furthermore, we show that integrating PAC-S++ into the fine-tuning stage of a captioning model results in semantically richer captions with fewer repetitions and grammatical errors. Evaluations on out-of-domain benchmarks further demonstrate the efficacy of our fine-tuning approach in enhancing model capabilities. Source code and trained models are publicly available at: https://github.com/aimagelab/pacscore.
Published: 2024

9. Design, fabrication, and testing of diamond axicons for X-ray microscopy applications

Author: Samadi, Nazanin, Seiboth, Frank, Dias, Carlos Sato Baraldi, Novikov, Dmitri, Spiers, Kathryn, and Shi, Xianbo
Subjects: Physics - Optics
Abstract: This work presents the design, fabrication, and experimental validation of a refractive diamond axicon for X-ray beam shaping. The diamond axicon was developed to overcome the limitations of polymer-based axicons particularly for application in Transmission X-ray Microscopy (TXM) systems, offering superior mechanical strength, thermal stability, and radiation resistance, making it ideal for synchrotron applications. The axicon was fabricated using femtosecond laser ablation and tested at 11 keV under various coherence conditions. Results demonstrated that the axicon efficiently transformed the X-ray beam into a ring-shaped profile with over 80% transmission. Simulations confirmed the experimental findings and highlighted the potential for further improvements. This work paves the way for the use of diamond axicons in next-generation synchrotron facilities, with future efforts focusing on optimizing fabrication and testing the axicon in full TXM systems.
Published: 2024

10. Optimizing Resource Consumption in Diffusion Models through Hallucination Early Detection

Author: Betti, Federico, Baraldi, Lorenzo, Cucchiara, Rita, and Sebe, Nicu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models have significantly advanced generative AI, but they encounter difficulties when generating complex combinations of multiple objects. As the final result heavily depends on the initial seed, accurately ensuring the desired output can require multiple iterations of the generation process. This repetition not only leads to a waste of time but also increases energy consumption, echoing the challenges of efficiency and accuracy in complex generative tasks. To tackle this issue, we introduce HEaD (Hallucination Early Detection), a new paradigm designed to swiftly detect incorrect generations at the beginning of the diffusion process. The HEaD pipeline combines cross-attention maps with a new indicator, the Predicted Final Image, to forecast the final outcome by leveraging the information available at early stages of the generation process. We demonstrate that using HEaD saves computational resources and accelerates the generation process to get a complete image, i.e. an image where all requested objects are accurately depicted. Our findings reveal that HEaD can save up to 12% of the generation time on a two objects scenario and underscore the importance of early detection mechanisms in generative models., Comment: Accepted at ECCV Workshop 2024
Published: 2024

11. PSZ2 G282.28+49.94, a recently discovered analogue of the famous Bullet Cluster

Author: Bartalucci, I., Rossetti, M., Boschin, W., Girardi, M., Nonino, M., Baraldi, E., Balboni, M., Coe, D., De Grandi, S., Gastaldello, F., Ghizzardi, S., Giacintucci, S., Grillo, C., Harvey, D., Lovisari, L., Molendi, S., Resseguier, T., Riva, G., Venturi, T., and Zitrin, A.
Subjects: Astrophysics - Cosmology and Nongalactic Astrophysics
Abstract: We present a detailed study of the gas and galaxy properties of the cluster PSZ2 G282.28+49.94 detected in the Planck all-sky survey. The intracluster medium (ICM) of this object at z=0.56 exhibits a cometary-like shape. Combining Chandra and TNG observations, we characterised the spatially resolved thermodynamical properties of the gas and the spatial and velocity distribution of 73 galaxy members. The cluster structure is quite complex with an elongated core region containing the two brightest cluster galaxies and one dense group to the south-east. Since there is no velocity difference between the core and the south-east group, we suggest the presence of a merger along the plane of the sky. This structure is related to complex X-ray and radio features, and thus the merger has likely been caught during the post-merger phase. Comparing the distribution of the ICM and of member galaxies, we find a large offset of $\sim 350$ kpc between the position of the X-ray peak and the centre of a concentration of galaxies, preceding it in the likely direction of motion. This configuration is similar to the famous Bullet Cluster, leading us to dub PSZ2 G282.28+49.94 the "Planck bullet", and represents an ideal situation to provide astrophysical constraints to the self-interaction cross-section ($\sigma/m$) of dark matter particles. These results illustrate the power of a multi-wavelength approach to probe the merging scenario of such complex and distant systems., Comment: Accepted for publication in A&A
Published: 2024

12. Fluent and Accurate Image Captioning with a Self-Trained Reward Model

Author: Moratelli, Nicholas, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Fine-tuning image captioning models with hand-crafted rewards like the CIDEr metric has been a classical strategy for promoting caption quality at the sequence level. This approach, however, is known to limit descriptiveness and semantic richness and tends to drive the model towards the style of ground-truth sentences, thus losing detail and specificity. On the contrary, recent attempts to employ image-text models like CLIP as reward have led to grammatically incorrect and repetitive captions. In this paper, we propose Self-Cap, a captioning approach that relies on a learnable reward model based on self-generated negatives that can discriminate captions based on their consistency with the image. Specifically, our discriminator is a fine-tuned contrastive image-text model trained to promote caption correctness while avoiding the aberrations that typically happen when training with a CLIP-based reward. To this end, our discriminator directly incorporates negative samples from a frozen captioner, which significantly improves the quality and richness of the generated captions but also reduces the fine-tuning time in comparison to using the CIDEr score as the sole metric for optimization. Experimental results demonstrate the effectiveness of our training strategy on both standard and zero-shot image captioning datasets., Comment: ICPR 2024
Published: 2024

13. Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Author: Moratelli, Nicholas, Caffagni, Davide, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO., Comment: BMVC 2024
Published: 2024

14. UNMuTe: Unifying Navigation and Multimodal Dialogue-like Text Generation

Author: Rawal, Niyati, Bigazzi, Roberto, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Robotics
Abstract: Smart autonomous agents are becoming increasingly important in various real-life applications, including robotics and autonomous vehicles. One crucial skill that these agents must possess is the ability to interact with their surrounding entities, such as other agents or humans. In this work, we aim at building an intelligent agent that can efficiently navigate in an environment while being able to interact with an oracle (or human) in natural language and ask for directions when it is unsure about its navigation performance. The interaction is started by the agent that produces a question, which is then answered by the oracle on the basis of the shortest trajectory to the goal. The process can be performed multiple times during navigation, thus enabling the agent to hold a dialogue with the oracle. To this end, we propose a novel computational model, named UNMuTe, that consists of two main components: a dialogue model and a navigator. Specifically, the dialogue model is based on a GPT-2 decoder that handles multimodal data consisting of both text and images. First, the dialogue model is trained to generate question-answer pairs: the question is generated using the current image, while the answer is produced leveraging future images on the path toward the goal. Subsequently, a VLN model is trained to follow the dialogue predicting navigation actions or triggering the dialogue model if it needs help. In our experimental analysis, we show that UNMuTe achieves state-of-the-art performance on the main navigation tasks implying dialogue, i.e. Cooperative Vision and Dialogue Navigation (CVDN) and Navigation from Dialogue History (NDH), proving that our approach is effective in generating useful questions and answers to guide navigation.
Published: 2024

15. Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Author: Baraldi, Lorenzo, Cocchi, Federico, Cornia, Marcella, Nicolosi, Alessandro, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has only recently surfaced. This prompted the recent exploration of solutions that employ foundation vision-and-language models, like CLIP. However, the CLIP embedding space is optimized for global image-to-text alignment and is not inherently designed for deepfake detection, neglecting the potential benefits of tailored training and local image features. In this study, we propose CoDE (Contrastive Deepfake Embeddings), a novel embedding space specifically designed for deepfake detection. CoDE is trained via contrastive learning by additionally enforcing global-local similarities. To sustain the training of our model, we generate a comprehensive dataset that focuses on images generated by diffusion models and encompasses a collection of 9.2 million images produced by using four different generators. Experimental results demonstrate that CoDE achieves state-of-the-art accuracy on the newly collected dataset, while also showing excellent generalization capabilities to unseen image generators. Our source code, trained models, and collected dataset are publicly available at: https://github.com/aimagelab/CoDE., Comment: ECCV 2024
Published: 2024

16. BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score., Comment: ECCV 2024
Published: 2024

17. Towards Retrieval-Augmented Architectures for Image Captioning

Author: Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, Nicolosi, Alessandro, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a kNN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale., Comment: ACM Transactions on Multimedia Computing, Communications and Applications (2024)
Published: 2024

18. Efficient proximal subproblem solvers for a nonsmooth trust-region method

Author: Baraldi, Robert J. and Kouri, Drew P.
Published: 2025
Full Text: View/download PDF

19. Exogenous dsRNAs against chitin synthase and glucan synthase genes suppress the virulence of the pathogenic fungus Botrytis cinerea

Author: Gebremichael, Daniel Endale, Ciofini, Alice, Sabbadini, Silvia, Mezzetti, Bruno, Baraldi, Elena, Haile, Zeraye Mehari, and Negrini, Francesca
Published: 2024
Full Text: View/download PDF

20. Random forest analysis and lasso regression outperform traditional methods in identifying missing data auxiliary variables when the MAR mechanism is nonlinear (p.s. Stop using Little’s MCAR test)

Author: Hayes, Timothy, Baraldi, Amanda N., and Coxe, Stefany
Published: 2024
Full Text: View/download PDF

21. Blood pressure monitoring in elderly migraineurs starting an anti-CGRP monoclonal antibody: a real-world prospective study

Author: Mascarella, Davide, Andrini, Giorgia, Baraldi, Carlo, Altamura, Claudia, Favoni, Valentina, Lo Castro, Flavia, Pierangeli, Giulia, Vernieri, Fabrizio, Guerzoni, Simona, and Cevoli, Sabina
Published: 2024
Full Text: View/download PDF

22. Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

Author: Caffagni, Davide, Cocchi, Federico, Moratelli, Nicholas, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach., Comment: CVPR 2024 Workshop on What is Next in Multimodal Foundation Models
Published: 2024

23. AIGeN: An Adversarial Approach for Instruction Generation in VLN

Author: Rawal, Niyati, Bigazzi, Roberto, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Robotics
Abstract: In the last few years, the research interest in Vision-and-Language Navigation (VLN) has grown significantly. VLN is a challenging task that involves an agent following human instructions and navigating in a previously unknown environment to reach a specified goal. Recent work in literature focuses on different ways to augment the available datasets of instructions for improving navigation performance by exploiting synthetic training data. In this work, we propose AIGeN, a novel architecture inspired by Generative Adversarial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance. The model is composed of a Transformer decoder (GPT-2) and a Transformer encoder (BERT). During the training phase, the decoder generates sentences for a sequence of images describing the agent's path to a particular point while the encoder discriminates between real and fake instructions. Experimentally, we evaluate the quality of the generated instructions and perform extensive ablation studies. Additionally, we generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D Dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method. The validation analysis of our proposal is conducted on REVERIE and R2R and highlights the promising aspects of our proposal, achieving state-of-the-art performance., Comment: Accepted to 7th Multimodal Learning and Applications Workshop (MULA 2024) at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024
Published: 2024

24. Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation

Author: Barsellotti, Luca, Amoroso, Roberto, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training., Comment: CVPR 2024. Project page: https://aimagelab.github.io/freeda/
Published: 2024

25. Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Author: Bigazzi, Roberto, Baraldi, Lorenzo, Kousik, Shreyas, Cucchiara, Rita, and Pavone, Marco
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph generation; less effort has been focused on the task of purely identifying and mapping large semantic regions. The present work proposes a method for semantic region mapping via embodied navigation in indoor environments, generating a high-level representation of the knowledge of the agent. To enable region identification, the method uses a vision-to-language model to provide scene information for mapping. By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location. This mapping procedure is paired with a trained navigation policy to enable autonomous map generation. The proposed method significantly outperforms a variety of baselines, including an object-based system and a pretrained scene classifier, in experiments in a photorealistic simulator., Comment: Accepted by IEEE International Conference on Robotics and Automation (ICRA 2024)
Published: 2024

26. The Revolution of Multimodal Large Language Models: A Survey

Author: Caffagni, Davide, Cocchi, Federico, Barsellotti, Luca, Moratelli, Nicholas, Sarto, Sara, Baraldi, Lorenzo, Cornia, Marcella, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs., Comment: ACL 2024 (Findings)
Published: 2024

27. Many nonnormalities, one simulation: Do different data generation algorithms affect study results?

Author: Fairchild, Amanda J., Yin, Yunhang, Baraldi, Amanda N., Astivia, Oscar L. Olvera, and Shi, Dexin
Published: 2024
Full Text: View/download PDF

28. A two-tiered high-flow nasal cannula approach does not increase intensive care utilization and hospital length of stay in bronchiolitis

Author: Tirelli, Francesca, Todeschini Premuda, Marco, Francaviglia, Giulia, Frigo, Anna Chiara, Baraldi, Eugenio, Da Dalt, Liviana, and Bressan, Silvia
Published: 2024
Full Text: View/download PDF

29. Investigating the influence of varying cobalt doping on the cross-sectional widths and surface composition of MnOx nanowires in the context of battery–supercapacitor systems

Author: da Silva Eduardo, Samuel, de Figueiredo, Patrick Benedito Silva, de Lima, Scarllett Lalesca Santos, Santos, Karolinne Evelin Rodrigues, Ribeiro, Geyse Adriana Correa, Fonseca, Weliton Silva, Letichevsky, Sonia, Gothe, Maitê Lippel, Vidinha, Pedro, Spadotto, Julio, Dourado, André Henrique Baraldi, Connolly, Brian, de Lima, Roberto Batista, da Silva, Anderson Gabriel Marques, and Garcia, Marco Aurélio Suller
Published: 2024
Full Text: View/download PDF

30. Cerebral venous thrombosis and deep medullary vein thrombosis: Padua experience over the last two decades

Author: Cavicchiolo, Maria Elena, Brigiari, Gloria, Nosadini, Margherita, Pin, Jacopo Norberto, Vincenti, Arianna, Toldo, Irene, Ancona, Claudio, Simioni, Paolo, D′Errico, Ignazio, Baraldi, Eugenio, and Sartori, Stefano
Published: 2024
Full Text: View/download PDF

31. Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models

Author: Poppi, Samuele, Poppi, Tobia, Cocchi, Federico, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concerns in their adoption. Our research introduces a novel approach to enhancing the safety of vision-and-language models by diminishing their sensitivity to NSFW (not safe for work) inputs. In particular, our methodology seeks to sever "toxic" linguistic and visual concepts, unlearning the linkage between unsafe linguistic or visual items and unsafe regions of the embedding space. We show how this can be done by fine-tuning a CLIP model on synthetic data obtained from a large language model trained to convert between safe and unsafe sentences, and a text-to-image generator. We conduct extensive experiments on the resulting embedding space for cross-modal retrieval, text-to-image, and image-to-text generation, where we show that our model can be remarkably employed with pre-trained generative models. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip., Comment: ECCV 2024
Published: 2023

32. Application of the TIDieR checklist to improve the HFNC use in bronchiolitis management

Author: Sara, Manti, Antonella, Gambadauro, Paolo, Ruggeri, and Eugenio, Baraldi
Published: 2025
Full Text: View/download PDF

33. The genomic evolutionary dynamics and global circulation patterns of respiratory syncytial virus.

Author: Langedijk, Annefleur, Vrancken, Bram, Lebbink, Robert, Wilkins, Deidre, Kelly, Elizabeth, Baraldi, Eugenio, Mascareñas de Los Santos, Abiel, Danilenko, Daria, Choi, Eun, Palomino, María, Chi, Hsin, Keller, Christian, Cohen, Robert, Papenburg, Jesse, Pernica, Jeffrey, Greenough, Anne, Richmond, Peter, Martinón-Torres, Federico, Heikkinen, Terho, Stein, Renato, Hosoya, Mitsuaki, Nunes, Marta, Verwey, Charl, Evers, Anouk, Kragten-Tabatabaie, Leyla, Suchard, Marc, Kosakovsky Pond, Sergei, Poletto, Chiara, Colizza, Vittoria, Lemey, Philippe, and Bont, Louis
Subjects: Infant, Child, Humans, Child, Preschool, Respiratory Syncytial Virus Infections, Phylogeny, Respiratory Syncytial Virus, Human, Genomics, Respiratory Tract Infections
Abstract: Respiratory syncytial virus (RSV) is a leading cause of acute lower respiratory tract infection in young children and the second leading cause of infant death worldwide. While global circulation has been extensively studied for respiratory viruses such as seasonal influenza, and more recently also in great detail for SARS-CoV-2, a lack of global multi-annual sampling of complete RSV genomes limits our understanding of RSV molecular epidemiology. Here, we capitalise on the genomic surveillance by the INFORM-RSV study and apply phylodynamic approaches to uncover how selection and neutral epidemiological processes shape RSV diversity. Using complete viral genome sequences, we show similar patterns of site-specific diversifying selection among RSVA and RSVB and recover the imprint of non-neutral epidemic processes on their genealogies. Using a phylogeographic approach, we provide evidence for air travel governing the global patterns of RSVA and RSVB spread, which results in a considerable degree of phylogenetic mixing across countries. Our findings highlight the potential of systematic global RSV genomic surveillance for transforming our understanding of global RSV spread.
Published: 2024

34. Microbial diversity and cover plants in de-sealed urban soil as strategies for mitigating anthropogenic volatile organic compounds

Author: Cucu, Maria Alexandra, Neri, Luisa, Sillo, Fabiano, Zampieri, Elisa, Calvo, Alice, Giovannini, Luca, De Benedictis, Cinzia, Zaldei, Alessandro, Gioli, Beniamino, Baraldi, Rita, and Balestrini, Raffaella
Published: 2024
Full Text: View/download PDF

35. Policy options for sustainable access to off-patent antibiotics in Europe

Author: Panteli, Dimitra, Anderson, Michael, Fieldman, Thomas, Baraldi, Enrico, Tängdén, Thomas, Vogler, Sabine, Årdal, Christine, and Mossialos, Elias
Published: 2024
Full Text: View/download PDF

36. Metabolomic analysis to predict the onset and severity of necrotizing enterocolitis

Author: Moschino, Laura, Verlato, Giovanna, Stocchero, Matteo, Giordano, Giuseppe, Pirillo, Paola, Meneghelli, Marta, Guiducci, Silvia, Duci, Miriam, Fascetti Leon, Francesco, and Baraldi, Eugenio
Published: 2024
Full Text: View/download PDF

37. Expression of human Interferon Regulatory Factor 3 (IRF-3) in alveolar macrophages relates to clinical and functional traits in COPD.

Author: Baraldo, Simonetta, Bonato, Matteo, Cassia, Sebastiano, Casolari, Paolo, De Ferrari, Laura, Tiné, Mariaenrica, Baraldi, Federico, Bigoni, Tommaso, Riccio, Anna Maria, Braido, Fulvio, Saetta, Marina, Papi, Alberto, and Contoli, Marco
Published: 2024
Full Text: View/download PDF

38. Role of serum complement C3 and C4 on kidney outcomes in IgA nephropathy

Author: Tringali, Edoardo, Vetrano, Daniele, Tondolo, Francesco, Maritati, Federica, Fabbrizio, Benedetta, Pasquinelli, Gianandrea, Provenzano, Michele, La Manna, Gaetano, and Baraldi, Olga
Published: 2024
Full Text: View/download PDF

39. Comparison of “IN-REC-SUR-E” and LISA in preterm neonates with respiratory distress syndrome: a randomized controlled trial (IN-REC-LISA trial)

Author: Vento, Giovanni, Paladini, Angela, Aurilia, C., Ozdemir, S. Alkan, Carnielli, V. P., Cools, F., Costa, S., Cota, F., Dani, C., Davis, P. G., Fattore, S., Fè, C., Finer, N., Fusco, F. P., Gizzi, C., Herting, E., Jian, M., Lio, A., Lista, G., Mosca, F., Nobile, S., Perri, A., Picone, S., Pillow, J. J., Polglase, G., Pasciuto, T., Pastorino, R., Tana, M., Tingay, D., Tirone, C., van Kaam, A. H., Ventura, M. L., Aceti, A., Agosti, M., Alighieri, G., Ancora, G., Angileri, V., Ausanio, G., Aversa, S., Balestri, E., Baraldi, E., Barbini, M. C., Barone, C., Beghini, R., Bellan, C., Berardi, A., Bernardo, I., Betta, P., Binotti, M., Bizzarri, B., Borgarello, G., Borgione, S., Borrelli, A., Bottino, R., Bracaglia, G., Bresesti, I., Burattini, I., Cacace, C., Calzolari, F., Campagnoli, M. F., Capasso, L., Capozza, M., Capretti, M. G., Caravetta, J., Carbonara, C., Cardilli, V., Carta, M., Castoldi, F., Castronovo, A., Cavalleri, E., Cavigioli, F., Cecchi, S., Chierici, V., Cimino, C., Cocca, F., Cocca, C., Cogo, P., Coma, M., Comito, V., Condò, V., Consigli, C., Conti, R., Corradi, M., Corsello, G., Corvaglia, L. T., Costa, A., Coscia, A., Cresi, F., Crispino, F., D’Amico, P., De Cosmo, L., De Maio, C., Del Campo, G., Di Credico, S., Di Fabio, S., Di Nicola, P., Di Paolo, A., Di Valerio, S., Distilo, A., Duca, V., Falcone, A., Falsaperla, R., Fasolato, V. A., Fatuzzo, V., Favini, F., Ferrarello, M. P., Ferrari, S., Nastro, F. Fiori, Forcellini, C. A., Fracchiolla, A., Gabriele, A., Galdo, F., Gallini, F., Gangemi, A., Gargano, G., Gazzolo, D., Gentile, M. P., Ghirardello, S., Giardina, F., Giordano, L., Gitto, E., Giuffrè, M., Grappone, L., Grasso, F., Greco, I., Grison, A., Guglielmino, R., Guidotti, I., Guzzo, I., La Forgia, N., La Placa, S., La Torre, G., Lago, P., Lanciotti, L., Lavizzari, A., Leo, F., Leonardi, V., Lestingi, D., Li, J., Liberatore, P., Lodin, D., Lubrano, R., Lucente, M., Luciani, S., Luvarà, D., Maffei, G., Maggio, A., Maggio, L., Maiolo, K., Malaigia, L., Mangili, G., Manna, A., Maranella, E., Marciano, A., Marcozzi, P., Marletta, M., Marseglia, L., Martinelli, D., Martinelli, S., Massari, S., Massenzi, L., Matina, F., Mattia, L., Mescoli, G., Migliore, I. V., Minghetti, D., Mondello, I., Montano, S., Morandi, G., Mores, N., Morreale, S., Morselli, I., Motta, M., Napolitano, M., Nardo, D., Nicolardi, A., Nider, S., Nigro, G., Nuccio, M., Orfeo, L., Ottaviano, C., Paganin, P., Palamides, S., Palatta, S., Paolillo, P., Pappalardo, M. G., Pasta, E., Patti, L., Paviotti, G., Perniola, R., Perotti, G., Perrone, S., Petrillo, F., Piazza, M. S., Piccirillo, A., Pierro, M., Piga, E., Pingitore, G. A., Pisu, S., Pittini, C., Pontiggia, F., Pontrelli, G., Primavera, A., Proto, A., Quartulli, L., Raimondi, F., Ramenghi, L., Rapsomaniki, M., Ricotti, A., Rigotti, C., Rinaldi, M., Risso, F. M., Roma, E., Romanini, E., Romano, V., Rosati, E., Rosella, V., Rulli, I., Salvo, V., Sanfilippo, C., Sannia, A., Saporito, A., Sauna, A., Scapillati, E., Schettini, F., Scorrano, A., Mantelli, S. Semeria, Sepporta, V., Sindico, P., Solinas, A., Sorrentino, E., Spaggiari, E., Staffler, A., Stella, M., Termini, D., Terrin, G., Testa, A., Tina, G., Tirantello, M., Tomasini, B., Tormena, F., Travan, L., Trevisanuto, D., Tuling, G., Tulino, V., Valenzano, L., Vedovato, S., Vendramin, S., Villani, P. E., Viola, S., Viola, V., Vitaliti, G., Vitaliti, M., Wanker, P., Yang, Y., Zanetta, S., and Zannin, E.
Published: 2024
Full Text: View/download PDF

40. Clinical and economic burden of respiratory syncytial virus in children aged 0–5 years in Italy

Author: Dovizio, Melania, Veronesi, Chiara, Bartolini, Fausto, Cavaliere, Arturo, Grego, Stefano, Pagliaro, Romina, Procacci, Cataldo, Ubertazzo, Loredana, Bertizzolo, Lorenzo, Muzii, Barbara, Parisi, Salvatore, Perrone, Valentina, Baraldi, Eugenio, Bozzola, Elena, Mosca, Fabio, and Esposti, Luca Degli
Published: 2024
Full Text: View/download PDF

41. Efficacy and feasibility of a novel semi-facial respirator with chitosan nanoparticles on the incidence of SARS-CoV-2 infection in healthcare professionals: randomized controlled trial

Author: Kubota, Aline Midori Adati, Rosa, Mário Fabrício Fleury, Baraldi, Solange, Vale, Janine Araújo Montefusco, da Silva, Joana D`arc Gonçalves, Carneiro, Marcella Lemos Brettas, Padula, Rosimeire Simprini, Haddad, Rodrigo, Joanitti, Graziella Anselmo, da Silva Luz, Glécia Virgolino, Fook, Marcus Vinícius Lia, Zimmermann, Ivan Ricardo, Rosa, Suélia de Siqueira Rodrigues Fleury, Peixoto, Henry Maia, and Luiz Carregaro, Rodrigo
Published: 2024
Full Text: View/download PDF

42. With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Author: Barraco, Manuele, Sarto, Sara, Cornia, Marcella, Baraldi, Lorenzo, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net., Comment: ICCV 2023
Published: 2023

43. A longitudinal study of math skills in heritage bilingual children: profiles of strengths and weaknesses

Author: Bonifacci, Paola, Serena, Baraldi, Codeluppi, Francesca, and Peri, Benedetta
Published: 2025
Full Text: View/download PDF

44. Impact of macronutrients intake on glycemic homeostasis of preterm infants: evidence from continuous glucose monitoring

Author: Guiducci, Silvia, Res, Giulia, Bonadies, Luca, Savio, Federica, Brigadoi, Sabrina, Priante, Elena, Trevisanuto, Daniele, Baraldi, Eugenio, and Galderisi, Alfonso
Published: 2024
Full Text: View/download PDF

45. Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Author: Cornia, Marcella, Baraldi, Lorenzo, Fiameni, Giuseppe, and Cucchiara, Rita
Published: 2024
Full Text: View/download PDF

46. Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Author: Betti, Federico, Staiano, Jacopo, Baraldi, Lorenzo, Cucchiara, Rita, and Sebe, Nicu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Research in Image Generation has recently made significant progress, particularly boosted by the introduction of Vision-Language models which are able to produce high-quality visual content based on textual inputs. Despite ongoing advancements in terms of generation quality and realism, no methodical frameworks have been defined yet to quantitatively measure the quality of the generated content and the adherence with the prompted requests: so far, only human-based evaluations have been adopted for quality satisfaction and for comparing different generative methods. We introduce a novel automated method for Visual Concept Evaluation (ViCE), i.e. to assess consistency between a generated/edited image and the corresponding prompt/instructions, with a process inspired by the human cognitive behaviour. ViCE combines the strengths of Large Language Models (LLMs) and Visual Question Answering (VQA) into a unified pipeline, aiming to replicate the human cognitive process in quality assessment. This method outlines visual concepts, formulates image-specific verification questions, utilizes the Q&A system to investigate the image, and scores the combined outcome. Although this brave new hypothesis of mimicking humans in the image evaluation process is in its preliminary assessment stage, results are promising and open the door to a new form of automatic evaluation which could have significant impact as the image generation or the image target editing tasks become more and more sophisticated., Comment: Accepted as oral at ACM MultiMedia 2023 (Brave New Ideas track)
Published: 2023

47. Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

Author: Baraldi, Lorenzo, Amoroso, Roberto, Cornia, Marcella, Pilzer, Andrea, and Cucchiara, Rita
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of visual tasks such as image classification. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a backbone by reconstructing visual tokens associated with randomly masked image patches. This masking approach, however, introduces noise into the input data during pre-training, leading to discrepancies that can impair performance during the fine-tuning phase. Furthermore, input masking neglects the dependencies between corrupted patches, increasing the inconsistencies observed in downstream fine-tuning tasks. To overcome these issues, we propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT), that employs autoregressive and permuted predictions to capture intra-patch dependencies. In addition, MaPeT employs auxiliary positional information to reduce the disparity between the pre-training and fine-tuning phases. In our experiments, we employ a fair setting to ensure reliable and meaningful comparisons and conduct investigations on multiple visual tokenizers, including our proposed $k$-CLIP which directly employs discretized CLIP features. Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting. Source code and trained models are publicly available at: https://github.com/aimagelab/MaPeT.
Published: 2023

48. Interfacial two-dimensional oxide enhances photocatalytic activity of graphene/titania via electronic structure modification

Author: De Angelis, Dario, Presel, Francesco, Jabeen, Naila, Bignardi, Luca, Lizzit, Daniel, Lacovig, Paolo, Lizzit, Silvano, Montini, Tiziano, Fornasiero, Paolo, Alfè, Dario, and Baraldi, Alessandro
Subjects: Condensed Matter - Materials Science
Abstract: A two-dimensional layer of oxide reveals itself as a essential element to drive the photocatalytic activity in a nanostructured hybrid material, which combines high-quality epitaxial graphene and titanium dioxide nanoparticles. In particular, it has been revealed that the addition of a 2D Ti oxide layer sandwiched between graphene and metal induces a p-doping of graphene and a consistent shift in the Ti d states. These modifications induced by the interfacial oxide layer induce a reduction of the probability of charge carrier recombination and enhance the photocatalytic activity of the heterostructure. This is indicative of a capital role played by thin oxide films in fine-tuning the properties of heterostructures based on graphene and pave the way to new combinations of graphene/oxides for photocatalysis-oriented applications.
Published: 2023
Full Text: View/download PDF

49. Correction to: RNA interference-based strategies to control Botrytis cinerea infection in cultivated strawberry

Author: Capriotti, Luca, Molesini, Barbara, Pandolfini, Tiziana, Jin, Hailing, Baraldi, Elena, Cecchin, Michela, Mezzetti, Bruno, and Sabbadini, Silvia
Published: 2024
Full Text: View/download PDF

50. RNA interference-based strategies to control Botrytis cinerea infection in cultivated strawberry

Author: Capriotti, Luca, Molesini, Barbara, Pandolfini, Tiziana, Jin, Hailing, Baraldi, Elena, Cecchin, Michela, Mezzetti, Bruno, and Sabbadini, Silvia
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,007 results on '"Baraldi, P."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources