266 results on '"Rita Cucchiara"'
Search Results
2. Information fusion as an integrative cross-cutting enabler to achieve robust, explainable, and trustworthy medical artificial intelligence
- Author
-
Rita Cucchiara, Javier Del Ser, Wojciech Samek, Matthias Dehmer, Igor Jurisica, Isabelle Augenstein, Natalia Díaz-Rodríguez, Frank Emmert-Streib, Andreas Holzinger, Tampere University, Computing Sciences, and Publica
- Subjects
Artificial intelligence ,Computer science ,Process (engineering) ,Inference ,Context (language use) ,Trust ,03 medical and health sciences ,Neural-symbolic learning and reasoning ,0302 clinical medicine ,Robustness ,030304 developmental biology ,Causal model ,0303 health sciences ,business.industry ,213 Electronic, automation and communications engineering, electronics ,Explainability ,Explainable AI ,Graph-based machine learning ,Information fusion ,Medical AI ,Complex network ,3. Good health ,Transformative learning ,Workflow ,Hardware and Architecture ,030220 oncology & carcinogenesis ,Enabling ,Signal Processing ,business ,Software ,Information Systems - Abstract
Andreas Holzinger acknowledges funding support from the Austrian Science Fund (FWF), Project: P-32554 explainable Artificial Intelligence and from the European Union's Horizon 2020 research and innovation program under grant agreement 826078 (Feature Cloud). This publication reflects only the authors' view and the European Commission is not responsible for any use that may be made of the information it contains; Natalia Diaz-Rodriguez is supported by the Spanish Government Juan de la Cierva Incorporacion contract (IJC2019-039152-I); Isabelle Augenstein's research is partially funded by a DFF Sapere Aude research leader grant; Javier Del Ser acknowledges funding support from the Basque Government through the ELKARTEK program (3KIA project, KK-2020/00049) and the consolidated research group MATHMODE (ref. T1294-19); Wojciech Samek acknowledges funding support from the European Union's Horizon 2020 research and innovation program under grant agreement No. 965221 (iToBoS), and the German Federal Ministry of Education and Research (ref. 01IS18025 A, ref. 01IS18037I and ref. 0310L0207C); Igor Jurisica acknowledges funding support from Ontario Research Fund (RDI 34876), Natural Sciences Research Council (NSERC 203475), CIHR Research Grant (93579), Canada Foundation for Innovation (CFI 29272, 225404, 33536), IBM, Ian Lawson van Toch Fund, the Schroeder Arthritis Institute via the Toronto General and Western Hospital Foundation., Medical artificial intelligence (AI) systems have been remarkably successful, even outperforming human performance at certain tasks. There is no doubt that AI is important to improve human health in many ways and will disrupt various medical workflows in the future. Using AI to solve problems in medicine beyond the lab, in routine environments, we need to do more than to just improve the performance of existing AI methods. Robust AI solutions must be able to cope with imprecision, missing and incorrect information, and explain both the result and the process of how it was obtained to a medical expert. Using conceptual knowledge as a guiding model of reality can help to develop more robust, explainable, and less biased machine learning models that can ideally learn from less data. Achieving these goals will require an orchestrated effort that combines three complementary Frontier Research Areas: (1) Complex Networks and their Inference, (2) Graph causal models and counterfactuals, and (3) Verification and Explainability methods. The goal of this paper is to describe these three areas from a unified view and to motivate how information fusion in a comprehensive and integrative manner can not only help bring these three areas together, but also have a transformative role by bridging the gap between research and practical applications in the context of future trustworthy medical AI. This makes it imperative to include ethical and legal aspects as a cross-cutting discipline, because all future solutions must not only be ethically responsible, but also legally compliant., Austrian Science Fund (FWF) P-32554, European Union's Horizon 2020 research and innovation program 826078 965221, Spanish Government Juan de la Cierva Incorporacion IJC2019-039152-I, DFF Sapere Aude research leader grant, Basque Government KK-2020/00049, consolidated research group MATHMODE T1294-19, Federal Ministry of Education & Research (BMBF) 01IS18025 A 01IS18037I 0310L0207C, Ontario Research Fund RDI 34876, Natural Sciences Research Council NSERC 203475, Canadian Institutes of Health Research (CIHR) 93579, Canada Foundation for Innovation CGIAR CFI 29272 225404 33536, International Business Machines (IBM), Ian Lawson van Toch Fund, Schroeder Arthritis Institute via the Toronto General and Western Hospital Foundation
- Published
- 2022
3. Working Memory Connections for LSTM
- Author
-
Federico Landi, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Memory, Long-Term ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Cognitive Neuroscience ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) ,Artificial Intelligence ,Memory cell ,Learning ,Neural and Evolutionary Computing (cs.NE) ,Computer Science - Computation and Language ,Working memory ,business.industry ,Cell-to-gate connections ,Gated RNNs ,Image captioning ,Language modeling ,Long Short-Term Memory networks ,Computer Science - Neural and Evolutionary Computing ,Task (computing) ,Projection (relational algebra) ,Knowledge ,Memory, Short-Term ,Recurrent neural network ,Key (cryptography) ,Neural Networks, Computer ,State (computer science) ,Language model ,Artificial intelligence ,business ,Computation and Language (cs.CL) - Abstract
Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of gating mechanisms to mitigate exploding and vanishing gradients when learning long-term dependencies. For this reason, LSTMs and other gated RNNs are widely adopted, being the standard de facto for many sequence modeling tasks. Although the memory cell inside the LSTM contains essential information, it is not allowed to influence the gating mechanism directly. In this work, we improve the gate potential by including information coming from the internal cell state. The proposed modification, named Working Memory Connection, consists in adding a learnable nonlinear projection of the cell content into the network gates. This modification can fit into the classical LSTM gates without any assumption on the underlying task, being particularly effective when dealing with longer sequences. Previous research effort in this direction, which goes back to the early 2000s, could not bring a consistent improvement over vanilla LSTM. As part of this paper, we identify a key issue tied to previous connections that heavily limits their effectiveness, hence preventing a successful integration of the knowledge coming from the internal cell state. We show through extensive experimental evaluation that Working Memory Connections constantly improve the performance of LSTMs on a variety of tasks. Numerical results suggest that the cell state contains useful information that is worth including in the gate structure., Accepted for publication in Neural Networks
- Published
- 2021
4. Unifying tensor factorization and tensor nuclear norm approaches for low-rank tensor completion
- Author
-
Yuqing Shi, Qingjiang Xiao, Rita Cucchiara, Shiqiang Du, and Yide Ma
- Subjects
Tensor nuclear norm ,0209 industrial biotechnology ,Karush–Kuhn–Tucker conditions ,Tensor factorization ,Rank (linear algebra) ,Computer science ,Cognitive Neuroscience ,Low-rank tensor ,Inpainting ,Matrix norm ,Tensor completion ,02 engineering and technology ,Regularization (mathematics) ,Computer Science Applications ,020901 industrial engineering & automation ,Artificial Intelligence ,Tensor (intrinsic definition) ,Convex optimization ,Singular value decomposition ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Algorithm - Abstract
Low-rank tensor completion (LRTC) has gained significant attention due to its powerful capability of recovering missing entries. However, it has to repeatedly calculate the time-consuming singular value decomposition (SVD). To address this drawback, we, based on the tensor-tensor product (t-product), propose a new LRTC method-the unified tensor factorization (UTF)-for 3-way tensor completion. We first integrate the tensor factorization (TF) and the tensor nuclear norm (TNN) regularization into a framework that inherits the benefits of both TF and TNN: fast calculation and convex optimization. The conditions under which TF and TNN are equivalent are analyzed. Then, UTF for tensor completion is presented and an efficient iterative updated algorithm based on the alternate direction method of multipliers (ADMM) is used for our UTF optimization, and the solution of the proposed alternate minimization algorithm is also proven to be able to converge to a Karush–Kuhn–Tucker (KKT) point. Finally, numerical experiments on synthetic data completion and image/video inpainting tasks demonstrate the effectiveness of our method over other state-of-the-art tensor completion methods.
- Published
- 2021
5. A computational approach for progressive architecture shrinkage in action recognition
- Author
-
Matteo Tomei, Rita Cucchiara, Simone Bronzin, Lorenzo Baraldi, and Giuseppe Fiameni
- Subjects
Computer science ,business.industry ,Action recognition ,video understanding ,Artificial intelligence ,architectural optimization ,distributed training ,Architecture ,business ,Software ,Shrinkage - Published
- 2021
6. From Show to Tell: A Survey on Deep Learning-based Image Captioning
- Author
-
Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Computational Theory and Mathematics ,Artificial Intelligence ,Applied Mathematics ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Computer Vision and Pattern Recognition ,Computation and Language (cs.CL) ,Software - Abstract
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.
- Published
- 2022
7. Explaining digital humanities by aligning images and textual descriptions
- Author
-
Rita Cucchiara, Massimiliano Corsini, Marcella Cornia, Lorenzo Baraldi, and Matteo Stefanini
- Subjects
Information retrieval ,Computer science ,02 engineering and technology ,Semi-supervised learning ,Semantics ,01 natural sciences ,Domain (software engineering) ,Cultural heritage ,Artificial Intelligence ,Digital humanities ,Simple (abstract algebra) ,0103 physical sciences ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,Natural (music) ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,010306 general physics ,Software - Abstract
Replicating the human ability to connect Vision and Language has recently been gaining a lot of attention in the Computer Vision and the Natural Language Processing communities. This research effort has resulted in algorithms that can retrieve images from textual descriptions and vice versa, when realistic images and sentences with simple semantics are employed and when paired training data is provided. In this paper, we go beyond these limitations and tackle the design of visual-semantic algorithms in the domain of the Digital Humanities. This setting not only advertises more complex visual and semantic structures but also features a significant lack of training data which makes the use of fully-supervised approaches infeasible. With this aim, we propose a joint visual-semantic embedding that can automatically align illustrations and textual elements without paired supervision. This is achieved by transferring the knowledge learned on ordinary visual-semantic datasets to the artistic domain. Experiments, performed on two datasets specifically designed for this domain, validate the proposed strategies and quantify the domain shift between natural images and artworks.
- Published
- 2020
8. Focus on Impact: Indoor Exploration with Intrinsic Motivation
- Author
-
Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara
- Subjects
FOS: Computer and information sciences ,Control and Optimization ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Mechanical Engineering ,Computer Science - Computer Vision and Pattern Recognition ,Biomedical Engineering ,Computer Science Applications ,Human-Computer Interaction ,Computer Science - Robotics ,Artificial Intelligence (cs.AI) ,Artificial Intelligence ,Control and Systems Engineering ,Computer Vision and Pattern Recognition ,Robotics (cs.RO) - Abstract
Exploration of indoor environments has recently experienced a significant interest, also thanks to the introduction of deep neural agents built in a hierarchical fashion and trained with Deep Reinforcement Learning (DRL) on simulated environments. Current state-of-the-art methods employ a dense extrinsic reward that requires the complete a priori knowledge of the layout of the training environment to learn an effective exploration policy. However, such information is expensive to gather in terms of time and resources. In this work, we propose to train the model with a purely intrinsic reward signal to guide exploration, which is based on the impact of the robot's actions on its internal representation of the environment. So far, impact-based rewards have been employed for simple tasks and in procedurally generated synthetic environments with countable states. Since the number of states observable by the agent in realistic indoor environments is non-countable, we include a neural-based density model and replace the traditional count-based regularization with an estimated pseudo-count of previously visited states. The proposed exploration approach outperforms DRL-based competitors relying on intrinsic rewards and surpasses the agents trained with a dense extrinsic reward computed with the environment layouts. We also show that a robot equipped with the proposed approach seamlessly adapts to point-goal navigation and real-world deployment., Comment: Published in IEEE Robotics and Automation Letters. To appear in ICRA 2022
- Published
- 2022
9. Anomaly Detection, Localization and Classification for Railway Inspection
- Author
-
Giuseppe Scaglione, Andrea D'Eusanio, Stefano Pini, Guido Borghi, Eugenio Fedeli, Riccardo Gasparini, Simone Calderara, Rita Cucchiara, Riccardo Gasparini, Andrea D'Eusanio, Guido Borghi, Stefano Pini, Giuseppe Scaglione, Simone Calderara, Eugenio Fedeli, and Rita Cucchiara
- Subjects
business.industry ,Computer science ,020208 electrical & electronic engineering ,Inference ,Context (language use) ,02 engineering and technology ,computer.software_genre ,railway inspection, anomaly detection ,Class (biology) ,Drone ,Task (project management) ,Image (mathematics) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Anomaly detection ,Artificial intelligence ,Data mining ,business ,computer - Abstract
The ability to detect, localize and classify objects that are anomalies is a challenging task in the computer vision community. In this paper, we tackle these tasks developing a framework to automatically inspect the railway during the night. Specifically, it is able to predict the presence, the image coordinates and the class of obstacles. To deal with the low-light environment, the framework is based on thermal images and consists of three different modules that address the problem of detecting anomalies, predicting their image coordinates and classifying them. Moreover, due to the absolute lack of publicly-released datasets collected in the railway context for anomaly detection, we introduce a new multi-modal dataset, acquired from a rail drone, used to evaluate the proposed framework. Experimental results confirm the accuracy of the framework and its suitability, in terms of computational load, performance, and inference time, to be implemented on a self-powered inspection system.
- Published
- 2021
10. Multimodal Hand Gesture Classification for the Human–Car Interaction
- Author
-
Rita Cucchiara, Andrea D'Eusanio, Stefano Pini, Alessandro Simoni, Guido Borghi, Roberto Vezzani, Andrea D’Eusanio, Alessandro Simoni, Stefano Pini, Guido Borghi, Roberto Vezzani, and Rita Cucchiara
- Subjects
Computer Networks and Communications ,Computer science ,0206 medical engineering ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Automotive industry ,Context (language use) ,02 engineering and technology ,Convolutional neural network ,natural user interface ,computer vision ,0502 economics and business ,natural user interfaces ,Computer vision ,infrared image ,050210 logistics & transportation ,Point (typography) ,lcsh:T58.5-58.64 ,business.industry ,lcsh:Information technology ,Communication ,Deep learning ,05 social sciences ,deep learning ,depth map ,020601 biomedical engineering ,hand gesture recognition ,Human-Computer Interaction ,infrared images ,depth maps ,RGB color model ,automotive ,Artificial intelligence ,User interface ,business ,Gesture - Abstract
The recent spread of low-cost and high-quality RGB-D and infrared sensors has supported the development of Natural User Interfaces (NUIs) in which the interaction is carried without the use of physical devices such as keyboards and mouse. In this paper, we propose a NUI based on dynamic hand gestures, acquired with RGB, depth and infrared sensors. The system is developed for the challenging automotive context, aiming at reducing the driver&rsquo, s distraction during the driving activity. Specifically, the proposed framework is based on a multimodal combination of Convolutional Neural Networks whose input is represented by depth and infrared images, achieving a good level of light invariance, a key element in vision-based in-car systems. We test our system on a recent multimodal dataset collected in a realistic automotive setting, placing the sensors in an innovative point of view, i.e., in the tunnel console looking upwards. The dataset consists of a great amount of labelled frames containing 12 dynamic gestures performed by multiple subjects, making it suitable for deep learning-based approaches. In addition, we test the system on a different well-known public dataset, created for the interaction between the driver and the car. Experimental results on both datasets reveal the efficacy and the real-time performance of the proposed method.
- Published
- 2020
11. Face-from-Depth for Head Pose Estimation on Depth Images
- Author
-
Roberto Vezzani, Matteo Fabbri, Simone Calderara, Rita Cucchiara, Guido Borghi, Guido Borghi, Matteo Fabbri, Roberto Vezzani, Simone Calderara, and Rita Cucchiara
- Subjects
FOS: Computer and information sciences ,Male ,Shoulder ,Databases, Factual ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Automated Facial Recognition ,Posture ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,Convolutional neural network ,Pattern Recognition, Automated ,Set (abstract data type) ,Imaging, Three-Dimensional ,Artificial Intelligence ,Component (UML) ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Computer vision ,Pose ,head pose estimation, depth cameras, depth frames, GAN, CNN ,business.industry ,Applied Mathematics ,Frame rate ,Computational Theory and Mathematics ,Hallucinating ,Face (geometry) ,Face ,RGB color model ,020201 artificial intelligence & image processing ,Female ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Neural Networks, Computer ,business ,Head ,Software ,Algorithms - Abstract
Depth cameras allow to set up reliable solutions for people monitoring and behavior understanding, especially when unstable or poor illumination conditions make unusable common RGB sensors. Therefore, we propose a complete framework for the estimation of the head and shoulder pose based on depth images only. A head detection and localization module is also included, in order to develop a complete end-to-end system. The core element of the framework is a Convolutional Neural Network, called POSEidon+, that receives as input three types of images and provides the 3D angles of the pose as output. Moreover, a Face-from-Depth component based on a Deterministic Conditional GAN model is able to hallucinate a face from the corresponding depth image. We empirically demonstrate that this positively impacts the system performances. We test the proposed framework on two public datasets, namely Biwi Kinect Head Pose and ICT-3DHP, and on Pandora, a new challenging dataset mainly inspired by the automotive setup. Experimental results show that our method overcomes several recent state-of-art works based on both intensity and depth input data, running in real-time at more than 30 frames per second., Comment: Submitted to IEEE Transactions on PAMI, updated version (second round). arXiv admin note: substantial text overlap with arXiv:1611.10195
- Published
- 2020
12. Mercury: A Vision-Based Framework for Driver Monitoring
- Author
-
Rita Cucchiara, Stefano Pini, Guido Borghi, Roberto Vezzani, Guido Borghi, Stefano Pini, Roberto Vezzani, and Rita Cucchiara
- Subjects
Pixel ,Vision based ,business.industry ,Computer science ,Deep learning ,Real-time computing ,Automotive industry ,Monitoring system ,Convolutional neural network ,Time of day ,driver monitoring ,Artificial intelligence ,Mercury (programming language) ,business ,computer ,computer.programming_language - Abstract
In this paper, we propose a complete framework, namely Mercury, that combines Computer Vision and Deep Learning algorithms to continuously monitor the driver during the driving activity. The proposed solution complies to the require-ments imposed by the challenging automotive context: the light invariance, in or-der to have a system able to work regardless of the time of day and the weather conditions. Therefore, infrared-based images, i.e. depth maps (in which each pixel corresponds to the distance between the sensor and that point in the scene), have been exploited in conjunction with traditional intensity images. Second, the non-invasivity of the system is required, since driver’s movements must not be impeded during the driving activity: in this context, the use of camer-as and vision-based algorithms is one of the best solutions. Finally, real-time per-formance is needed since a monitoring system must immediately react as soon as a situation of potential danger is detected.
- Published
- 2020
13. Anomaly Detection for Vision-based Railway Inspection
- Author
-
Giuseppe Scaglione, Simone Calderara, Stefano Pini, Riccardo Gasparini, Guido Borghi, Eugenio Fedeli, Rita Cucchiara, Riccardo Gasparini, Stefano Pini, Guido Borghi, Giuseppe Scaglione, Simone Calderara, Eugenio Fedeli, and Rita Cucchiara
- Subjects
Vision based ,Computer science ,business.industry ,Deep learning ,020208 electrical & electronic engineering ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,Anomaly detection ,railway inspection, anomaly detection ,Railway inspection ,Drone ,Order (business) ,Computer vision ,Self-powered drone ,0202 electrical engineering, electronic engineering, information engineering ,RGB color model ,020201 artificial intelligence & image processing ,Artificial intelligence ,business - Abstract
The automatic inspection of railways for the detection of obstacles is a fundamental activity in order to guarantee the safety of the train transport. Therefore, in this paper, we propose a vision-based framework that is able to detect obstacles during the night, when the train circulation is usually suspended, using RGB or thermal images. Acquisition cameras and external light sources are placed in the frontal part of a rail drone and a new dataset is collected. Experiments show the accuracy of the proposed approach and its suitability, in terms of computational load, to be implemented on a self-powered drone.
- Published
- 2020
14. A Transformer-Based Network for Dynamic Hand Gesture Recognition
- Author
-
Rita Cucchiara, Stefano Pini, Alessandro Simoni, Roberto Vezzani, Andrea D'Eusanio, Guido Borghi, Andrea D’Eusanio, Alessandro Simoni, Stefano Pini, Guido Borghi, Roberto Vezzani, and Rita Cucchiara
- Subjects
Artificial neural network ,Computer science ,business.industry ,Dynamic Hand Gesture Recognition, depth maps ,Feature extraction ,Pattern recognition ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Data type ,Visualization ,Gesture recognition ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,0105 earth and related environmental sciences ,Transformer (machine learning model) ,Gesture - Abstract
Transformer-based neural networks represent a successful self-attention mechanism that achieves state-of-the-art results in language understanding and sequence modeling. However, their application to visual data and, in particular, to the dynamic hand gesture recognition task has not yet been deeply investigated. In this paper, we propose a transformer-based architecture for the dynamic hand gesture recognition task. We show that the employment of a single active depth sensor, specifically the usage of depth maps and the surface normals estimated from them, achieves state-of-the-art results, overcoming all the methods available in the literature on two automotive datasets, namely NVidia Dynamic Hand Gesture and Briareo. Moreover, we test the method with other data types available with common RGB-D devices, such as infrared and color data. We also assess the performance in terms of inference time and number of parameters, showing that the proposed framework is suitable for an online in-car infotainment system.
- Published
- 2020
15. Attentive models in vision: Computing saliency maps in the deep learning era
- Author
-
Simone Calderara, Rita Cucchiara, Andrea Palazzi, Marcella Cornia, Lorenzo Baraldi, and Davide Abati
- Subjects
Closed captioning ,Vision ,Computer science ,Process (engineering) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Inference ,0102 computer and information sciences ,02 engineering and technology ,Machine learning ,computer.software_genre ,Semantics ,01 natural sciences ,Convolutional neural network ,Deep Learning ,Artificial Intelligence ,0202 electrical engineering, electronic engineering, information engineering ,Computational model ,Saliency ,business.industry ,Deep learning ,Human Attention ,Image segmentation ,Geography ,Neuroscience ,010201 computation theory & mathematics ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer - Abstract
Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-truth.
- Published
- 2019
16. VITON-GT: An Image-based Virtual Try-On Model with Geometric Transformations
- Author
-
Marcella Cornia, Matteo Fincato, Rita Cucchiara, Federico Landi, and Fabio Cesari
- Subjects
Generalization ,business.industry ,Computer science ,Geometric transformation ,020207 software engineering ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Image (mathematics) ,Domain (software engineering) ,Set (abstract data type) ,User experience design ,Pattern recognition (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,Artificial intelligence ,business ,computer ,Transformation geometry ,0105 earth and related environmental sciences - Abstract
The large spread of online shopping has led computer vision researchers to develop different solutions for the fashion domain to potentially increase the online user experience and improve the efficiency of preparing fashion catalogs. Among them, image-based virtual try-on has recently attracted a lot of attention resulting in several architectures that can generate a new image of a person wearing an input try-on garment in a plausible and realistic way. In this paper, we present VITON-G T, a new model for virtual try-on that generates high-quality and photo-realistic images thanks to multiple geometric transformations. In particular, our model is composed of a two-stage geometric transformation module that performs two different projections on the input garment, and a transformation-guided try-on module that synthesizes the new image. We experimentally validate the proposed solution on the most common dataset for this task, containing mainly t-shirts, and we demonstrate its effectiveness compared to different baselines and previous methods. Additionally, we assess the generalization capabilities of our model on a new set of fashion items composed of upper-body clothes from different categories. To the best of our knowledge, we are the first to test virtual try-on architectures in this challenging experimental setting.
- Published
- 2021
17. DAG-Net: Double Attentive Graph Neural Network for Trajectory Forecasting
- Author
-
Rita Cucchiara, Alessio Monti, Simone Calderara, and Alessia Bertugli
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,0209 industrial biotechnology ,Social robot ,Exploit ,business.industry ,Computer science ,Graph neural networks ,Computer Vision and Pattern Recognition (cs.CV) ,Autonomous agent ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,010501 environmental sciences ,Computer Science - Learning ,Net (mathematics) ,01 natural sciences ,Machine Learning (cs.LG) ,Task (project management) ,Generative model ,020901 industrial engineering & automation ,Trajectory ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
Understanding human motion behaviour is a critical task for several possible applications like self-driving cars or social robots, and in general for all those settings where an autonomous agent has to navigate inside a human-centric environment. This is non-trivial because human motion is inherently multi-modal: given a history of human motion paths, there are many plausible ways by which people could move in the future. Additionally, people activities are often driven by goals, e.g. reaching particular locations or interacting with the environment. We address the aforementioned aspects by proposing a new recurrent generative model that considers both single agents' future goals and interactions between different agents. The model exploits a double attention-based graph neural network to collect information about the mutual influences among different agents and to integrate it with data about agents' possible future objectives. Our proposal is general enough to be applied to different scenarios: the model achieves state-of-the-art results in both urban environments and also in sports applications., Comment: Accepted at ICPR 2020
- Published
- 2021
18. Watch Your Strokes: Improving Handwritten Text Recognition with Deformable Convolutions
- Author
-
Silvia Cascianelli, Rita Cucchiara, Iulian Cojocaru, Lorenzo Baraldi, and Massimiliano Corsini
- Subjects
Pixel ,Orientation (computer vision) ,business.industry ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Convolutional neural network ,Field (computer science) ,Convolution ,ComputingMethodologies_PATTERNRECOGNITION ,Kernel (image processing) ,Handwriting recognition ,ComputingMethodologies_DOCUMENTANDTEXTPROCESSING ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Focus (optics) ,0105 earth and related environmental sciences - Abstract
Handwritten Text Recognition (HTR) in free-layout pages is a valuable yet challenging task which aims to automatically understand handwritten texts. State-of-the-art approaches in this field usually encode input images with Convolutional Neural Networks, whose kernels are typically defined on a fixed grid and focus on all input pixels independently. However, this is in contrast with the sparse nature of handwritten pages, in which only pixels representing the ink of the writing are useful for the recognition task. Furthermore, the standard convolution operator is not explicitly designed to take into account the great variability in shape, scale, and orientation of handwritten characters. To overcome these limitations, we investigate the use of deformable convolutions for handwriting recognition. The kernel of this type of convolution deforms according to the content of the neighborhood, and can therefore be more adaptable to geometric variations and other deformations of the text. Experiments conducted on the IAM and RIMES datasets demonstrate that the use of deformable convolutions is a promising direction for the design of novel architectures for handwritten text recognition.
- Published
- 2021
19. Estimating (and fixing) the Effect of Face Obfuscation in Video Recognition
- Author
-
Lorenzo Baraldi, Matteo Tomei, Simone Bronzin, and Rita Cucchiara
- Subjects
Reduction (complexity) ,Contextual image classification ,Computer science ,business.industry ,Face (geometry) ,Obfuscation ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,Artificial intelligence ,Video recognition ,business ,Facial recognition system ,Visualization - Abstract
Recent research has shown that faces can be obfuscated in large-scale datasets with a minimal performance impact on image classification and downstream tasks like object recognition. In this paper, we investigate the role of face obfuscation in video classification datasets and quantify a more significant reduction in performance caused by face blurring. To reduce such performance effects, we propose a generalized distillation approach in which a privacy-preserving action recognition network is trained with privileged information given by face identities. We show, through experiments performed on Kinetics-400, that the proposed approach can fully close the performance gap caused by face anonymization.
- Published
- 2021
20. Learning to Read L'Infinito: Handwritten Text Recognition with Synthetic Training Data
- Author
-
Maria Ludovica Piazzi, Lorenzo Baraldi, Marcella Cornia, Silvia Cascianelli, Rita Cucchiara, and Rosiana Schiuma
- Subjects
Focus (computing) ,Training set ,Computer science ,business.industry ,media_common.quotation_subject ,Deep learning ,computer.software_genre ,Synthetic data ,Style (sociolinguistics) ,Scarcity ,Handwriting ,Learning to read ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
Deep learning-based approaches to Handwritten Text Recognition (HTR) have shown remarkable results on publicly available large datasets, both modern and historical. However, it is often the case that historical manuscripts are preserved in small collections, most of the time with unique characteristics in terms of paper support, author handwriting style, and language. State-of-the-art HTR approaches struggle to obtain good performance on such small manuscript collections, for which few training samples are available. In this paper, we focus on HTR on small historical datasets and propose a new historical dataset, which we call Leopardi, with the typical characteristics of small manuscript collections, consisting of letters by the poet Giacomo Leopardi, and devise strategies to deal with the training data scarcity scenario. In particular, we explore the use of carefully designed but cost-effective synthetic data for pre-training HTR models to be applied to small single-author manuscripts. Extensive experiments validate the suitability of the proposed approach, and both the Leopardi dataset and synthetic data will be available to favor further research in this direction.
- Published
- 2021
21. Improving Indoor Semantic Segmentation with Boundary-level Objectives
- Author
-
Roberto Amoroso, Lorenzo Baraldi, and Rita Cucchiara
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,Boundary losses ,Boundary (topology) ,Geometric distance ,Robotics ,Machine learning ,computer.software_genre ,Task (project management) ,Indoor scene understanding, Segmentation, Boundary losses ,Segmentation ,Indoor scene understanding ,Augmented reality ,Quality (business) ,Artificial intelligence ,business ,Image retrieval ,computer ,media_common - Abstract
While most of the recent literature on semantic segmentation has focused on outdoor scenarios, the generation of accurate indoor segmentation maps has been partially under-investigated, although being a relevant task with applications in augmented reality, image retrieval, and personalized robotics. With the goal of increasing the accuracy of semantic segmentation in indoor scenarios, we develop and propose two novel boundary-level training objectives, which foster the generation of accurate boundaries between different semantic classes. In particular, we take inspiration from the Boundary and Active Boundary losses, two recent proposals which deal with the prediction of semantic boundaries, and propose modified geometric distance functions that improve predictions at the boundary level. Through experiments on the NYUDv2 dataset, we assess the appropriateness of our proposal in terms of accuracy and quality of boundary prediction and demonstrate its accuracy gain.
- Published
- 2021
22. Driver Face Verification with Depth Maps
- Author
-
Stefano Pini, Rita Cucchiara, Guido Borghi, Roberto Vezzani, Guido Borghi, Stefano Pini, Roberto Vezzani, and Rita Cucchiara
- Subjects
Computer science ,02 engineering and technology ,lcsh:Chemical technology ,Biochemistry ,Article ,Analytical Chemistry ,Task (project management) ,0202 electrical engineering, electronic engineering, information engineering ,fully-convolutional network ,Point (geometry) ,Computer vision ,lcsh:TP1-1185 ,Electrical and Electronic Engineering ,Instrumentation ,business.industry ,Deep learning ,deep learning ,020206 networking & telecommunications ,depth map ,Atomic and Molecular Physics, and Optics ,Range (mathematics) ,depth maps ,Feature (computer vision) ,driver face verification ,Face (geometry) ,Key (cryptography) ,020201 artificial intelligence & image processing ,Artificial intelligence ,automotive ,Siamese model ,business - Abstract
Face verification is the task of checking if two provided images contain the face of the same person or not. In this work, we propose a fully-convolutional Siamese architecture to tackle this task, achieving state-of-the-art results on three publicly-released datasets, namely Pandora, High-Resolution Range-based Face Database (HRRFaceD), and CurtinFaces. The proposed method takes depth maps as the input, since depth cameras have been proven to be more reliable in different illumination conditions. Thus, the system is able to work even in the case of the total or partial absence of external light sources, which is a key feature for automotive applications. From the algorithmic point of view, we propose a fully-convolutional architecture with a limited number of parameters, capable of dealing with the small amount of depth data available for training and able to run in real time even on a CPU and embedded boards. The experimental results show acceptable accuracy to allow exploitation in real-world applications with in-board cameras. Finally, exploiting the presence of faces occluded by various head garments and extreme head poses available in the Pandora dataset, we successfully test the proposed system also during strong visual occlusions. The excellent results obtained confirm the efficacy of the proposed method.
- Published
- 2019
23. Domain Translation with Conditional GANs: from Depth to RGB Face-to-Face
- Author
-
Simone Calderara, Rita Cucchiara, Fabio Lanzi, Roberto Vezzani, Guido Borghi, Matteo Fabbri, FABBRI, MATTEO, BORGHI, GUIDO, LANZI, FABIO, Roberto Vezzani, Simone Calderara, and Rita Cucchiara
- Subjects
FOS: Computer and information sciences ,Modality (human–computer interaction) ,Computer science ,business.industry ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,02 engineering and technology ,010501 environmental sciences ,Translation (geometry) ,01 natural sciences ,Luminance ,image translation, face analysis ,Hallucinating ,Face (geometry) ,Pattern recognition (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,RGB color model ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
Can faces acquired by low-cost depth sensors be useful to catch some characteristic details of the face? Typically the answer is no. However, new deep architectures can generate RGB images from data acquired in a different modality, such as depth data. In this paper, we propose a new \textit{Deterministic Conditional GAN}, trained on annotated RGB-D face datasets, effective for a face-to-face translation from depth to RGB. Although the network cannot reconstruct the exact somatic features for unknown individual faces, it is capable to reconstruct plausible faces; their appearance is accurate enough to be used in many pattern recognition tasks. In fact, we test the network capability to hallucinate with some \textit{Perceptual Probes}, as for instance face aspect classification or landmark detection. Depth face can be used in spite of the correspondent RGB images, that often are not available due to difficult luminance conditions. Experimental results are very promising and are as far as better than previously proposed approaches: this domain translation can constitute a new way to exploit depth data in new future applications., Comment: Accepted at ICPR 2018
- Published
- 2019
- Full Text
- View/download PDF
24. Hand Gestures for the Human-Car Interaction: The Briareo Dataset
- Author
-
Stefano Pini, Rita Cucchiara, Roberto Vezzani, Fabio Manganaro, Guido Borghi, Fabio Manganaro, Stefano Pini, Guido Borghi, Roberto Vezzani, and Rita Cucchiara
- Subjects
050210 logistics & transportation ,Point (typography) ,business.industry ,Computer science ,Deep learning ,Natural User Interfaces ,05 social sciences ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Context (language use) ,02 engineering and technology ,Task (project management) ,Gesture recognition ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,RGB color model ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,User interface ,business ,Gesture - Abstract
Natural User Interfaces can be an effective way to reduce driver's inattention during the driving activity. To this end, in this paper we propose a new dataset, called Briareo, specifically collected for the hand gesture recognition task in the automotive context. The dataset is acquired from an innovative point of view, exploiting different kinds of cameras, i.e. RGB, infrared stereo, and depth, that provide various types of images and 3D hand joints. Moreover, the dataset contains a significant amount of hand gesture samples, performed by several subjects, allowing the use of deep learning-based approaches. Finally, a framework for hand gesture segmentation and classification is presented, exploiting a method introduced to assess the quality of the proposed dataset.
- Published
- 2019
25. Future Urban Scenes Generation Through Vehicles Synthesis
- Author
-
Simone Calderara, Rita Cucchiara, Andrea Palazzi, Alessandro Simoni, and Luca Bergamini
- Subjects
Computational Geometry (cs.CG) ,FOS: Computer and information sciences ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,0502 economics and business ,Leverage (statistics) ,050207 economics ,Set (psychology) ,Representation (mathematics) ,0105 earth and related environmental sciences ,business.industry ,Deep learning ,05 social sciences ,Pipeline (software) ,Visualization ,View synthesis ,Pattern recognition (psychology) ,Computer Science - Computational Geometry ,Artificial intelligence ,business ,computer - Abstract
In this work we propose a deep learning pipeline to predict the visual future appearance of an urban scene. Despite recent advances, generating the entire scene in an end-to-end fashion is still far from being achieved. Instead, here we follow a two stages approach, where interpretable information is included in the loop and each actor is modelled independently. We leverage a per-object novel view synthesis paradigm; i.e. generating a synthetic representation of an object undergoing a geometrical roto-translation in the 3D space. Our model can be easily conditioned with constraints (e.g. input trajectories) provided by state-of-the-art tracking methods or by the user itself. This allows us to generate a set of diverse realistic futures starting from the same input in a multi-modal fashion. We visually and quantitatively show the superiority of this approach over traditional end-to-end scene-generation methods on CityFlow, a challenging real world dataset., Accepted at ICPR2020
- Published
- 2020
26. AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction
- Author
-
Simone Calderara, Alessia Bertugli, Lamberto Ballan, Rita Cucchiara, and Pasquale Coscia
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Time series ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Vision ,Computer Science - Computer Vision and Pattern Recognition ,Graph attention networks ,Multi-future prediction ,Trajectory forecasting ,Variational recurrent neural networks ,Machine learning ,computer.software_genre ,Trajectory Prediction ,Machine Learning (cs.LG) ,Machine Learning ,Component (UML) ,Forcing (recursion theory) ,Intersection (set theory) ,business.industry ,Computer Vision, Machine Learning, Trajectory Prediction ,Recurrent neural network ,Signal Processing ,Trajectory ,Graph (abstract data type) ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Software - Abstract
Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video surveillance applications. A key component of this task is represented by the inherently multi-modal nature of human paths which makes socially acceptable multiple futures when human interactions are involved. To this end, we propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs). Conditioning mainly relies on prior belief maps, representing most likely moving directions and forcing the model to consider past observed dynamics in generating future positions. Human interactions are modeled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation. To corroborate our model, we perform extensive experiments on publicly-available datasets (e.g., ETH/UCY, Stanford Drone Dataset, STATS SportVU NBA, Intersection Drone Dataset and TrajNet++) and demonstrate its effectiveness in crowded scenes compared to several state-of-the-art methods., Accepted at Computer Vision and Image Understanding (CVIU)
- Published
- 2020
27. Compressed Volumetric Heatmaps for Multi-Person 3D Pose Estimation
- Author
-
Matteo Fabbri, Stefano Alletto, Simone Calderara, Rita Cucchiara, and Fabio Lanzi
- Subjects
FOS: Computer and information sciences ,Monocular ,business.industry ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Process (computing) ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,010501 environmental sciences ,3D pose estimation ,01 natural sciences ,Autoencoder ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,RGB color model ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business ,Representation (mathematics) ,Pose ,0105 earth and related environmental sciences - Abstract
In this paper we present a novel approach for bottom-up multi-person 3D human pose estimation from monocular RGB images. We propose to use high resolution volumetric heatmaps to model joint locations, devising a simple and effective compression method to drastically reduce the size of this representation. At the core of the proposed method lies our Volumetric Heatmap Autoencoder, a fully-convolutional network tasked with the compression of ground-truth heatmaps into a dense intermediate representation. A second model, the Code Predictor, is then trained to predict these codes, which can be decompressed at test time to re-obtain the original representation. Our experimental evaluation shows that our method performs favorably when compared to state of the art on both multi-person and single-person 3D human pose estimation datasets and, thanks to our novel compression strategy, can process full-HD images at the constant runtime of 8 fps regardless of the number of subjects in the scene. Code and models available at https://github.com/fabbrimatteo/LoCO ., Comment: CVPR 2020
- Published
- 2020
- Full Text
- View/download PDF
28. RMS-Net: Regression and Masking for Soccer Event Spotting
- Author
-
Rita Cucchiara, Lorenzo Baraldi, Simone Calderara, Simone Bronzin, and Matteo Tomei
- Subjects
FOS: Computer and information sciences ,business.industry ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Pattern recognition ,02 engineering and technology ,010501 environmental sciences ,Spotting ,01 natural sciences ,Task (project management) ,Discriminative model ,Test set ,Pattern recognition (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,020201 artificial intelligence & image processing ,Timestamp ,Artificial intelligence ,business ,0105 earth and related environmental sciences ,Event (probability theory) - Abstract
The recently proposed action spotting task consists in finding the exact timestamp in which an event occurs. This task fits particularly well for soccer videos, where events correspond to salient actions strictly defined by soccer rules (a goal occurs when the ball crosses the goal line). In this paper, we devise a lightweight and modular network for action spotting, which can simultaneously predict the event label and its temporal offset using the same underlying features. We enrich our model with two training strategies: the first one for data balancing and uniform sampling, the second for masking ambiguous frames and keeping the most discriminative visual cues. When tested on the SoccerNet dataset and using standard features, our full proposal exceeds the current state of the art by 3 Average-mAP points. Additionally, it reaches a gain of more than 10 Average-mAP points on the test set when fine-tuned in combination with a strong 2D backbone.
- Published
- 2020
29. A Novel Attention-based Aggregation Function to Combine Vision and Language
- Author
-
Rita Cucchiara, Matteo Stefanini, Lorenzo Baraldi, and Marcella Cornia
- Subjects
Closed captioning ,FOS: Computer and information sciences ,Matching (statistics) ,Computer Science - Machine Learning ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Ranking (information retrieval) ,Machine Learning (cs.LG) ,Reduction (complexity) ,0202 electrical engineering, electronic engineering, information engineering ,Question answering ,Set (psychology) ,Image retrieval ,0105 earth and related environmental sciences ,Computer Science - Computation and Language ,business.industry ,Deep learning ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Computation and Language (cs.CL) - Abstract
The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements -- like regions and words -- proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach., ICPR 2020
- Published
- 2020
30. Meshed-Memory Transformer for Image Captioning
- Author
-
Lorenzo Baraldi, Rita Cucchiara, Marcella Cornia, and Matteo Stefanini
- Subjects
FOS: Computer and information sciences ,Closed captioning ,Training set ,Computer Science - Computation and Language ,Machine translation ,Computer science ,business.industry ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,010501 environmental sciences ,computer.software_genre ,Machine learning ,01 natural sciences ,Visualization ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,Language translation ,business ,Computation and Language (cs.CL) ,computer ,Decoding methods ,0105 earth and related environmental sciences ,Transformer (machine learning model) - Abstract
Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M$^2$ - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M$^2$ Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer., CVPR 2020
- Published
- 2019
31. A Systematic Comparison of Depth Map Representations for Face Recognition
- Author
-
Stefano Pini, Rita Cucchiara, Roberto Vezzani, Davide Maltoni, Guido Borghi, Stefano Pini, Guido Borghi, Roberto Vezzani, Davide Maltoni, and Rita Cucchiara
- Subjects
Databases, Factual ,Computer science ,Point cloud ,02 engineering and technology ,lcsh:Chemical technology ,computer.software_genre ,Biochemistry ,Convolutional neural network ,Facial recognition system ,Analytical Chemistry ,Voxel ,0202 electrical engineering, electronic engineering, information engineering ,dataset ,lcsh:TP1-1185 ,Instrumentation ,depth sensors ,Depth sensors ,depth map representations ,Atomic and Molecular Physics, and Optics ,020201 artificial intelligence & image processing ,Voxels ,Depth Map ,Facial Recognition ,Dataset ,Depth map representations ,Depth maps ,Face recognition ,Surface normal ,Algorithms ,point cloud ,Normalization (statistics) ,Depth Sensor ,Face Recognition ,Article ,Depth map ,Electrical and Electronic Engineering ,business.industry ,Surface Normal ,020207 software engineering ,Pattern recognition ,depth maps ,Pointcloud ,Neural Networks, Computer ,Artificial intelligence ,voxel ,business ,computer - Abstract
Nowadays, we are witnessing the wide diffusion of active depth sensors. However, the generalization capabilities and performance of the deep face recognition approaches that are based on depth data are hindered by the different sensor technologies and the currently available depth-based datasets, which are limited in size and acquired through the same device. In this paper, we present an analysis on the use of depth maps, as obtained by active depth sensors and deep neural architectures for the face recognition task. We compare different depth data representations (depth and normal images, voxels, point clouds), deep models (two-dimensional and three-dimensional Convolutional Neural Networks, PointNet-based networks), and pre-processing and normalization techniques in order to determine the configuration that maximizes the recognition accuracy and is capable of generalizing better on unseen data and novel acquisition settings. Extensive intra- and cross-dataset experiments, which were performed on four public databases, suggest that representations and methods that are based on normal images and point clouds perform and generalize better than other 2D and 3D alternatives. Moreover, we propose a novel challenging dataset, namely MultiSFace, in order to specifically analyze the influence of the depth map quality and the acquisition distance on the face recognition accuracy.
- Published
- 2021
32. Pattern Recognition. ICPR International Workshops and Challenges : Virtual Event, January 10–15, 2021, Proceedings, Part VI
- Author
-
Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, Roberto Vezzani, Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani
- Subjects
- Computer vision, Application software, Artificial intelligence, Computers
- Abstract
This 8-volumes set constitutes the refereed of the 25th International Conference on Pattern Recognition Workshops, ICPR 2020, held virtually in Milan, Italy and rescheduled to January 10 - 11, 2021 due to Covid-19 pandemic. The 416 full papers presented in these 8 volumes were carefully reviewed and selected from about 700 submissions. The 46 workshops cover a wide range of areas including machine learning, pattern analysis, healthcare, human behavior, environment, surveillance, forensics and biometrics, robotics and egovision, cultural heritage and document analysis, retrieval, and women at ICPR2020.
- Published
- 2021
33. Pattern Recognition. ICPR International Workshops and Challenges : Virtual Event, January 10–15, 2021, Proceedings, Part III
- Author
-
Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, Roberto Vezzani, Alberto Del Bimbo, Rita Cucchiara, Stan Sclaroff, Giovanni Maria Farinella, Tao Mei, Marco Bertini, Hugo Jair Escalante, and Roberto Vezzani
- Subjects
- Computer vision, Application software, Artificial intelligence, Computers
- Abstract
This 8-volumes set constitutes the refereed of the 25th International Conference on Pattern Recognition Workshops, ICPR 2020, held virtually in Milan, Italy and rescheduled to January 10 - 11, 2021 due to Covid-19 pandemic. The 416 full papers presented in these 8 volumes were carefully reviewed and selected from about 700 submissions. The 46 workshops cover a wide range of areas including machine learning, pattern analysis, healthcare, human behavior, environment, surveillance, forensics and biometrics, robotics and egovision, cultural heritage and document analysis, retrieval, and women at ICPR2020.
- Published
- 2021
34. Learning to Generate Facial Depth Maps
- Author
-
Roberto Vezzani, Guido Borghi, Rita Cucchiara, Filippo Grazioli, Stefano Pini, PINI, STEFANO, GRAZIOLI, FILIPPO, Guido Borghi, Roberto Vezzani, and Rita Cucchiara
- Subjects
FOS: Computer and information sciences ,Monocular ,facial depth map estimation ,business.industry ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Supervised learning ,Computer Science - Computer Vision and Pattern Recognition ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,02 engineering and technology ,010501 environmental sciences ,Visual appearance ,01 natural sciences ,Task (project management) ,Depth map ,Face verification ,Face (geometry) ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
In this paper, an adversarial architecture for facial depth map estimation from monocular intensity images is presented. By following an image-to-image approach, we combine the advantages of supervised learning and adversarial training, proposing a conditional Generative Adversarial Network that effectively learns to translate intensity face images into the corresponding depth maps. Two public datasets, namely Biwi database and Pandora dataset, are exploited to demonstrate that the proposed model generates high-quality synthetic depth images, both in terms of visual appearance and informative content. Furthermore, we show that the model is capable of predicting distinctive facial details by testing the generated depth maps through a deep model trained on authentic depth maps for the face verification task.
- Published
- 2018
35. Fully Convolutional Network for Head Detection with Depth Images
- Author
-
Diego Ballotta, Roberto Vezzani, Guido Borghi, Rita Cucchiara, Diego Ballotta, Guido Borghi, Roberto Vezzani, and Rita Cucchiara
- Subjects
head detection, depth maps ,business.industry ,Computer science ,Deep learning ,Detector ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,010501 environmental sciences ,Object (computer science) ,01 natural sciences ,Face (geometry) ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,RGB color model ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
Head detection and localization are one of the most investigated and demanding tasks of the Computer Vision community. These are also a key element for many disciplines, like Human Computer Interaction, Human Behavior Understanding, Face Analysis and Video Surveillance. In last decades, many efforts have been conducted to develop accurate and reliable head or face detectors on standard RGB images, but only few solutions concern other types of images, such as depth maps. In this paper, we propose a novel method for head detection on depth images, based on a deep learning approach. In particular, the presented system overcomes the classic sliding-window approach, that is often the main computational bottleneck of many object detectors, through a Fully Convolutional Network. Two public datasets, namely Pandora and Watch-n-Patch, are exploited to train and test the proposed network. Experimental results confirm the effectiveness of the method, that is able to exceed all the state-of-art works based on depth images and to run with real time performance.
- Published
- 2018
36. Hands on the wheel: a Dataset for Driver Hand Detection and Tracking
- Author
-
Rita Cucchiara, Roberto Vezzani, Guido Borghi, Elia Frigieri, Guido Borghi, Elia Frigieri, Roberto Vezzani, and Rita Cucchiara
- Subjects
050210 logistics & transportation ,Point (typography) ,Computer science ,business.industry ,Hand detection ,Automotive ,Dataset ,05 social sciences ,Automotive industry ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,Interaction systems ,Steering wheel ,Tracking (particle physics) ,Leap motion ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,Computer vision ,Artificial intelligence ,business ,Gesture - Abstract
The ability to detect, localize and track the hands is crucial in many applications requiring the understanding of the person behavior, attitude and interactions. In particular, this is true for the automotive context, in which hand analysis allows to predict preparatory movements for maneuvers or to investigate the driver's attention level. Moreover, due to the recent diffusion of cameras inside new car cockpits, it is feasible to use hand gestures to develop new Human-Car Interaction systems, more user-friendly and safe. In this paper, we propose a new dataset, called Turms, that consists of infrared images of driver's hands, collected from the back of the steering wheel, an innovative point of view. The Leap Motion device has been selected for the recordings, thanks to its stereo capabilities and the wide view-angle. Besides, we introduce a method to detect the presence and the location of driver's hands on the steering wheel, during driving activity tasks.
- Published
- 2018
37. Recognizing and Presenting the Storytelling Video Structure With Deep Multimodal Networks
- Author
-
Rita Cucchiara, Costantino Grana, and Lorenzo Baraldi
- Subjects
FOS: Computer and information sciences ,Semantic feature ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Feature vector ,Feature extraction ,Computer Science - Computer Vision and Pattern Recognition ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,Semantics ,computer.software_genre ,scene detection ,temporal video segmentation ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Computer vision ,Electrical and Electronic Engineering ,business.industry ,Deep networks ,020207 software engineering ,Semantic property ,performance evaluation ,Computer Science Applications ,Visualization ,Euclidean distance ,Feature (computer vision) ,Signal Processing ,Embedding ,020201 artificial intelligence & image processing ,Deep networks, performance evaluation, scene detection, temporal video segmentation ,Artificial intelligence ,business ,computer ,Natural language processing - Abstract
This paper presents a novel approach for temporal and semantic segmentation of edited videos into meaningful segments, from the point of view of the storytelling structure. The objective is to decompose a long video into more manageable sequences, which can in turn be used to retrieve the most significant parts of it given a textual query and to provide an effective summarization. Previous video decomposition methods mainly employed perceptual cues, tackling the problem either as a story change detection, or as a similarity grouping task, and the lack of semantics limited their ability to identify story boundaries. Our proposal connects together perceptual, audio and semantic cues in a specialized deep network architecture designed with a combination of CNNs which generate an appropriate embedding, and clusters shots into connected sequences of semantic scenes, i.e. stories. A retrieval presentation strategy is also proposed, by selecting the semantically and aesthetically "most valuable" thumbnails to present, considering the query in order to improve the storytelling presentation. Finally, the subjective nature of the task is considered, by conducting experiments with different annotators and by proposing an algorithm to maximize the agreement between automatic results and human annotators.
- Published
- 2017
38. Segmentation models diversity for object proposals
- Author
-
Rita Cucchiara, Marco Manfredi, Costantino Grana, Arnold W. M. Smeulders, and Intelligent Sensory Information Systems (IVI, FNWI)
- Subjects
Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Scale-space segmentation ,02 engineering and technology ,Machine learning ,computer.software_genre ,Segmentation ,Object proposals ,0202 electrical engineering, electronic engineering, information engineering ,computer.programming_language ,business.industry ,Segmentation-based object categorization ,Supervised learning ,020207 software engineering ,Image segmentation ,Pascal (programming language) ,Signal Processing ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Software - Abstract
We present an efficient segmentation proposal method.Starting from bounding-boxes we obtain precise segmentation masks.We diversify segmentation strategies using class agnostics features.We demonstrate how segmentation strategy diversification greatly boosts accuracy. In this paper we present a segmentation proposal method which employs a box-hypotheses generation step followed by a lightweight segmentation strategy. Inspired by interactive segmentation, for each automatically placed bounding-box we compute a precise segmentation mask. We introduce diversity in segmentation strategies enhancing a generic model performance exploiting class-independent regional appearance features. Foreground probability scores are learned from groups of objects with peculiar characteristics to specialize segmentation models. We demonstrate results comparable to the state-of-the-art on PASCAL VOC 2012 and a further improvement by merging our proposals with those of a recent solution. The ability to generalize to unseen object categories is demonstrated on Microsoft COCO 2014.
- Published
- 2017
39. Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
- Author
-
Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara
- Subjects
FOS: Computer and information sciences ,Closed captioning ,Sequence ,Computer Science - Computation and Language ,Theoretical computer science ,Computer science ,business.industry ,Computer Vision and Pattern Recognition (cs.CV) ,Deep learning ,Deep Learning ,Vision + Language ,Visual Reasoning ,Computer Science - Computer Vision and Pattern Recognition ,Context (language use) ,02 engineering and technology ,Visual reasoning ,010501 environmental sciences ,01 natural sciences ,Set (abstract data type) ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Computation and Language (cs.CL) ,0105 earth and related environmental sciences - Abstract
Current captioning approaches can describe images using black-box architectures whose behavior is hardly controllable and explainable from the exterior. As an image can be described in infinite ways depending on the goal and the context at hand, a higher degree of controllability is needed to apply captioning algorithms in complex scenarios. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by allowing both grounding and controllability. Given a control signal in the form of a sequence or set of image regions, we generate the corresponding caption through a recurrent architecture which predicts textual chunks explicitly grounded on regions, following the constraints of the given control. Experiments are conducted on Flickr30k Entities and on COCO Entities, an extended version of COCO in which we add grounding annotations collected in a semi-automatic manner. Results demonstrate that our method achieves state of the art performances on controllable image captioning, in terms of caption quality and diversity. Code and annotations are publicly available at: https://github.com/aimagelab/show-control-and-tell., Comment: CVPR 2019
- Published
- 2019
40. M-VAD Names: a Dataset for Video Captioning with Naming
- Author
-
Federico Bolelli, Stefano Pini, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara
- Subjects
Closed captioning ,FOS: Computer and information sciences ,Computer Networks and Communications ,business.industry ,Computer science ,Character (computing) ,Deep learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,020207 software engineering ,Economic shortage ,02 engineering and technology ,computer.software_genre ,Task (project management) ,Annotation ,Hardware and Architecture ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,Proper noun ,Artificial intelligence ,business ,computer ,Software ,Natural language processing - Abstract
Current movie captioning architectures are not capable of mentioning characters with their proper name, replacing them with a generic "someone" tag. The lack of movie description datasets with characters' visual annotations surely plays a relevant role in this shortage. Recently, we proposed to extend the M-VAD dataset by introducing such information. In this paper, we present an improved version of the dataset, namely M-VAD Names, and its semi-automatic annotation procedure. The resulting dataset contains 63k visual tracks and 34k textual mentions, all associated with character identities. To showcase the features of the dataset and quantify the complexity of the naming task, we investigate multimodal architectures to replace the "someone" tags with proper character names in existing video captions. The evaluation is further extended by testing this application on videos outside of the M-VAD Names dataset., Source Code: https://github.com/aimagelab/mvad-names-dataset - Video Demo: https://youtu.be/dOvtAXbOOH4
- Published
- 2019
41. Can adversarial networks hallucinate occluded people with a plausible aspect?
- Author
-
Federico Fulgeri, Stefano Alletto, Rita Cucchiara, Matteo Fabbri, and Simone Calderara
- Subjects
FOS: Computer and information sciences ,Attribute recognition ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,Image (mathematics) ,Computer graphics ,Occlusions ,Discriminative model ,GAN ,0202 electrical engineering, electronic engineering, information engineering ,Computer vision ,Video game ,Pixel ,Artificial neural network ,business.industry ,Deep learning ,020207 software engineering ,Hallucinating ,Signal Processing ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,Software - Abstract
When you see a person in a crowd, occluded by other persons, you miss visual information that can be used to recognize, re-identify or simply classify him or her. You can imagine its appearance given your experience, nothing more. Similarly, AI solutions can try to hallucinate missing information with specific deep learning architectures, suitably trained with people with and without occlusions. The goal of this work is to generate a complete image of a person, given an occluded version in input, that should be a) without occlusion b) similar at pixel level to a completely visible people shape c) capable to conserve similar visual attributes (e.g. male/female) of the original one. For the purpose, we propose a new approach by integrating the state-of-the-art of neural network architectures, namely U-nets and GANs, as well as discriminative attribute classification nets, with an architecture specifically designed to de-occlude people shapes. The network is trained to optimize a Loss function which could take into account the aforementioned objectives. As well we propose two datasets for testing our solution: the first one, occluded RAP, created automatically by occluding real shapes of the RAP dataset (which collects also attributes of the people aspect); the second is a large synthetic dataset, AiC, generated in computer graphics with data extracted from the GTA video game, that contains 3D data of occluded objects by construction. Results are impressive and outperform any other previous proposal. This result could be an initial step to many further researches to recognize people and their behavior in an open crowded world., Under review at CVIU
- Published
- 2019
42. Predicting the Driver's Focus of Attention: the DR(eye)VE Project
- Author
-
Simone Calderara, Davide Abati, Rita Cucchiara, Andrea Palazzi, and Francesco Solera
- Subjects
FOS: Computer and information sciences ,Focus (computing) ,Matching (statistics) ,Computer science ,business.industry ,Computer Vision and Pattern Recognition (cs.CV) ,Applied Mathematics ,Computer Science - Computer Vision and Pattern Recognition ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Context (language use) ,02 engineering and technology ,Object detection ,Task (project management) ,Visualization ,Computational Theory and Mathematics ,Artificial Intelligence ,Human–computer interaction ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,020201 artificial intelligence & image processing ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,Software - Abstract
In this work we aim to predict the driver's focus of attention. The goal is to estimate what a person would pay attention to while driving, and which part of the scene around the vehicle is more critical for the task. To this end we propose a new computer vision model based on a multi-branch deep architecture that integrates three sources of information: raw video, motion and scene semantics. We also introduce DR(eye)VE, the largest dataset of driving scenes for which eye-tracking annotations are available. This dataset features more than 500,000 registered frames, matching ego-centric views (from glasses worn by drivers) and car-centric views (from roof-mounted camera), further enriched by other sensors measurements. Results highlight that several attention patterns are shared across drivers and can be reproduced to some extent. The indication of which elements in the scene are likely to capture the driver's attention may benefit several applications in the context of human-vehicle interaction and driver attention analysis., Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
- Published
- 2019
43. Towards Cycle-Consistent Models for Text and Image Retrieval
- Author
-
Marcella Cornia, Hamed R. Tavakoli, Lorenzo Baraldi, and Rita Cucchiara
- Subjects
Computer science ,business.industry ,020207 software engineering ,02 engineering and technology ,Space (commercial competition) ,Translation (geometry) ,Machine learning ,computer.software_genre ,Domain (software engineering) ,Development (topology) ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Joint (audio engineering) ,computer ,Image retrieval - Abstract
Cross-modal retrieval has been recently becoming an hot-spot research, thanks to the development of deeply-learnable architectures. Such architectures generally learn a joint multi-modal embedding space in which text and images could be projected and compared. Here we investigate a different approach, and reformulate the problem of cross-modal retrieval as that of learning a translation between the textual and visual domain. In particular, we propose an end-to-end trainable model which can translate text into image features and vice versa, and regularizes this mapping with a cycle-consistency criterion. Preliminary experimental evaluations show promising results with respect to ordinary visual-semantic models.
- Published
- 2019
44. What was Monet seeing while painting? Translating artworks to photo-realistic images
- Author
-
Matteo Tomei, Lorenzo Baraldi, Marcella Cornia, and Rita Cucchiara
- Subjects
060201 languages & linguistics ,Painting ,Similarity (geometry) ,Pixel ,Computer science ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Context (language use) ,06 humanities and the arts ,010501 environmental sciences ,Real image ,01 natural sciences ,Domain (software engineering) ,0602 languages and literature ,Computer vision ,Artificial intelligence ,business ,0105 earth and related environmental sciences - Abstract
State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured from the world as it is. This brings to an incompatibility between such methods and digital data from the artistic domain, on which current techniques under-perform. A possible solution is to reduce the domain shift at the pixel level, thus translating artistic images to realistic copies. In this paper, we present a model capable of translating paintings to photo-realistic images, trained without paired examples. The idea is to enforce a patch level similarity between real and generated images, aiming to reproduce photo-realistic details from a memory bank of real images. This is subsequently adopted in the context of an unpaired image-to-image translation framework, mapping each image from one distribution to a new one belonging to the other distribution. Qualitative and quantitative results are presented on Monet, Cezanne and Van Gogh paintings translation tasks, showing that our approach increases the realism of generated images with respect to the CycleGAN approach.
- Published
- 2019
45. Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain
- Author
-
Massimiliano Corsini, Lorenzo Baraldi, Rita Cucchiara, Marcella Cornia, and Matteo Stefanini
- Subjects
business.industry ,Computer science ,02 engineering and technology ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,Domain (software engineering) ,Image (mathematics) ,Cultural heritage ,0202 electrical engineering, electronic engineering, information engineering ,Contextual information ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,Natural language processing ,0105 earth and related environmental sciences - Abstract
As vision and language techniques are widely applied to realistic images, there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios. In this paper, we address the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we collect and manually annotate the Artpedia dataset that contains paintings and textual sentences describing both the visual content of the paintings and other contextual information. Thus, the problem is not only to match images and sentences, but also to identify which sentences actually describe the visual content of a given image. To this end, we devise a visual-semantic model that jointly addresses these two challenges by exploiting the latent alignment between visual and textual chunks. Experimental evaluations, obtained by comparing our model to different baselines, demonstrate the effectiveness of our solution and highlight the challenges of the proposed dataset. The Artpedia dataset is publicly available at: http://aimagelab.ing.unimore.it/artpedia.
- Published
- 2019
46. End-to-end 6-DoF Object Pose Estimation through Differentiable Rasterization
- Author
-
Simone Calderara, Rita Cucchiara, Luca Bergamini, and Andrea Palazzi
- Subjects
Computer science ,business.industry ,Deep learning ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Rendering (computer graphics) ,Silhouette ,End-to-end principle ,deep learning, differentiable rendering, 6 degrees of freedom pose estimation ,0202 electrical engineering, electronic engineering, information engineering ,Leverage (statistics) ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,Differentiable function ,business ,Encoder ,Pose ,ComputingMethodologies_COMPUTERGRAPHICS ,0105 earth and related environmental sciences - Abstract
Here we introduce an approximated differentiable renderer to refine a 6-DoF pose prediction using only 2D alignment information. To this end, a two-branched convolutional encoder network is employed to jointly estimate the object class and its 6-DoF pose in the scene. We then propose a new formulation of an approximated differentiable renderer to re-project the 3D object on the image according to its predicted pose; in this way the alignment error between the observed and the re-projected object silhouette can be measured. Since the renderer is differentiable, it is possible to back-propagate through it to correct the estimated pose at test time in an online learning fashion. Eventually we show how to leverage the classification branch to profitably re-project a representative model of the predicted class (i.e. a medoid) instead. Each object in the scene is processed independently and novel viewpoints in which both objects arrangement and mutual pose are preserved can be rendered.
- Published
- 2019
47. Image-to-Image Translation to Unfold the Reality of Artworks: an Empirical Analysis
- Author
-
Matteo Tomei, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara
- Subjects
Closed captioning ,business.industry ,Computer science ,02 engineering and technology ,010501 environmental sciences ,Real image ,01 natural sciences ,Cultural heritage ,Memory bank ,0202 electrical engineering, electronic engineering, information engineering ,Entropy (information theory) ,Image translation ,020201 artificial intelligence & image processing ,Computer vision ,Segmentation ,Artificial intelligence ,Architecture ,business ,0105 earth and related environmental sciences - Abstract
State-of-the-art Computer Vision pipelines show poor performances on artworks and data coming from the artistic domain, thus limiting the applicability of current architectures to the automatic understanding of the cultural heritage. This is mainly due to the difference in texture and low-level feature distribution between artistic and real images, on which state-of-the-art approaches are usually trained. To enhance the applicability of pre-trained architectures on artistic data, we have recently proposed an unpaired domain translation approach which can translate artworks to photo-realistic visualizations. Our approach leverages semantically-aware memory banks of real patches, which are used to drive the generation of the translated image while improving its realism. In this paper, we provide additional analyses and experimental results which demonstrate the effectiveness of our approach. In particular, we evaluate the quality of generated results in the case of the translation of landscapes, portraits and of paintings coming from four different styles using automatic distance metrics. Also, we analyze the response of pre-trained architecture for classification, detection and segmentation both in terms of feature distribution and entropy of prediction, and show that our approach effectively reduces the domain shift of paintings. As an additional contribution, we also provide a qualitative analysis of the reduction of the domain shift for detection, segmentation and image captioning.
- Published
- 2019
48. Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation
- Author
-
Rita Cucchiara, Lorenzo Baraldi, Matteo Tomei, and Marcella Cornia
- Subjects
FOS: Computer and information sciences ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Deep Learning ,Image and Video Synthesis ,Vision Applications and Systems ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Image (mathematics) ,Domain (software engineering) ,0202 electrical engineering, electronic engineering, information engineering ,Code (cryptography) ,Segmentation ,0105 earth and related environmental sciences ,Painting ,Information retrieval ,business.industry ,Deep learning ,Process (computing) ,Image translation ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Realism - Abstract
The applicability of computer vision to real paintings and artworks has been rarely investigated, even though a vast heritage would greatly benefit from techniques which can understand and process data from the artistic domain. This is partially due to the small amount of annotated artistic data, which is not even comparable to that of natural images captured by cameras. In this paper, we propose a semantic-aware architecture which can translate artworks to photo-realistic visualizations, thus reducing the gap between visual features of artistic and realistic data. Our architecture can generate natural images by retrieving and learning details from real photos through a similarity matching strategy which leverages a weakly-supervised semantic understanding of the scene. Experimental results show that the proposed technique leads to increased realism and to a reduction in domain shift, which improves the performance of pre-trained architectures for classification, detection, and segmentation. Code is publicly available at: https://github.com/aimagelab/art2real., CVPR 2019
- Published
- 2018
49. Aligning Text and Document Illustrations: Towards Visually Explainable Digital Humanities
- Author
-
Costantino Grana, Rita Cucchiara, Marcella Cornia, and Lorenzo Baraldi
- Subjects
Information retrieval ,business.industry ,Computer science ,020207 software engineering ,02 engineering and technology ,Semantics ,Domain (software engineering) ,Visualization ,Annotation ,Pattern recognition (psychology) ,0202 electrical engineering, electronic engineering, information engineering ,Task analysis ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Historical document - Abstract
While several approaches to bring vision and language together are emerging, none of them has yet addressed the digital humanities domain, which, nevertheless, is a rich source of visual and textual data. To foster research in this direction, we investigate the learning of visual-semantic embeddings for historical document illustrations, devising both supervised and semi-supervised approaches. We exploit the joint visual-semantic embeddings to automatically align illustrations and textual elements, thus providing an automatic annotation of the visual content of a manuscript. Experiments are performed on the Borso d'Este Holy Bible, one of the most sophisticated illuminated manuscript from the Renaissance, which we manually annotate aligning every illustration with textual commentaries written by experts. Experimental results quantify the domain shift between ordinary visual-semantic datasets and the proposed one, validate the proposed strategies, and devise future works on the same line.
- Published
- 2018
50. Latent Space Autoregression for Novelty Detection
- Author
-
Simone Calderara, Angelo Porrello, Rita Cucchiara, and Davide Abati
- Subjects
FOS: Computer and information sciences ,business.industry ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Deep learning ,020208 electrical & electronic engineering ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,Machine learning ,computer.software_genre ,Novelty detection ,Autoencoder ,Autoregressive model ,0202 electrical engineering, electronic engineering, information engineering ,Probability distribution ,020201 artificial intelligence & image processing ,Anomaly detection ,Artificial intelligence ,business ,Feature learning ,computer ,Parametric statistics - Abstract
Novelty detection is commonly referred to as the discrimination of observations that do not conform to a learned model of regularity. Despite its importance in different application settings, designing a novelty detector is utterly complex due to the unpredictable nature of novelties and its inaccessibility during the training procedure, factors which expose the unsupervised nature of the problem. In our proposal, we design a general framework where we equip a deep autoencoder with a parametric density estimator that learns the probability distribution underlying its latent representations through an autoregressive procedure. We show that a maximum likelihood objective, optimized in conjunction with the reconstruction of normal samples, effectively acts as a regularizer for the task at hand, by minimizing the differential entropy of the distribution spanned by latent vectors. In addition to providing a very general formulation, extensive experiments of our model on publicly available datasets deliver on-par or superior performances if compared to state-of-the-art methods in one-class and video anomaly detection settings. Differently from prior works, our proposal does not make any assumption about the nature of the novelties, making our work readily applicable to diverse contexts., Accepted by CVPR 2019
- Published
- 2018
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.