Author: "Silvio Savarese" / Topic: computer vision - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Silvio Savarese"' showing total 83 results

Start Over Author "Silvio Savarese" Topic computer vision

83 results on '"Silvio Savarese"'

1. JRDB: A Dataset and Benchmark of Egocentric Robot Visual Perception of Humans in Built Environments

Author: Eric H. Frankel, JunYoung Gwak, Amir Sadeghian, Mihir Patel, Roberto Martín-Martín, Silvio Savarese, Hamid Rezatofighi, and Abhijeet Shenoi
Subjects: FOS: Computer and information sciences, Visual perception, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Point cloud, 02 engineering and technology, Computer Science - Robotics, Artificial Intelligence, 0202 electrical engineering, electronic engineering, information engineering, Computer vision, Social robot, Audio signal, Mobile manipulator, business.industry, Applied Mathematics, Computational Theory and Mathematics, Robot, RGB color model, 020201 artificial intelligence & image processing, Computer Vision and Pattern Recognition, Artificial intelligence, business, Robotics (cs.RO), Encoder, Software
Abstract: We present JRDB, a novel egocentric dataset collected from our social mobile manipulator JackRabbot. The dataset includes 64 minutes of annotated multimodal sensor data including stereo cylindrical 360$^\circ$ RGB video at 15 fps, 3D point clouds from two Velodyne 16 Lidars, line 3D point clouds from two Sick Lidars, audio signal, RGB-D video at 30 fps, 360$^\circ$ spherical image from a fisheye camera and encoder values from the robot's wheels. Our dataset incorporates data from traditionally underrepresented scenes such as indoor environments and pedestrian areas, all from the ego-perspective of the robot, both stationary and navigating. The dataset has been annotated with over 2.3 million bounding boxes spread over 5 individual cameras and 1.8 million associated 3D cuboids around all people in the scenes totaling over 3500 time consistent trajectories. Together with our dataset and the annotations, we launch a benchmark and metrics for 2D and 3D person detection and tracking. With this dataset, which we plan on extending with further types of annotation in the future, we hope to provide a new source of data and a test-bench for research in the areas of egocentric robot vision, autonomous navigation, and all perceptual tasks around social robotics in human environments.
Published: 2023
Full Text: View/download PDF

2. Model-Based Object Recognition: Traditional Approach

Author: Min Sun and Silvio Savarese
Subjects: Computer science, business.industry, Cognitive neuroscience of visual object recognition, Computer vision, Artificial intelligence, business
Published: 2021
Full Text: View/download PDF

3. Localizing Against Drawn Maps via Spline-Based Registration

Author: Silvio Savarese, Marynel Vázquez, and Kevin Chen
Subjects: Computer science, business.industry, 05 social sciences, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, Task (project management), Spline (mathematics), Lidar, 0502 economics and business, Robot, Computer vision, Artificial intelligence, 050207 economics, business, Thin plate spline, Baseline (configuration management), Rigid transformation, ComputingMethodologies_COMPUTERGRAPHICS, 0105 earth and related environmental sciences
Abstract: We propose a method to facilitate robot navigation relative to sketched maps of human environments. Our main contribution centers around using thin plate splines for registering the robot’s LIDAR observation with the hand-drawn maps. Thin plate splines are particularly effective for this task because they are able to handle many of the nonrigid deformations commonly seen in sketches of maps, which render traditional rigid transformations inappropriate. Our proposed approach uses a convolutional neural network to efficiently predict the control points which define the spline transform, from which we then compute the pose of the robot on the hand drawn map for navigation purposes. Our systematic evaluations in simulation using a synthetic dataset and real, hand-drawn sketches show that the proposed spline-based registration approach outperforms baseline methods.
Published: 2020
Full Text: View/download PDF

4. Visuomotor Mechanical Search: Learning to Retrieve Target Objects in Clutter

Author: Silvio Savarese, Animesh Garg, Roberto Martín-Martín, Rohun Kulkarni, Andrey Kurenkov, Marcus Dominguez-Kuhne, and Joseph Taglic
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Machine Learning, Extremely hard, business.industry, Computer Science - Artificial Intelligence, Sample (statistics), 02 engineering and technology, 010501 environmental sciences, Object (computer science), 01 natural sciences, Outcome (probability), Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, Artificial Intelligence (cs.AI), Clutter, Reinforcement learning, Computer vision, Artificial intelligence, business, Robotics (cs.RO), 0105 earth and related environmental sciences, Heap (data structure)
Abstract: When searching for objects in cluttered environments, it is often necessary to perform complex interactions in order to move occluding objects out of the way and fully reveal the object of interest and make it graspable. Due to the complexity of the physics involved and the lack of accurate models of the clutter, planning and controlling precise predefined interactions with accurate outcome is extremely hard, when not impossible. In problems where accurate (forward) models are lacking, Deep Reinforcement Learning (RL) has shown to be a viable solution to map observations (e.g. images) to good interactions in the form of close-loop visuomotor policies. However, Deep RL is sample inefficient and fails when applied directly to the problem of unoccluding objects based on images. In this work we present a novel Deep RL procedure that combines i) teacher-aided exploration, ii) a critic with privileged information, and iii) mid-level representations, resulting in sample efficient and effective learning for the problem of uncovering a target object occluded by a heap of unknown objects. Our experiments show that our approach trains faster and converges to more efficient uncovering solutions than baselines and ablations, and that our uncovering policies lead to an average improvement in the graspability of the target object, facilitating downstream retrieval applications.
Published: 2020

5. 3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

Author: Amir Roshan Zamir, JunYoung Gwak, Zhi-Yang He, Iro Armeni, Silvio Savarese, Martin Fischer, and Jitendra Malik
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, 02 engineering and technology, Graph, Visualization, Computer Science - Robotics, 3d space, Framing (construction), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Scene graph, Polygon mesh, Computer vision, Artificial intelligence, business, Robotics (cs.RO)
Abstract: A comprehensive semantic understanding of a scene is important for many applications - but in what space should diverse semantic information (e.g., objects, scene categories, material types, texture, etc.) be grounded and what should be its structure? Aspiring to have one unified structure that hosts diverse types of semantics, we follow the Scene Graph paradigm in 3D, generating a 3D Scene Graph. Given a 3D mesh and registered panoramic images, we construct a graph that spans the entire building and includes semantics on objects (e.g., class, material, and other attributes), rooms (e.g., scene category, volume, etc.) and cameras (e.g., location, etc.), as well as the relationships among these entities. However, this process is prohibitively labor heavy if done manually. To alleviate this we devise a semi-automatic framework that employs existing detection methods and enhances them using two main constraints: I. framing of query images sampled on panoramas to maximize the performance of 2D detectors, and II. multi-view consistency enforcement across 2D detections that originate in different camera locations., Comment: ICCV 2019
Published: 2019
Full Text: View/download PDF

6. Mechanical Search: Multi-Step Retrieval of a Target Object Occluded by Clutter

Author: Roberto Martín-Martín, Andrey Kurenkov, Ken Goldberg, Animesh Garg, Silvio Savarese, Ashwin Balakrishna, David Wang, Michael Danielczuk, and Matthew Matl
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, business.industry, GRASP, 02 engineering and technology, Image segmentation, Visualization, Computer Science - Robotics, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Task analysis, Robot, Clutter, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Robotics (cs.RO), Heap (data structure)
Abstract: When operating in unstructured environments such as warehouses, homes, and retail centers, robots are frequently required to interactively search for and retrieve specific objects from cluttered bins, shelves, or tables. Mechanical Search describes the class of tasks where the goal is to locate and extract a known target object. In this paper, we formalize Mechanical Search and study a version where distractor objects are heaped over the target object in a bin. The robot uses an RGBD perception system and control policies to iteratively select, parameterize, and perform one of 3 actions -- push, suction, grasp -- until the target object is extracted, or either a time limit is exceeded, or no high confidence push or grasp is available. We present a study of 5 algorithmic policies for mechanical search, with 15,000 simulated trials and 300 physical trials for heaps ranging from 10 to 20 objects. Results suggest that success can be achieved in this long-horizon task with algorithmic policies in over 95% of instances and that the number of actions required scales approximately linearly with the size of the heap. Code and supplementary material can be found at http://ai.stanford.edu/mech-search ., Comment: To appear in IEEE International Conference on Robotics and Automation (ICRA), 2019. 9 pages with 4 figures
Published: 2019
Full Text: View/download PDF

7. Deep Visual MPC-Policy Learning for Navigation

Author: Amir Sadeghian, Noriaki Hirose, Roberto Martín-Martín, Silvio Savarese, and Fei Xia
Subjects: FOS: Computer and information sciences, Control and Optimization, business.industry, Computer science, Mechanical Engineering, Biomedical Engineering, Computer Science Applications, Visualization, Human-Computer Interaction, Model predictive control, Computer Science - Robotics, Artificial Intelligence, Control and Systems Engineering, Position (vector), Obstacle avoidance, Path (graph theory), Trajectory, Robot, Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, business, Robotics (cs.RO)
Abstract: Humans can routinely follow a trajectory defined by a list of images/landmarks. However, traditional robot navigation methods require accurate mapping of the environment, localization, and planning. Moreover, these methods are sensitive to subtle changes in the environment. In this paper, we propose a Deep Visual MPC-policy learning method that can perform visual navigation while avoiding collisions with unseen objects on the navigation path. Our model PoliNet takes in as input a visual trajectory and the image of the robot's current view and outputs velocity commands for a planning horizon of $N$ steps that optimally balance between trajectory following and obstacle avoidance. PoliNet is trained using a strong image predictive model and traversability estimation model in a MPC setup, with minimal human supervision. Different from prior work, PoliNet can be applied to new scenes without retraining. We show experimentally that the robot can follow a visual trajectory when varying start position and in the presence of previously unseen obstacles. We validated our algorithm with tests both in a realistic simulation environment and in the real world. We also show that we can generate visual trajectories in simulation and execute the corresponding path in the real environment. Our approach outperforms classical approaches as well as previous learning-based baselines in success rate of goal reaching, sub-goal coverage rate, and computational load., 11pages, 11 figures, 5 tables
Published: 2019

8. 6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints

Author: Li Fei-Fei, Chen Wang, Jun Lv, Danfei Xu, Yuke Zhu, Silvio Savarese, Cewu Lu, and Roberto Martín-Martín
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, business.industry, Deep learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Inter frame, 02 engineering and technology, Visualization, Computer Science - Robotics, 020901 industrial engineering & automation, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, RGB color model, Robot, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Pose, Robotics (cs.RO)
Abstract: We present 6-PACK, a deep learning approach to category-level 6D object pose tracking on RGB-D data. Our method tracks in real-time novel object instances of known object categories such as bowls, laptops, and mugs. 6-PACK learns to compactly represent an object by a handful of 3D keypoints, based on which the interframe motion of an object instance can be estimated through keypoint matching. These keypoints are learned end-to-end without manual supervision in order to be most effective for tracking. Our experiments show that our method substantially outperforms existing methods on the NOCS category-level 6D pose estimation benchmark and supports a physical robot to perform simple vision-based closed-loop manipulation tasks. Our code and video are available at https://sites.google.com/view/6packtracking.
Published: 2019
Full Text: View/download PDF

9. DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

Author: Roberto Martín-Martín, Li Fei-Fei, Yuke Zhu, Silvio Savarese, Danfei Xu, Cewu Lu, and Chen Wang
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer science, business.industry, Deep learning, Computer Vision and Pattern Recognition (cs.CV), GRASP, Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Object (computer science), Computer Science - Robotics, 020901 industrial engineering & automation, Feature (computer vision), 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Robot, RGB color model, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Pose, Robotics (cs.RO)
Abstract: A key technical challenge in performing 6D object pose estimation from RGB-D image is to fully leverage the two complementary data sources. Prior works either extract information from the RGB image and depth separately or use costly post-processing steps, limiting their performances in highly cluttered scenes and real-time applications. In this work, we present DenseFusion, a generic framework for estimating 6D pose of a set of known objects from RGB-D images. DenseFusion is a heterogeneous architecture that processes the two data sources individually and uses a novel dense fusion network to extract pixel-wise dense feature embedding, from which the pose is estimated. Furthermore, we integrate an end-to-end iterative pose refinement procedure that further improves the pose estimation while achieving near real-time inference. Our experiments show that our method outperforms state-of-the-art approaches in two datasets, YCB-Video and LineMOD. We also deploy our proposed method to a real robot to grasp and manipulate objects based on the estimated pose.
Published: 2019
Full Text: View/download PDF

10. Machine vision for natural gas methane emissions detection using an infrared camera

Author: Jingfan Wang, Mike McGuire, Lyne P. Tchapmi, Daniel Zimmerle, Silvio Savarese, Adam R. Brandt, Arvind P. Ravikumar, and Clay S. Bell
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Leak, Offset (computer science), Computer science, Machine vision, Computer Vision and Pattern Recognition (cs.CV), 020209 energy, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Management, Monitoring, Policy and Law, Convolutional neural network, Methane, Machine Learning (cs.LG), chemistry.chemical_compound, 020401 chemical engineering, Natural gas, FOS: Electrical engineering, electronic engineering, information engineering, 0202 electrical engineering, electronic engineering, information engineering, Computer vision, 0204 chemical engineering, Background subtraction, business.industry, Mechanical Engineering, Deep learning, Image and Video Processing (eess.IV), Building and Construction, Electrical Engineering and Systems Science - Image and Video Processing, General Energy, chemistry, Artificial intelligence, business
Abstract: It is crucial to reduce natural gas methane emissions, which can potentially offset the climate benefits of replacing coal with gas. Optical gas imaging (OGI) is a widely-used method to detect methane leaks, but is labor-intensive and cannot provide leak detection results without operators' judgment. In this paper, we develop a computer vision approach to OGI-based leak detection using convolutional neural networks (CNN) trained on methane leak images to enable automatic detection. First, we collect ~1 M frames of labeled video of methane leaks from different leaking equipment for building CNN model, covering a wide range of leak sizes (5.3-2051.6 gCH4/h) and imaging distances (4.6-15.6 m). Second, we examine different background subtraction methods to extract the methane plume in the foreground. Third, we then test three CNN model variants, collectively called GasNet, to detect plumes in videos taken at other pieces of leaking equipment. We assess the ability of GasNet to perform leak detection by comparing it to a baseline method that uses optical-flow based change detection algorithm. We explore the sensitivity of results to the CNN structure, with a moderate-complexity variant performing best across distances. We find that the detection accuracy can reach as high as 99%, the overall detection accuracy can exceed 95% for a case across all leak sizes and imaging distances. Binary detection accuracy exceeds 97% for large leaks (~710 gCH4/h) imaged closely (~5-7 m). At closer imaging distances (~5-10 m), CNN-based models have greater than 94% accuracy across all leak sizes. At farthest distances (~13-16 m), performance degrades rapidly, but it can achieve above 95% accuracy to detect large leaks (>950 gCH4/h). The GasNet-based computer vision approach could be deployed in OGI surveys to allow automatic vigilance of methane leak detection with high detection accuracy in the real world., This paper was submitted to Applied Energy
Published: 2020
Full Text: View/download PDF

11. VUNet: Dynamic Scene View Synthesis for Traversability Estimation using an RGB Camera

Author: Amir Sadeghian, Noriaki Hirose, Roberto Martín-Martín, Silvio Savarese, and Fei Xia
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Control and Optimization, Computer science, Computer Vision and Pattern Recognition (cs.CV), Biomedical Engineering, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Computer Science - Robotics, 020901 industrial engineering & automation, Artificial Intelligence, Computer vision, 0105 earth and related environmental sciences, Network architecture, business.industry, Mechanical Engineering, Deep learning, Mobile robot, Computer Science Applications, View synthesis, Human-Computer Interaction, Control and Systems Engineering, Virtual image, Teleoperation, Robot, RGB color model, Computer Vision and Pattern Recognition, Artificial intelligence, business, Robotics (cs.RO)
Abstract: We present VUNet, a novel view(VU) synthesis method for mobile robots in dynamic environments, and its application to the estimation of future traversability. Our method predicts future images for given virtual robot velocity commands using only RGB images at previous and current time steps. The future images result from applying two types of image changes to the previous and current images: 1) changes caused by different camera pose, and 2) changes due to the motion of the dynamic obstacles. We learn to predict these two types of changes disjointly using two novel network architectures, SNet and DNet. We combine SNet and DNet to synthesize future images that we pass to our previously presented method GONet to estimate the traversable areas around the robot. Our quantitative and qualitative evaluation indicate that our approach for view synthesis predicts accurate future images in both static and dynamic environments. We also show that these virtual images can be used to estimate future traversability correctly. We apply our view synthesis-based traversability estimation method to two applications for assisted teleoperation., website: http://svl.stanford.edu/projects/vunet/
Published: 2018

12. Robust real-time tracking combining 3D shape, color, and motion

Author: Jesse Levinson, David Held, Silvio Savarese, and Sebastian Thrun
Subjects: 0209 industrial biotechnology, business.industry, Applied Mathematics, Mechanical Engineering, Posterior probability, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Object (computer science), Tracking (particle physics), Motion (physics), Tracking error, 020901 industrial engineering & automation, Artificial Intelligence, Laser tracker, Robustness (computer science), Modeling and Simulation, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Electrical and Electronic Engineering, business, Real time tracking, Software, Mathematics
Abstract: Real-time tracking algorithms often suffer from low accuracy and poor robustness when confronted with difficult, real-world data. We present a tracker that combines 3D shape, color (when available), and motion cues to accurately track moving objects in real-time. Our tracker allocates computational effort based on the shape of the posterior distribution. Starting with a coarse approximation to the posterior, the tracker successively refines this distribution, increasing in tracking accuracy over time. The tracker can thus be run for any amount of time, after which the current approximation to the posterior is returned. Even at a minimum runtime of 0.37 ms per object, our method outperforms all of the baseline methods of similar speed by at least 25% in root-mean-square (RMS) tracking error. If our tracker is allowed to run for longer, the accuracy continues to improve, and it continues to outperform all baseline methods. Our tracker is thus anytime, allowing the speed or accuracy to be optimized based on the needs of the application. By combining 3D shape, color (when available), and motion cues in a probabilistic framework, our tracker is able to robustly handle changes in viewpoint, occlusions, and lighting variations for moving objects of a variety of shapes, sizes, and distances.
Published: 2015
Full Text: View/download PDF

13. Group and Crowd Behavior for Computer Vision

Author: Vittorio Murino, Marco Cristani, Shishir Shah, Silvio Savarese, Vittorio Murino, Marco Cristani, Shishir Shah, and Silvio Savarese
Subjects: Computer vision
Abstract: Group and Crowd Behavior for Computer Vision provides a multidisciplinary perspective on how to solve the problem of group and crowd analysis and modeling, combining insights from the social sciences with technological ideas in computer vision and pattern recognition. The book answers many unresolved issues in group and crowd behavior, with Part One providing an introduction to the problems of analyzing groups and crowds that stresses that they should not be considered as completely diverse entities, but as an aggregation of people. Part Two focuses on features and representations with the aim of recognizing the presence of groups and crowds in image and video data. It discusses low level processing methods to individuate when and where a group or crowd is placed in the scene, spanning from the use of people detectors toward more ad-hoc strategies to individuate group and crowd formations. Part Three discusses methods for analyzing the behavior of groups and the crowd once they have been detected, showing how to extract semantic information, predicting/tracking the movement of a group, the formation or disaggregation of a group/crowd and the identification of different kinds of groups/crowds depending on their behavior. The final section focuses on identifying and promoting datasets for group/crowd analysis and modeling, presenting and discussing metrics for evaluating the pros and cons of the various models and methods. This book gives computer vision researcher techniques for segmentation and grouping, tracking and reasoning for solving group and crowd modeling and analysis, as well as more general problems in computer vision and machine learning. - Presents the first book to cover the topic of modeling and analysis of groups in computer vision - Discusses the topics of group and crowd modeling from a cross-disciplinary perspective, using social science anthropological theories translated into computer vision algorithms - Focuses on group and crowd analysis metrics - Discusses real industrial systems dealing with the problem of analyzing groups and crowds
Published: 2017

14. Scene Semantic Reconstruction from Egocentric RGB-D-Thermal Videos

Author: Silvio Savarese, Ozan Sener, and Rachel Luo
Subjects: Data stream, 0209 industrial biotechnology, Computer science, business.industry, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Semantic property, Observer (special relativity), 020901 industrial engineering & automation, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, RGB color model, Leverage (statistics), 020201 artificial intelligence & image processing, Segmentation, Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: In this paper we focus on the problem of inferring geometric and semantic properties of a complex scene where humans interact with objects from egocentric views. Unlike most previous work, our goal is to leverage a multimodal sensory stream composed of RGB, depth, and thermal (RGB-D-T) signals and use this data stream as an input to a new framework for joint 6 DOF camera localization, 3D reconstruction, and semantic segmentation. As our extensive experimental evaluation shows, the combination of different sensing modalities allows us to achieve greater robustness in situations where both the observer and the objects in the scene move rapidly (a challenging situation for traditional semantic reconstruction methods). Moreover, we contribute a new dataset that includes a large number of egocentric RGB-D-T videos of humans performing daily real-world activities as well as a new demonstration hardware platform for acquiring such a dataset.
Published: 2017
Full Text: View/download PDF

15. Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies

Author: Alexandre Alahi, Amir Sadeghian, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, 020207 software engineering, 02 engineering and technology, Sensor fusion, Tracking (particle physics), Object detection, Term (time), Recurrent neural network, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: The majority of existing solutions to the Multi-Target Tracking (MTT) problem do not combine cues in a coherent end-to-end fashion over a long period of time. However, we present an online method that encodes long-term temporal dependencies across multiple cues. One key challenge of tracking methods is to accurately track occluded targets or those which share similar appearance properties with surrounding objects. To address this challenge, we present a structure of Recurrent Neural Networks (RNN) that jointly reasons on multiple cues over a temporal window. We are able to correct many data association errors and recover observations from an occluded state. We demonstrate the robustness of our data-driven approach by tracking multiple targets using their appearance, motion, and even interactions. Our method outperforms previous works on multiple publicly available datasets including the challenging MOT benchmark.
Published: 2017
Full Text: View/download PDF

16. Deep View Morphing

Author: Junghyun Kwon, Max E. McFarland, Dinghuang Ji, and Silvio Savarese
Subjects: FOS: Computer and information sciences, Pixel, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Visibility (geometry), Interpolation (computer graphics), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, View synthesis, Morphing, Rectification, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, 0105 earth and related environmental sciences, Interpolation
Abstract: Recently, convolutional neural networks (CNN) have been successfully applied to view synthesis problems. However, such CNN-based methods can suffer from lack of texture details, shape distortions, or high computational complexity. In this paper, we propose a novel CNN architecture for view synthesis called "Deep View Morphing" that does not suffer from these issues. To synthesize a middle view of two input images, a rectification network first rectifies the two input images. An encoder-decoder network then generates dense correspondences between the rectified images and blending masks to predict the visibility of pixels of the rectified images in the middle view. A view morphing network finally synthesizes the middle view using the dense correspondences and blending masks. We experimentally show the proposed method significantly outperforms the state-of-the-art CNN-based view synthesis method., Accepted to CVPR 2017
Published: 2017
Full Text: View/download PDF

17. Unsupervised camera localization in crowded spaces

Author: Silvio Savarese, Li Fei-Fei, Alexandre Alahi, and Judson Wilson
Subjects: Matching (statistics), Social robot, Optimization problem, business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020206 networking & telecommunications, 02 engineering and technology, Tracking (particle physics), Motion (physics), 0202 electrical engineering, electronic engineering, information engineering, Unsupervised learning, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: Existing camera networks in public spaces such as train terminals or malls can help social robots to navigate crowded scenes. However, the localization of the cameras is required, i.e., the positions and poses of all cameras in a unique reference. In this work, we estimate the relative location of any pair of cameras by solely using noisy trajectories observed from each camera. We propose a fully unsupervised learning technique using unlabelled pedestrians motion patterns captured in crowded scenes. We first estimate the pairwise camera parameters by optimally matching single-view pedestrian tracks using social awareness. Then, we show the impact of jointly estimating the network parameters. This is done by formulating a nonlinear least square optimization problem, leveraging a continuous approximation of the matching function. We evaluate our approach in real-world environments such as train terminals, where several hundreds of individuals need to be tracked across dozens of cameras every second.
Published: 2017
Full Text: View/download PDF

18. Subcategory-Aware Convolutional Neural Networks for Object Proposals and Detection

Author: Wongun Choi, Silvio Savarese, Yu Xiang, and Yuanqing Lin
Subjects: FOS: Computer and information sciences, Subcategory, 050210 logistics & transportation, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), 05 social sciences, Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, 02 engineering and technology, Object (computer science), 3D pose estimation, Convolutional neural network, Object detection, Object-class detection, 0502 economics and business, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Viola–Jones object detection framework, Computer vision, Artificial intelligence, business, Pose
Abstract: In CNN-based object detection methods, region proposal becomes a bottleneck when objects exhibit significant scale variation, occlusion or truncation. In addition, these methods mainly focus on 2D object detection and cannot estimate detailed properties of objects. In this paper, we propose subcategory-aware CNNs for object detection. We introduce a novel region proposal network that uses subcategory information to guide the proposal generating process, and a new detection network for joint detection and subcategory classification. By using subcategories related to object pose, we achieve state-of-the-art performance on both detection and pose estimation on commonly used benchmarks., Comment: Published in WACV 2017
Published: 2017
Full Text: View/download PDF

19. Multi-Task Domain Adaptation for Deep Learning of Instance Grasping from Simulation

Author: Silvio Savarese, Stefan Hinterstoisser, Kuan Fang, Yunfei Bai, and Mrinal Kalakrishnan
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, Computer Science - Artificial Intelligence, Computer science, Computer Vision and Pattern Recognition (cs.CV), Feature extraction, Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Machine Learning (cs.LG), Computer Science - Robotics, 020901 industrial engineering & automation, 0202 electrical engineering, electronic engineering, information engineering, Computer vision, Artificial neural network, business.industry, Deep learning, Image segmentation, Object (computer science), Computer Science - Learning, Artificial Intelligence (cs.AI), Robot, 020201 artificial intelligence & image processing, Artificial intelligence, Transfer of learning, business, Robotics (cs.RO)
Abstract: Learning-based approaches to robotic manipulation are limited by the scalability of data collection and accessibility of labels. In this paper, we present a multi-task domain adaptation framework for instance grasping in cluttered scenes by utilizing simulated robot experiments. Our neural network takes monocular RGB images and the instance segmentation mask of a specified target object as inputs, and predicts the probability of successfully grasping the specified object for each candidate motor command. The proposed transfer learning framework trains a model for instance grasping in simulation and uses a domain-adversarial loss to transfer the trained model to real robots using indiscriminate grasping data, which is available both in simulation and the real world. We evaluate our model in real-world robot experiments, comparing it with alternative model architectures as well as an indiscriminate grasping baseline., Comment: ICRA 2018
Published: 2017
Full Text: View/download PDF

20. Indoor Scene Understanding with Geometric and Semantic Contexts

Author: Silvio Savarese, Wongun Choi, Yu-Wei Chao, and Caroline Pantofaru
Subjects: Parsing, business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cognitive neuroscience of visual object recognition, Scene statistics, Pattern recognition, Object (computer science), computer.software_genre, Object detection, Artificial Intelligence, Pattern recognition (psychology), Core (graph theory), Graph (abstract data type), Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, business, computer, Software, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Truly understanding a scene involves integrating information at multiple levels as well as studying the interactions between scene elements. Individual object detectors, layout estimators and scene classifiers are powerful but ultimately confounded by complicated real-world scenes with high variability, different viewpoints and occlusions. We propose a method that can automatically learn the interactions among scene elements and apply them to the holistic understanding of indoor scenes from a single image. This interpretation is performed within a hierarchical interaction model which describes an image by a parse graph, thereby fusing together object detection, layout estimation and scene classification. At the root of the parse graph is the scene type and layout while the leaves are the individual detections of objects. In between is the core of the system, our 3D Geometric Phrases (3DGP). We conduct extensive experimental evaluations on single image 3D scene understanding using both 2D and 3D metrics. The results demonstrate that our model with 3DGPs can provide robust estimation of scene type, 3D space, and 3D objects by leveraging the contextual relationships among the visual elements.
Published: 2014
Full Text: View/download PDF

21. Automatic Extrinsic Calibration of Vision and Lidar by Maximizing Mutual Information

Author: James R. McBride, Silvio Savarese, Gaurav Pandey, and Ryan M. Eustice
Subjects: Mean squared error, Calibration (statistics), Computer science, business.industry, Mutual information, Computer Science Applications, Lidar, Minimum-variance unbiased estimator, Omnidirectional camera, Control and Systems Engineering, Sample variance, Computer vision, Artificial intelligence, business, Cramér–Rao bound
Abstract: This paper reports on an algorithm for automatic, targetless, extrinsic calibration of a lidar and optical camera system based upon the maximization of mutual information between the sensor-measured surface intensities. The proposed method is completely data-driven and does not require any fiducial calibration targets-making in situ calibration easy. We calculate the Cramer-Rao lower bound CRLB of the estimated calibration parameter variance, and we show experimentally that the sample variance of the estimated parameters empirically approaches the CRLB when the amount of data used for calibration is sufficiently large. Furthermore, we compare the calibration results to independent ground-truth where available and observe that the mean error empirically approaches zero as the amount of data used for calibration is increased, thereby suggesting that the proposed estimator is a minimum variance unbiased estimate of the calibration parameters. Experimental results are presented for three different lidar-camera systems: i a three-dimensional 3D lidar and omnidirectional camera, ii a 3D time-of-flight sensor and monocular camera, and iii a 2D lidar and monocular camera.
Published: 2014
Full Text: View/download PDF

22. Understanding Collective Activitiesof People from Videos

Author: Silvio Savarese and Wongun Choi
Subjects: Property (programming), Computer science, Posture, Context (language use), Machine learning, computer.software_genre, Pattern Recognition, Automated, Artificial Intelligence, Image Processing, Computer-Assisted, Humans, Human Activities, Computer vision, Isolation (database systems), Hidden Markov model, business.industry, Applied Mathematics, Videotape Recording, Object detection, Computational Theory and Mathematics, Video tracking, Key (cryptography), Computer Vision and Pattern Recognition, Artificial intelligence, business, computer, Algorithms, Software
Abstract: This paper presents a principled framework for analyzing collective activities at different levels of semantic granularity from videos. Our framework is capable of jointly tracking multiple individuals, recognizing activities performed by individuals in isolation (i.e., atomic activities such as walking or standing), recognizing the interactions between pairs of individuals (i.e., interaction activities) as well as understanding the activities of group of individuals (i.e., collective activities). A key property of our work is that it can coherently combine bottom-up information stemming from detections or fragments of tracks (or tracklets) with top-down evidence. Top-down evidence is provided by a newly proposed descriptor that captures the coherent behavior of groups of individuals in a spatial-temporal neighborhood of the sequence. Top-down evidence provides contextual information for establishing accurate associations between detections or tracklets across frames and, thus, for obtaining more robust tracking results. Bottom-up evidence percolates upwards so as to automatically infer collective activity labels. Experimental results on two challenging data sets demonstrate our theoretical claims and indicate that our model achieves enhances tracking results and the best collective classification results to date.
Published: 2014
Full Text: View/download PDF

23. Object detection, shape recovery, and 3D modelling by depth-encoded hough voting

Author: Shyam Sunder Kumar, Silvio Savarese, Gary Bradski, and Min Sun
Subjects: business.industry, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cognitive neuroscience of visual object recognition, Pattern recognition, Object (computer science), Object detection, Image (mathematics), Depth map, Signal Processing, Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, Scale (map), business, Pose, Software, Mathematics
Abstract: Detecting objects, estimating their pose, and recovering their 3D shape are critical problems in many vision and robotics applications. This paper addresses the above needs using a two stages approach. In the first stage, we propose a new method called DEHV - Depth-Encoded Hough Voting. DEHV jointly detects objects, infers their categories, estimates their pose, and infers/decodes objects depth maps from either a single image (when no depth maps are available in testing) or a single image augmented with depth map (when this is available in testing). Inspired by the Hough voting scheme introduced in [1], DEHV incorporates depth information into the process of learning distributions of image features (patches) representing an object category. DEHV takes advantage of the interplay between the scale of each object patch in the image and its distance (depth) from the corresponding physical patch attached to the 3D object. Once the depth map is given, a full reconstruction is achieved in a second (3D modelling) stage, where modified or state-of-the-art 3D shape and texture completion techniques are used to recover the complete 3D model. Extensive quantitative and qualitative experimental analysis on existing datasets [2-4] and a newly proposed 3D table-top object category dataset shows that our DEHV scheme obtains competitive detection and pose estimation results. Finally, the quality of 3D modelling in terms of both shape completion and texture completion is evaluated on a 3D modelling dataset containing both in-door and out-door object categories. We demonstrate that our overall algorithm can obtain convincing 3D shape reconstruction from just one single uncalibrated image.
Published: 2013
Full Text: View/download PDF

24. DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes

Author: Kevin Chen, Silvio Savarese, Saumitro Dasgupta, and Kuan Fang
Subjects: Cuboid, Artificial neural network, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Convolutional neural network, Margin (machine learning), Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Clutter, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Vanishing point, business, Projection (set theory), 0105 earth and related environmental sciences
Abstract: We consider the problem of estimating the spatial layout of an indoor scene from a monocular RGB image, modeled as the projection of a 3D cuboid. Existing solutions to this problem often rely strongly on hand-engineered features and vanishing point detection, which are prone to failure in the presence of clutter. In this paper, we present a method that uses a fully convolutional neural network (FCNN) in conjunction with a novel optimization framework for generating layout estimates. We demonstrate that our method is robust in the presence of clutter and handles a wide range of highly challenging scenes. We evaluate our method on two standard benchmarks and show that it achieves state of the art results, outperforming previous methods by a wide margin.
Published: 2016
Full Text: View/download PDF

25. 3D Semantic Parsing of Large-Scale Indoor Spaces

Author: Ioannis Brilakis, Silvio Savarese, Ozan Sener, Martin Fischer, Helen Jiang, Amir Roshan Zamir, and Iro Armeni
Subjects: Parsing, business.industry, Coordinate system, Point cloud, 020207 software engineering, 02 engineering and technology, Image segmentation, Notation, computer.software_genre, 4013 Geomatic Engineering, 46 Information and Computing Sciences, Robustness (computer science), Histogram, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Segmentation, Computer vision, Data mining, Artificial intelligence, business, computer, 40 Engineering, Mathematics
Abstract: In this paper, we propose a method for semantic parsing the 3D point cloud of an entire building using a hierarchical approach: first, the raw data is parsed into semantically meaningful spaces (e.g. rooms, etc) that are aligned into a canonical reference coordinate system. Second, the spaces are parsed into their structural and building elements (e.g. walls, columns, etc). Performing these with a strong notation of global 3D space is the backbone of our method. The alignment in the first step injects strong 3D priors from the canonical coordinate system into the second step for discovering elements. This allows diverse challenging scenarios as man-made indoor spaces often show recurrent geometric patterns while the appearance features can change drastically. We also argue that identification of structural elements in indoor spaces is essentially a detection problem, rather than segmentation which is commonly used. We evaluated our method on a new dataset of several buildings with a covered area of over 6, 000m2 and over 215 million points, demonstrating robust results readily useful for practical applications.
Published: 2016
Full Text: View/download PDF

26. Robust single-view instance recognition

Author: Sebastian Thrun, David Held, and Silvio Savarese
Subjects: Artificial neural network, Computer science, business.industry, 3D single-object recognition, 02 engineering and technology, 010501 environmental sciences, Object (computer science), 01 natural sciences, Method, Robustness (computer science), 0202 electrical engineering, electronic engineering, information engineering, Robot, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Set (psychology), business, Neural coding, 0105 earth and related environmental sciences
Abstract: Some robots must repeatedly interact with a fixed set of objects in their environment. To operate correctly, it is helpful for the robot to be able to recognize the object instances that it repeatedly encounters. However, current methods for recognizing object instances require that, during training, many pictures are taken of each object from a large number of viewing angles. This procedure is slow and requires much manual effort before the robot can begin to operate in a new environment. We have developed a novel procedure for training a neural network to recognize a set of objects from just a single training image per object. To obtain robustness to changes in viewpoint, we take advantage of a supplementary dataset in which we observe a separate (non-overlapping) set of objects from multiple viewpoints. After pre-training the network in a novel multi-stage fashion, the network can robustly recognize new object instances given just a single training image of each object. If more images of each object are available, the performance improves. We perform a thorough analysis comparing our novel training procedure to traditional neural network pre-training techniques as well as previous state-of-the-art approaches including keypoint-matching, template-matching, and sparse coding, and we demonstrate that our method significantly outperforms these previous approaches. Our method can thus be used to easily teach a robot to recognize a novel set of object instances from unknown viewpoints.
Published: 2016
Full Text: View/download PDF

27. Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes

Author: Silvio Savarese, Alexandre Alahi, Alexandre Robicquet, Amir Sadeghian, Leibe, Bastian, Matas, Jiri, Sebe, Nicu, and Welling, Max
Subjects: Social sensitivity, business.industry, Computer science, media_common.quotation_subject, 020206 networking & telecommunications, Common sense, 02 engineering and technology, Etiquette, Order (exchange), Human–computer interaction, 0202 electrical engineering, electronic engineering, information engineering, Trajectory, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, Tracking (education), business, media_common
Abstract: Humans navigate crowded spaces such as a university campus by following common sense rules based on social etiquette. In this paper, we argue that in order to enable the design of new target tracking or trajectory forecasting methods that can take full advantage of these rules, we need to have access to better data in the first place. To that end, we contribute a new large-scale dataset that collects videos of various types of targets (not just pedestrians, but also bikers, skateboarders, cars, buses, golf carts) that navigate in a real world outdoor environment such as a university campus. Moreover, we introduce a new characterization that describes the “social sensitivity” at which two targets interact. We use this characterization to define “navigation styles” and improve both forecasting models and state-of-the-art multi-target tracking–whereby the learnt forecasting models help the data association step.
Published: 2016
Full Text: View/download PDF

28. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Author: Kevin Chen, Silvio Savarese, JunYoung Gwak, Christopher Choy, and Danfei Xu
Subjects: Occupancy grid mapping, Artificial neural network, Computer science, business.industry, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, Pattern recognition, 02 engineering and technology, Object (computer science), Image (mathematics), Recurrent neural network, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data [13]. Our network takes in one or more images of an object instance from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing. Our extensive experimental analysis shows that our reconstruction framework (i) outperforms the state-of-the-art methods for single view reconstruction, and (ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).
Published: 2016
Full Text: View/download PDF

29. Knowledge transfer for scene-specific motion prediction

Author: Alexandre Alahi, Silvio Savarese, Francesco Palmieri, Lamberto Ballan, Francesco Castaldo, Ballan, Lamberto, Castaldo, Francesco, Alahi, Alexandre, Palmieri, Francesco, and Savarese, Silvio
Subjects: FOS: Computer and information sciences, 0209 industrial biotechnology, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science (all), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 02 engineering and technology, Motion (physics), Theoretical Computer Science, 020901 industrial engineering & automation, Dynamics (music), 0202 electrical engineering, electronic engineering, information engineering, Key (cryptography), 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Knowledge transfer, Dynamic Bayesian network, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: When given a single frame of the video, humans can not only interpret the content of the scene, but also they are able to forecast the near future. This ability is mostly driven by their rich prior knowledge about the visual world, both in terms of (i) the dynamics of moving agents, as well as (ii) the semantic of the scene. In this work we exploit the interplay between these two key elements to predict scene-specific motion patterns. First, we extract patch descriptors encoding the probability of moving to the adjacent patches, and the probability of being in that particular patch or changing behavior. Then, we introduce a Dynamic Bayesian Network which exploits this scene specific knowledge for trajectory prediction. Experimental results demonstrate that our method is able to accurately predict trajectories and transfer predictions to a novel scene characterized by similar elements., Accepted to ECCV 2016
Published: 2016

30. Learning to Track at 100 FPS with Deep Regression Networks

Author: David Held, Sebastian Thrun, and Silvio Savarese
Subjects: Training set, Artificial neural network, Computer science, business.industry, Deep learning, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, 020207 software engineering, 02 engineering and technology, Regression, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: Machine learning techniques are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. Unfortunately, most generic object trackers are still trained from scratch online and do not benefit from the large number of videos that are readily available for offline training. We propose a method for offline training of neural networks that can track novel objects at test-time at 100 fps. Our tracker is significantly faster than previous methods that use neural networks for tracking, which are typically very slow to run and not practical for real-time applications. Our tracker uses a simple feed-forward network with no online training required. The tracker learns a generic relationship between object motion and appearance and can be used to track novel objects that do not appear in the training set. We test our network on a standard tracking benchmark to demonstrate our tracker’s state-of-the-art performance. Further, our performance improves as we add more videos to our offline training set. To the best of our knowledge, our tracker (Our tracker is available at http://davheld.github.io/GOTURN/GOTURN.html) is the first neural-network tracker that learns to track generic objects at 100 fps.
Published: 2016
Full Text: View/download PDF

31. Object Detection using Geometrical Context Feedback

Author: Min Sun, Silvio Savarese, and Sid Yingze Bao
Subjects: business.industry, Computer science, Detector, 3D reconstruction, Cognitive neuroscience of visual object recognition, Estimator, Pattern recognition, Object detection, Artificial Intelligence, Region of interest, Segmentation, Computer vision, Viola–Jones object detection framework, Computer Vision and Pattern Recognition, Artificial intelligence, business, Software
Abstract: We propose a new coherent framework for joint object detection, 3D layout estimation, and object supporting region segmentation from a single image. Our approach is based on the mutual interactions among three novel modules: (i) object detector; (ii) scene 3D layout estimator; (iii) object supporting region segmenter. The interactions between such modules capture the contextual geometrical relationship between objects, the physical space including these objects, and the observer. An important property of our algorithm is that the object detector module is capable of adaptively changing its confidence in establishing whether a certain region of interest contains an object (or not) as new evidence is gathered about the scene layout. This enables an iterative estimation procedure where the detector becomes more and more accurate as additional evidence about a specific scene becomes available. Extensive quantitative and qualitative experiments are conducted on the table-top dataset (Sun et al. in ECCV, 2010b) and two publicly available datasets (Hoiem et al. in CVPR, 2006; Sudderth et al. in IJCV, 2008), and demonstrate competitive object detection, 3D layout estimation, and segmentation results.
Published: 2012
Full Text: View/download PDF

32. Multimodal Video Indexing and Retrieval Using Directed Information

Author: Xu Chen, Alfred O. Hero, and Silvio Savarese
Subjects: Computer science, business.industry, Feature extraction, Search engine indexing, Scale-invariant feature transform, Pattern recognition, Mutual information, Similarity measure, Computer Science Applications, Support vector machine, Feature (computer vision), Signal Processing, Media Technology, Computer vision, Artificial intelligence, Electrical and Electronic Engineering, Hidden Markov model, business
Abstract: We propose a novel framework for multimodal video indexing and retrieval using shrinkage optimized directed information assessment (SODA) as similarity measure. The directed information (DI) is a variant of the classical mutual information which attempts to capture the direction of information flow that videos naturally possess. It is applied directly to the empirical probability distributions of both audio-visual features over successive frames. We utilize RASTA-PLP features for audio feature representation and SIFT features for visual feature representation. We compute the joint probability density functions of audio and visual features in order to fuse features from different modalities. With SODA, we further estimate the DI in a manner that is suitable for high dimensional features p and small sample size n (large p small n ) between pairs of video-audio modalities. We demonstrate the superiority of the SODA approach in video indexing, retrieval, and activity recognition as compared to the state-of-the-art methods such as hidden Markov models (HMM), support vector machine (SVM), cross-media indexing space (CMIS), and other noncausal divergence measures such as mutual information (MI). We also demonstrate the success of SODA in audio and video localization and indexing/retrieval of data with missaligned modalities.
Published: 2012
Full Text: View/download PDF

33. Semantic Cross-View Matching

Author: Francesco Palmieri, Francesco Castaldo, Amir Roshan Zamir, Roland Angst, Silvio Savarese, ICVV, Castaldo, F., Zamir, A., Angst, R., Palmieri, Francesco A. N., and Savarese, Silvio
Subjects: FOS: Computer and information sciences, Geographic information system, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, 02 engineering and technology, Image segmentation, 020204 information systems, 11. Sustainability, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business
Abstract: Matching cross-view images is challenging because the appearance and viewpoints are significantly different. While low-level features based on gradient orientations or filter responses can drastically vary with such changes in viewpoint, semantic information of images however shows an invariant characteristic in this respect. Consequently, semantically labeled regions can be used for performing cross-view matching. In this paper, we therefore explore this idea and propose an automatic method for detecting and representing the semantic information of an RGB image with the goal of performing cross-view matching with a (non-RGB) geographic information system (GIS). A segmented image forms the input to our system with segments assigned to semantic concepts such as traffic signs, lakes, roads, foliage, etc. We design a descriptor to robustly capture both, the presence of semantic concepts and the spatial layout of those segments. Pairwise distances between the descriptors extracted from the GIS map and the query image are then used to generate a shortlist of the most promising locations with similar semantic concepts in a consistent spatial layout. An experimental evaluation with challenging query images and a large urban area shows promising results.
Published: 2015
Full Text: View/download PDF

34. Toward coherent object detection and scene layout understanding

Author: Silvio Savarese, Sid Yingze Bao, and Min Sun
Subjects: Optimization problem, business.industry, Computer science, 3D single-object recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cognitive neuroscience of visual object recognition, LabelMe, Probabilistic logic, Deep-sky object, 3D pose estimation, Object detection, Computer graphics, Object-class detection, Signal Processing, Focal length, Viola–Jones object detection framework, Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, False alarm, business, Pose
Abstract: Detecting objects in complex scenes while recovering the scene layout is a critical functionality in many vision-based applications. Inspired by the work of [18], we advocate the importance of geometric contextual reasoning for object recognition. We start from the intuition that objects' location and pose in the 3D space are not arbitrarily distributed but rather constrained by the fact that objects must lie on one or multiple supporting surfaces. We model such supporting surfaces by means of hidden parameters (i.e. not explicitly observed) and formulate the problem of joint scene reconstruction and object recognition as the one of finding the set of parameters that maximizes the joint probability of having a number of detected objects on K supporting planes given the observations. As a key ingredient for solving this optimization problem, we have demonstrated a novel relationship between object location and pose in the image, and the scene layout parameters (i.e. normal of one or more supporting planes in 3D and camera pose, location and focal length). Using the probabilistic formulation and the above relationship our method has the unique ability to jointly: i) reduce false alarm and false negative object detection rate; ii) recover object location and supporting planes within the 3D camera reference system; iii) infer camera parameters (view point and the focal length) from just one single uncalibrated image. Quantitative and qualitative experimental evaluation on a number of datasets (a novel in-house dataset and label-me[28] on car and pedestrian) demonstrates our theoretical claims.
Published: 2011
Full Text: View/download PDF

35. Extrinsic Calibration of a 3D Laser Scanner and an Omnidirectional Camera

Author: Gaurav Pandey, Ryan M. Eustice, James R. McBride, and Silvio Savarese
Subjects: Laser scanning, Orientation (computer vision), business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Point cloud, General Medicine, Laser, law.invention, Structured-light 3D scanner, Omnidirectional camera, law, Camera auto-calibration, Computer vision, Artificial intelligence, business, Camera resectioning
Abstract: We propose an approach for external calibration of a 3D laser scanner with an omnidirectional camera system. The utility of an accurate calibration is that it allows for precise co-registration between the camera imagery and the 3D point cloud. This association can be used to enhance various state of the art algorithms in computer vision and robotics. The extrinsic calibration technique used here is similar to the calibration of a 2D laser range finder and a single camera as proposed by Zhang (2004), but has been extended to the case where we have a 3D laser scanner and an omnidirectional camera system. The procedure requires a planar checkerboard pattern to be observed simultaneously from the laser scanner and the camera system from a minimum of 3 views. The normal of the planar surface and 3D points lying on the surface constrain the relative position and orientation of the laser scanner and the omnidirectional camera system. These constraints can be used to form a non-linear optimization problem that is solved for the extrinsic calibration parameters and the covariance associated with the estimated parameters. Results are presented for a real world data set collected by a vehicle mounted with a 3D laser scanner and an omnidirectional camera system.
Published: 2010
Full Text: View/download PDF

36. Enriching object detection with 2D-3D registration and continuous viewpoint estimation

Author: Sam Corbett-Davies, Silvio Savarese, Christopher Choy, and Michael Stark
Subjects: Discriminative model, Computer science, business.industry, Computer vision, Artificial intelligence, Object (computer science), business, Pose, Object detection
Abstract: A large body of recent work on object detection has focused on exploiting 3D CAD model databases to improve detection performance. Many of these approaches work by aligning exact 3D models to images using templates generated from renderings of the 3D models at a set of discrete viewpoints. However, the training procedures for these approaches are computationally expensive and require gigabytes of memory and storage, while the viewpoint discretization hampers pose estimation performance. We propose an efficient method for synthesizing templates from 3D models that runs on the fly - that is, it quickly produces detectors for an arbitrary viewpoint of a 3D model without expensive dataset-dependent training or template storage. Given a 3D model and an arbitrary continuous detection viewpoint, our method synthesizes a discriminative template by extracting features from a rendered view of the object and decorrelating spatial dependences among the features. Our decorrelation procedure relies on a gradient-based algorithm that is more numerically stable than standard decomposition-based procedures, and we efficiently search for candidate detections by computing FFT-based template convolutions. Due to the speed of our template synthesis procedure, we are able to perform joint optimization of scale, translation, continuous rotation, and focal length using Metropolis-Hastings algorithm. We provide an efficient GPU implementation of our algorithm, and we validate its performance on 3D Object Classes and PASCAL3D+ datasets.
Published: 2015
Full Text: View/download PDF

37. Data-driven 3D Voxel Patterns for object category recognition

Author: Silvio Savarese, Yu Xiang, Wongun Choi, and Yuanqing Lin
Subjects: Computer science, business.industry, Cognitive neuroscience of visual object recognition, Pattern recognition, computer.software_genre, 3D pose estimation, Object-class detection, Voxel, Segmentation, Computer vision, Viola–Jones object detection framework, Artificial intelligence, Representation (mathematics), business, Pose, computer
Abstract: Despite the great progress achieved in recognizing objects as 2D bounding boxes in images, it is still very challenging to detect occluded objects and estimate the 3D properties of multiple objects from a single image. In this paper, we propose a novel object representation, 3D Voxel Pattern (3DVP), that jointly encodes the key properties of objects including appearance, 3D shape, viewpoint, occlusion and truncation. We discover 3DVPs in a data-driven way, and train a bank of specialized detectors for a dictionary of 3DVPs. The 3DVP detectors are capable of detecting objects with specific visibility patterns and transferring the meta-data from the 3DVPs to the detected objects, such as 2D segmentation mask, 3D pose as well as occlusion or truncation boundaries. The transferred meta-data allows us to infer the occlusion relationship among objects, which in turn provides improved object recognition results. Experiments are conducted on the KITTI detection benchmark [17] and the outdoor-scene dataset [41]. We improve state-of-the-art results on car detection and pose estimation with notable margins (6% in difficult data of KITTI). We also verify the ability of our method in accurately segmenting objects from the background and localizing them in 3D.
Published: 2015
Full Text: View/download PDF

38. 3D Reconstruction by Shadow Carving: Theory and Practical Evaluation

Author: Fausto Bernardini, Holly Rushmeier, Pietro Perona, Silvio Savarese, and M. Andreetto
Subjects: Carving, business.industry, Computer science, Carve out, 3D reconstruction, Image processing, Iterative reconstruction, Silhouette, Rendering (computer graphics), Artificial Intelligence, Computer vision, Computer Vision and Pattern Recognition, Artificial intelligence, business, Error detection and correction, Software
Abstract: Cast shadows are an informative cue to the shape of objects. They are particularly valuable for discovering object's concavities which are not available from other cues such as occluding boundaries. We propose a new method for recovering shape from shadows which we call shadow carving. Given a conservative estimate of the volume occupied by an object, it is possible to identify and carve away regions of this volume that are inconsistent with the observed pattern of shadows. We prove a theorem that guarantees that when these regions are carved away from the shape, the shape still remains conservative. Shadow carving overcomes limitations of previous studies on shape from shadows because it is robust with respect to errors in shadows detection and it allows the reconstruction of objects in the round, rather than just bas-reliefs. We propose a reconstruction system to recover shape from silhouettes and shadow carving. The silhouettes are used to reconstruct the initial conservative estimate of the object's shape and shadow carving is used to carve out the concavities. We have simulated our reconstruction system with a commercial rendering package to explore the design parameters and assess the accuracy of the reconstruction. We have also implemented our reconstruction scheme in a table-top system and present the results of scanning of several objects.
Published: 2006
Full Text: View/download PDF

39. Combining 3D Shape, Color, and Motion for Robust Anytime Tracking

Author: Silvio Savarese, Sebastian Thrun, David Held, and Jesse Levinson
Subjects: Millisecond, Computer science, business.industry, Posterior probability, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Tracking (particle physics), Motion cues, Motion (physics), Robustness (computer science), Video tracking, Computer vision, Artificial intelligence, Baseline (configuration management), business
Abstract: Although object tracking has been studied for decades, real-time tracking algorithms often suffer from low accuracy and poor robustness when confronted with difficult, realworld data. We present a tracker that combines 3D shape, color (when available), and motion cues to accurately track moving objects in real-time. Our tracker allocates computational effort based on the shape of the posterior distribution. Starting with a coarse approximation to the posterior, the tracker successively refines this distribution, increasing in tracking accuracy over time. The tracker can thus be run for any amount of time, after which the current approximation to the posterior is returned. Even at a minimum runtime of 0.7 milliseconds, our method outperforms all of the baseline methods of similar speed by at least 10%. If our tracker is allowed to run for longer, the accuracy continues to improve, and it continues to outperform all baseline methods. Our tracker is thus anytime, allowing the speed or accuracy to be optimized based on the needs of the application.
Published: 2014
Full Text: View/download PDF

40. Learning an Image-Based Motion Context for Multiple People Tracking

Author: Silvio Savarese, Alina Kuznetsova, Laura Leal-Taixé, Michele Fenzi, and Bodo Rosenhahn
Subjects: Feature (computer vision), business.industry, Social force model, Context (language use), Computer vision, Artificial intelligence, Representation (mathematics), business, Tracking (particle physics), Motion (physics), Mathematics, Random forest, Feature detection (computer vision)
Abstract: We present a novel method for multiple people tracking that leverages a generalized model for capturing interactions among individuals. At the core of our model lies a learned dictionary of interaction feature strings which capture relationships between the motions of targets. These feature strings, created from low-level image features, lead to a much richer representation of the physical interactions between targets compared to hand-specified social force models that previous works have introduced for tracking. One disadvantage of using social forces is that all pedestrians must be detected in order for the forces to be applied, while our method is able to encode the effect of undetected targets, making the tracker more robust to partial occlusions. The interaction feature strings are used in a Random Forest framework to track targets according to the features surrounding them. Results on six publicly available sequences show that our method outperforms state-of-the-art approaches in multiple people tracking.
Published: 2014
Full Text: View/download PDF

41. Beyond PASCAL: A benchmark for 3D object detection in the wild

Author: Silvio Savarese, Roozbeh Mottaghi, and Yu Xiang
Subjects: 2d images, Computer science, business.industry, Testbed, Solid modeling, Pascal (programming language), Object detection, Computer vision, Electronic design automation, Artificial intelligence, business, Pose, computer, computer.programming_language
Abstract: 3D object detection and pose estimation methods have become popular in recent years since they can handle ambiguities in 2D images and also provide a richer description for objects compared to 2D object detectors. However, most of the datasets for 3D recognition are limited to a small amount of images per category or are captured in controlled environments. In this paper, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 [4] with 3D annotations. Furthermore, more images are added for each category from ImageNet [3]. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area. We provide the results of variations of DPM [6] on our new dataset for object detection and viewpoint estimation in different scenarios, which can be used as baselines for the community. Our benchmark is available online at http://cvgl.stanford.edu/projects/pascal3d
Published: 2014
Full Text: View/download PDF

42. Understanding the 3D layout of a cluttered room from multiple images

Author: Li Fei-Fei, Axel Furlan, Sid Yingze Bao, Silvio Savarese, Bao, S, Furlan, A, Fei Fei, L, and Savarese, S
Subjects: Structure (mathematical logic), business.industry, Computer science, Cognitive neuroscience of visual object recognition, Image segmentation, Viewpoints, Semantics, ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI, Scene understanding, indoor, clutter, moving camera, robotics, computer vision, Image (mathematics), Computer vision, Ceiling (aeronautics), Augmented reality, Artificial intelligence, business
Abstract: We present a novel framework for robustly understanding the geometrical and semantic structure of a cluttered room from a small number of images captured from different viewpoints. The tasks we seek to address include: i) estimating the 3D layout of the room - that is, the 3D configuration of floor, walls and ceiling; ii) identifying and localizing all the foreground objects in the room. We jointly use multiview geometry constraints and image appearance to identify the best room layout configuration. Extensive experimental evaluation demonstrates that our estimation results are more complete and accurate in estimating 3D room structure and recognizing objects than alternative state-of-the-art algorithms. In addition, we show an augmented reality mobile application to highlight the high accuracy of our method, which may be beneficial to many computer vision applications.
Published: 2014
Full Text: View/download PDF

43. Monocular Multiview Object Tracking with 3D Aspect Parts

Author: Silvio Savarese, Roozbeh Mottaghi, Yu Xiang, and Changkyu Song
Subjects: Monocular, Computer science, business.industry, Video tracking, Computer vision, Artificial intelligence, Particle filter, Object (computer science), Focus (optics), business, Tracking (particle physics)
Abstract: In this work, we focus on the problem of tracking objects under significant viewpoint variations, which poses a big challenge to traditional object tracking methods. We propose a novel method to track an object and estimate its continuous pose and part locations under severe viewpoint change. In order to handle the change in topological appearance introduced by viewpoint transformations, we represent objects with 3D aspect parts and model the relationship between viewpoint and 3D aspect parts in a part-based particle filtering framework. Moreover, we show that instance-level online-learned part appearance can be incorporated into our model, which makes it more robust in difficult scenarios with occlusions. Experiments are conducted on a new dataset of challenging YouTube videos and a subset of the KITTI dataset [14] that include significant viewpoint variations, as well as a standard sequence for car tracking. We demonstrate that our method is able to track the 3D aspect parts and the viewpoint of objects accurately despite significant changes in viewpoint.
Published: 2014
Full Text: View/download PDF

44. 3D Scene Understanding by Voxel-CRF

Author: Silvio Savarese, Pushmeet Kohli, and Byung-soo Kim
Subjects: Conditional random field, business.industry, Computer science, 3D reconstruction, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Pattern recognition, Iterative reconstruction, computer.software_genre, Object (computer science), Voxel, RGB color model, Computer vision, Artificial intelligence, business, computer, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of partial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1 and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images.
Published: 2013
Full Text: View/download PDF

45. Label transfer exploiting three-dimensional structure for semantic segmentation

Author: Valeria Garro, Andrea Fusiello, and Silvio Savarese
Subjects: Markov random field, business.industry, Scale-space segmentation, Pattern recognition, Image segmentation, Image (mathematics), Term (time), Computer Science::Computer Vision and Pattern Recognition, Structure from motion, Segmentation, Computer vision, Pairwise comparison, Artificial intelligence, business, Mathematics
Abstract: This paper deals with the problem of computing a semantic segmentation of an image via label transfer from an already labeled image set. In particular it proposes a method that takes advantage of sparse 3D structure to infer the category of superpixel in the novel image. The label assignment is computed by a Markov random field that has the superpixels of the image as nodes. The data term combines labeling proposals from the appearance of the superpixel and from the 3D structure, while the pairwise term incorporates spatial context, both in the image and in 3D space. Exploratory results indicate that 3D structure, albeit sparse, improves the process of label transfer.
Published: 2013
Full Text: View/download PDF

46. Understanding Indoor Scenes Using 3D Geometric Phrases

Author: Caroline Pantofaru, Yu-Wei Chao, Silvio Savarese, and Wongun Choi
Subjects: Phrase, Computer science, business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Scene statistics, Pattern recognition, Solid modeling, Object (computer science), Semantics, Object detection, Core (graph theory), Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Visual scene understanding is a difficult problem interleaving object detection, geometric reasoning and scene classification. We present a hierarchical scene model for learning and reasoning about complex indoor scenes which is computationally tractable, can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase Model which captures the semantic and geometric relationships between objects which frequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.
Published: 2013
Full Text: View/download PDF

47. Dense Object Reconstruction with Semantic Priors

Author: Yuanqing Lin, Manmohan Chandraker, Silvio Savarese, and Sid Yingze Bao
Subjects: business.industry, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Point cloud, Pattern recognition, Iterative reconstruction, Object (computer science), Object detection, Set (abstract data type), Active shape model, Prior probability, Structure from motion, Computer vision, Artificial intelligence, business, ComputingMethodologies_COMPUTERGRAPHICS, Mathematics
Abstract: We present a dense reconstruction approach that overcomes the drawbacks of traditional multiview stereo by incorporating semantic information in the form of learned category-level shape priors and object detection. Given training data comprised of 3D scans and images of objects from various viewpoints, we learn a prior comprised of a mean shape and a set of weighted anchor points. The former captures the commonality of shapes across the category, while the latter encodes similarities between instances in the form of appearance and spatial consistency. We propose robust algorithms to match anchor points across instances that enable learning a mean shape for the category, even with large shape variations across instances. We model the shape of an object instance as a warped version of the category mean, along with instance-specific details. Given multiple images of an unseen instance, we collate information from 2D object detectors to align the structure from motion point cloud with the mean shape, which is subsequently warped and refined to approach the actual shape. Extensive experiments demonstrate that our model is general enough to learn semantic priors for different object categories, yet powerful enough to reconstruct individual shapes with large variations. Qualitative and quantitative evaluations show that our framework can produce more accurate reconstructions than alternative state-of-the-art multiview stereo systems.
Published: 2013
Full Text: View/download PDF

48. Accurate Localization of 3D Objects from RGB-D Data Using Segmentation Hypotheses

Author: Silvio Savarese, Byung-soo Kim, and Shili Xu
Subjects: Computer science, Segmentation-based object categorization, business.industry, 3D single-object recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Cognitive neuroscience of visual object recognition, Pattern recognition, Image segmentation, Object detection, Support vector machine, Object-class detection, RGB color model, Viola–Jones object detection framework, Computer vision, Segmentation, Artificial intelligence, business
Abstract: In this paper we focus on the problem of detecting objects in 3D from RGB-D images. We propose a novel framework that explores the compatibility between segmentation hypotheses of the object in the image and the corresponding 3D map. Our framework allows to discover the optimal location of the object using a generalization of the structural latent SVM formulation in 3D as well as the definition of a new loss function defined over the 3D space in training. We evaluate our method using two existing RGB-D datasets. Extensive quantitative and qualitative experimental results show that our proposed approach outperforms state-of-the-art as methods well as a number of baseline approaches for both 3D and 2D object recognition tasks.
Published: 2013
Full Text: View/download PDF

49. Layout Estimation of Highly Cluttered Indoor Scenes Using Geometric and Semantic Cues

Author: Wongun Choi, Yu-Wei Chao, Silvio Savarese, and Caroline Pantofaru
Subjects: Structure (mathematical logic), business.industry, Computer science, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Scene statistics, Vanishing point estimation, Data set, 3d space, Clutter, Computer vision, Artificial intelligence, Vanishing point, Focus (optics), business, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: Recovering the spatial layout of cluttered indoor scenes is a challenging problem. Current methods generate layout hypotheses from vanishing point estimates produced using 2D image features. This method fails in highly cluttered scenes in which most of the image features come from clutter instead of the room’s geometric structure. In this paper, we propose to use human detections as cues to more accurately estimate the vanishing points. Our method is built on top of the fact that people are often the focus of indoor scenes, and that the scene and the people within the scene should have consistent geometric configurations in 3D space. We contribute a new data set of highly cluttered indoor scenes containing people, on which we provide baselines and evaluate our method. This evaluation shows that our approach improves 3D interpretation of scenes.
Published: 2013
Full Text: View/download PDF

50. Free your Camera: 3D Indoor Scene Understanding from Arbitrary Camera Motion

Author: Axel Furlan, Li Fei-Fei, Domenico G. Sorrenti, Silvio Savarese, Stephen Miller, Furlan, A, Miller, S, Sorrenti, D, Fei Fei, L, and Savarese, S
Subjects: business.industry, Computer science, Computation, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Probabilistic logic, Monocular video, scene understanding, Observer (special relativity), Limiting, robot vision, 3D scene layout, computer vision, ING-INF/05 - SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI, Motion estimation, Computer vision, Artificial intelligence, business
Abstract: Many works have been presented for indoor scene understanding, yet few of them combine structural reasoning with full motion estimation in a real-time oriented approach. In this work we address the problem of estimating the 3D structural layout of complex and cluttered indoor scenes from monocular video sequences, where the observer can freely move in the surrounding space. We propose an effective probabilistic formulation that allows us to generate, evaluate and optimize layout hypotheses by integrating new image evidence as the observer moves. Compared to state-of-the-art work, our approach makes significantly less limiting hypotheses about the scene and the observer (e.g., Manhattan world assumption, known camera motion). We introduce a new challenging dataset and present an extensive experimental evaluation, which demonstrates that our formulation reaches near-real-time computation time and outperforms state-of-the-art methods while operating in significantly less constrained conditions.
Published: 2013
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Database

Publisher

83 results on '"Silvio Savarese"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources