52 results on '"P Bischof"'
Search Results
2. Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection
- Author
-
Fruhwirth-Reisinger, Christian, Lin, Wei, Malić, Dušan, Bischof, Horst, and Possegger, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Accurate 3D object detection in LiDAR point clouds is crucial for autonomous driving systems. To achieve state-of-the-art performance, the supervised training of detectors requires large amounts of human-annotated data, which is expensive to obtain and restricted to predefined object categories. To mitigate manual labeling efforts, recent unsupervised object detection approaches generate class-agnostic pseudo-labels for moving objects, subsequently serving as supervision signal to bootstrap a detector. Despite promising results, these approaches do not provide class labels or generalize well to static objects. Furthermore, they are mostly restricted to data containing multiple drives from the same scene or images from a precisely calibrated and synchronized camera setup. To overcome these limitations, we propose a vision-language-guided unsupervised 3D detection approach that operates exclusively on LiDAR point clouds. We transfer CLIP knowledge to classify point clusters of static and moving objects, which we discover by exploiting the inherent spatio-temporal information of LiDAR point clouds for clustering, tracking, as well as box and label refinement. Our approach outperforms state-of-the-art unsupervised 3D object detectors on the Waymo Open Dataset ($+23~\text{AP}_{3D}$) and Argoverse 2 ($+7.9~\text{AP}_{3D}$) and provides class labels not solely based on object size assumptions, marking a significant advancement in the field., Comment: Accepted to BMVC 2024
- Published
- 2024
3. KerasCV and KerasNLP: Vision and Language Power-Ups
- Author
-
Watson, Matthew, Sreepathihalli, Divyashree Shivakumar, Chollet, Francois, Gorner, Martin, Sodhia, Kiranbir, Sampath, Ramesh, Patel, Tirth, Jin, Haifeng, Kovelamudi, Neel, Rasskin, Gabriel, Saadat, Samaneh, Wood, Luke, Qian, Chen, Bischof, Jonathan, Stenbit, Ian, Sharma, Abheesht, and Mishra, Anshuman
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning ,Computer Science - Software Engineering ,I.2.5 ,I.2.7 ,I.2.10 - Abstract
We present the Keras domain packages KerasCV and KerasNLP, extensions of the Keras API for Computer Vision and Natural Language Processing workflows, capable of running on either JAX, TensorFlow, or PyTorch. These domain packages are designed to enable fast experimentation, with a focus on ease-of-use and performance. We adopt a modular, layered design: at the library's lowest level of abstraction, we provide building blocks for creating models and data preprocessing pipelines, and at the library's highest level of abstraction, we provide pretrained ``task" models for popular architectures such as Stable Diffusion, YOLOv8, GPT2, BERT, Mistral, CLIP, Gemma, T5, etc. Task models have built-in preprocessing, pretrained weights, and can be fine-tuned on raw inputs. To enable efficient training, we support XLA compilation for all models, and run all preprocessing via a compiled graph of TensorFlow operations using the tf.data API. The libraries are fully open-source (Apache 2.0 license) and available on GitHub., Comment: Submitted to Journal of Machine Learning Open Source Software
- Published
- 2024
4. Occlusion Handling in 3D Human Pose Estimation with Perturbed Positional Encoding
- Author
-
Azizi, Niloofar, Fayyaz, Mohsen, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Understanding human behavior fundamentally relies on accurate 3D human pose estimation. Graph Convolutional Networks (GCNs) have recently shown promising advancements, delivering state-of-the-art performance with rather lightweight architectures. In the context of graph-structured data, leveraging the eigenvectors of the graph Laplacian matrix for positional encoding is effective. Yet, the approach does not specify how to handle scenarios where edges in the input graph are missing. To this end, we propose a novel positional encoding technique, PerturbPE, that extracts consistent and regular components from the eigenbasis. Our method involves applying multiple perturbations and taking their average to extract the consistent and regular component from the eigenbasis. PerturbPE leverages the Rayleigh-Schrodinger Perturbation Theorem (RSPT) for calculating the perturbed eigenvectors. Employing this labeling technique enhances the robustness and generalizability of the model. Our results support our theoretical findings, e.g. our experimental analysis observed a performance enhancement of up to $12\%$ on the Human3.6M dataset in instances where occlusion resulted in the absence of one edge. Furthermore, our novel approach significantly enhances performance in scenarios where two edges are missing, setting a new benchmark for state-of-the-art.
- Published
- 2024
5. Into the Fog: Evaluating Robustness of Multiple Object Tracking
- Author
-
Kirillova, Nadezda, Mirza, M. Jehanzeb, Bischof, Horst, and Possegger, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
State-of-the-art Multiple Object Tracking (MOT) approaches have shown remarkable performance when trained and evaluated on current benchmarks. However, these benchmarks primarily consist of clear weather scenarios, overlooking adverse atmospheric conditions such as fog, haze, smoke and dust. As a result, the robustness of trackers against these challenging conditions remains underexplored. To address this gap, we introduce physics-based volumetric fog simulation method for arbitrary MOT datasets, utilizing frame-by-frame monocular depth estimation and a fog formation optical model. We enhance our simulation by rendering both homogeneous and heterogeneous fog and propose to use the dark channel prior method to estimate atmospheric light, showing promising results even in night and indoor scenes. We present the leading benchmark MOTChallenge (third release) augmented with fog (smoke for indoor scenes) of various intensities and conduct a comprehensive evaluation of MOT methods, revealing their limitations under fog and fog-like challenges.
- Published
- 2024
6. MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection
- Author
-
Micorek, Jakub, Possegger, Horst, Narnhofer, Dominik, Bischof, Horst, and Kozinski, Mateusz
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We propose a novel approach to video anomaly detection: we treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution and model this distribution with a neural network. This lets us estimate the likelihood of test videos and detect video anomalies by thresholding the likelihood estimates. We train our video anomaly detector using a modification of denoising score matching, a method that injects training data with noise to facilitate modeling its distribution. To eliminate hyperparameter selection, we model the distribution of noisy video features across a range of noise levels and introduce a regularizer that tends to align the models for different levels of noise. At test time, we combine anomaly indications at multiple noise scales with a Gaussian mixture model. Running our video anomaly detector induces minimal delays as inference requires merely extracting the features and forward-propagating them through a shallow neural network and a Gaussian mixture model. Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance, both in the object-centric and in the frame-centric setup.
- Published
- 2024
7. GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data
- Author
-
Schinagl, David, Krispel, Georg, Fruhwirth-Reisinger, Christian, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Widely-used LiDAR-based 3D object detectors often neglect fundamental geometric information readily available from the object proposals in their confidence estimation. This is mostly due to architectural design choices, which were often adopted from the 2D image domain, where geometric context is rarely available. In 3D, however, considering the object properties and its surroundings in a holistic way is important to distinguish between true and false positive detections, e.g. occluded pedestrians in a group. To address this, we present GACE, an intuitive and highly efficient method to improve the confidence estimation of a given black-box 3D object detector. We aggregate geometric cues of detections and their spatial relationships, which enables us to properly assess their plausibility and consequently, improve the confidence estimation. This leads to consistent performance gains over a variety of state-of-the-art detectors. Across all evaluated detectors, GACE proves to be especially beneficial for the vulnerable road user classes, i.e. pedestrians and cyclists., Comment: ICCV 2023, code is available at https://github.com/dschinagl/gace
- Published
- 2023
8. Counterfactual Image Generation for adversarially robust and interpretable Classifiers
- Author
-
Bischof, Rafael, Scheidegger, Florian, Kraus, Michael A., and Malossi, A. Cristiano I.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Image and Video Processing - Abstract
Neural Image Classifiers are effective but inherently hard to interpret and susceptible to adversarial attacks. Solutions to both problems exist, among others, in the form of counterfactual examples generation to enhance explainability or adversarially augment training datasets for improved robustness. However, existing methods exclusively address only one of the issues. We propose a unified framework leveraging image-to-image translation Generative Adversarial Networks (GANs) to produce counterfactual samples that highlight salient regions for interpretability and act as adversarial samples to augment the dataset for more robustness. This is achieved by combining the classifier and discriminator into a single model that attributes real images to their respective classes and flags generated images as "fake". We assess the method's effectiveness by evaluating (i) the produced explainability masks on a semantic segmentation task for concrete cracks and (ii) the model's resilience against the Projected Gradient Descent (PGD) attack on a fruit defects detection problem. Our produced saliency maps are highly descriptive, achieving competitive IoU values compared to classical segmentation models despite being trained exclusively on classification labels. Furthermore, the model exhibits improved robustness to adversarial attacks, and we show how the discriminator's "fakeness" value serves as an uncertainty measure of the predictions.
- Published
- 2023
9. TAP: Targeted Prompting for Task Adaptive Generation of Textual Training Instances for Visual Classification
- Author
-
Mirza, M. Jehanzeb, Karlinsky, Leonid, Lin, Wei, Possegger, Horst, Feris, Rogerio, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision and Language Models (VLMs), such as CLIP, have enabled visual recognition of a potentially unlimited set of categories described by text prompts. However, for the best visual recognition performance, these models still require tuning to better fit the data distributions of the downstream tasks, in order to overcome the domain shift from the web-based pre-training data. Recently, it has been shown that it is possible to effectively tune VLMs without any paired data, and in particular to effectively improve VLMs visual recognition performance using text-only training data generated by Large Language Models (LLMs). In this paper, we dive deeper into this exciting text-only VLM training approach and explore ways it can be significantly further improved taking the specifics of the downstream task into account when sampling text data from LLMs. In particular, compared to the SOTA text-only VLM training approach, we demonstrate up to 8.4% performance improvement in (cross) domain-specific adaptation, up to 8.7% improvement in fine-grained recognition, and 3.1% overall average improvement in zero-shot classification compared to strong baselines., Comment: Code is available at: https://github.com/jmiemirza/TAP
- Published
- 2023
10. Sit Back and Relax: Learning to Drive Incrementally in All Weather Conditions
- Author
-
Leitner, Stefan, Mirza, M. Jehanzeb, Lin, Wei, Micorek, Jakub, Masana, Marc, Kozinski, Mateusz, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In autonomous driving scenarios, current object detection models show strong performance when tested in clear weather. However, their performance deteriorates significantly when tested in degrading weather conditions. In addition, even when adapted to perform robustly in a sequence of different weather conditions, they are often unable to perform well in all of them and suffer from catastrophic forgetting. To efficiently mitigate forgetting, we propose Domain-Incremental Learning through Activation Matching (DILAM), which employs unsupervised feature alignment to adapt only the affine parameters of a clear weather pre-trained network to different weather conditions. We propose to store these affine parameters as a memory bank for each weather condition and plug-in their weather-specific parameters during driving (i.e. test time) when the respective weather conditions are encountered. Our memory bank is extremely lightweight, since affine parameters account for less than 2% of a typical object detector. Furthermore, contrary to previous domain-incremental learning approaches, we do not require the weather label when testing and propose to automatically infer the weather condition by a majority voting linear classifier., Comment: Intelligent Vehicle Conference (oral presentation)
- Published
- 2023
11. LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections
- Author
-
Mirza, M. Jehanzeb, Karlinsky, Leonid, Lin, Wei, Kozinski, Mateusz, Possegger, Horst, Feris, Rogerio, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computation and Language - Abstract
Recently, large-scale pre-trained Vision and Language (VL) models have set a new state-of-the-art (SOTA) in zero-shot visual classification enabling open-vocabulary recognition of potentially unlimited set of categories defined as simple language prompts. However, despite these great advances, the performance of these zeroshot classifiers still falls short of the results of dedicated (closed category set) classifiers trained with supervised fine tuning. In this paper we show, for the first time, how to reduce this gap without any labels and without any paired VL data, using an unlabeled image collection and a set of texts auto-generated using a Large Language Model (LLM) describing the categories of interest and effectively substituting labeled visual instances of those categories. Using our label-free approach, we are able to attain significant performance improvements over the zero-shot performance of the base VL model and other contemporary methods and baselines on a wide variety of datasets, demonstrating absolute improvement of up to 11.7% (3.8% on average) in the label-free setting. Moreover, despite our approach being label-free, we observe 1.3% average gains over leading few-shot prompting baselines that do use 5-shot supervision., Comment: NeurIPS 2023 (Camera Ready) - Project Page: https://jmiemirza.github.io/LaFTer/
- Published
- 2023
12. AIR-DA: Adversarial Image Reconstruction for Unsupervised Domain Adaptive Object Detection
- Author
-
Sun, Kunyang, Lin, Wei, Shi, Haoqin, Zhang, Zhengming, Huang, Yongming, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Unsupervised domain adaptive object detection is a challenging vision task where object detectors are adapted from a label-rich source domain to an unlabeled target domain. Recent advances prove the efficacy of the adversarial based domain alignment where the adversarial training between the feature extractor and domain discriminator results in domain-invariance in the feature space. However, due to the domain shift, domain discrimination, especially on low-level features, is an easy task. This results in an imbalance of the adversarial training between the domain discriminator and the feature extractor. In this work, we achieve a better domain alignment by introducing an auxiliary regularization task to improve the training balance. Specifically, we propose Adversarial Image Reconstruction (AIR) as the regularizer to facilitate the adversarial training of the feature extractor. We further design a multi-level feature alignment module to enhance the adaptation performance. Our evaluations across several datasets of challenging domain shifts demonstrate that the proposed method outperforms all previous methods, of both one- and two-stage, in most settings., Comment: Accepted at IEEE Robotics and Automation Letters 2023
- Published
- 2023
13. MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
- Author
-
Lin, Wei, Karlinsky, Leonid, Shvetsova, Nina, Possegger, Horst, Kozinski, Mateusz, Panda, Rameswar, Feris, Rogerio, Kuehne, Hilde, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{https://github.com/wlin-at/MAXI}., Comment: Accepted at ICCV 2023
- Published
- 2023
14. TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering
- Author
-
Lin, Wei, Kukleva, Anna, Possegger, Horst, Kuehne, Hilde, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results., Comment: Computer Vision Winter Workshop 2023
- Published
- 2023
15. MAELi: Masked Autoencoder for Large-Scale LiDAR Point Clouds
- Author
-
Krispel, Georg, Schinagl, David, Fruhwirth-Reisinger, Christian, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The sensing process of large-scale LiDAR point clouds inevitably causes large blind spots, i.e. regions not visible to the sensor. We demonstrate how these inherent sampling properties can be effectively utilized for self-supervised representation learning by designing a highly effective pre-training framework that considerably reduces the need for tedious 3D annotations to train state-of-the-art object detectors. Our Masked AutoEncoder for LiDAR point clouds (MAELi) intuitively leverages the sparsity of LiDAR point clouds in both the encoder and decoder during reconstruction. This results in more expressive and useful initialization, which can be directly applied to downstream perception tasks, such as 3D object detection or semantic segmentation for autonomous driving. In a novel reconstruction approach, MAELi distinguishes between empty and occluded space and employs a new masking strategy that targets the LiDAR's inherent spherical projection. Thereby, without any ground truth whatsoever and trained on single frames only, MAELi obtains an understanding of the underlying 3D scene geometry and semantics. To demonstrate the potential of MAELi, we pre-train backbones in an end-to-end manner and show the effectiveness of our unsupervised pre-trained weights on the tasks of 3D object detection and semantic segmentation., Comment: Accepted to WACV 2024, 16 pages
- Published
- 2022
16. Sparse Message Passing Network with Feature Integration for Online Multiple Object Tracking
- Author
-
Wang, Bisheng, Possegger, Horst, Bischof, Horst, and Cao, Guo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Existing Multiple Object Tracking (MOT) methods design complex architectures for better tracking performance. However, without a proper organization of input information, they still fail to perform tracking robustly and suffer from frequent identity switches. In this paper, we propose two novel methods together with a simple online Message Passing Network (MPN) to address these limitations. First, we explore different integration methods for the graph node and edge embeddings and put forward a new IoU (Intersection over Union) guided function, which improves long term tracking and handles identity switches. Second, we introduce a hierarchical sampling strategy to construct sparser graphs which allows to focus the training on more difficult samples. Experimental results demonstrate that a simple online MPN with these two contributions can perform better than many state-of-the-art methods. In addition, our association method generalizes well and can also improve the results of private detection based methods., Comment: 8 pages, 2 figures
- Published
- 2022
17. Video Test-Time Adaptation for Action Recognition
- Author
-
Lin, Wei, Mirza, Muhammad Jehanzeb, Kozinski, Mateusz, Possegger, Horst, Kuehne, Hilde, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. Code will be available at \url{https://github.com/wlin-at/ViTTA}., Comment: Accepted at CVPR 2023
- Published
- 2022
18. ActMAD: Activation Matching to Align Distributions for Test-Time-Training
- Author
-
Mirza, Muhammad Jehanzeb, Soneira, Pol Jané, Lin, Wei, Kozinski, Mateusz, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the distribution of entire channels in the ultimate layer of the feature extractor, we model the distribution of each feature in multiple layers across the network. This results in a more fine-grained supervision and makes ActMAD attain state of the art performance on CIFAR-100C and Imagenet-C. ActMAD is also architecture- and task-agnostic, which lets us go beyond image classification, and score 15.4% improvement over previous approaches when evaluating a KITTI-trained object detector on KITTI-Fog. Our experiments highlight that ActMAD can be applied to online adaptation in realistic scenarios, requiring little data to attain its full performance., Comment: CVPR 2023 - Project Page: https://jmiemirza.github.io/ActMAD/
- Published
- 2022
19. MATE: Masked Autoencoders are Online 3D Test-Time Learners
- Author
-
Mirza, M. Jehanzeb, Shin, Inkyu, Lin, Wei, Schriebl, Andreas, Sun, Kunyang, Choe, Jaesung, Possegger, Horst, Kozinski, Mateusz, Kweon, In So, Yoon, Kun-Jin, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. We show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, making it extremely lightweight. Our experiments show that MATE also achieves competitive performance by adapting sparsely on the test data, which further reduces its computational overhead, making it ideal for real-time applications., Comment: Code is available at this repository: https://github.com/jmiemirza/MATE
- Published
- 2022
20. SAILOR: Scaling Anchors via Insights into Latent Object Representation
- Author
-
Malić, Dušan, Fruhwirth-Reisinger, Christian, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
LiDAR 3D object detection models are inevitably biased towards their training dataset. The detector clearly exhibits this bias when employed on a target dataset, particularly towards object sizes. However, object sizes vary heavily between domains due to, for instance, different labeling policies or geographical locations. State-of-the-art unsupervised domain adaptation approaches outsource methods to overcome the object size bias. Mainstream size adaptation approaches exploit target domain statistics, contradicting the original unsupervised assumption. Our novel unsupervised anchor calibration method addresses this limitation. Given a model trained on the source data, we estimate the optimal target anchors in a completely unsupervised manner. The main idea stems from an intuitive observation: by varying the anchor sizes for the target domain, we inevitably introduce noise or even remove valuable object cues. The latent object representation, perturbed by the anchor size, is closest to the learned source features only under the optimal target anchors. We leverage this observation for anchor size optimization. Our experimental results show that, without any retraining, we achieve competitive results even compared to state-of-the-art weakly-supervised size adaptation approaches. In addition, our anchor calibration can be combined with such existing methods, making them completely unsupervised., Comment: WACV 2023; code is available at https://github.com/malicd/sailor
- Published
- 2022
21. An Efficient Domain-Incremental Learning Approach to Drive in All Weather Conditions
- Author
-
Mirza, M. Jehanzeb, Masana, Marc, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Although deep neural networks enable impressive visual perception performance for autonomous driving, their robustness to varying weather conditions still requires attention. When adapting these models for changed environments, such as different weather conditions, they are prone to forgetting previously learned information. This catastrophic forgetting is typically addressed via incremental learning approaches which usually re-train the model by either keeping a memory bank of training samples or keeping a copy of the entire model or model parameters for each scenario. While these approaches show impressive results, they can be prone to scalability issues and their applicability for autonomous driving in all weather conditions has not been shown. In this paper we propose DISC -- Domain Incremental through Statistical Correction -- a simple online zero-forgetting approach which can incrementally learn new tasks (i.e weather conditions) without requiring re-training or expensive memory banks. The only information we store for each task are the statistical parameters as we categorize each domain by the change in first and second order statistics. Thus, as each task arrives, we simply 'plug and play' the statistical vectors for the corresponding task into the model and it immediately starts to perform well on that task. We show the efficacy of our approach by testing it for object detection in a challenging domain-incremental autonomous driving scenario where we encounter different adverse weather conditions, such as heavy rain, fog, and snow., Comment: Accepted to CVPR Workshops - Camera Ready Version
- Published
- 2022
22. OccAM's Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR Data
- Author
-
Schinagl, David, Krispel, Georg, Possegger, Horst, Roth, Peter M., and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
While 3D object detection in LiDAR point clouds is well-established in academia and industry, the explainability of these models is a largely unexplored field. In this paper, we propose a method to generate attribution maps for the detected objects in order to better understand the behavior of such models. These maps indicate the importance of each 3D point in predicting the specific objects. Our method works with black-box models: We do not require any prior knowledge of the architecture nor access to the model's internals, like parameters, activations or gradients. Our efficient perturbation-based approach empirically estimates the importance of each point by testing the model with randomly generated subsets of the input point cloud. Our sub-sampling strategy takes into account the special characteristics of LiDAR data, such as the depth-dependent point density. We show a detailed evaluation of the attribution maps and demonstrate that they are interpretable and highly informative. Furthermore, we compare the attribution maps of recent 3D object detection architectures to provide insights into their decision-making processes., Comment: CVPR 2022, code is available at https://github.com/dschinagl/occam
- Published
- 2022
23. CycDA: Unsupervised Cycle Domain Adaptation from Image to Video
- Author
-
Lin, Wei, Kukleva, Anna, Sun, Kunyang, Possegger, Horst, Kuehne, Hilde, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and video frames; (2) modality gap between image and video data. To address these challenges, we propose Cycle Domain Adaptation (CycDA), a cycle-based approach for unsupervised image-to-video domain adaptation by leveraging the joint spatial information in images and videos on the one hand and, on the other hand, training an independent spatio-temporal model to bridge the modality gap. We alternate between the spatial and spatio-temporal learning with knowledge transfer between the two in each cycle. We evaluate our approach on benchmark datasets for image-to-video as well as for mixed-source domain adaptation achieving state-of-the-art results and demonstrating the benefits of our cyclic adaptation. Code is available at \url{https://github.com/wlin-at/CycDA}., Comment: Accepted at ECCV2022. Supplementary included
- Published
- 2022
24. 3D Human Pose Estimation Using M\'obius Graph Convolutional Networks
- Author
-
Azizi, Niloofar, Possegger, Horst, Rodolà, Emanuele, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
3D human pose estimation is fundamental to understanding human behavior. Recently, promising results have been achieved by graph convolutional networks (GCNs), which achieve state-of-the-art performance and provide rather light-weight architectures. However, a major limitation of GCNs is their inability to encode all the transformations between joints explicitly. To address this issue, we propose a novel spectral GCN using the M\"obius transformation (M\"obiusGCN). In particular, this allows us to directly and explicitly encode the transformation between joints, resulting in a significantly more compact representation. Compared to even the lightest architectures so far, our novel approach requires 90-98% fewer parameters, i.e. our lightest M\"obiusGCN uses only 0.042M trainable parameters. Besides the drastic parameter reduction, explicitly encoding the transformation of joints also enables us to achieve state-of-the-art results. We evaluate our approach on the two challenging pose estimation benchmarks, Human3.6M and MPI-INF-3DHP, demonstrating both state-of-the-art results and the generalization capabilities of M\"obiusGCN.
- Published
- 2022
25. The Norm Must Go On: Dynamic Unsupervised Domain Adaptation by Normalization
- Author
-
Mirza, M. Jehanzeb, Micorek, Jakub, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Domain adaptation is crucial to adapt a learned model to new scenarios, such as domain shifts or changing data distributions. Current approaches usually require a large amount of labeled or unlabeled data from the shifted domain. This can be a hurdle in fields which require continuous dynamic adaptation or suffer from scarcity of data, e.g. autonomous driving in challenging weather conditions. To address this problem of continuous adaptation to distribution shifts, we propose Dynamic Unsupervised Adaptation (DUA). By continuously adapting the statistics of the batch normalization layers we modify the feature representations of the model. We show that by sequentially adapting a model with only a fraction of unlabeled data, a strong performance gain can be achieved. With even less than 1% of unlabeled data from the target domain, DUA already achieves competitive results to strong baselines. In addition, the computational overhead is minimal in contrast to previous approaches. Our approach is simple, yet effective and can be applied to any architecture which uses batch normalization as one of its components. We show the utility of DUA by evaluating it on a variety of domain adaptation datasets and tasks including object recognition, digit recognition and object detection., Comment: Accepted to CVPR 2022 - Camera Ready Version - Code: https://github.com/jmiemirza/DUA
- Published
- 2021
26. Efficient Multi-Organ Segmentation Using SpatialConfiguration-Net with Low GPU Memory Requirements
- Author
-
Thaler, Franz, Payer, Christian, Bischof, Horst, and Stern, Darko
- Subjects
Electrical Engineering and Systems Science - Image and Video Processing ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Even though many semantic segmentation methods exist that are able to perform well on many medical datasets, often, they are not designed for direct use in clinical practice. The two main concerns are generalization to unseen data with a different visual appearance, e.g., images acquired using a different scanner, and efficiency in terms of computation time and required Graphics Processing Unit (GPU) memory. In this work, we employ a multi-organ segmentation model based on the SpatialConfiguration-Net (SCN), which integrates prior knowledge of the spatial configuration among the labelled organs to resolve spurious responses in the network outputs. Furthermore, we modified the architecture of the segmentation model to reduce its memory footprint as much as possible without drastically impacting the quality of the predictions. Lastly, we implemented a minimal inference script for which we optimized both, execution time and required GPU memory.
- Published
- 2021
27. FAST3D: Flow-Aware Self-Training for 3D Object Detectors
- Author
-
Fruhwirth-Reisinger, Christian, Opitz, Michael, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics - Abstract
In the field of autonomous driving, self-training is widely applied to mitigate distribution shifts in LiDAR-based 3D object detectors. This eliminates the need for expensive, high-quality labels whenever the environment changes (e.g., geographic location, sensor setup, weather condition). State-of-the-art self-training approaches, however, mostly ignore the temporal nature of autonomous driving data. To address this issue, we propose a flow-aware self-training method that enables unsupervised domain adaptation for 3D object detectors on continuous LiDAR point clouds. In order to get reliable pseudo-labels, we leverage scene flow to propagate detections through time. In particular, we introduce a flow-based multi-target tracker, that exploits flow consistency to filter and refine resulting tracks. The emerged precise pseudo-labels then serve as a basis for model re-training. Starting with a pre-trained KITTI model, we conduct experiments on the challenging Waymo Open Dataset to demonstrate the effectiveness of our approach. Without any prior target domain knowledge, our results show a significant improvement over the state-of-the-art., Comment: Accepted to BMVC 2021
- Published
- 2021
28. Robustness of Object Detectors in Degrading Weather Conditions
- Author
-
Mirza, Muhammad Jehanzeb, Buerkle, Cornelius, Jarquin, Julio, Opitz, Michael, Oboril, Fabian, Scholl, Kay-Ulrich, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
State-of-the-art object detection systems for autonomous driving achieve promising results in clear weather conditions. However, such autonomous safety critical systems also need to work in degrading weather conditions, such as rain, fog and snow. Unfortunately, most approaches evaluate only on the KITTI dataset, which consists only of clear weather scenes. In this paper we address this issue and perform one of the most detailed evaluation on single and dual modality architectures on data captured in real weather conditions. We analyse the performance degradation of these architectures in degrading weather conditions. We demonstrate that an object detection architecture performing good in clear weather might not be able to handle degrading weather conditions. We also perform ablation studies on the dual modality architectures and show their limitations., Comment: Accepted for publication at ITSC 2021
- Published
- 2021
29. FuseSeg: LiDAR Point Cloud Segmentation Fusing Multi-Modal Data
- Author
-
Krispel, Georg, Opitz, Michael, Waltner, Georg, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics ,I.4.6 ,I.4.8 ,I.2.9 ,I.2.10 - Abstract
We introduce a simple yet effective fusion method of LiDAR and RGB data to segment LiDAR point clouds. Utilizing the dense native range representation of a LiDAR sensor and the setup calibration, we establish point correspondences between the two input modalities. Subsequently, we are able to warp and fuse the features from one domain into the other. Therefore, we can jointly exploit information from both data sources within one single network. To show the merit of our method, we extend SqueezeSeg, a point cloud segmentation network, with an RGB feature branch and fuse it into the original structure. Our extension called FuseSeg leads to an improvement of up to 18% IoU on the KITTI benchmark. In addition to the improved accuracy, we also achieve real-time performance at 50 fps, five times as fast as the KITTI LiDAR data recording speed., Comment: Accepted for publication in WACV 2020
- Published
- 2019
30. Integrating Spatial Configuration into Heatmap Regression Based CNNs for Landmark Localization
- Author
-
Payer, Christian, Štern, Darko, Bischof, Horst, and Urschler, Martin
- Subjects
Electrical Engineering and Systems Science - Image and Video Processing ,Computer Science - Computer Vision and Pattern Recognition - Abstract
In many medical image analysis applications, often only a limited amount of training data is available, which makes training of convolutional neural networks (CNNs) challenging. In this work on anatomical landmark localization, we propose a CNN architecture that learns to split the localization task into two simpler sub-problems, reducing the need for large training datasets. Our fully convolutional SpatialConfiguration-Net (SCN) dedicates one component to locally accurate but ambiguous candidate predictions, while the other component improves robustness to ambiguities by incorporating the spatial configuration of landmarks. In our experimental evaluation, we show that the proposed SCN outperforms related methods in terms of landmark localization error on size-limited datasets., Comment: MIDL 2019 [arXiv:1907.08612]
- Published
- 2019
- Full Text
- View/download PDF
31. MURAUER: Mapping Unlabeled Real Data for Label AUstERity
- Author
-
Poier, Georg, Opitz, Michael, Schinagl, David, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,I.2.6 ,I.2.10 ,I.4.5 ,I.4.8 ,I.4.10 ,I.5.4 - Abstract
Data labeling for learning 3D hand pose estimation models is a huge effort. Readily available, accurately labeled synthetic data has the potential to reduce the effort. However, to successfully exploit synthetic data, current state-of-the-art methods still require a large amount of labeled real data. In this work, we remove this requirement by learning to map from the features of real data to the features of synthetic data mainly using a large amount of synthetic and unlabeled real data. We exploit unlabeled data using two auxiliary objectives, which enforce that (i) the mapped representation is pose specific and (ii) at the same time, the distributions of real and synthetic data are aligned. While pose specifity is enforced by a self-supervisory signal requiring that the representation is predictive for the appearance from different views, distributions are aligned by an adversarial term. In this way, we can significantly improve the results of the baseline system, which does not use unlabeled data and outperform many recent approaches already with about 1% of the labeled real data. This presents a step towards faster deployment of learning based hand pose estimation, making it accessible for a larger range of applications., Comment: WACV 2019; Project page at https://poier.github.io/murauer
- Published
- 2018
32. Instance Segmentation and Tracking with Cosine Embeddings and Recurrent Hourglass Networks
- Author
-
Payer, Christian, Štern, Darko, Neff, Thomas, Bischof, Horst, and Urschler, Martin
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Different to semantic segmentation, instance segmentation assigns unique labels to each individual instance of the same class. In this work, we propose a novel recurrent fully convolutional network architecture for tracking such instance segmentations over time. The network architecture incorporates convolutional gated recurrent units (ConvGRU) into a stacked hourglass network to utilize temporal video information. Furthermore, we train the network with a novel embedding loss based on cosine similarities, such that the network predicts unique embeddings for every instance throughout videos. Afterwards, these embeddings are clustered among subsequent video frames to create the final tracked instance segmentations. We evaluate the recurrent hourglass network by segmenting left ventricles in MR videos of the heart, where it outperforms a network that does not incorporate video information. Furthermore, we show applicability of the cosine embedding loss for segmenting leaf instances on still images of plants. Finally, we evaluate the framework for instance segmentation and tracking on six datasets of the ISBI celltracking challenge, where it shows state-of-the-art performance., Comment: Accepted for ORAL presentation at MICCAI 2018
- Published
- 2018
33. Deep 2.5D Vehicle Classification with Sparse SfM Depth Prior for Automated Toll Systems
- Author
-
Waltner, Georg, Maurer, Michael, Holzmann, Thomas, Ruprecht, Patrick, Opitz, Michael, Possegger, Horst, Fraundorfer, Friedrich, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Automated toll systems rely on proper classification of the passing vehicles. This is especially difficult when the images used for classification only cover parts of the vehicle. To obtain information about the whole vehicle. we reconstruct the vehicle as 3D object and exploit this additional information within a Convolutional Neural Network (CNN). However, when using deep networks for 3D object classification, large amounts of dense 3D models are required for good accuracy, which are often neither available nor feasible to process due to memory requirements. Therefore, in our method we reproject the 3D object onto the image plane using the reconstructed points, lines or both. We utilize this sparse depth prior within an auxiliary network branch that acts as a regularizer during training. We show that this auxiliary regularizer helps to improve accuracy compared to 2D classification on a real-world dataset. Furthermore due to the design of the network, at test time only the 2D camera images are required for classification which enables the usage in portable computer vision systems., Comment: Submitted to the IEEE International Conference on Intelligent Transportation Systems 2018 (ITSC), 6 pages, 4 figures; changed format in compliance with adapted IEEE template
- Published
- 2018
34. Learning Pose Specific Representations by Predicting Different Views
- Author
-
Poier, Georg, Schinagl, David, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Learning ,I.2.6 ,I.2.10 ,I.4.5 ,I.4.8 ,I.4.10 ,I.5.4 - Abstract
The labeled data required to learn pose estimation for articulated objects is difficult to provide in the desired quantity, realism, density, and accuracy. To address this issue, we develop a method to learn representations, which are very specific for articulated poses, without the need for labeled training data. We exploit the observation that the object pose of a known object is predictive for the appearance in any known view. That is, given only the pose and shape parameters of a hand, the hand's appearance from any viewpoint can be approximated. To exploit this observation, we train a model that -- given input from one view -- estimates a latent representation, which is trained to be predictive for the appearance of the object when captured from another viewpoint. Thus, the only necessary supervision is the second view. The training process of this model reveals an implicit pose representation in the latent space. Importantly, at test time the pose representation can be inferred using only a single view. In qualitative and quantitative experiments we show that the learned representations capture detailed pose information. Moreover, when training the proposed method jointly with labeled and unlabeled data, it consistently surpasses the performance of its fully supervised counterpart, while reducing the amount of needed labeled samples by at least one order of magnitude., Comment: CVPR 2018 (Spotlight); Project Page at https://poier.github.io/PreView/
- Published
- 2018
35. Prioritized Multi-View Stereo Depth Map Generation Using Confidence Prediction
- Author
-
Mostegel, Christian, Fraundorfer, Friedrich, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this work, we propose a novel approach to prioritize the depth map computation of multi-view stereo (MVS) to obtain compact 3D point clouds of high quality and completeness at low computational cost. Our prioritization approach operates before the MVS algorithm is executed and consists of two steps. In the first step, we aim to find a good set of matching partners for each view. In the second step, we rank the resulting view clusters (i.e. key views with matching partners) according to their impact on the fulfillment of desired quality parameters such as completeness, ground resolution and accuracy. Additional to geometric analysis, we use a novel machine learning technique for training a confidence predictor. The purpose of this confidence predictor is to estimate the chances of a successful depth reconstruction for each pixel in each image for one specific MVS algorithm based on the RGB images and the image constellation. The underlying machine learning technique does not require any ground truth or manually labeled data for training, but instead adapts ideas from depth map fusion for providing a supervision signal. The trained confidence predictor allows us to evaluate the quality of image constellations and their potential impact to the resulting 3D reconstruction and thus builds a solid foundation for our prioritization approach. In our experiments, we are thus able to reach more than 70% of the maximal reachable quality fulfillment using only 5% of the available images as key views. For evaluating our approach within and across different domains, we use two completely different scenarios, i.e. cultural heritage preservation and reconstruction of single family houses., Comment: This paper was accepted to ISPRS Journal of Photogrammetry and Remote Sensing (https://www.journals.elsevier.com/isprs-journal-of-photogrammetry-and-remote-sensing) on March 21, 2018. The official version will be made available on ScienceDirect (https://www.sciencedirect.com)
- Published
- 2018
36. Deep Metric Learning with BIER: Boosting Independent Embeddings Robustly
- Author
-
Opitz, Michael, Waltner, Georg, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Learning similarity functions between image pairs with deep neural networks yields highly correlated activations of embeddings. In this work, we show how to improve the robustness of such embeddings by exploiting the independence within ensembles. To this end, we divide the last embedding layer of a deep network into an embedding ensemble and formulate training this ensemble as an online gradient boosting problem. Each learner receives a reweighted training sample from the previous learners. Further, we propose two loss functions which increase the diversity in our ensemble. These loss functions can be applied either for weight initialization or during training. Together, our contributions leverage large embedding sizes more effectively by significantly reducing correlation of the embedding and consequently increase retrieval accuracy of the embedding. Our method works with any differentiable loss function and does not introduce any additional parameters during test time. We evaluate our metric learning method on image retrieval tasks and show that it improves over state-of-the-art methods on the CUB 200-2011, Cars-196, Stanford Online Products, In-Shop Clothes Retrieval and VehicleID datasets., Comment: Extension to our paper BIER: Boosting Independent Embeddings Robustly (ICCV 2017 oral) - submitted to PAMI
- Published
- 2018
37. Scalable Surface Reconstruction from Point Clouds with Extreme Scale and Density Diversity
- Author
-
Mostegel, Christian, Prettenthaler, Rudolf, Fraundorfer, Friedrich, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Graphics - Abstract
In this paper we present a scalable approach for robustly computing a 3D surface mesh from multi-scale multi-view stereo point clouds that can handle extreme jumps of point density (in our experiments three orders of magnitude). The backbone of our approach is a combination of octree data partitioning, local Delaunay tetrahedralization and graph cut optimization. Graph cut optimization is used twice, once to extract surface hypotheses from local Delaunay tetrahedralizations and once to merge overlapping surface hypotheses even when the local tetrahedralizations do not share the same topology.This formulation allows us to obtain a constant memory consumption per sub-problem while at the same time retaining the density independent interpolation properties of the Delaunay-based optimization. On multiple public datasets, we demonstrate that our approach is highly competitive with the state-of-the-art in terms of accuracy, completeness and outlier resilience. Further, we demonstrate the multi-scale potential of our approach by processing a newly recorded dataset with 2 billion points and a point density variation of more than four orders of magnitude - requiring less than 9GB of RAM per process., Comment: This paper was accepted to the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. The copyright was transfered to IEEE (ieee.org). The official version of the paper will be made available on IEEE Xplore (R) (ieeexplore.ieee.org). This version of the paper also contains the supplementary material, which will not appear IEEE Xplore (R)
- Published
- 2017
38. OctNetFusion: Learning Depth Fusion from Data
- Author
-
Riegler, Gernot, Ulusoy, Ali Osman, Bischof, Horst, and Geiger, Andreas
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we present a learning based approach to depth fusion, i.e., dense 3D reconstruction from multiple depth images. The most common approach to depth fusion is based on averaging truncated signed distance functions, which was originally proposed by Curless and Levoy in 1996. While this method is simple and provides great results, it is not able to reconstruct (partially) occluded surfaces and requires a large number frames to filter out sensor noise and outliers. Motivated by the availability of large 3D model repositories and recent advances in deep learning, we present a novel 3D CNN architecture that learns to predict an implicit surface representation from the input depth maps. Our learning based method significantly outperforms the traditional volumetric fusion approach in terms of noise reduction and outlier suppression. By learning the structure of real world 3D objects and scenes, our approach is further able to reconstruct occluded regions and to fill in gaps in the reconstruction. We demonstrate that our learning based approach outperforms both vanilla TSDF fusion as well as TV-L1 fusion on the task of volumetric fusion. Further, we demonstrate state-of-the-art 3D shape completion results., Comment: 3DV 2017, https://github.com/griegler/octnetfusion
- Published
- 2017
39. PetroSurf3D - A Dataset for high-resolution 3D Surface Segmentation
- Author
-
Poier, Georg, Seidl, Markus, Zeppelzauer, Matthias, Reinbacher, Christian, Schaich, Martin, Bellandi, Giovanna, Marretta, Alberto, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The development of powerful 3D scanning hardware and reconstruction algorithms has strongly promoted the generation of 3D surface reconstructions in different domains. An area of special interest for such 3D reconstructions is the cultural heritage domain, where surface reconstructions are generated to digitally preserve historical artifacts. While reconstruction quality nowadays is sufficient in many cases, the robust analysis (e.g. segmentation, matching, and classification) of reconstructed 3D data is still an open topic. In this paper, we target the automatic and interactive segmentation of high-resolution 3D surface reconstructions from the archaeological domain. To foster research in this field, we introduce a fully annotated and publicly available large-scale 3D surface dataset including high-resolution meshes, depth maps and point clouds as a novel benchmark dataset to the community. We provide baseline results for our existing random forest-based approach and for the first time investigate segmentation with convolutional neural networks (CNNs) on the data. Results show that both approaches have complementary strengths and weaknesses and that the provided dataset represents a challenge for future research., Comment: CBMI Submission; Dataset and more information can be found at http://lrs.icg.tugraz.at/research/petroglyphsegmentation/
- Published
- 2016
- Full Text
- View/download PDF
40. Grid Loss: Detecting Occluded Faces
- Author
-
Opitz, Michael, Waltner, Georg, Poier, Georg, Possegger, Horst, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Detection of partially occluded objects is a challenging computer vision problem. Standard Convolutional Neural Network (CNN) detectors fail if parts of the detection window are occluded, since not every sub-part of the window is discriminative on its own. To address this issue, we propose a novel loss layer for CNNs, named grid loss, which minimizes the error rate on sub-blocks of a convolution layer independently rather than over the whole feature map. This results in parts being more discriminative on their own, enabling the detector to recover if the detection window is partially occluded. By mapping our loss layer back to a regular fully connected layer, no additional computational cost is incurred at runtime compared to standard CNNs. We demonstrate our method for face detection on several public face detection benchmarks and show that our method outperforms regular CNNs, is suitable for realtime applications and achieves state-of-the-art performance., Comment: accepted to ECCV 2016
- Published
- 2016
41. A Deep Primal-Dual Network for Guided Depth Super-Resolution
- Author
-
Riegler, Gernot, Ferstl, David, Rüther, Matthias, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper we present a novel method to increase the spatial resolution of depth images. We combine a deep fully convolutional network with a non-local variational method in a deep primal-dual network. The joint network computes a noise-free, high-resolution estimate from a noisy, low-resolution input depth map. Additionally, a high-resolution intensity image is used to guide the reconstruction in the network. By unrolling the optimization steps of a first-order primal-dual algorithm and formulating it as a network, we can train our joint method end-to-end. This not only enables us to learn the weights of the fully convolutional network, but also to optimize all parameters of the variational method and its optimization procedure. The training of such a deep network requires a large dataset for supervision. Therefore, we generate high-quality depth maps and corresponding color images with a physically based renderer. In an exhaustive evaluation we show that our method outperforms the state-of-the-art on multiple benchmarks., Comment: BMVC 2016
- Published
- 2016
42. ATGV-Net: Accurate Depth Super-Resolution
- Author
-
Riegler, Gernot, Rüther, Matthias, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this work we present a novel approach for single depth map super-resolution. Modern consumer depth sensors, especially Time-of-Flight sensors, produce dense depth measurements, but are affected by noise and have a low lateral resolution. We propose a method that combines the benefits of recent advances in machine learning based single image super-resolution, i.e. deep convolutional networks, with a variational method to recover accurate high-resolution depth maps. In particular, we integrate a variational method that models the piecewise affine structures apparent in depth data via an anisotropic total generalized variation regularization term on top of a deep network. We call our method ATGV-Net and train it end-to-end by unrolling the optimization procedure of the variational method. To train deep networks, a large corpus of training data with accurate ground-truth is required. We demonstrate that it is feasible to train our method solely on synthetic data that we generate in large quantities for this task. Our evaluations show that we achieve state-of-the-art results on three different benchmarks, as well as on a challenging Time-of-Flight dataset, all without utilizing an additional intensity image as guidance., Comment: ECCV 2016
- Published
- 2016
43. UAV-based Autonomous Image Acquisition with Multi-View Stereo Quality Assurance by Confidence Prediction
- Author
-
Mostegel, Christian, Rumpler, Markus, Fraundorfer, Friedrich, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics - Abstract
In this paper we present an autonomous system for acquiring close-range high-resolution images that maximize the quality of a later-on 3D reconstruction with respect to coverage, ground resolution and 3D uncertainty. In contrast to previous work, our system uses the already acquired images to predict the confidence in the output of a dense multi-view stereo approach without executing it. This confidence encodes the likelihood of a successful reconstruction with respect to the observed scene and potential camera constellations. Our prediction module runs in real-time and can be trained without any externally recorded ground truth. We use the confidence prediction for on-site quality assurance and for planning further views that are tailored for a specific multi-view stereo approach with respect to the given scene. We demonstrate the capabilities of our approach with an autonomous Unmanned Aerial Vehicle (UAV) in a challenging outdoor scenario., Comment: This paper was accepted to the 7th International Workshop on Computer Vision in Vehicle Technology (CVVT 2016) and will appear in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2016. The copyright was transferred to IEEE (ieee.org). The paper will be available on IEEE Xplore(ieeexplore.ieee.org). This version of the paper also contains the supplementary material
- Published
- 2016
44. Using Self-Contradiction to Learn Confidence Measures in Stereo Vision
- Author
-
Mostegel, Christian, Rumpler, Markus, Fraundorfer, Friedrich, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Learned confidence measures gain increasing importance for outlier removal and quality improvement in stereo vision. However, acquiring the necessary training data is typically a tedious and time consuming task that involves manual interaction, active sensing devices and/or synthetic scenes. To overcome this problem, we propose a new, flexible, and scalable way for generating training data that only requires a set of stereo images as input. The key idea of our approach is to use different view points for reasoning about contradictions and consistencies between multiple depth maps generated with the same stereo algorithm. This enables us to generate a huge amount of training data in a fully automated manner. Among other experiments, we demonstrate the potential of our approach by boosting the performance of three learned confidence measures on the KITTI2012 dataset by simply training them on a vast amount of automatically generated training data rather than a limited amount of laser ground truth data., Comment: This paper was accepted to the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. The copyright was transfered to IEEE (https://www.ieee.org). The official version of the paper will be made available on IEEE Xplore (R) (http://ieeexplore.ieee.org). This version of the paper also contains the supplementary material, which will not appear IEEE Xplore (R)
- Published
- 2016
45. Hybrid One-Shot 3D Hand Pose Estimation by Exploiting Uncertainties
- Author
-
Poier, Georg, Roditakis, Konstantinos, Schulter, Samuel, Michel, Damien, Bischof, Horst, and Argyros, Antonis A.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,I.2.10 ,I.4.8 ,I.5.4 - Abstract
Model-based approaches to 3D hand tracking have been shown to perform well in a wide range of scenarios. However, they require initialisation and cannot recover easily from tracking failures that occur due to fast hand motions. Data-driven approaches, on the other hand, can quickly deliver a solution, but the results often suffer from lower accuracy or missing anatomical validity compared to those obtained from model-based approaches. In this work we propose a hybrid approach for hand pose estimation from a single depth image. First, a learned regressor is employed to deliver multiple initial hypotheses for the 3D position of each hand joint. Subsequently, the kinematic parameters of a 3D hand model are found by deliberately exploiting the inherent uncertainty of the inferred joint proposals. This way, the method provides anatomically valid and accurate solutions without requiring manual initialisation or suffering from track losses. Quantitative results on several standard datasets demonstrate that the proposed method outperforms state-of-the-art representatives of the model-based, data-driven and hybrid paradigms., Comment: BMVC 2015 (oral); see also http://lrs.icg.tugraz.at/research/hybridhape/
- Published
- 2015
- Full Text
- View/download PDF
46. Performance Evaluation of Vision-Based Algorithms for MAVs
- Author
-
Holzmann, T., Prettenthaler, R., Pestana, J., Muschick, D., Graber, G., Mostegel, C., Fraundorfer, F., and Bischof, H.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
An important focus of current research in the field of Micro Aerial Vehicles (MAVs) is to increase the safety of their operation in general unstructured environments. Especially indoors, where GPS cannot be used for localization, reliable algorithms for localization and mapping of the environment are necessary in order to keep an MAV airborne safely. In this paper, we compare vision-based real-time capable methods for localization and mapping and point out their strengths and weaknesses. Additionally, we describe algorithms for state estimation, control and navigation, which use the localization and mapping results of our vision-based algorithms as input., Comment: Presented at OAGM Workshop, 2015 (arXiv:1505.01065)
- Published
- 2015
47. Building with Drones: Accurate 3D Facade Reconstruction using MAVs
- Author
-
Daftry, Shreyansh, Hoppe, Christof, and Bischof, Horst
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Automatic reconstruction of 3D models from images using multi-view Structure-from-Motion methods has been one of the most fruitful outcomes of computer vision. These advances combined with the growing popularity of Micro Aerial Vehicles as an autonomous imaging platform, have made 3D vision tools ubiquitous for large number of Architecture, Engineering and Construction applications among audiences, mostly unskilled in computer vision. However, to obtain high-resolution and accurate reconstructions from a large-scale object using SfM, there are many critical constraints on the quality of image data, which often become sources of inaccuracy as the current 3D reconstruction pipelines do not facilitate the users to determine the fidelity of input data during the image acquisition. In this paper, we present and advocate a closed-loop interactive approach that performs incremental reconstruction in real-time and gives users an online feedback about the quality parameters like Ground Sampling Distance (GSD), image redundancy, etc on a surface mesh. We also propose a novel multi-scale camera network design to prevent scene drift caused by incremental map building, and release the first multi-scale image sequence dataset as a benchmark. Further, we evaluate our system on real outdoor scenes, and show that our interactive pipeline combined with a multi-scale camera network approach provides compelling accuracy in multi-view reconstruction tasks when compared against the state-of-the-art methods., Comment: 8 Pages, 2015 IEEE International Conference on Robotics and Automation (ICRA '15), Seattle, WA, USA
- Published
- 2015
- Full Text
- View/download PDF
48. Indoor Activity Detection and Recognition for Sport Games Analysis
- Author
-
Waltner, Georg, Mauthner, Thomas, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Activity recognition in sport is an attractive field for computer vision research. Game, player and team analysis are of great interest and research topics within this field emerge with the goal of automated analysis. The very specific underlying rules of sports can be used as prior knowledge for the recognition task and present a constrained environment for evaluation. This paper describes recognition of single player activities in sport with special emphasis on volleyball. Starting from a per-frame player-centered activity recognition, we incorporate geometry and contextual information via an activity context descriptor that collects information about all player's activities over a certain timespan relative to the investigated player. The benefit of this context information on single player activity recognition is evaluated on our new real-life dataset presenting a total amount of almost 36k annotated frames containing 7 activity classes within 6 videos of professional volleyball games. Our incorporation of the contextual information improves the average player-centered classification performance of 77.56% by up to 18.35% on specific classes, proving that spatio-temporal context is an important clue for activity recognition., Comment: Part of the OAGM 2014 proceedings (arXiv:1404.3538)
- Published
- 2014
49. Geometric Abstraction from Noisy Image-Based 3D Reconstructions
- Author
-
Holzmann, Thomas, Hoppe, Christof, Kluckner, Stefan, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Creating geometric abstracted models from image-based scene reconstructions is difficult due to noise and irregularities in the reconstructed model. In this paper, we present a geometric modeling method for noisy reconstructions dominated by planar horizontal and orthogonal vertical structures. We partition the scene into horizontal slices and create an inside/outside labeling represented by a floor plan for each slice by solving an energy minimization problem. Consecutively, we create an irregular discretization of the volume according to the individual floor plans and again label each cell as inside/outside by minimizing an energy function. By adjusting the smoothness parameter, we introduce different levels of detail. In our experiments, we show results with varying regularization levels using synthetically generated and real-world data., Comment: Part of the OAGM 2014 proceedings (arXiv:1404.3538)
- Published
- 2014
50. Revisiting loss-specific training of filter-based MRFs for image restoration
- Author
-
Chen, Yunjin, Pock, Thomas, Ranftl, René, and Bischof, Horst
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
It is now well known that Markov random fields (MRFs) are particularly effective for modeling image priors in low-level vision. Recent years have seen the emergence of two main approaches for learning the parameters in MRFs: (1) probabilistic learning using sampling-based algorithms and (2) loss-specific training based on MAP estimate. After investigating existing training approaches, it turns out that the performance of the loss-specific training has been significantly underestimated in existing work. In this paper, we revisit this approach and use techniques from bi-level optimization to solve it. We show that we can get a substantial gain in the final performance by solving the lower-level problem in the bi-level framework with high accuracy using our newly proposed algorithm. As a result, our trained model is on par with highly specialized image denoising algorithms and clearly outperforms probabilistically trained MRF models. Our findings suggest that for the loss-specific training scheme, solving the lower-level problem with higher accuracy is beneficial. Our trained model comes along with the additional advantage, that inference is extremely efficient. Our GPU-based implementation takes less than 1s to produce state-of-the-art performance., Comment: 10 pages, 2 figures, appear at 35th German Conference, GCPR 2013, Saarbr\"ucken, Germany, September 3-6, 2013. Proceedings
- Published
- 2014
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.