1. Towards designing inherently interpretable deep neural networks for image classification
- Author
-
Böhle, Moritz
- Abstract
Over the last decade, Deep Neural Networks (DNNs) have proven successful in a wide range of applications and hold the promise to have a positive impact on our lives, especially in high-stakes applications. For example, given their outstanding performance — by now regularly outperforming humans — DNNs could make state-of-the-art medical diagnostics more easily accessible to many and lessen the strain of often overworked medical professionals. That said, it is of course exactly those high-stakes situations in which a wrong decision can be disastrous, potentially putting human lives at risk. Especially in such settings it is therefore imperative that we can understand and obtain an explanation for a model's 'decision'. This thesis studies this problem for image classification models from three directions. First, we evaluate methods that explain DNNs in a post-hoc fashion and highlight promises and shortcomings of existing approaches. In particular, we study a popular importance attribution technique to explain a model trained to identify brain scans of patients suffering from Alzheimer's disease (AD), and find it to correlate with known biomarkers of AD. Unfortunately, however, we do not know for certain which patterns in the input signals a given model is using to classify its inputs. To address this, we additionally design a novel evaluation scheme for explanation methods. Specifically, in this scheme, we control which input regions the model was certainly not using, which allows us to detect instances in which explanation methods are provably not model-faithful, i.e., they do not adequately represent the underlying classification model. Second, we study how to design inherently interpretable DNNs. In contrast to explaining the models post hoc, this approach not only takes the training procedure and the DNN architecture into account, but also modifies them to ensure that the decision process becomes inherently more transparent. In particular, we propose two novel DNN architectures: the CoDA and the B-cos Networks. These architectures are designed such that they can easily and faithfully be summarised by a single linear transformation, and are optimised during training such that these transformations align with the task-relevant input features. As a result, we find that they exhibit a great amount of detail and are able to accurately localise task-relevant features. As such, they lend themselves well to be used as explanations for humans. Third, we investigate how to leverage explanations to guide models during training, e.g., to suppress reliance on spuriously correlated features or to increase the fidelity of knowledge distillation approaches. In particular, we show that regularising the explanations to align with human annotations or with the explanations of another model can be a powerful and efficient tool to, e.g., improve model robustness under distribution shift or to better leverage limited training data during knowledge distillation. Finally, in the last part of this thesis, we additionally analyse a popular self-supervised representation learning paradigm: contrastive learning. In particular, we study how a single parameter influences the learning dynamics on imbalanced data and show that it can significantly impact the learnt representations. While not directly linked to model explanations, this work highlights the importance of taking even minor aspects of the optimisation procedure into account when trying to understand and explain DNNs.
- Published
- 2024