1. Assessing generalisability of deep learning-based polyp detection and segmentation methods through a computer vision challenge
- Author
-
Ali, S, Ghatwary, N, Jha, D, Isik-Polat, E, Polat, G, Yang, C, Li, W, Galdran, A, Ballester, M-ÁG, Thambawita, V, Hicks, S, Poudel, S, Lee, S-W, Jin, Z, Gan, T, Yu, C, Yan, J, Yeo, D, Lee, H, Tomar, NK, Haithmi, M, Ahmed, A, Riegler, MA, Daul, C, Halvorsen, P, Rittscher, J, Salem, OE, Lamarque, D, Cannizzaro, R, Realdon, S, Lange, TD, and East, JE
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Polyps are well-known cancer precursors identified by colonoscopy. However, variability in their size, location, and surface largely affect identification, localisation, and characterisation. Moreover, colonoscopic surveillance and removal of polyps (referred to as polypectomy ) are highly operator-dependent procedures. There exist a high missed detection rate and incomplete removal of colonic polyps due to their variable nature, the difficulties to delineate the abnormality, the high recurrence rates, and the anatomical topography of the colon. There have been several developments in realising automated methods for both detection and segmentation of these polyps using machine learning. However, the major drawback in most of these methods is their ability to generalise to out-of-sample unseen datasets that come from different centres, modalities and acquisition systems. To test this hypothesis rigorously we curated a multi-centre and multi-population dataset acquired from multiple colonoscopy systems and challenged teams comprising machine learning experts to develop robust automated detection and segmentation methods as part of our crowd-sourcing Endoscopic computer vision challenge (EndoCV) 2021. In this paper, we analyse the detection results of the four top (among seven) teams and the segmentation results of the five top teams (among 16). Our analyses demonstrate that the top-ranking teams concentrated on accuracy (i.e., accuracy > 80% on overall Dice score on different validation sets) over real-time performance required for clinical applicability. We further dissect the methods and provide an experiment-based hypothesis that reveals the need for improved generalisability to tackle diversity present in multi-centre datasets., 26 pages
- Published
- 2022