1. MVTamperBench: Evaluating Robustness of Vision-Language Models
- Author
-
Agarwal, Amit, Panda, Srikant, Charles, Angeline, Kumar, Bhargava, Patel, Hitesh, Pattnayak, Priyaranjan, Rafi, Taki Hasan, Kumar, Tejaswini, and Chae, Dong-Kyu
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,68T37, 68T05, 68Q32, 68T45, 94A08, 68T40, 68Q85 ,I.2.10 ,I.2.7 ,I.5.4 ,I.4.9 ,I.4.8 ,H.5.1 - Abstract
Multimodal Large Language Models (MLLMs) have driven major advances in video understanding, yet their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce MVTamperBench, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping. Built from 3.4K original videos-expanded to over 17K tampered clips spanning 19 video tasks. MVTamperBench challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families, revealing substantial variability in resilience across tampering types and showing that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code and data to foster open research in trustworthy video understanding. Code: https://amitbcp.github.io/MVTamperBench/ Data: https://huggingface.co/datasets/Srikant86/MVTamperBench
- Published
- 2024