Back to Search Start Over

MVTamperBench: Evaluating Robustness of Vision-Language Models

Authors :
Agarwal, Amit
Panda, Srikant
Charles, Angeline
Kumar, Bhargava
Patel, Hitesh
Pattnayak, Priyaranjan
Rafi, Taki Hasan
Kumar, Tejaswini
Chae, Dong-Kyu
Publication Year :
2024

Abstract

Multimodal Large Language Models (MLLMs) have driven major advances in video understanding, yet their vulnerability to adversarial tampering and manipulations remains underexplored. To address this gap, we introduce MVTamperBench, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques: rotation, masking, substitution, repetition, and dropping. Built from 3.4K original videos-expanded to over 17K tampered clips spanning 19 video tasks. MVTamperBench challenges models to detect manipulations in spatial and temporal coherence. We evaluate 45 recent MLLMs from 15+ model families, revealing substantial variability in resilience across tampering types and showing that larger parameter counts do not necessarily guarantee robustness. MVTamperBench sets a new benchmark for developing tamper-resilient MLLM in safety-critical applications, including detecting clickbait, preventing harmful content distribution, and enforcing policies on media platforms. We release all code and data to foster open research in trustworthy video understanding. Code: https://amitbcp.github.io/MVTamperBench/ Data: https://huggingface.co/datasets/Srikant86/MVTamperBench

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2412.19794
Document Type :
Working Paper