On the limitations of adversarial training for robust image classification with convolutional neural networks

Carletti, M.; Sinigaglia, A.; Terzi, M.; Susto, G. A.

doi:10.1016/j.ins.2024.120703

Adversarial Training has proved to be an effective training paradigm to enforce robustness against adversarial examples in modern neural network architectures. Despite many efforts, explanations of the foundational principles underpinning the effectiveness of Adversarial Training are limited and far from being widely accepted by the Deep Learning community. Moreover, very few research works investigated the limitations of robust Convolutional Neural Networks beyond the well-known accuracy drop on natural images. In this paper, we describe surprising properties of these models, shedding light on mechanisms through which robustness against adversarial attacks is implemented. We also highlight limitations and failure modes that were not discussed in prior works. Through extensive analyses on a wide range of architectures and datasets, we empirically demonstrate that adversarially-trained Convolutional Neural Networks do not exploit efficiently the model capacity and that the simplicity biases induced by Adversarial Training may lead to undesired behaviors.