Adversarial Training has proved to be an effective training paradigm to enforce robustness against adversarial examples in modern neural network architectures. Despite many efforts, explanations of the foundational principles underpinning the effectiveness of Adversarial Training are limited and far from being widely accepted by the Deep Learning community. Moreover, very few research works investigated the limitations of robust Convolutional Neural Networks beyond the well-known accuracy drop on natural images. In this paper, we describe surprising properties of these models, shedding light on mechanisms through which robustness against adversarial attacks is implemented. We also highlight limitations and failure modes that were not discussed in prior works. Through extensive analyses on a wide range of architectures and datasets, we empirically demonstrate that adversarially-trained Convolutional Neural Networks do not exploit efficiently the model capacity and that the simplicity biases induced by Adversarial Training may lead to undesired behaviors.

On the limitations of adversarial training for robust image classification with convolutional neural networks

Carletti M.;Sinigaglia A.;Terzi M.;Susto G. A.
2024

Abstract

Adversarial Training has proved to be an effective training paradigm to enforce robustness against adversarial examples in modern neural network architectures. Despite many efforts, explanations of the foundational principles underpinning the effectiveness of Adversarial Training are limited and far from being widely accepted by the Deep Learning community. Moreover, very few research works investigated the limitations of robust Convolutional Neural Networks beyond the well-known accuracy drop on natural images. In this paper, we describe surprising properties of these models, shedding light on mechanisms through which robustness against adversarial attacks is implemented. We also highlight limitations and failure modes that were not discussed in prior works. Through extensive analyses on a wide range of architectures and datasets, we empirically demonstrate that adversarially-trained Convolutional Neural Networks do not exploit efficiently the model capacity and that the simplicity biases induced by Adversarial Training may lead to undesired behaviors.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11577/3531191
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
  • OpenAlex ND
social impact