Gradient-based Adversarial Attacks against Text Transformers

October 20, 2021


We propose the first general-purpose gradient-based attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.

Download the Paper


Written by

Alexandre Sablayrolles

Chuan Guo

Douwe Kiela

Hervé Jegou



Research Topics

Natural Language Processing (NLP)

Core Machine Learning

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.