Gradient-based Adversarial Attacks against Text Transformers

October 20, 2021

Abstract

We propose the first general-purpose gradient-based attack against transformer models. Instead of searching for a single adversarial example, we search for a distribution of adversarial examples parameterized by a continuous-valued matrix, hence enabling gradient-based optimization. We empirically demonstrate that our white-box attack attains state-of-the-art attack performance on a variety of natural language tasks. Furthermore, we show that a powerful black-box transfer attack, enabled by sampling from the adversarial distribution, matches or exceeds existing methods, while only requiring hard-label outputs.

Download the Paper

AUTHORS

Written by

Alexandre Sablayrolles

Chuan Guo

Douwe Kiela

Hervé Jegou

Publisher

EMNLP

Research Topics

Natural Language Processing (NLP)

Core Machine Learning

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.