CONVERSATIONAL AI

NLP

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming

April 05, 2024

Abstract

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses. While effective, manual red-teaming is costly, and existing automatic red-teaming typically discovers safety risks without addressing them. In this paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation, significantly increasing red-teaming scalability and the safety of the target LLM. Specifically, an adversarial LLM and a target LLM interplay with each other in an iterative manner, where the adversarial LLM aims to generate challenging prompts that elicit unsafe responses from the target LLM, while the target LLM is fine-tuned with safety aligned data on these adversarial prompts. In each round, the adversarial LLM crafts better attacks on the updated target LLM, while the target LLM also improves itself through safety fine-tuning. On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART, achieving comparable performance to LLMs with extensive adversarial prompt writing. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.

Download the Paper

AUTHORS

Written by

Suyu Ge

Chunting Zhou

Rui Hou

Madian Khabsa

Yi-Chia Wang

Qifan Wang

Jiawei Han

Yuning Mao

Publisher

ACL Rolling Review

Related Publications

May 24, 2024

SPEECH & AUDIO

NLP

DOC-RAG: ASR Language Model Personalization with Domain-Distributed Co-occurrence Retrieval Augmentation

Zhe Liu

May 24, 2024

April 23, 2024

CONVERSATIONAL AI

GRAPHICS

Generating Illustrated Instructions

Sachit Menon, Ishan Misra, Rohit Girdhar

April 23, 2024

April 22, 2024

NLP

Text Quality-Based Pruning for Efficient Training of Language Models

Vasu Sharma *, Karthik Padthe *, Newsha Ardalani, Kushal Tirumala, Russ Howes, Hu Xu, Bernie Huang, Daniel Li (FAIR), Armen Aghajanyan, Gargi Ghosh, Luke Zettlemoyer

April 22, 2024

April 14, 2024

SPEECH & AUDIO

NLP

Multi-task Learning for Front-end Text Processing in TTS

Yun Wang (Speech), Arthur Hinsvark, Qing He, Shun Zhang, Wonjune Kang

April 14, 2024

Help Us Pioneer The Future of AI

We share our open source frameworks, tools, libraries, and models for everything from research exploration to large-scale production deployment.