JUNE 14, 2022

DISC21: Dataset for the Image Similarity Challenge 2021

The DISC21 dataset is designed to help researchers evaluate their image copy detection models for accuracy. It was released as part of the Image Similarity Challenge.


We designed the Image Similarity Challenge dataset to serve as a benchmark for work in image copy detection, providing a reference collection of 1 million images, a development set of 50,000 query images, and a test set of 50,000 additional query images. The query images are versions of reference images transformed through human and nonhuman edits to include various types of image edition, collages, and reencoding. We also provide a collection of 1 million untransformed images that can be used for training.

The underlying images for DISC21 were sourced from the YFCC100M dataset as well as AI at Meta’s own Casual Conversations dataset. The YFCC100M source images were filtered to include only the broadest licenses and not include any recognizable people. Some of the images from Casual Conversations had the CIAGAN deepfake effect applied before transformation.

Please follow the download link at the top of the page if you would like to use DISC21. The data is about 350GB in total, split into 42 zip files containing 50,000 images each.