Reverse engineering generative models from a single deepfake image

June 16, 2021

Deepfakes have become more believable in recent years. In some cases, humans can no longer easily tell some of them apart from genuine images. Although detecting deepfakes remains a compelling challenge, their increasing sophistication opens up more potential lines of inquiry, such as: What happens when deepfakes are produced not just for amusement and awe, but for malicious intent on a grand scale? Today, we — in partnership with Michigan State University (MSU) — are presenting a research method of detecting and attributing deepfakes that relies on reverse engineering from a single AI-generated image to the generative model used to produce it. Our method will facilitate deepfake detection and tracing in real-world settings, where the deepfake image itself is often the only information detectors have to work with.

Why reverse engineering?

Something Went Wrong
We're having trouble playing this video.

Current methods of discussing deepfakes focus on telling whether an image is real or a deepfake (detection), or identifying whether an image was generated by a model seen during training or not (image attribution via “close-set” classification). But solving the problem of proliferating deepfakes requires taking the discussion one step further, and working to understand how to extend image attribution beyond the limited set of models present in training. It’s important to go beyond close-set image attribution because a deepfake can be created using a generative model that is not seen in training.

Reverse engineering is a different way of approaching the problem of deepfakes, but it’s not a new concept in machine learning. Prior work on reverse engineering ML models arrives at the model by examining its input/output pairs, treating the model itself as a black box. Another approach assumes that hardware information, such as CPU and memory usage, are available during model interference. Both of these approaches depend on preexisting knowledge about the model itself, which limits their usability in real-world cases, where such information is often unavailable.

Our reverse engineering method relies on uncovering the unique patterns behind the AI model used to generate a single deepfake image. We begin with image attribution and then work on discovering properties of the model that was used to generate the image. By generalizing image attribution to open-set recognition, we can infer more information about the generative model used to create a deepfake that goes beyond recognizing that it has not been seen before. And by tracing similarities among patterns of a collection of deepfakes, we could also tell whether a series of images originated from a single source. This ability to detect which deepfakes have been generated from the same AI model can be useful for uncovering instances of coordinated disinformation or other malicious attacks launched using deepfakes.

How it works

We begin by running a deepfake image through a fingerprint estimation network (FEN) to estimate details about the fingerprint left by the generative model. Device fingerprints are subtle but unique patterns left on each image produced by a particular device because of imperfections in the manufacturing process. In digital photography, fingerprints are used to identify the digital camera used to produce an image. Similar to device fingerprints, image fingerprints are unique patterns left on images generated by a generative model that can equally be used to identify the generative model that the image came from.

Before the deep learning era, researchers typically used a small, handcrafted, and well-known set of tools to generate photos. The fingerprints of these generative models were estimated by their handcrafted features. Deep learning has made the set of tools that can be used to generate images limitless, making it impossible for researchers to identify specific “signals” or fingerprint properties by handcrafted features.

To overcome the limitation of working within an endless sea of possibilities, we used the properties of fingerprints as the basis for developing constraints to perform an unsupervised training. Put differently, we estimated fingerprints using different constraints based on properties of a fingerprint in general, including the fingerprint magnitude, repetitive nature, frequency range and symmetrical frequency response. We then used different loss functions to apply these constraints to FEN to enforce the generated fingerprints to have these desired properties. Once the fingerprint generation has been completed, the fingerprints can be used as inputs for model parsing.

Model parsing is a novel problem that uses estimated generative model fingerprints to predict a model’s hyperparameters, that is, the properties of a model that make up its architecture, including the number of layers of network, the number of blocks, and the types of operations used in each block. An example of a model’s hyperparameters that affect the types of deepfakes it produces is its training loss functions, which guides how the model will be trained. Both a model’s network architecture and its training loss function types will have an impact on its weights and thus influence the way it generates images. To understand hyperparameters better, think of a generative model as a type of car and its hyperparameters as its various specific engine components. Different cars can look similar, but under the hood they can have very different engines with vastly different components. Our reverse engineering technique is somewhat like recognizing the components of a car based on how it sounds, even if this is a new car we've never heard of before.

Through our model parsing approach, we estimate both the network architecture of the model used to create a deepfake, and its training loss functions. We normalized some continuous parameters in network architecture to make it easy for training and also performed hierarchical learning to classify the loss function types. Since generative models mostly differ from each other in their network architectures and training loss functions, mapping from the deepfake or generative image to the hyperparameter space allows us to gain critical understanding of the features of the model used to create it.

To test the approach, the MSU research team put together a fake image data set with 100,000 synthetic images generated from 100 publicly available generative models. Each of the 100 generative models corresponds to one open -source project developed and shared by researchers from throughout the scientific community. Some of the open -source projects already had fake images released, in which case the MSU research team randomly selected 1,000 images. In cases where the open -source project did not have any available fake images, the research team ran their released code to generate 1,000 synthetic images. Given that testing images may come from an unseen generative model in the real world, the research team mimicked real-world applications by performing cross-validation to train and evaluate our models on different splits of our data sets.

Our results

As we are the first to conduct model parsing, there are no existing baselines for comparison. We formed a baseline termed random ground-truth by randomly shuffling each hyperparameter in the ground-truth set. These random ground-truth vectors kept the original distribution. The results showed that our approach performs substantially better than the random ground-truth baseline. This indicated that there was indeed a much stronger and generalized correlation between generated images and the embedding space of meaningful architecture hyperparameters and loss function types, compared with a random vector of the same length and distribution. We also conducted ablation studies to demonstrate the effectiveness of fingerprint estimation and hierarchical learning.

One generated image from each of 100 GMs produces an estimated fingerprint on the left and a corresponding frequency spectrum on the right. Many frequency spectrums show distinct high- frequency signals, while some appear to be similar to each other.

In addition to model parsing, our FEN can be used for deepfake detection and image attribution. For both tasks, we add a shallow network that inputs the estimated fingerprint and performs a binary (deepfake detection) or multi-class classification (image attribution). Although our fingerprint estimation is not tailored for these tasks, we still achieve results with competitive state of the art, indicating the superior generalization ability of our fingerprint estimation.

As we continue to prioritize responsible AI, we are mindful of taking human-centered approaches to our research whenever possible. A diverse collection of deepfake images from 100 generative models means that our model was built with a representative selection and has a better ability to generalize across both human and non-human representations. Even though some of the original images used to generate the deepfakes are those of real individuals in publicly available face data sets, the MSU research team began the forensic-style analysis with the deepfakes rather than the original images used to create them. Because the method involves deconstructing a deepfake into its fingerprint, the MSU research team analyzed whether the model could map the fingerprint back to the original image content. The results showed this does not occur, which confirms that the fingerprint mainly contains the trace left by the generative models rather than the content of the original deepfakes.

All fake face images used for this research were generated at MSU. All experiments on the reverse engineering process were also conducted at MSU. MSU will open-source the data set, code, and trained models to the wider research community to facilitate the research in various domains, including deepfake detection, image attribution, and reverse engineering of generative models.

Why it matters

Our research pushes the boundaries of understanding in deepfake detection, introducing the concept of model parsing that is more suited to real-world deployment. This work will give researchers and practitioners tools to better investigate incidents of coordinated disinformation using deepfakes, as well as open up new directions for future research.

MSU’s code, data set, and trained models

Model parsing was developed in collaboration with Vishal Asnani and Xiaoming Liu from Michigan State University

Written By

Xi Yin

Research Scientist

Research Scientist