June 11, 2021
We’re introducing TextStyleBrush, an AI research project that can copy the style of text in a photo using just a single word. With this AI model, you can edit and replace text in images.
Unlike most AI systems that can do this for well-defined, specialized tasks, TextStyleBrush is the first self-supervised AI model that replaces text in images of both handwriting and scenes — in one shot — using a single example word.
Although this is a research project, it could one day unlock new potential for creative self-expression like personalized messaging and captions, and lays the groundwork for future innovations like photo-realistic translation of languages in augmented reality (AR).
By publishing the capabilities, methods, and results of this research, we hope to spur dialogue and research into detecting potential misuse of this type of technology, such as deepfake text attacks — a critical, emerging challenge in the AI field.
AI-generated images have been advancing at breakneck speed — capable of synthetically reconstructing historical scenes or changing a photo to resemble the style of Van Gogh or Renoir. Now, we’ve built a system that can replace text both in scenes and handwriting — using only a single word example as input.
While most AI systems can do this for well-defined, specialized tasks, building an AI system that’s flexible enough to understand the nuances of both text in real-world scenes and handwriting is a much harder AI challenge. It means understanding unlimited text styles for not just different typography and calligraphy, but also for different transformations, like rotations, curved text, and deformations that happen between paper and pen when handwriting; background clutter; and image noise. Because of these complexities, it’s not possible to neatly segment text from its background, nor is it reasonable to create annotated examples for every possible appearance for the entire alphabet, as well as digits.
Today, we’re introducing TextStyleBrush, the first self-supervised AI model that replaces text in existing images of both scenes and handwriting — in one shot — using just a single example word. The work will also be submitted to a peer-reviewed journal.
It works similar to the way style brush tools work in word processors, but for text aesthetics in images. It surpasses state-of-the-art accuracy in both automated tests and user studies for any type of text. Unlike previous approaches, which define specific parameters such as typeface or target style supervision, we take a more holistic training approach and disentangle the content of a text image from all aspects of its appearance of the entire word box. The representation of the overall appearance can then be applied as one-shot-transfer without retraining on the novel source style samples.
Move the arrow to see TextStyleBrush translate text style on handwritten signs.
By openly publishing this research, we hope to spur additional research and dialogue preempting deepfake text attacks in the same way that we do with deepfake faces. If AI researchers and practitioners can get ahead of adversaries in building this technology, we can learn to better detect this new style of deepfakes and build robust systems to combat them. While this technology is research, it can power a variety of useful applications in the future, like translating text in images to different languages, creating personalized messaging and captions, and maybe one day facilitating real-world translation of street signs using AR.
Typically, transferring text styles involves training a model with supervised data in terms of source and target content in similar styles and explicit segmentation of text (that is, classifying each pixel to either foreground or background). But it’s hard to build an efficient text segmentation method for images captured in the real-world. The line in handwriting, for instance, is often one pixel wide or even less. Plus, collecting good training data for segmentation involves the added complexity of labeling both foreground and background.
To train on real-world images directly, we take a different approach that is self-supervised in nature in terms of learning styles and segmentation. We don’t assume any form of supervision available on how styles are represented or the availability of segmented text labels (neither for individual characters nor the whole words). Also, we don’t assume that the source style example and new content share the same length. Given a detected text box containing a source style, we extract an opaque latent style representation, and we optimize our representation to allow photo-realistic rendering of new content in the source style using a single source sample.
Our generator architecture is based on the StyleGAN2 model. For our goal of generating photo-realistic text images, however, the design of StyleGAN2 has two important limitations.
First, StyleGAN2 is an unconditional model, meaning it generates images by sampling a random latent vector. We, however, need to control the output based on two separate sources: our desired text content and style.
Second, we have an added challenge of the unique nature of stylized text images. Representing text styles involves a combination of global information (e.g., color palette and spatial transformation) along with detailed, fine-scale information, like the minute variations of individual penmanship.
We address these limitations jointly, by conditioning the generator on our content and style representations. We handle the multiscale nature of text styles by extracting layer-specific style information and injecting it at each layer of the generator. In addition to generating the target image in the desired style, the generator also generates a soft mask image that denotes the foreground pixels (text region). This way, the generator controls both low- and high-resolution details of the text appearance to match a desired input style.
Of course, when training our system, we can’t assume that we have style labels for real photos. In fact, because of the essentially limitless variability of text styles, it’s not clear what high-level parameters can be used to capture these styles. To address this challenge, we introduce a novel self-supervised training criteria that preserves both source style and target content using a typeface classifier, a text recognizer, and an adversarial discriminator. We first measure how well our generator captures the style of input text by using a pretrained typeface classification network. Separately, we use a pretrained text recognition network to evaluate the content of a generated image to reflect how well the generator captured the target content. Taken together, this allow for effective self-supervision of our training.
TextStyleBrush proves that it’s possible to build AI systems that can learn to transfer text aesthetics with more flexibility and accuracy than what was possible before — using a one-word example. We’re continuing to improve our system through some limitations that we’ve run into, like text written in metallic objects or characters in different colors.
We hope this work will continue to lower barriers to photorealistic translation, creative self-expression, and the study of deepfake text attacks.
As the ongoing self-supervised revolution continues to progress, we see it as imperative that the AI field openly facilitate research into detecting misuse of technology. This includes moving beyond fake faces to text and sharing benchmark data sets, such as the Deepfake Detection Challenge data set. We hope that by openly publishing our work and methods for synthetically generated text styles, the broader AI field will be able to build on this work and make cumulative forward progress.