Making Transformer networks simpler and more efficient


Transformer networks have brought big improvements to many areas of deep learning, including machine translation, text understanding, and speech and image processing. As powerful as these networks are, they are quite hungry for computational resources during both training and inference, which limits their usage at scale, especially for sequences with long-term dependencies. New research from Facebook AI is looking at ways to make the Transformer model simpler and more efficient.

To enable wider use of this powerful deep learning architecture, we propose two new methods. The first, adaptive attention span is a way to make Transformer networks more efficient for longer sentences. With this method, we were able to increase the attention span of a Transformer to over 8,000 tokens without significantly increasing computation time or memory footprint. The second, all-attention layer is a way to simplify the model architecture of Transformer networks. Even with a much simpler architecture, our all-attention network matched the state-of-the-art performance of Transformer networks. We believe that this work to improve the efficiency of Transformer networks is an important step toward wider adaptation.

Adaptive attention span

The goal of this research is to make Transformer networks more computationally efficient, especially when processing very long sequences. Discovering long-term relations in data requires a longer attention span. However, increasing attention span also increases the computation time and memory footprint of a Transformer.

In our experiments with Transformers, we observed that not all the attention heads utilize their attention span to the fullest. In fact, in a task of character-level language modeling, most of the heads were using only a small portion of their attention span. If we can take advantage of this property during training, we can reduce the computation time and memory footprint significantly, as both depend on the length of attention span. Unfortunately, we don’t know how much attention span each head requires. After many attempts to set attention spans heuristically, we realized that it’s best if we can learn that from the data itself.

As an attention span is an integer (and thus non-differentiable), we cannot directly learn it with back-propagation like the other parameters of the model. However, we can convert it to a continuous value using a soft-masking function. The value of this function goes from 1 to 0 smoothly, which makes it possible to differentiate it with respect to the masking length. We simply insert this masking function to each attention head so that each can have a different attention span, determined by the data.

Something Went Wrong
We're having trouble playing this video.

With our adaptive attention span mechanism, we managed to increase the attention span of a Transformer to over 8,000 tokens without significantly increasing its computation time and memory footprint. On character-level language modeling tasks, this led to performance that improves the state of the art, with fewer parameters.

While the longest attention span in the model is over 8,000 steps, the average attention span is only around 200 steps, which makes the models much more efficient to run. This is reflected in the number of FLOPS per step, which is significantly smaller for these models. In the below figure, we show one such learned attention span in a case of a 12-layer model with eight heads in each layer. We can see that only five out of 96 heads have a span over 1,000 steps.

We have released the code for running the experiments in our paper. As the adaptive attention span mechanism is implemented as a ‘nn.module’ of PyTorch, it can be easily integrated into other neural models.

All-attention layer

Next, we focused on simplifying the architecture of Transformer networks. A Transformer layer is composed of two sublayers: self-attention and feedforward. Although the self-attention layer is considered to be the main component, the feedforward sublayer is important for strong performance, which is why its size is often set at four times larger than the rest of the network.

On the surface, self-attention and feedforward sublayers look very different from each other. However, with one simple change, a feedforward sublayer can be turned into an attention layer. By replacing the ReLU nonlinear function with a softmax function, we can interpret its activations as attention weights. Furthermore, we can view the first linear transformation as key vectors and the second linear transformation as value vectors. .

Taking advantage of this interpretation, we merge the feedforward sublayer onto the self-attention sublayer, creating a unified attention layer, which we call the all-attention layer. All we have to do is add an extra set of vectors into keys and values of a self-attention sublayer. Those extra vectors are like the weights of a feedforward sublayer: fixed, trainable, and context independent. In contrast, keys and values computed from context dynamically change, depending on the current context.

Something Went Wrong
We're having trouble playing this video.

Since the extra vectors can act like a feedforward sublayer and capture general knowledge about the task, we can remove all feedforward sublayers from the network. In the end, our all-attention network is just a stack of all-attention layers. On language modeling benchmark tasks, our all-attention network matched the state-of-the-art performances by Transformer networks, with a much simpler architecture. We hope this simplified architecture will open a path for better understanding and improving Transformer networks.