Transformers
The Transformer Network
This is the diagram of the Transformer network presented in the Attention is All You Need paper. We will go through all the different pieces of this network throughout this notebook.
Transformers Explained Step by Step
Tokenization
The first step in processing text is to cut it into pieces called tokens. There are many variations of how to do it, and we won’t go into details, butBERT
uses WordPiece
tokenization. This means that tokens correspond roughly to words and punctuation, although a word can also be split into several tokens if it contains a common prefix or suffix. These are called sub-word tokens and usually contain ##
characters. Words can even be spelled out if they have never been seen before. 
Embedding
The second step is to associate each token with an embedding, which is nothing more than a vector of real numbers. Again, there are many ways to create embedding vectors. Fortunately, already trained embeddings are often provided by research groups, and we can just use an existing dictionary to convert the WordPiece tokens into embedding vectors.
The embedding of tokens into vectors is an achievement in itself. The values inside an embedding carry information about the meaning of the token, but they are also arranged in such a way that one can perform mathematical operations on them, which correspond to semantic changes, like changing the gender of a noun, or the tense of a verb, or even the homeland of a city.

Context
However, embeddings are associated with tokens by a straight dictionary look-up, which means that the same token always gets the same embedding, regardless of its context. This is where the attention mechanism comes in, and specifically for BERT, the scaled dot-product self-attention. Attention transforms the default embeddings by analyzing the whole sequence of tokens, so that the values are more representative of the token they represent in the context of the sentence.

Self Attention Mechanism
Let’s have a look at this process with the sequence of tokenswalk
, by
, river
, bank
. Each token is initially replaced by its default embedding, which in this case is a vector with 768 components. 
Let’s color the embedding of the first token to follow what happens to it. We start by calculating the scalar product between pairs of embeddings. Here we have the first embedding with itself. When the two vectors are more correlated, or aligned, meaning that they are generally more similar, the scalar product is higher (darker in image), and we consider that they have a strong relationship. If they had less similar content, the scalar product would be lower (brighter in the image) and we would consider that they don’t relate to each other.


Then comes the only non-linear operation in the attention mechanism: The scaled values are passed through a softmax activation function, by groups corresponding to each input token. So in this illustration, we apply the softmax column by column. What the softmax does is to exponentially amplify large values, while crushing low and negative values towards zero. It also does normalization, so that each column sums up to 1.

Finally, we create a new embedding vector for each token by linear combination of the input embeddings, in proportions given by the softmax results. We can say that the new embedding vectors are contextualized, since they contain a fraction of every input embedding for this particular sequence of tokens. In particular, if a token has a strong relationship with another one, a large fraction of its new contextualized embedding will be made of the related embedding. If a token doesn’t relate much to any other, as measured by the scalar product between their input embeddings, its contextualized embedding will be nearly identical to the input embedding.

For instance, one can imagine that the vector space has a direction that corresponds to the idea of nature. The input embeddings of the tokens river
and bank
should both have large values in that direction, so that they are more similar and have a strong relationship. As a result, the new contextualized embeddings of the river
and bank
tokens would combine both input embeddings in roughly equal parts. On the other hand, the preposition by
sounds quite neutral, so that its embedding should have a weak relationship with every other one and little modification of its embedding vector would occur. So there we have the mechanism that lets the scaled dot-product attention utilize context.

Keys, Queries and Values
However, that’s not the whole story. Most importantly, we don’t have to use the input embedding vectors as is. We can first project them using 3 linear projections to create the so-called Key, Query, and Value vectors. Typically, the projections are also mapping the input embeddings onto a space of lower dimension. In the case of BERT, the Key, Query, and Value vectors all have 64 components.
Each projection can be thought of as focusing on different directions of the vector space, which would represent different semantic aspects. One can imagine that a Key is the projection of an embedding onto the direction of “prepositions”, and a Query is the projection of an embedding along the direction of “locations”. In this case, the Key of the token by
should have a strong relationship with every other Query, since by
should have strong components in the direction of “prepositions”, and every other token should have strong components in the direction of “locations”. The Values can come from yet another projection that is relevant, for example the direction of physical places. It’s these values that are combined to create the contextualized embeddings In practice, the meaning of each projection may not be so clear, and the model is free to learn whatever projections allow it to solve language tasks the most efficiently.
Multi-head Attention
In addition, the same process can be repeated many times with different Key, Query, and Value projections, forming what is called a multi-head attention. Each head can focus on different projections of the input embeddings. For instance, one head could calculate the preposition/location relationships, while another head could calculate subject/verb relationships, simply by using different projections to create the Key, Query, and Value vectors. The outputs from each head are concatenated back in a large vector. BERT uses 12 such heads, which means that the final output contains one 768-component contextualized embedding vector per token, equally long with the input.
Positional Encoding
We can also kickstart the process by adding the input embeddings to positional embeddings. Positional embeddings are vectors that contain information about a position in the sequence, rather than about the meaning of a token. This adds information about the sequence even before attention is applied, and it allows attention to calculate relationships knowing the relative order of the tokens.

A detailed explanation of how it works can be found here, but a quick explanation is that we create a vector for each element representing its position with regard to every other element in the sequence. Positional encoding follows this very complicated-looking formula which, in practice, we won’t really need to understand:
\[\begin{equation} p_{i,j} = \left\{ \begin{array}{@{}ll@{}} \sin \left(\frac{1}{10000^{\frac{j}{dim\:embed}}} \right), & \text{if}\ j=even \\ \cos \left(\frac{1}{10000^{\frac{j}{dim\:embed}}} \right), & \text{if}\ j=odd \\ \end{array}\right. \end{equation} \]
BERT
Finally, thanks to the non-linearity introduced by the softmax function, we can achieve even more complex transformations of the embeddings by applying attention again and again, with a couple of helpful steps between each application. A complete model like BERT uses 12 layers of attention, each with its own set of projections So when you search for suggestions for a “walk by the river bank”, the computer doesn’t only get a chance to recognize the keyword “river”, but even the numerical values given to “bank” indicate that you’re interested in enjoying the waterside, and not in need of the nearest cash machine.
Importing Libraries
Multi Head Attention
Attention is a mechanism that allows neural networks to assign a different amount of weight or attention to each element in a sequence. For text sequences, the elements are token embeddings, where each token is mapped to a vector of some fixed dimension. For example, in BERT each token is represented as a 768-dimensional vector. The “self” part of self-attention refers to the fact that these weights are computed for all hidden states in the same set—for example, all the hidden states of the encoder. By contrast, the attention mechanism associated with recurrent models involves computing the relevance of each encoder hidden state to the decoder hidden state at a given decoding timestep.
The main idea behind self-attention is that instead of using a fixed embedding for each token, we can use the whole sequence to compute a weighted average of each embedding. Another way to formulate this is to say that given a sequence of token embeddings \(x_{1}, x_{2}, ..., x_{n}\), self-attention produces a sequence of new embeddings \(x^{'}_{1}, x^{'}_{2}, ..., x^{'}_{n}\) where each \(x^{'}_{i}\) is a linear combination of all the \(x_{j}\):
\[x^{'}_{i} = \sum^{n}_{j=1} w_{ji} . x_{j}\]
MultiHeadAttention
MultiHeadAttention (embed_size, heads)
Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool
Encoder Layer

We will be referring to the encoder layer. The encoder layer/block consists of: 1. Multi-Head Attention 2. Add & Norm 3. Feed Forward 4. Add & Norm again.
- nn.LayerNorm()
- The
forward_expansion
is a parameter in the “Attention is All You Need” paper which simply adds nodes to the Linear Layer. Since it’s used in two different layers in the end it doesn’t affect the shape of the output (same as input) it just add some extra computation. Its default value is 4.
TransformerLayer
TransformerLayer (embed_size, heads, dropout, forward_expansion=4)
Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool
Encoder

We will be referring to the transformer block. The transformer block consists of: 1. Embedding 2. Positional Encoding 3. Transformer Block
Encoder
Encoder (src_vocab_size, embed_size, num_layers, heads, device, forward_expansion, dropout, max_length)
Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool
Decoder Layer

We will be referring to the decoder layer. The decoder layer/block consists of: 1. Masked Multi-Head Attention 2. Add & Norm 3. Masked Multi-Head Attention 4. Add & Norm 5. Feed Forward 6. Add & Norm
DecoderLayer
DecoderLayer (embed_size, heads, forward_expansion, dropout, device)
Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool
Decoder

We will be referring to the decoder. The decoder consists of: 1. Output Embedding 2. Decoder Block 3. Linear 4. Softmax
Notes:
- In this implementation the Token Embeddings are learned. Normally, we would use the output of the model’s tokenizer.
- In this implementation the Positional Embedding are learned. We don’t use the formula.
Decoder
Decoder (trg_vocab_size, embed_size, num_layers, heads, forward_expansion, dropout, device, max_length)
Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool
Transformer

Transformer
Transformer (src_vocab_size, trg_vocab_size, src_pad_idx, trg_pad_idx, embed_size=512, num_layers=6, forward_expansion=4, heads=8, dropout=0, device='cpu', max_length=100)
Base class for all neural network modules.
Your models should also subclass this class.
Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::
import torch.nn as nn
import torch.nn.functional as F
class Model(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(1, 20, 5)
self.conv2 = nn.Conv2d(20, 20, 5)
def forward(self, x):
x = F.relu(self.conv1(x))
return F.relu(self.conv2(x))
Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to
, etc.
.. note:: As per the example above, an __init__()
call to the parent class must be made before assignment on the child.
:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool