Follow @JuanCamiloZap

Bidirectional Encoder Representations from Transformers (BERT)

BERT is a variation of the original transformer model, its main features are that is a transformer model that aims to be easily transferable between different domain tasks with no architectural changes and minimal fine-tuning.

BERT uses only Encoder modules

The original transformed used encoder and decoder modules as in the classical seq2seq architecture for machine translation, it turns out that to solve the NLP tasks that the transformer solves each encoder/decoder module needs to learn how the input or output language works, and all the necessary complexities that involve understanding the input and the output language, for this reason, BERT to be more general uses only an encoder part, but other architectures like GPT3 uses decoder only modules, what is important here is that these models can understand language tasks with an encoder only modules or decoder only modules.

BERT main goal is to be easily transferable between domain tasks, to achieve this BERT training needs to have a special set of characteristics:

Pre-Training

Multiple Training Objectives for robust learning representations

BERT pre-training has two loss objectives to optimize

  • Masked Language Model (MLM)

  • Next Sentence Prediction (NSP)

It turns out that both of these tasks are very hard to achieve, so if the model goal is to optimize both of these tasks (at the same time) its final learned representations will be very robust and will be very highly generalizable towards language tasks, what this means is that the internal representations of BERT after optimizing both of these tasks are great to be reusable in many more NLP tasks with a little bit of fine-tuning and no architectural changes at all.

Masked Language Model

The input sentence words are randomly masked and the model must predict what the masked words are based on the context of the unmasked words, to mask the words a <MASK> special token is used.

Next Sentence Prediction

The same input sentence can contain the special token <SEP> to represent a separation between the first sentence and the second sentence, also at the beginning of the input, a special token <CLS> is present to represent the binary task of predicting whether the second sentence follows the first sentence. At the output of BERT, the CLS token will be 1 or 0.

As you can see both MLM and NSP uses the same input, and the transformer network optimize both tasks at the same time during training. The reason why both of these tasks are used is that they are challenging tasks and they need the understanding of the entire sentence, this is what makes BERT learned representations very robust to tackle other NLP domain tasks.

Composed Embeddings

The original transformed model uses positional encoding at the input embeddings, this is necessary because transformers process their inputs in a parallel fashion which makes them very fast to train and one of the main appealing reasons to use them. Because the input enters all in one shot, the positions are lost (the transformer does not know whether the first word is the first word) for the reason the positional information is encoded in an embedding vector and combined with the original input embedding. BERT enhance the embedding representation by combining the Token Embeddings with:

  • Positional Encodings
  • Sentence Encodings

Sentence encodings are just a way to tell the model whether a certain token belongs to the first sentence or the second sentence.

Fine-Tuning

BERT can learn many different language tasks different from the ones it was originally trained in. Fine-tuning is cheap to compute, and usually, it does not require any model architecture changes, aside from the final output classification layer. BERT surpasses some domain task-specific state-of-the-art models, which is very impressive taking into account that most of the BERT training didn’t use the domain task-specific data.

Some studies show that BERT is pretty good with some domain-specific tasks with little data. This is because the original BERT-trained representation already has a good understanding of the world, from that starting point is just a matter to make simple mappings to more specific tasks.

Written on April 7, 2021