Simplifying Transformers (in ML)

9 min readAug 18, 2023

My own version (attempt) of telling this story

Left Photo by Arseny Togulev & Right Photo by Iqram-O-dowla Shawon on Unsplash

Content of this Article

Introduction
What are Transformers?
What came before?
How does a Transformer work?
Transformers in BERT & GPT
What’s next?

Introduction

In the realm of AI and NLP, the term “transformer” has become synonymous with breakthroughs and state-of-the-art performance. It has greatly redefined the paradigms of NLP applications.

This article wants to help answer questions like:

What exactly makes the transformer so transformative? (pun not intended)
What are the missing pieces that the transformer is able to address?
How exactly does a transformer work?
What are its current applications and what’s next?

What are Transformers?

At its core, a transformer is a deep learning model introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017.

What exactly makes the transformer so transformative?

Parallel Processing & Scalability: Unlike its predecessors that process data sequentially, a transformer can process all the inputs (words in a sentence) simultaneously. This is made possible by the attention mechanism. This parallelism led to faster training times and better scalability.
On the WMT 2014 English-to-German translation task, the transformer model trained on eight P100 GPUs took 3.5 days to achieve a BLEU score of 28.4, while comparable RNN models took more than a week to achieve similar performance.
Attention Mechanism: The heart of the transformer is its attention mechanism. This “attention” enables the model to capture long-range dependencies and relationships in the data, something that was challenging for previous models. This advancement essentially eliminates constraints related to sequence length.
Flexibility across Tasks: The transformer architecture is highly modular, making it adaptable to a wide range of tasks without significant modifications.
Several models like BERT, GPT, T5, and RoBERTa, have been developed based on the transformer architecture.

Around 2017, deep learning was experiencing rapid advancements. Google claimed the achievement of Quantum Supremacy in 2019. With a surge in available computational power, larger and more diverse datasets, the transformer model was able to set new performance benchmarks across various NLP tasks.

What came before?

Image from The A.I. Hacker — Michael Phi

Before the advent of transformers, the deep learning landscape was dominated by Recurrent Neural Networks (RNNs) and their advanced variant, Long Short-Term Memory (LSTM) networks.

While revolutionary in their time, they had certain limitations:

What were the missing pieces that the transformer is able to address?

Sequential Processing & Scalability: Traditional models, such as RNNs, processed data sequentially. For example, when processing a sentence, an RNN would address each word consecutively, updating a hidden state with every word. This sequential approach not only slowed down processing, particularly for extended sequences, but also rendered these models sensitive to batch sizes, posing scalability challenges.
Models like GPT-3, with 175 billion parameters, would be incredibly challenging to train using sequential architectures.
Vanishing Gradient Problem & Difficulty in Capturing Long-Term Dependencies: RNNs, while innovative, encountered a persistent issue known as the vanishing gradient problem as they are trained using a method called backpropagation. This challenge made it difficult for them to remember or give significance to older inputs in a sequence. LSTMs and GRUs were introduced as solutions, designed specifically to tackle this problem. Yet, they were not always successful, especially with very long sequences. Transformers have consistently demonstrated superior performance, especially in tasks that require recognizing distant relationships in data, such as coreference resolution.
Sidebar: What is Coreference Resolution?
In NLP, coreference resolution is the task of finding all expressions in a text that refer to the same entity. For instance, in “Anna told her sister that she would call later,” the words “Anna” and “she” refer to the same person. Coreference resolution aims to identify these relationships.
Contextual Depth/Understanding: While LSTMs and GRUs possess gating mechanisms enabling them to recall previous information, their ability to deeply understand context has limits. Transformers, on the other hand, excel in this area. Their architecture allows them to weigh the importance of each word in relation to others, providing a richer, more nuanced understanding of context.

How does a Transformer work?

The Transformer Architecture from “Attention Is All You Need“

To understand the Transformer, you have you first understand the Attention Mechanism, followed by the Encoder-Decoder structure.

Attention Mechanism

The Attention Mechanism allows the transformer to focus on different parts of the input data based on their relevance.

Imagine you are in a noisy bar, and you are trying to engage in a conversation with your friend. Even though there are many people talking around you, you focus mainly on your friend’s voice. The transformer does something similar. Instead of treating all words or pieces of information equally, it focuses more on the important bits — your friend’s voice.

In more technical terms, the Multi-Headed Attention module in the diagram above applies a mechanism called Self-Attention that allows the model to associate each individual word in the input to other words in the input.

Consider the phrase: “How are you doing today?” Here, the model might associate the word “you” with “how”, recognizing the typical structure of a question and responding accordingly.

The beauty of the Attention Mechanism lies in its dynamic computation of attention scores. This score determines how much each word in the input should influence the current word being processed in the output. For instance, in the sentence “The cat, which was brown, sat on the mat”, when translating or processing the word “sat”, the words “cat” and “mat” might have higher attention scores because they are contextually relevant to the action “sat”.

To achieve this, the input undergoes processing through 3 distinct fully connected layers to create the Query, Key, Value (Q, K, V) vectors. These vectors serve as representations of the input data. The Q, K, V paradigm can be likened to retrieval systems. For instance, when searching on YouTube, the engine compares your query (the search text) with a set of keys (like video titles or descriptions) linked to potential videos in its database. It then showcases the most relevant videos (values) based on this comparison.

I recommend reading up on this StackExchange thread to further your understanding of Self-Attention. The contributors did more than an amazing job explaining the concept.

Under the hood, the Attention Mechanism assigns a weight to each input word by computing a scaled dot product between the Query and the Key. Scaling ensures stability in training. A subsequent softmax operation converts these raw attention scores into probabilities, indicating each word’s significance in relation to others. Using the softmax-ed attention scores as weights, this combination provides a context-rich representation of the data.

In essence, the Attention Mechanism enables transformers to dynamically weigh input relevance for each output word, preserving the context’s “memory.” This capability allows for understanding relationships between distant words in a sequence.

While I have provided an overview, delving into every nuance of this concept would be quite extensive. For a more comprehensive understanding, I highly recommend this YouTube video, which succinctly captures the essence of the topic.

Encoder-Decoder Structure

The transformer architecture comprises an encoder and a decoder. At their core, encoders and decoders are components of neural network architectures.

At a high-overview technical level, the encoder maps an input sequence into an abstract continuous representation that holds all the learned information of the input. The decoder then takes the continuous representation and step-by-step generates a single output while also being fed the previous output.

Simply put, the encoder processes the input data, creating a context for each word. Think of the encoder as someone reading a book, understanding all of the words and meaning inside it.
The decoder then uses this context to produce the desired output, be it a translated sentence, a summary, or an answer to a question regarding the book.

The Encoder’s primary function is to capture and distil information from the input data. This compressed representation, often referred to as embeddings or features, serves as a bridge between raw input and meaningful context. By extracting and preserving the most relevant features in a compact form, we can train the encoder on a large and diverse dataset, enabling it to identify and internalize intricate patterns and relationships within the data.

The Decoder’s primary function is to take the compressed representation (or embeddings) produced by the encoder and translate or transform it into the desired output format. This can be a sequence of words, a classification label, or any other type of target output. In essence, while the encoder captures and compresses the information, the decoder acts as a translator, decompressing and interpreting that information to produce a meaningful result.

Transformers in BERT & GPT

What are its current applications and what’s next?

Transformers have paved the way for a plethora of applications in the NLP domain. Two of the most prominent offspring of the transformer era are BERT (Bidirectional Encoder Representations from Transformers) by Google and the GPT (Generative Pre-trained Transformer) series by OpenAI.

BERT vs. GPT

The inception of both BERT and GPT can be traced back to the transformer blueprint. Their architecture fundamentally relies on the self-attention mechanism pioneered by the transformer concept.

BERT: Developed by Google, BERT is primarily an Encoder-Only Transformer. It is crafted to discern the relationships between different words and their meanings in various contexts. It doesn’t generate sequences like GPT but produces embeddings that capture the context of words in their surrounding environment. For instance, in “He went to the bank to withdraw money,” BERT can ascertain that “bank” implies a financial establishment rather than a river’s edge.
GPT: From OpenAI, GPT leverages only the Decoder aspect of the transformer. It is architected to produce text but not for a seq2seq task. GPT is trained to predict the next word in a sentence based on the context provided by all the preceding words. It’s akin to predicting the next note in a melody after hearing the initial notes. Feed it a prompt like “Once upon a time,” and it can weave a narrative for you.

Applications of Transformers:

Language Translation: Real-time translations are now possible.
Text Summarization: Able to process extensive articles and churning out coherent, concise summaries that retains the main points.
Question Answering: Given a text corpus or passage, the system identifies and extracts the segment that answers a posed question.
Example: For the passage “Elephants are the largest land animals on Earth and they primarily eat grasses, fruits, and bark.” A question “What do elephants eat?” might yield the answer “grasses, fruits, and bark.”
Sentiment Analysis: Analyzing a given text can determine its emotional tone, like positive, negative, or neutral. Commercial enterprises can potentially discern customer sentiments from reviews, thereby refining their offerings.
Named Entity Recognition (NER): Detect and classify entities in text into predefined categories such as names of persons, organizations, locations, quantities, monetary values, etc.
Example: In the sentence “Elon Musk is the CEO of Tesla”, “Tesla” might be tagged as an ‘organization’, “Elon Musk” as a ‘person’, and “CEO” as an ‘occupation’.

What‘s next?

As mentioned earlier, what makes the Tranformer so transformative was due to it being highly modular. After the introduction of the transformer architecture, the NLP community witnessed a series of innovations. Some of the prominent ones include:

T5 (Text-to-Text Transfer Transformer): Introduced by Google Research, T5 views every NLP problem as a text-to-text problem. Instead of having different model architectures for different tasks, T5 adopts a unified text-to-text framework where every NLP task (be it translation, summarization, question answering, etc.) is cast as a text-to-text problem.
XLNet: As a generalized autoregressive (AR) pretraining model, XLNet model combines the best of BERT and autoregressive models like GPT by using a permutation-based training strategy to learn bidirectional context, which addresses some of BERT’s pretraining-finetuning discrepancy.
RoBERTa (Robustly optimized BERT approach): Reintroduced by Facebook AI, RoBERTa rethinks BERT’s training approach and makes changes like removing the next sentence prediction objective, training with much larger mini-batches and learning rates, and using more data. As a result, it often outperforms BERT in several benchmarks.
ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately): Instead of the traditional masked language modeling approach used by BERT, ELECTRA trains a discriminator to tell if a token in a sentence is original or if it’s been replaced by a generator (another small model). This way, it leverages all the tokens in the input for prediction rather than just the masked ones, making pretraining more efficient.

Conclusion

In conclusion, transformers have reshaped the landscape of NLP. As research progresses, we can only anticipate even more groundbreaking applications to this already revolutionary model.

Thanks so much for reading my article!!! Feel free to drop me any comments, suggestions, and follow me on LinkedIn!