Encoders-Only Models: Workhorses of Practical Language Processing + ModernBERT Case Study

7 min readJan 3, 2025

Why Encoder-Only Models Deserve More Attention in AI

Content of this Article

Introduction
Reintroducing The Transformer Architecture
Why Decoder Models Get All the Hype
The Overlooked Prowess of Encoder-Only Models
ModernBERT: An Encoder-Only Model Case Study
Conclusion

Introduction

In the current AI landscape, particularly within natural language processing (NLP), decoder-only models like GPT-4 and LLaMA have always dominated headlines with their remarkable generative capabilities. Everyone is hyped over how the next iteration of GPT will transcend its predecessors, or achieve the elusive AGI.

However, overshadowed by this hype are encoder-only models, the workhorses that power countless practical applications behind the scenes.

This article will explore the role of encoder-only models, compares them with their decoder counterparts, highlights specific scenarios and applications where encoder-only models, as exemplified by ModernBERT, not only meet, but exceed expectations.

Reintroducing The Transformer Architecture

At the heart of most modern NLP models lies the Transformer architecture, which comprises two primary components: encoders and decoders.

Encoders: The encoder processes the input data, creating a context for each word. Think of the encoder as someone reading a book, understanding all of the words and meaning inside it. It is designed to understand and represent input data by converting it into rich, numerical embeddings.

Decoders: Conversely, a decoder acts like a storyteller, using the context provided by the encoder to generate coherent and contextually relevant output. It can be a translated sentence, a summary, or an answer to a question regarding the book. It is built to generate coherent and contextually relevant output.

Why Decoder Models Get All the Hype

Decoder-only models, such as GPT, LLaMA, and Claude, have garnered significant attention due to their remarkable ability to generate human-like content. Their generative prowess has opened up new application areas like AI-generated art and interactive chatbots, attracting significant investment and hype.

A layman user of these ‘new AI tools’ might naturally gravitate toward decoder-only models, viewing them as the cornerstone of all practical applications. However, for technical practitioners and engineers, this perspective can be misleading. That is because there are so many practical applications that need a model that is lean, fast, and cost-effective! And it does not need to be a generative model.

I will quote the ModernBERT blog post on decoder models here:

“More bluntly, decoder-only models are too big, slow, private, and expensive for many jobs. Consider that the original GPT-1 was a 117 million parameter model. The Llama 3.1 model, by contrast, has 405 billion parameters, and its technical report describes a data synthesis and curation recipe that is too complex and expensive for most corporations to reproduce. So to use such a model, like ChatGPT, you pay in cents and wait in seconds to get an API reply back from heavyweight servers outside of your control.”

The Overlooked Prowess of Encoder-Only Models

Encoder-only models, like BERT and its successors, transform input data into numerical embeddings instead of generating text like decoder models.

You might say that instead of answering with text, an encoder model literally encodes its “answer” into this compressed, numerical form. That vector is a compressed representation of the model’s input, which is why encoder-only models are sometimes referred to as representational models.

Key Advantages of Encoder-Only Models over Decoder-Only Models:

Bi-Directional Context Understanding: A decoder-only model is mathematically “not allowed” to “peek” at later tokens. They can only ever look backwards. However, encoder models are able to process input data in both directions — forward and backward. This allows them to capture the full context of each token. This bi-directional understanding makes them exceptionally effective for tasks like classification, retrieval, and entity extraction.
Efficiency and Speed: Compared to their decoder counterparts, encoder-only models are typically smaller and faster. This allows them to run on more affordable hardware and handle high-volume, low-latency applications. For instance, content moderation systems can quickly scan millions of posts daily without needing expensive resources.
Cost-Effectiveness: Running encoder models is generally more affordable, especially when deployed at scale. Their smaller size and efficiency translate to lower operational costs, which is crucial for businesses handling millions of inferences daily. For example, FineWeb-Edu needed to perform quality filtering on 15 trillion tokens. They used a decoder-only model, Llama-3–70b-Instruct, to generate annotations and performed the bulk of the filtering with a fine-tuned BERT-based model. This filtering took 6,000 H100 hours, costing approximately $60,000 at HuggingFace’s rate of $10 per hour. In contrast, processing the same 15 trillion tokens with decoder-only models like Google’s Gemini Flash, even at the low cost of $0.075 per million tokens, would exceed one million dollars!

Real-World Applications:

Retrieval Augmented Generation (RAG): RAG uses a document store to supplement the LLM with information relevant to the query. In order for the LLM to know which documents are relevant to the query, it will need a model that is fast and cost-effective enough to efficiently encode the vast amounts of information that will be retrieved by the decoder models for generating responses. That model is often an encoder-only model.
Content Moderation: Encoder models can quickly and accurately classify content, ensuring platforms remain safe and compliant without the overhead of larger generative models.

In essence, whenever you see a decoder-only model in deployment, there is a reasonable chance an encoder-only model is also part of the system. But the converse is not true.

ModernBERT: A Case Study in Encoder Excellence

Released in 2018, BERT is still widely used today. It is currently the second most downloaded model on the HuggingFace hub, with more than 68 million monthly downloads. That is because its encoder-only architecture makes it ideal for the kinds of real-world problems that come up every day, like retrieval (such as for RAG), classification (such as content moderation), and entity extraction (such as for privacy and regulatory compliance).

Building upon BERT’s foundation, a team from Answer.AI and LightOn introduced ModernBERT. Released in December 2024, ModernBERT is a new model series that is a Pareto improvement over BERT and its younger siblings across both speed and accuracy.

Key Summarised Features of ModernBERT

Enhanced Context Length:

ModernBERT extends the maximum sequence length from 512 to 8,192 tokens. This expanded context window allows the model to process longer documents and more complex queries, making it ideal for applications like large-scale code search and comprehensive document retrieval.

Incorporation of Code Data:

Unlike traditional encoder models trained primarily on natural language, ModernBERT includes a substantial amount of code in its training data. This specialization enables it to excel in programming-related tasks, such as code similarity detection and AI-assisted development tools, opening new avenues for AI applications in software development.

Architectural Improvements from the Transformer architecture:

Rotary Positional Embeddings (RoPE): Enhances the model’s ability to understand the position of words, especially in longer sequences.
GeGLU Activation Layers: Improves performance and efficiency compared to the traditional GeLU activation function.
Alternating Attention Mechanism: Combines global and local attention to handle long sequences more efficiently, reducing computational overhead.

Efficiency Optimizations:

Alternating Attention: Instead of full global attention, ModernBERT uses global attention every few layers and local attention in others, significantly speeding up processing for long inputs.
Unpadding and Sequence Packing: Eliminates wasted computations on padding tokens by removing them and efficiently packing sequences, resulting in a 10–20% speedup.
Hardware-Aware Design: Optimized to run efficiently on consumer-grade GPUs like the NVIDIA RTX 4090, making it accessible for a wider range of applications without requiring specialized hardware.

Training Process:

ModernBERT employs a three-phase training process to ensure robust performance across various tasks:

Phase 1: Trained on 1.7 trillion tokens with a sequence length of 1,024.
Phase 2: Adapted to longer contexts by training on 250 billion tokens with a sequence length of 8,192.
Phase 3: Fine-tuned with 50 billion tokens using a decaying learning rate to solidify performance across diverse tasks.
Weight Initialization Trick: The ModernBERT-large variant initializes its weights by tiling the base model’s weights, speeding up training and enhancing stability.

Size Variants:

It is available in the following sizes:

ModernBERT-base — 22 layers, 149 million parameters
ModernBERT-large — 28 layers, 395 million parameters

For more in-depth description of ModernBERT, feel free to give ModernBERT’s blog post a read:

Finally, a Replacement for BERT: Introducing ModernBERT

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

And check out ModernBERT’s model card on Huggingface:

answerdotai/ModernBERT-base · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Super amazing work and all kudos to them!

Conclusion

The buzz around generative AI has, to some extent, overshadowed the critical role of encoder-only models. These are the workhorses of practical language processing, where they are being used for countless workloads right now in many scientific and commercial applications.

As the AI landscape evolves, a balanced appreciation and utilisation of both encoder and decoder models will be essential for the next stage of AI development.

Encoders-Only Models: Workhorses of Practical Language Processing + ModernBERT Case Study

Content of this Article

Introduction

Reintroducing The Transformer Architecture

Why Decoder Models Get All the Hype

The Overlooked Prowess of Encoder-Only Models

Key Advantages of Encoder-Only Models over Decoder-Only Models:

Real-World Applications:

ModernBERT: A Case Study in Encoder Excellence

Key Summarised Features of ModernBERT

Enhanced Context Length:

Incorporation of Code Data:

Architectural Improvements from the Transformer architecture:

Efficiency Optimizations:

Training Process:

Size Variants:

Finally, a Replacement for BERT: Introducing ModernBERT

We're on a journey to advance and democratize artificial intelligence through open source and open science.

answerdotai/ModernBERT-base · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Conclusion

Written by Jed Lee

No responses yet