Bayesian & Laplace Smoothing — Applications in Modern Machine Learning

5 min readDec 13, 2024

Dealing with Sparse Data & Uncertainty

Content of this Article

Introduction
Why Bother With Smoothing?
Bayesian Smoothing: A Guided Recalibration
Laplace Smoothing: Guarding Against Zeroes
Conclusion

Introduction

In the world of machine learning and data-driven decision-making, we often find ourselves dealing with uncertainty. Sometimes we have too little data to be confident in our predictions. Other times, the data seems skewed because a few records behave very differently from the rest. This is very common. Furthermore, it is expensive and time-consuming to collect all the data you need.

That is why smoothing techniques can step in as guardrails, helping ensure our models do not get misled by sparse or misleading evidence.

Smoothing, at its core, is about making our models more resilient. Instead of taking raw frequencies or probabilities at face value, we gently adjust them — “smooth” them — so they better reflect a more realistic underlying pattern.

Think of it like baking bread: you knead the dough (data) to remove lumps and air pockets (noise), resulting in a more wholesome loaf (model).

Why Bother With Smoothing?

We all have had that moment where we try to learn something new but our sample of experiences is too small to draw any firm conclusions. In technical terms, this is often called the “sparse data” problem. If you base a strong opinion on just a few observations, you risk overfitting your worldview. By “normal central limit theorem” convention, we have to consider a sample size of at least 30 to be “sufficiently large.”

For instance, imagine you are working for a large online marketplace. You want to predict how likely a newly launched product is going to sell well. But there is an issue: this product is so new that you have only a handful of transactions. Do you really want to trust the numbers as is? If after 5 purchases, all 5 customers were delighted, can you confidently say there is a 100% satisfaction rate? It seems too good to be true, right?

Smoothing techniques help address this. By borrowing ‘insights’ from broader distributions or historical averages, smoothing prevents overconfidence in tiny samples and protects against extreme conclusions. In machine learning setups — such as recommender systems, rating predictions, or probability estimations — these techniques keep our models on a steady track and prevent overfitting due to limited data.

Bayesian Smoothing: A Guided Recalibration

Let’s understand Bayesian smoothing through a scenario. Imagine you run a popular food delivery platform. Your job: decide which delivery person to assign to the next incoming order. One piece of data you consider is the “acceptance rate” — how often a delivery person accepts and completes an order.

Delivery Person A: Completed 3 out of 3 orders (100% success rate).
Delivery Person B: Completed 88 out of 90 orders (~97.8% success rate).

Naively, Delivery Person A looks more reliable. But a tiny sample (3 orders) is not as trustworthy as a large one (90 orders). Bayesian smoothing incorporates a “prior belief” about the average acceptance rate across the entire fleet of delivery partners. If the global average completion rate is around 90%, Person A’s 3/3 success rate might need to be nudged down — penalised — from 100% to something more like 92%, while Person B’s more established record would not be “smoothed” as much and remains near 97.8%. In this way, Bayesian smoothing prevents us from overvaluing a tiny sample and makes our model’s estimates more reasonable.

How to utilize Bayesian smoothing:

Calculate a posterior probability by combining the observed data (e.g., number of completed vs. total orders) with a prior (e.g., historical or global averages).
Choose a reasonable prior that fits your domain knowledge. The goal is to balance the observed data with what you already “believe” is typical, ensuring that small samples are not overly influential.

What About Categorical Data?

If you’re working with categorical data, target mean encoding with smoothing applies a similar principle to Bayesian smoothing. Instead of estimating probabilities, it transforms categorical variables into numerical features by combining the global target mean (a prior belief) with the category-specific mean (observed data).

The key difference is in the application: while Bayesian smoothing adjusts probabilities, target mean encoding is used to create smoother, less biased representations of categorical features for machine learning models.

Laplace Smoothing: Guarding Against Zeroes

Another common smoothing technique is Laplace smoothing (also known as additive smoothing). The idea is simple: rather than ever assigning a probability of zero to an unseen event, you “pretend” you have seen it occur a small number of times. This prevents your model from confidently concluding that something is impossible just because it has not appeared yet.

Example:

Suppose you are running an app store and analyzing user ratings for a new game. Your data shows the game has only a handful of user ratings, and all of them happen to be positive. You might be tempted to conclude that the probability of receiving a low rating is zero. But does this reflect reality? Of course not. You likely just have not observed a low rating yet.

This is where Laplace smoothing steps in. It gives each possible event (in this case, all possible rating categories) with a small positive value — let’s say 1 — before observing the actual data. These pre-loaded values act as if you have already seen each category at least once, even if you have not. After adding these counts, you normalize the probabilities as usual. This small adjustment ensures that no event has a zero probability, acknowledging that while an event may not have occurred yet, it still could in the future.

The elegance of Laplace smoothing lies in its balance. The amount of smoothing depends on the value of the added constant, lambda (λ). A larger λ pulls probabilities closer to a uniform distribution, while a smaller λ allows the data to retain more influence. As you collect more data, these pre-loaded counts diminish in importance, and the actual data takes over.

For example, if you eventually collect thousands of ratings, the influence of λ becomes negligible. This dynamic ensures that Laplace smoothing is most impactful when data is sparse but gracefully fades into the background as the dataset grows.

Feel free to give this Stanford Video relating to Laplace Smoothing a watch!

How to utilize Laplace smoothing:

Add a small constant λ (commonly 1) to each possible event count before calculating probabilities. Normalize the adjusted counts to get smoothed probabilities.
When implementing Laplace smoothing, choose λ carefully: a larger λ moves probabilities closer to uniform, while a smaller λ retains the influence of observed data. Use it when you want to avoid assigning zero probabilities to unseen events, especially in sparse datasets.

Conclusion

Bayesian and Laplace smoothing techniques are about doing justice to the uncertainty in your data.

Bayesian smoothing lets you incorporate well-reasoned priors to avoid overvaluing small samples. Laplace smoothing ensures you never proclaim something impossible just because you have not seen it yet.

Rather than painting a falsely confident picture, these methods inject a healthy dose of skepticism, acknowledging that reality is often richer and more varied than your current dataset suggests.