How to identify, handle and deal with Outliers?
Content of this Article
- Brief Introduction to Outliers
- Effects of Outliers
- Causes of different types of Outliers
- Univariate and Multivariate Outliers
- How do we identify Outliers?
- How do we treat Outliers? (Covered in Part 2 — Link Here)
- Conclusion
Brief Introduction to Outliers
I like to analogize the process of working with data akin to a chef preparing his dishes. Starting with selecting the necessary ingredients, to preparing them, a lot of work has to be done before the chef can turn on the stove and start cooking.
The same goes for working with data. As much as we want to jump into running that data into our favourite forecasting or machine learning algorithm, it is crucial that we make sure that the data we have are filtered and cleaned.
One of the most important preparation works is dealing with outliers.
Definition
Outliers are data values that are unusually large or small compared to the other values of the same construct (in a random sample of a population).
In this Medium Article, let me walk you through each step, understanding outliers, and learning to deal with them using Python.
Effects of Outliers
As we make use of data to draw inferences and make conclusions, some of these statistical tools used for analyzing the data, are very sensitive to the presence of outliers in the data. If we ignore the outliers or use the wrong statistical tools for analysis, we may end up drawing the wrong conclusions.
Here is one example of what an outlier can do. Suppose you have a small dataset — [10, 7, 11, 9, 5, 16, 18, 21, 13, 15, 15, 101]. A quick look at it and it is obvious that ‘101’ is the outlier.
From this simple example, an outlier could most likely lead you to make erroneous inferences about your dataset’s mean, variance, and standard deviation. Do note here that this is somewhat an exaggerated example.
While it is important for us to understand how to handle outliers, not all outliers have to be handled or treated the same way, and that has got to do with what caused the outliers in the first place.
Causes of different types of Outliers
There are mainly 3 main causes of Outliers:
Systemic Error
Systemic Errors lie at a distance from other data points because they are the results of inaccuracies. Such inaccuracies are normally caused by measurement and encoding errors.
For example, you are supposed to record down a 1 million US dollars sale project you just clinched. However, you accidentally entered an extra ‘0’, resulting in an outlier data value of $10 million.
Sampling Error
As you consider a random sample of a population for the purpose of your study, your sampling process might have included a participant or a data entry that also belongs to a different population.
For example, you are conducting a study on Bone Density Growth. In the process, you discovered that a patient with an outlier bone density growth had diabetes, which affected his/her bone health.
Natural Variation / Random Outliers
Unlike the above 2 cases of outliers, natural variation can produce outliers, but it is not necessarily a problem. These outliers are not caused due to any errors but are simply unlikely data. If your sample size is large enough, you are bound to obtain unusual values. In a normal distribution, approximately 1 in 340 observations will be at least three standard deviations away from the mean.
For example, let’s look at IQ scores. The intelligence quotient (IQ) is a measure of human intelligence. Most people (about 68 percent) have an IQ between 85 and 115. However, you do have a super minority of the population, about 0.00015% that has an IQ above 170.
While these individuals are very unlikely, it is still possible. They are still a normal part of the data distribution.
Univariate and Multivariate Outliers
One must distinguish between univariate and multivariate outliers.
Univariate outliers are unusually large or small values in the distribution of a specific variable, whereas multivariate outliers are a combination of values in an observation that is unusual.
Robert Wadlow is the tallest person in the world, who stood at 8' 11" feet, 2.72 meters. Given his unusually large value on a single dimension (his height), he can be considered a univariate outlier.
A multivariate outlier could be an observation of a person with a height of 2 meters, but with a weight measurement of 40kg. Values that became surprising when several dimensions are taken into account are called multivariate outliers.
For the rest of the article, I will be focusing on Univariate Outliers. I strongly recommend reading this article that explains succinctly how to detect and deal with Multivariate Outlier by Sergen Cansiz.
How do we identify Outliers?
I will be using a dataset from Kaggle on Pima Indians Diabetes for the purpose of this outlier analysis! This data set predicts the onset of diabetes based on diagnostic measures. I will be using specifically the ‘BMI’ variable.
Visualizing Outliers
A first and essential step in detecting univariate outliers is the visualization of their distribution. I will introduce 2 visualization plots that are most commonly used to identify outliers.
1. Box and Whisker Plot (Box Plot)
Box and Whisker Plot, first introduced by John Tukey in 1970, divide the data into sections that each contain approximately 25% of the data in that set, extended by whiskers that reach the minimum and maximum data entry.
Box plots are useful as they provide a visual summary of the data, enabling us to quickly identify mean values, the dispersion of the data set, signs of skewness, and identify outliers that lie outside of the whiskers.
2. Scatter Plot
Scatter Plot uses dots to represent values for two different numeric variables. Scatter plots are useful in identifying patterns. Data points that do not belong to the pattern or are far from the congregation of the data are most likely to be outliers.
Looking at the 2 plots above, you should be able to identify 2 types of outliers. Firstly, we have outliers caused by Systematic Errors. There are a couple of data values that have a 0
BMI. Realistically, this is not possible. Secondly, we have outliers that are naturally varied. Having a BMI above 50, while very unlikely, is still possible. They fall into the category considered morbidly obese.
Statistical Approach
After visualizing the distribution of your data, we employ statistical approaches to identify the outliers. I will introduce 3 statistical approaches that are most commonly used to identify outliers.
1. Tukey’s Box-and-Whisker Plot (aka Box-Plot)
Besides its visual benefits, the box plot provides useful statistics to identify outliers. Tukey distinguishes between possible and probable outliers. A value between the inner and outer fences is a possible outlier, whereas a value falling outside the outer fences is a probable outlier. The removal of all possible and probable outliers is referred to as the Interquartile (IQ) method, while in Tukey’s method only the probable outliers are discarded.
The Python Code below takes in the dataset and the column name that you are detecting for outliers. The first code chunk detects possible outliers while the second code chunk detects probable outliers.
The great advantage of Tukey’s box plot method is that the inner and outer fences are robust to outliers, meaning to find one outlier is independent of the other outliers. Furthermore, this method does not require a normal distribution of the data, which is often not guaranteed in real-life situations.
If you have a distribution that is highly skewed, the Tukey method can be extended to the Interquartile Range with Log-Normal Distribution method (aka log-IQ method), where each value is transformed to its logarithm before calculating the inner and outer fences.
2. Z-Score (aka Internally Studentized Residuals)
Another commonly used method to detect univariate outliers is using Z-Score, also known as internally standardized residuals.
Data values are considered to be outliers whenever they are more extreme than the mean plus or minus the standard deviation multiplied by a constant, where this constant is usually 3, or 3.29 (Tabachnick & Fidell, 2013). These cutoffs are based on the fact that when the data are normally distributed, 99.7% of the observations fall within 3 standard deviations around the mean, and 99.9% fall within 3.29 standard deviations.
Similarly, the Python Code below takes in the dataset and the column name that you are detecting for outliers.
However, this method is highly limited as the mean and standard deviation are sensitive to outliers. This means that finding one outlier is dependent on other outliers as every observation directly affects the mean. Moreover, the z-score method assumes the variable of interest to be normally distributed.
3. Robust Z-Score (aka Median Absolute Deviation Method)
The Robust Z-Score method is also called the Median Absolute Deviation Method (MAD). It replaces the mean and standard deviation with more robust statistics, like the median and median absolute deviation.
where X = each value in the variable column and m = average value
Since mean and standard deviations are heavily influenced by outliers, it is recommended to use this method by Leys et al. (2013) as the MAD is calculated based on a range around the median.
This method is more robust to outliers and assumes that each value is parametrically distributed.
How do we treat Outliers?
I will cover this here in the second part of my Medium Article. This will be a heavier topic as I will extend beyond treating outliers to dealing with missing data. This will be relevant to our Pima Indians Diabetes dataset as we consider how to deal with data values with 0
BMI.
Conclusion
Outliers are not discussed often in testing, but, depending on your business and the metric you’re optimizing, they could affect your results.
One or two high values in small sample size can totally skew a test, leading you to make a decision based on faulty data.
There are different ways to detect univariate outliers as discussed. Tukey’s Box-and-Whisker Plot offers robust results and can be easily extended when the data is highly skewed. The Z-Score method needs to be applied critically due to its sensitivity to mean and standard deviation and its assumption of a normally distributed variable. The Robust Z-Score (MAD) method is often used instead and serves as a more robust alternative.
Most importantly, to decide on the right approach for your own data, closely examine your variables’ distribution, and use your domain knowledge.
I hope you enjoyed this article and check out the second part of this article. Here is the Github File that I used for this article.