[Code in Python] Treating Outliers & Missing Data — Using scipy, sklearn.impute Library
Exploring Winsorization, K-Nearest Neighbors, Multiple Imputation techniques.
Content of this Article
- Brief Introduction to Outliers and Missing Data
- Classification of Missing Data
- Keep them
- Remove them
- Recode them — Winsorization & Imputation
- Conclusion
Brief Introduction to Outliers and Missing Data
We all know, that data cleaning is one of the most time-consuming stages in the data analysis process. In my previous article on Working with Outliers, I gave an introduction to outliers, explaining in detail their effects, causes, types, and how we could identify them.
Now that you have identified the outliers in your data, how are you going to handle them? In this article, I will explore how we can treat outliers. I decided to extend the treatment to missing data. In real-world examples, outliers and missing data often come hand-in-hand together.
Identification of outliers and missing data patterns and correct imputation process will influence further analysis. Let us take a look at the different strategies to deal with them.
Similarly, I will be using a dataset from Kaggle on Pima Indians Diabetes for the purpose of this outlier analysis! This data set predicts the onset of diabetes based on diagnostic measures.
Classification of Missing Data
Let us first understand the reason why data goes missing. There are primarily 3 classifications of missing data.
- Missing Completely At Random (MCAR): If the probability of being missing is the same for all cases, then the data are said to be missing completely at random. This implies that the causes of the missing data are unrelated to the data. An example of MCAR is a weighing scale that ran out of batteries. Some of the data will be missing simply due to bad luck.
- Missing At Random (MAR): If the probability of being missing is the same only within groups defined by the observed data, then the data are missing at random. MAR is a much broader class than MCAR. For example, when placed on a soft surface, a weighing scale may produce more missing values than when placed on a hard surface.
- Missing Not At Random (MNAR): If neither MCAR nor MAR holds, then we speak of missing not at random (MNAR). MNAR means that the probability of being missing varies for reasons that are unknown to us. For example, the weighing scale mechanism may wear out over time, producing more missing data as time progresses.
I have taken some excerpts from Stef van Buuren’s book on Flexible Imputation of Missing Data, which I strongly recommend reading. Stef van Buuren pioneered quantitative algorithms for “filling up the missing data” (imputation). He is the originator of the MICE algorithm for multiple imputations of multivariate data, and co-developed the mice
package in R
.
Keep Them
Keeping outliers is a good decision if the outliers rightfully belong to the distribution of interest. Outlier data values that are unlikely are still possible and they should be kept as they are representative of the distribution.
However, in the case where you are not able to easily identify whether or not an extreme value is a part of the population of interest or not, be aware that keeping them will most likely distort the results of your actual task: e.g. leading to a rejection of the null hypothesis or an under/over-optimistic prediction.
In those cases, I would recommend looking at the next two strategies.
Remove Them
This is a very intuitive and straightforward strategy. Removing outliers are efficient if outliers corrupt the estimation of the distribution parameters.
However, the biggest issue with removing outliers is the loss of information. Looking at the nullity matrix of our Pima Indians Diabetes dataset (after I convert 0
to NaN
), we see that removing all the outliers (assuming that they are 0
), will potentially reduce a lot of observations in our dataset.
This happens when all univariate outliers are removed for each variable. When you are working with a dataset like this, it would not be wise to remove the entire row or column with all these outliers/missing data.
Recode Them
In my opinion, as well as the purpose of this article, the best treatment for outliers and missing data is to recode them.
Recoding outliers avoid the loss of a large amount of data. However, bear in mind that recoding data should rely on reasonable and convincing arguments.
A common approach to recoding outliers is Winsorization (Tukey & McLaughlin, 1963), where all outliers are transformed to a value at a certain percentile of the data. Another approach is imputation. Imputation is a method that uses information and relationships among the non-missing predictors to replace outliers and missing data with estimates using other existing data.
Winsorization
With winsorization, all outliers are transformed to a value at a certain percentile of the data. The observed value of all data below or above a given percentile observation k
(generally k = 5
) is recoded into the value of the k
th or (100-k)
th percentile observation respectively.
SciPy has a winsorize function that is able to perform the above transformation for you. The function is scipy.stats.mstats.winsorize.
Instead of entirely removing (or trimming away) the outliers, winsorization is an effective method of lessening the extreme effects of outliers and achieving a less skewed distribution.
Imputation
There are several imputation techniques. One common technique used is Common Value Imputation. It is very intuitive as it simply replaces outliers or missing data with common values like mean or median or mode.
However, in doing so, we might under or overestimate it. In other words, we could be inducing biases in our dataset, which defeats the purpose of us treating outliers and missing data in the first place.
For example, a diabetic person did not want to reveal their weight, and thus the value for the variable BMI
would be missing for such a person. However, if we impute it with the median value of the variable, we could be underestimating that person's weight and thus introduce bias in our analysis.
Imputation does beg the question of how much missing data are too much to impute? Although not a general rule in any sense, 20% missing data within a column might be a good “line of dignity” to observe. Of course, this depends on the situation and the patterns of missing values in the training set.
— From Stef van Buuren’s book on Flexible Imputation of Missing Data
I will introduce 2 better alternatives and more robust imputation techniques. They are K-Nearest Neighbors and Multiple Imputation.
K-Nearest Neighbors
KNN is an algorithm that is useful for matching a point with its closest K-Nearest Neighbors in a multi-dimensional space. The similarity of the points and their neighbours is defined by a distance metric.
I will be using the KNNImputer function in sklearn.impute
. KNNImputer is a slightly modified version of the KNN algorithm where it tries to predict the value of numeric nullity by averaging the distances between its k nearest neighbors.
There are different types of distance metrics attributed to numerical and categorical data. In the KNNImputer function, it imputes missing values present by finding the nearest neighbors using the Euclidean Distance Matrix.
When using KNN, the most important parameters you have to take into consideration are the number of neighbors k
and the distance metric.
Taking a low k
will increase the influence of noise, and the results are going to be less generalizable. On the other hand, taking a high k
will tend to blur local effects which are exactly what we are looking for. It is also recommended to take an odd k
for binary classes to avoid ties.
Now, let us visualise the results of using different numbers of neighbors:
In the above plot, I compared the different KNN imputations for the Insulin
feature using Probability Density plots as it has the highest number of missing data. The closer the imputed distribution comes to the original, the better was the imputation. Here, it seems that having 2 or 3 neighbors is the best choice.
Limitations of using K-Nearest Neighbors
It is not as straightforward as it seems…
- The first drawback of this function is that it works only on numerical data. Since categorical data are most “Strings”, they need to be encoded before imputing. You can still use KNNImputer where you only have the nearest neighbour
(k=1)
. If you use more than one neighbour it will render some meaningless average. - Since the KNNImptuer is a distance-based imputation method, it is very sensitive to data scale. It does sometimes require us to normalize our data. This is especially so for datasets where you have large numerical values. Otherwise, the different scales of our data will lead the KNN Imputer to generate biased replacements for the missing values. After imputing the data you need, you have to transform your data back.
- Determining the right number of neighbours and distance metric. While it gives you the flexibility to make customizations, it can make the process a lot more complicated at times.
- KNN is termed as a lazy algorithm, as it does not learn a discriminative function from the data but “memorizes” the training dataset instead, thus making it computationally expensive for data sets with a large number of variables/features.
Multiple Imputation
Another algorithm that is equally if not more robust than K-Nearest Neighbors is Multiple Imputations by Chained Equations (MICE). First introduced by Stef van Buuren in 2011, MICE involves a procedure that imputes missing data in a dataset through an iterative series of predictive models. In each iteration, each specified variable in the dataset is imputed using the other variables in the dataset. These iterations should be run until it appears that convergence has been met.
However, this application is only available in R
. It is not long before a similar implementation in Python was born. Inspired by MICE, the fancyimpute package was developed for the same purpose: to estimate each feature from all the others. It was then merged into Sckit-Learn and renamed sklearn.impute.IterativeImputer
.
In the user guide, it states that IterativeImputer
can be used for multiple imputations by applying it repeatedly to the same dataset with different random seeds when sample_posterior=True
.
While it is still in an experimental phase, IterativeImputer
has a variety of estimators and one of which, ExtraTreesRegressor
, can replicate MissForest
, an approach that is able to deal both with numeric and categorical missing values (Stekhoven and Buhlmann, 2012).
Here is how it works:
Do note that ExtraTreesRegressor
cannot be used for multiple imputations as it does not support return_std
in its predict method. Since BayesianRidge
and ExtraTreeRegressor
yield the best results, I included the BayesianRidge
application in my Github which can be used for multiple imputations. If you set theestimator
to None
, the algorithm will choose on its own.
Limitations of using Multiple Imputation
Might be the best, only if you can run it…
- It is much more computationally heavy and expensive as compared to other imputation methods. Since this algorithm is running an iterative series of predictive models across all variables, it takes a lot more time and computational power to execute.
- It is still in an experimental phase and there are several limitations as I have just highlighted above. However, I believe by the time you read this article, there might be relevant changes/improvements that have been made.
Conclusion
Phew… this article ended up much longer than I thought it would be when I first started out. There you have it, you can either keep, delete or recode outliers or missing data. I shared a few recoding methods that I hope will come in useful as you handle your own outliers and missing data.
Compared to Common Value Imputation, both KNN Imputer
and IterativeImputer
maintains the value and variability of your datasets and yet it is much more precise and efficient than using the common values.
However, Common Value Imputation continues to be widely used due to its simplicity and low computational power required to run those imputations, especially when you are handling a huge dataset.
Missing data is a problem that should be taken seriously. Always remember that a model is only as good as the data it was trained on. Above everything else, the treatment of outlying data values is a highly subjective task as there is no mathematical right or wrong solution, so you make your own best judgement.
Thanks so much for reading my article!!! Here is the Github File that I used for this article.