[Code in Python] Principal Component Analysis — Using sklearn & pca Library

Jed Lee
6 min readMar 1, 2022

--

Let us explore using 2 Python Libraries to apply PCA.

Sunset captured at Edge Sky Deck, Hudson Yards. Image by Author

Content of this Article

  1. Brief Introduction to Principal Component Analysis
  2. Intuition behind Principal Component Analysis
  3. Code in Python: sklearn Library
  4. Code in Python: pca Library
  5. Conclusion

Introduction

The Principal Component Analysis (PCA) is an exploratory approach to reduce the data set’s dimensionality, used in data preprocessing and/or exploratory data analysis.

PCA is a great tool to transform a large dataset of many variables into a smaller one through dimensionality reduction, with the intention that the lower-dimensional space still captures as much of the dynamics of the original space as much as we can.

Most importantly, it is imperative that one must apply business/common sense when selecting features instead of leaving that entirely on PCA.

This article is a continuation of my previous article that gives a High-Level Overview of PCA.

Intuition behind Principal Component Analysis

Principal Component Analysis is a linear transformation of dataset that defines a new coordinate rule such that:

  • The highest variance by any projection of the data set appears to lie on the first axis.
  • The second biggest variance will be on the second axis, and so on.

I will be using a dataset from Kaggle on Pizza for this analysis! This data set contains measurements that capture the kind of things that make a pizza tasty. Yums~

A quick peek at the dataset reveals a total of 7 different variables ranging from mois to cal. (Fyi: mois = Amount of water per 100 grams in the sample)

Image by Author

There seem to be quite a few variables that you have to consider, and intuitively, it is almost impossible to figure out which are the most important variables unless you are a Pizza expert!

In this case, we can apply PCA!

Let me preface this by stating that in this particular scenario, I am operating under the assumption that 7 variables constitute a significant amount. In reality, 7 variables do not represent a significant number. Generally, PCA is employed when dealing with 20, 30, 50 or more variables.

First and foremost, you want to check the object type of your variables!

Image by Author

The purpose of this check is to make sure you have with you continuous variables. PCA can only work with Continuous variables, hence you have to make sure you do not have any categorical variables like race, sex, age group, educational level, etc.

Secondly, do check for outliers in your dataset. PCA is not robust against outliers and it will be biased in datasets with strong outliers.

Python Library: sklearn

The sklearn library contains a lot of efficient tools for machine learning and statistical modelling, including classification, regression, clustering and dimensionality reduction.

To apply PCA, you have to import PCA from sklearn.decomposition

Let us start with some preprocessing of the data.

The first preprocessing step is to divide the dataset into a feature set and corresponding labels. Since PCA depends only upon the feature set and not the label data, PCA can be considered as an unsupervised machine learning technique. The following script performs this task:

Image by Author

The script above stores the feature sets into the df2 dataframe and the series of corresponding labels into the df1_label dataframe.

Do note that PCA performs poorer if features are less correlated. If the features are not so correlated, the eigenvalues of the principal components will be lower. While you still can use PCA when you have highly uncorrelated features, your scree plot will not show a normal elbow.

Image by Author

From the Correlation Heatmap above, you can observe that our features are correlated to one another to differing degrees.

Next, we have to normalize our features. PCA is sensitive to unscaled data. PCA is a dimensionality reduction technique based on variance. If features are unscaled, those with higher magnitudes may have a higher variance, and PCA may end up giving them more importance. We will perform standard scalar normalization to normalize our feature set.

Performing PCA using sklearn library is a two-step process:

  1. Initialize the PCA class by passing the number of components to the constructor.
  2. Call the fit and then transform methods by passing the feature set to these methods. The transform method returns the specified number of principal components.

In the code below, we create a PCA object named sklearn_pca. You can choose to specify the number of components in the constructor. If left empty, it will consider all of the features in the feature set.

The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal components.

A limitation of the sklearn library is it does not give you the feature names attached to each variance. The following code can easily do that for you.

Image by Author

It can be seen that the first Principal Component, which is carb, is responsible for 59.6% of the variance. The second principal component, which is mois, is responsible for 32.7% variance in the dataset. Collectively, we can say that (59.6 + 32.7) = 92.3% of the classification information contained in the feature set is captured by the first two principal components.

Now, let us plot out the Cumulative Curve of Explained Variance.

Image by Author
Image by Author

From the plot above, you can see that after the second or third principal component, the change in variance becomes insignificant. Thus, keeping the first 2 or 3 principal components will suffice.

Python Library: pca

This is a relatively new Python Library that was first released in early 2020 by Erdogan Taskesen that performs Principal Component Analysis and makes insightful plots.

To apply PCA, you have to import pca from pca

The code below shows the entire code needed to derive the Principal Components using the pca package. In addition, this package incorporated statistical tests to detect any outliers across the multi-dimensional space of PCA.

Image by Author
Image by Author

From the resulting output, we can conclude that the first 2 Principal Components can capture 95.0% of the explained variance as we wanted.

Another function of the pca package is to plot the loadings. You can plot out visually the cumulative explained variance as well as Biplot and Scatter plots.

Image by Author
Image by Author

This is an amazing package and do read up on the PyPI link on more functions like detecting or/and selecting outliers, etc. The core of PCA is built on sklearn functionality to find maximum compatibility when combined with other packages. However, this package can do a lot more. Besides the regular pca, it can also perform SparsePCA, and TruncatedSVD. Depending on your input data, the best approach will be chosen.

Conclusion

There you have it! Using the 2 Python Library above, you manage to transform a large dataset of many variables into a smaller one! It is with the intention that the lower-dimensional space still captures as much of the dynamics of the original space as much as we can.

You might observe that the results for both packages are slightly different and my best guess to that would be how the pca library handled the outliers while the sklearn library did not.

I hope this is informative and here is the Github Code that I used for this article.

--

--

Jed Lee
Jed Lee

Written by Jed Lee

Passionate about AI & NLP. Based in Singapore. Currently a Data Scientist at PatSnap.

No responses yet