• AIPressRoom
  • Posts
  • The Good, The Unhealthy, and the Ugly of Pd.Get_Dummies | by Adam Ross Nelson | Jul, 2023

The Good, The Unhealthy, and the Ugly of Pd.Get_Dummies | by Adam Ross Nelson | Jul, 2023

That is for the pd.get_dummies diehards

Howdy people

Okay, I get it. One of many best methods to transform a categorial to an array of dummies in Python is with the Pandas pd.get_dummies(). Why would you’re taking the time to import OneHotEncoderfrom sklearn, execute a .fit_transform() and many others, and many others, and many others? Speak about tedious!

This text will first introduce a easy knowledge set for demonstration functions that consists of a testing set that incorporates categoricals not discovered within the coaching set. Then, it should reveal how utilizing pd.get_dummies() can result in issues with the demonstration knowledge. And, lastly present learn how to keep away from that drawback with sklearn’s OneHotEncoder.

Right here we now have a easy dataset that features a categorical characteristic known as OS. The OS column lists pc working techniques. We are going to use this fictional knowledge for functions of demonstration. In train_df shall be fictional demonstration coaching knowledge. Whereas in test_df we now have fictional demonstration testing knowledge.

In our fictional demonstration case, the testing set incorporates categorical values not current within the coaching set. This mis-match will trigger issues.

import pandas as pd

train_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Linux', 'Windows', 'MacOS']})
test_df = pd.DataFrame({'OS': ['Windows', 'MacOS',
'Android', 'Unix' 'iOS']})

In our coaching knowledge, we now have three working techniques: Home windows, MacOS, and Linux. However in our testing knowledge, we now have the extra classes together with Android, Unix, and iOS.

A mannequin match on train_df.get_dummies() is not going to work with testing knowledge from test_df.get_dummies(). The outcomes don’t match.

When making use of the pd.get_dummies() operate to each our coaching and testing datasets here’s what you’ll get.