Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Handling Categorical Data in Python
Data that only includes a few values is referred to as categorical data, often known as categories or levels. It is described in two ways nominal or ordinal. Data that lacks any intrinsic order, such as colors, genders, or animal species, is represented as nominal categorical data. Ordinal categorical data refers to information that is naturally ranked or ordered, such as customer satisfaction levels or educational attainment.
Setup
Install the required libraries for handling categorical data ?
pip install pandas pip install scikit-learn pip install category_encoders
Categorical data is often represented as text labels, and many machine learning algorithms require numerical input data. Therefore, it is important to convert categorical data into a numerical format before feeding it to a machine learning algorithm. This process is known as encoding.
One-Hot Encoding
One-Hot Encoding creates a binary vector for each category in the dataset. The vector contains a 1 for the category it represents and 0s for all other categories ?
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# Create a pandas DataFrame with categorical data
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red']})
print("Original data:")
print(df)
# Create an instance of OneHotEncoder
encoder = OneHotEncoder(sparse=False)
# Fit and transform the DataFrame using the encoder
encoded_data = encoder.fit_transform(df)
# Convert the encoded data into a pandas DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out())
print("\nOne-hot encoded data:")
print(encoded_df)
Original data: color 0 red 1 blue 2 green 3 green 4 red One-hot encoded data: color_blue color_green color_red 0 0.0 0.0 1.0 1 1.0 0.0 0.0 2 0.0 1.0 0.0 3 0.0 1.0 0.0 4 0.0 0.0 1.0
Ordinal Encoding
Ordinal encoding assigns each category a different numerical value based on its rank or order. This strategy is useful when categories have natural ordering, like ratings (poor, fair, good, excellent) ?
import pandas as pd
import category_encoders as ce
# Create a sample dataset
data = {'category': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
print("Original data:")
print(df)
# Initialize the encoder
encoder = ce.OrdinalEncoder()
# Encode the categorical feature
df['category_encoded'] = encoder.fit_transform(df['category'])
print("\nOrdinal encoded data:")
print(df)
Original data: category 0 red 1 green 2 blue 3 red 4 green Ordinal encoded data: category category_encoded 0 red 1 1 green 2 2 blue 3 3 red 1 4 green 2
Target Encoding
Target Encoding replaces each category with the average target value for that category. It's useful when there is a strong relationship between the categorical feature and the target variable ?
import pandas as pd
import category_encoders as ce
# Create a sample dataset with target variable
data = {'category': ['red', 'green', 'blue', 'red', 'green'],
'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)
print("Original data with target:")
print(df)
# Initialize the encoder
encoder = ce.TargetEncoder()
# Encode the categorical feature based on target
df['category_encoded'] = encoder.fit_transform(df['category'], df['target'])
print("\nTarget encoded data:")
print(df)
Original data with target: category target 0 red 1 1 green 0 2 blue 1 3 red 0 4 green 1 Target encoded data: category target category_encoded 0 red 1 0.500000 1 green 0 0.500000 2 blue 1 1.000000 3 red 0 0.500000 4 green 1 0.500000
Comparison of Encoding Methods
| Method | Best For | Pros | Cons |
|---|---|---|---|
| One-Hot | Nominal data | No assumptions about order | Creates many columns |
| Ordinal | Ordered categories | Preserves ordinality | Assumes equal spacing |
| Target | High cardinality features | Captures relationship with target | Risk of overfitting |
Conclusion
Choosing the right encoding method depends on your data type and machine learning task. Use one-hot encoding for nominal data, ordinal encoding when categories have natural order, and target encoding for high-cardinality features with strong target relationships.
