Handling Categorical Data in Python

Data that only includes a few values is referred to as categorical data, often known as categories or levels. It is described in two ways nominal or ordinal. Data that lacks any intrinsic order, such as colors, genders, or animal species, is represented as nominal categorical data. Ordinal categorical data refers to information that is naturally ranked or ordered, such as customer satisfaction levels or educational attainment.

Setup

Install the required libraries for handling categorical data ?

pip install pandas
pip install scikit-learn
pip install category_encoders

Categorical data is often represented as text labels, and many machine learning algorithms require numerical input data. Therefore, it is important to convert categorical data into a numerical format before feeding it to a machine learning algorithm. This process is known as encoding.

One-Hot Encoding

One-Hot Encoding creates a binary vector for each category in the dataset. The vector contains a 1 for the category it represents and 0s for all other categories ?

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a pandas DataFrame with categorical data
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red']})
print("Original data:")
print(df)

# Create an instance of OneHotEncoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the DataFrame using the encoder
encoded_data = encoder.fit_transform(df)

# Convert the encoded data into a pandas DataFrame
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out())
print("\nOne-hot encoded data:")
print(encoded_df)
Original data:
   color
0    red
1   blue
2  green
3  green
4    red

One-hot encoded data:
   color_blue  color_green  color_red
0         0.0          0.0        1.0
1         1.0          0.0        0.0
2         0.0          1.0        0.0
3         0.0          1.0        0.0
4         0.0          0.0        1.0

Ordinal Encoding

Ordinal encoding assigns each category a different numerical value based on its rank or order. This strategy is useful when categories have natural ordering, like ratings (poor, fair, good, excellent) ?

import pandas as pd
import category_encoders as ce

# Create a sample dataset
data = {'category': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)
print("Original data:")
print(df)

# Initialize the encoder
encoder = ce.OrdinalEncoder()

# Encode the categorical feature
df['category_encoded'] = encoder.fit_transform(df['category'])
print("\nOrdinal encoded data:")
print(df)
Original data:
  category
0      red
1    green
2     blue
3      red
4    green

Ordinal encoded data:
  category  category_encoded
0      red                 1
1    green                 2
2     blue                 3
3      red                 1
4    green                 2

Target Encoding

Target Encoding replaces each category with the average target value for that category. It's useful when there is a strong relationship between the categorical feature and the target variable ?

import pandas as pd
import category_encoders as ce

# Create a sample dataset with target variable
data = {'category': ['red', 'green', 'blue', 'red', 'green'], 
        'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)
print("Original data with target:")
print(df)

# Initialize the encoder
encoder = ce.TargetEncoder()

# Encode the categorical feature based on target
df['category_encoded'] = encoder.fit_transform(df['category'], df['target'])
print("\nTarget encoded data:")
print(df)
Original data with target:
  category  target
0      red       1
1    green       0
2     blue       1
3      red       0
4    green       1

Target encoded data:
  category  target  category_encoded
0      red       1          0.500000
1    green       0          0.500000
2     blue       1          1.000000
3      red       0          0.500000
4    green       1          0.500000

Comparison of Encoding Methods

Method Best For Pros Cons
One-Hot Nominal data No assumptions about order Creates many columns
Ordinal Ordered categories Preserves ordinality Assumes equal spacing
Target High cardinality features Captures relationship with target Risk of overfitting

Conclusion

Choosing the right encoding method depends on your data type and machine learning task. Use one-hot encoding for nominal data, ordinal encoding when categories have natural order, and target encoding for high-cardinality features with strong target relationships.

Updated on: 2026-03-27T01:17:03+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements