Article Categories

Selected Reading

How to convert categorical data to binary data in Python?

Python Server Side Programming Programming

Categorical data, also known as nominal data, is a type of data that is divided into discrete categories or groups. These categories have no inherent order or numerical value, and they are usually represented by words, labels, or symbols. Categorical data is commonly used to describe characteristics or attributes of objects, people, or events, and it can be found in various fields such as social sciences, marketing, and medical research.

In Python, categorical data can be represented using various data structures, such as lists, tuples, dictionaries, and arrays. The most commonly used data structure for categorical data in Python is the pandas DataFrame, which is a two-dimensional tablelike data structure that can store and manipulate large amounts of data.

Example of Categorical Data

Suppose you have a dataset containing information about the type of vehicles people own. The dataset includes the following categorical variables ?

Vehicle Type ? Car, Truck, SUV, Van, Motorcycle
Fuel Type ? Gasoline, Diesel, Electric, Hybrid
Color ? Red, Blue, Green, Black, White

You can represent this dataset in Python using a pandas DataFrame as follows ?

import pandas as pd

data = {'Vehicle Type': ['Car', 'Truck', 'SUV', 'Van', 'Motorcycle'],
        'Fuel Type': ['Gasoline', 'Diesel', 'Electric', 'Hybrid', 'Gasoline'],
        'Color': ['Red', 'Blue', 'Green', 'Black', 'White']}

df = pd.DataFrame(data)
print(df)

  Vehicle Type Fuel Type  Color
0          Car  Gasoline    Red
1        Truck    Diesel   Blue
2          SUV  Electric  Green
3          Van    Hybrid  Black
4   Motorcycle  Gasoline  White

As you can see, the categorical variables are represented as columns in the DataFrame, and each category is represented as a string value in the corresponding column. You can use various Pandas functions and methods to manipulate and analyze this data, such as groupby(), count(), value_counts(), and crosstab().

Characteristics of Categorical Data

Below are some key characteristics of categorical data ?

Categorical data has a limited number of categories.
The categories have no inherent order or ranking.
Categorical data can be measured on a nominal or ordinal scale.
Categorical data is often summarized using count or frequency distributions.
Categorical data has limited statistical analysis compared to numerical data.

Converting Categorical Data to Binary Data

Conversion of categorical data into binary data involves transforming categorical variables into binary (0 or 1) values that can be used for analysis or modeling purposes. This transformation is useful because many machine learning algorithms and statistical methods require numerical inputs, rather than categorical inputs.

Binary encoding is a common approach that converts each unique category in a categorical variable into a separate binary column, where a value of 1 indicates the presence of the category and 0 indicates its absence.

Using pandas get_dummies()

The simplest method to convert categorical data to binary format is using the pd.get_dummies() function. Let's see how it works ?

import pandas as pd

# Create a sample DataFrame with categorical data
data = {'Gender': ['Male', 'Female', 'Male', 'Female'],
        'City': ['New York', 'Chicago', 'Chicago', 'Los Angeles'],
        'Marital Status': ['Single', 'Married', 'Single', 'Divorced']}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Convert categorical variables to binary values
encoded_df = pd.get_dummies(df)
print("\nBinary Encoded DataFrame:")
print(encoded_df)

Original DataFrame:
   Gender         City Marital Status
0    Male     New York         Single
1  Female      Chicago        Married
2    Male      Chicago         Single
3  Female  Los Angeles       Divorced

Binary Encoded DataFrame:
   Gender_Female  Gender_Male  City_Chicago  City_Los Angeles  City_New York  Marital Status_Divorced  Marital Status_Married  Marital Status_Single
0              0            1             0                 0              1                        0                       0                      1
1              1            0             1                 0              0                        0                       1                      0
2              0            1             1                 0              0                        0                       0                      1
3              1            0             0                 1              0                        1                       0                      0

Using One-Hot Encoding with Specific Columns

You can also apply binary encoding to specific columns only ?

import pandas as pd

data = {'Gender': ['Male', 'Female', 'Male', 'Female'],
        'Age': [25, 30, 35, 28],
        'City': ['New York', 'Chicago', 'Chicago', 'Los Angeles']}

df = pd.DataFrame(data)

# Encode only categorical columns, keep numeric columns as-is
encoded_df = pd.get_dummies(df, columns=['Gender', 'City'])
print(encoded_df)

   Age  Gender_Female  Gender_Male  City_Chicago  City_Los Angeles  City_New York
0   25              0            1             0                 0              1
1   30              1            0             1                 0              0
2   35              0            1             1                 0              0
3   28              1            0             0                 1              0

Key Points

get_dummies() creates a new column for each unique category
Binary values: 1 indicates presence, 0 indicates absence of the category
This process is also called one-hot encoding
Use columns parameter to specify which columns to encode
The original categorical columns are replaced with multiple binary columns

Conclusion

Converting categorical data to binary format is essential for machine learning algorithms that require numerical inputs. The pd.get_dummies() function provides a simple and effective way to perform this transformation, creating binary columns that represent the presence or absence of each category.

Mukul Latiyan

Updated on: 2026-03-27T01:19:37+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started

Previous Next