Article Categories

Selected Reading

How to Create simulated data for classification in Python

Python Server Side Programming Programming

In this tutorial we will learn how to create simulated data for classification in Python using popular libraries like scikit-learn and Faker.

Introduction

Simulated data can be defined as any data not representing the real phenomenon but which is generated synthetically using parameters and constraints. This synthetic data mimics real-world patterns and relationships while being completely controllable.

When and Why Do We Need Simulated Data?

Sometimes while prototyping a particular algorithm in Machine Learning or Deep Learning we generally face a scarcity of good real-world data which can be useful to us. Sometimes there is no such data available for a given task. In such scenarios, we may need synthetically generated data. This data can also be from lab simulations.

Advantages of Simulated Data

Mostly represents data as it might be in the real form
Contains less variation of noise, so can be considered an ideal dataset
Useful for quick prototyping and POCs
Complete control over data distribution and class balance

Using scikit-learn's make_classification

The most common approach is using scikit-learn's make_classification function, which generates random n-class classification problems ?

from sklearn.datasets import make_classification
import pandas as pd

# Creating a simulated feature matrix and output vector with 100 samples
features, output = make_classification(
    n_samples=100,
    n_features=10,
    n_informative=5,
    n_redundant=5,
    n_classes=3,
    weights=[.2, .3, .5],
    random_state=42
)

# Create DataFrame for better visualization
df_features = pd.DataFrame(features,
    columns=[f"Feature_{i+1}" for i in range(10)])
output_series = pd.Series(output, name='label')
df = pd.concat([df_features, output_series], axis=1)

print("Simulated Classification Dataset:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Class distribution:\n{df['label'].value_counts().sort_index()}")

Simulated Classification Dataset:
   Feature_1  Feature_2  Feature_3  Feature_4  Feature_5  Feature_6  Feature_7  Feature_8  Feature_9  Feature_10  label
0   0.374540   0.950714  -0.151357  -0.103219   0.410599   0.144044   1.454274   0.761038   0.121675    0.443863      2
1   0.333674   1.494079  -0.205158   0.313068  -0.854096  -2.552990   0.653619   0.864436  -0.742165    2.269755      1
2  -1.142071  -2.153013  -0.252288   0.045759  -1.057711   0.822545  -1.220844  -1.959670  -1.328186    0.196861      0
3   0.567528   0.755611   2.269755  -1.454366   0.045759  -0.187184   1.532779   1.469359   0.154947    0.378163      2
4  -0.887786  -1.980796  -0.347912   0.156349  -1.367230   0.906098  -0.601707  -1.094685  -0.441652   -0.365886      0

Dataset shape: (100, 11)
Class distribution:
0    20
1    30
2    50
Name: label, dtype: int64

Visualizing the Generated Data

Let's create a simple scatter plot to visualize the relationship between features and target classes ?

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green']
for i, class_label in enumerate(sorted(df['label'].unique())):
    class_data = df[df['label'] == class_label]
    plt.scatter(class_data['Feature_1'], class_data['Feature_2'], 
               c=colors[i], label=f'Class {class_label}', alpha=0.7)

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Simulated Classification Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Using Faker Library for Realistic Data

Another method is using the Faker library to generate more realistic-looking data with meaningful column names ?

# pip install faker
from faker import Faker
from faker.providers import DynamicProvider
import pandas as pd
import random

# Create custom provider for medical professions
medical_professions_provider = DynamicProvider(
    provider_name="medical_profession",
    elements=["doctor", "nurse", "surgeon", "clerk", "radiologist"],
)

fake = Faker()
fake.add_provider(medical_professions_provider)

def create_medical_data(num_samples):
    data = []
    for i in range(num_samples):
        record = {
            'patient_id': fake.random_int(min=1000, max=9999),
            'name': fake.name(),
            'age': fake.random_int(min=18, max=90),
            'city': fake.city(),
            'profession': fake.medical_profession()
        }
        data.append(record)
    
    return pd.DataFrame(data)

# Generate sample data
medical_df = create_medical_data(10)
print("Simulated Medical Data:")
print(medical_df)

Simulated Medical Data:
   patient_id              name  age         city profession
0        3672      Ashley Smith   45     New York     doctor
1        8291   Michael Johnson   32      Chicago      nurse
2        5743     Sarah Williams   67  Los Angeles    surgeon
3        9184      David Brown   28      Houston      clerk
4        4627     Lisa Martinez   54      Phoenix radiologist
5        6395     John Anderson   41   Philadelphia     doctor
6        2768    Jennifer Davis   33   San Antonio      nurse
7        7539     Robert Miller   76        Dallas    surgeon
8        8162     Mary Wilson   29      San Jose      clerk
9        4951   Christopher Lee   58     Austin   radiologist

Parameters in make_classification

Parameter	Description	Default
`n_samples`	Number of samples to generate	100
`n_features`	Total number of features	20
`n_informative`	Number of informative features	2
`n_classes`	Number of classes	2
`weights`	Class distribution weights	None (balanced)

Conclusion

Simulated data is highly useful in day-to-day Machine Learning applications for prototyping or small POCs. Use make_classification for controlled numerical datasets and Faker for realistic categorical data with meaningful names and relationships.

Mithilesh Pradhan

Updated on: 2026-03-26T22:51:27+05:30

453 Views

Kickstart Your Career

Get certified by completing the course

Get Started

Previous Next