Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Create simulated data for classification in Python
In this tutorial we will learn how to create simulated data for classification in Python using popular libraries like scikit-learn and Faker.
Introduction
Simulated data can be defined as any data not representing the real phenomenon but which is generated synthetically using parameters and constraints. This synthetic data mimics real-world patterns and relationships while being completely controllable.
When and Why Do We Need Simulated Data?
Sometimes while prototyping a particular algorithm in Machine Learning or Deep Learning we generally face a scarcity of good real-world data which can be useful to us. Sometimes there is no such data available for a given task. In such scenarios, we may need synthetically generated data. This data can also be from lab simulations.
Advantages of Simulated Data
Mostly represents data as it might be in the real form
Contains less variation of noise, so can be considered an ideal dataset
Useful for quick prototyping and POCs
Complete control over data distribution and class balance
Using scikit-learn's make_classification
The most common approach is using scikit-learn's make_classification function, which generates random n-class classification problems ?
from sklearn.datasets import make_classification
import pandas as pd
# Creating a simulated feature matrix and output vector with 100 samples
features, output = make_classification(
n_samples=100,
n_features=10,
n_informative=5,
n_redundant=5,
n_classes=3,
weights=[.2, .3, .5],
random_state=42
)
# Create DataFrame for better visualization
df_features = pd.DataFrame(features,
columns=[f"Feature_{i+1}" for i in range(10)])
output_series = pd.Series(output, name='label')
df = pd.concat([df_features, output_series], axis=1)
print("Simulated Classification Dataset:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"Class distribution:\n{df['label'].value_counts().sort_index()}")
Simulated Classification Dataset: Feature_1 Feature_2 Feature_3 Feature_4 Feature_5 Feature_6 Feature_7 Feature_8 Feature_9 Feature_10 label 0 0.374540 0.950714 -0.151357 -0.103219 0.410599 0.144044 1.454274 0.761038 0.121675 0.443863 2 1 0.333674 1.494079 -0.205158 0.313068 -0.854096 -2.552990 0.653619 0.864436 -0.742165 2.269755 1 2 -1.142071 -2.153013 -0.252288 0.045759 -1.057711 0.822545 -1.220844 -1.959670 -1.328186 0.196861 0 3 0.567528 0.755611 2.269755 -1.454366 0.045759 -0.187184 1.532779 1.469359 0.154947 0.378163 2 4 -0.887786 -1.980796 -0.347912 0.156349 -1.367230 0.906098 -0.601707 -1.094685 -0.441652 -0.365886 0 Dataset shape: (100, 11) Class distribution: 0 20 1 30 2 50 Name: label, dtype: int64
Visualizing the Generated Data
Let's create a simple scatter plot to visualize the relationship between features and target classes ?
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
colors = ['red', 'blue', 'green']
for i, class_label in enumerate(sorted(df['label'].unique())):
class_data = df[df['label'] == class_label]
plt.scatter(class_data['Feature_1'], class_data['Feature_2'],
c=colors[i], label=f'Class {class_label}', alpha=0.7)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Simulated Classification Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Using Faker Library for Realistic Data
Another method is using the Faker library to generate more realistic-looking data with meaningful column names ?
# pip install faker
from faker import Faker
from faker.providers import DynamicProvider
import pandas as pd
import random
# Create custom provider for medical professions
medical_professions_provider = DynamicProvider(
provider_name="medical_profession",
elements=["doctor", "nurse", "surgeon", "clerk", "radiologist"],
)
fake = Faker()
fake.add_provider(medical_professions_provider)
def create_medical_data(num_samples):
data = []
for i in range(num_samples):
record = {
'patient_id': fake.random_int(min=1000, max=9999),
'name': fake.name(),
'age': fake.random_int(min=18, max=90),
'city': fake.city(),
'profession': fake.medical_profession()
}
data.append(record)
return pd.DataFrame(data)
# Generate sample data
medical_df = create_medical_data(10)
print("Simulated Medical Data:")
print(medical_df)
Simulated Medical Data: patient_id name age city profession 0 3672 Ashley Smith 45 New York doctor 1 8291 Michael Johnson 32 Chicago nurse 2 5743 Sarah Williams 67 Los Angeles surgeon 3 9184 David Brown 28 Houston clerk 4 4627 Lisa Martinez 54 Phoenix radiologist 5 6395 John Anderson 41 Philadelphia doctor 6 2768 Jennifer Davis 33 San Antonio nurse 7 7539 Robert Miller 76 Dallas surgeon 8 8162 Mary Wilson 29 San Jose clerk 9 4951 Christopher Lee 58 Austin radiologist
Parameters in make_classification
| Parameter | Description | Default |
|---|---|---|
n_samples |
Number of samples to generate | 100 |
n_features |
Total number of features | 20 |
n_informative |
Number of informative features | 2 |
n_classes |
Number of classes | 2 |
weights |
Class distribution weights | None (balanced) |
Conclusion
Simulated data is highly useful in day-to-day Machine Learning applications for prototyping or small POCs. Use make_classification for controlled numerical datasets and Faker for realistic categorical data with meaningful names and relationships.
