How to create a sample dataset using Python Scikit-learn?

In this tutorial, we will learn how to create sample datasets using Python Scikit-learn for machine learning experiments and testing.

There are various built-in scikit-learn datasets which we can use easily for our ML models, but sometimes we need custom toy datasets. For this purpose, scikit-learn provides excellent sample dataset generators that create synthetic data with specific patterns.

Creating Sample Blob Dataset using make_blobs

For creating sample blob dataset, we use sklearn.datasets.make_blobs which generates isotropic Gaussian blobs for clustering tasks ?

Example

# Importing libraries
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Creating Blob dataset
X, y = make_blobs(n_samples=500, centers=3, cluster_std=1, n_features=2, random_state=42)

# Plotting the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=30)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Sample Blob Dataset")
plt.show()

print(f"Dataset shape: {X.shape}")
print(f"Number of clusters: {len(set(y))}")
Dataset shape: (500, 2)
Number of clusters: 3

The make_blobs function creates well-separated clusters perfect for testing clustering algorithms like K-means.

Creating Sample Moon Dataset using make_moons

For creating sample moon dataset, we use sklearn.datasets.make_moons which generates a two-dimensional dataset in the shape of two crescent moons ?

Example

# Importing libraries
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

# Creating Moon dataset
X, y = make_moons(n_samples=500, noise=0.1, random_state=42)

# Plotting the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', s=30)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Sample Moon Dataset")
plt.show()

print(f"Dataset shape: {X.shape}")
print(f"Classes: {set(y)}")
Dataset shape: (500, 2)
Classes: {0, 1}

The moon dataset is ideal for testing non-linear classification algorithms as the data is not linearly separable.

Creating Sample Circle Dataset using make_circles

For creating sample circle dataset, we use sklearn.datasets.make_circles which generates a binary classification dataset with concentric circles ?

Example

# Importing libraries
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt

# Creating Circle dataset
X, y = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)

# Plotting the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='plasma', s=30)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Sample Circle Dataset")
plt.show()

print(f"Dataset shape: {X.shape}")
print(f"Outer circle samples: {sum(y == 1)}")
print(f"Inner circle samples: {sum(y == 0)}")
Dataset shape: (500, 2)
Outer circle samples: 250
Inner circle samples: 250

The circle dataset tests algorithms on non-linearly separable data where classes form concentric patterns.

Comparison of Dataset Generators

Function Pattern Best For Key Parameters
make_blobs Gaussian clusters Clustering algorithms centers, cluster_std
make_moons Crescent shapes Non-linear classification noise
make_circles Concentric circles Non-linear classification noise, factor

Conclusion

Scikit-learn's dataset generators provide convenient ways to create synthetic data for testing ML algorithms. Use make_blobs for clustering, make_moons and make_circles for non-linear classification problems.

Updated on: 2026-03-26T22:10:44+05:30

939 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements