Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to create a sample dataset using Python Scikit-learn?
In this tutorial, we will learn how to create sample datasets using Python Scikit-learn for machine learning experiments and testing.
There are various built-in scikit-learn datasets which we can use easily for our ML models, but sometimes we need custom toy datasets. For this purpose, scikit-learn provides excellent sample dataset generators that create synthetic data with specific patterns.
Creating Sample Blob Dataset using make_blobs
For creating sample blob dataset, we use sklearn.datasets.make_blobs which generates isotropic Gaussian blobs for clustering tasks ?
Example
# Importing libraries
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Creating Blob dataset
X, y = make_blobs(n_samples=500, centers=3, cluster_std=1, n_features=2, random_state=42)
# Plotting the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=30)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Sample Blob Dataset")
plt.show()
print(f"Dataset shape: {X.shape}")
print(f"Number of clusters: {len(set(y))}")
Dataset shape: (500, 2) Number of clusters: 3
The make_blobs function creates well-separated clusters perfect for testing clustering algorithms like K-means.
Creating Sample Moon Dataset using make_moons
For creating sample moon dataset, we use sklearn.datasets.make_moons which generates a two-dimensional dataset in the shape of two crescent moons ?
Example
# Importing libraries
from sklearn.datasets import make_moons
import matplotlib.pyplot as plt
# Creating Moon dataset
X, y = make_moons(n_samples=500, noise=0.1, random_state=42)
# Plotting the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', s=30)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Sample Moon Dataset")
plt.show()
print(f"Dataset shape: {X.shape}")
print(f"Classes: {set(y)}")
Dataset shape: (500, 2)
Classes: {0, 1}
The moon dataset is ideal for testing non-linear classification algorithms as the data is not linearly separable.
Creating Sample Circle Dataset using make_circles
For creating sample circle dataset, we use sklearn.datasets.make_circles which generates a binary classification dataset with concentric circles ?
Example
# Importing libraries
from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
# Creating Circle dataset
X, y = make_circles(n_samples=500, noise=0.05, factor=0.5, random_state=42)
# Plotting the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='plasma', s=30)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Sample Circle Dataset")
plt.show()
print(f"Dataset shape: {X.shape}")
print(f"Outer circle samples: {sum(y == 1)}")
print(f"Inner circle samples: {sum(y == 0)}")
Dataset shape: (500, 2) Outer circle samples: 250 Inner circle samples: 250
The circle dataset tests algorithms on non-linearly separable data where classes form concentric patterns.
Comparison of Dataset Generators
| Function | Pattern | Best For | Key Parameters |
|---|---|---|---|
make_blobs |
Gaussian clusters | Clustering algorithms | centers, cluster_std |
make_moons |
Crescent shapes | Non-linear classification | noise |
make_circles |
Concentric circles | Non-linear classification | noise, factor |
Conclusion
Scikit-learn's dataset generators provide convenient ways to create synthetic data for testing ML algorithms. Use make_blobs for clustering, make_moons and make_circles for non-linear classification problems.
