Dimensionality_Reduction

Index

The Curse of Dimensionality
What is Dimensionality Reduction?
Dimensionality Reduction benefits
1. Feature selection
2. Feature extraction
Principal Component Analysis(PCA)

The Curse of Dimensionality

The curse of dimensionality is a problem that arises when we are working with a lot of data having multiple features or we can say it as high dimensional data.

The curse of dimensionality is the expression of all phenomena that appear with high-dimensional data, and that have most often unfortunate consequences on the behavior and performances of learning algorithms.

◼️ Dimension

The dimension of the data means the number of features or columns in our dataset.

◼️ The Problem

In machine learning, a small increase in the dimensionality would require a large increase in the volume of the data in order to maintain a similar level of performance.

Issue 01 : Increase in computation time:-

Majority of the machine learning algorithms they rely on the calculation of distance for model building and as the number of dimensions increases it becomes more and more computation-intensive to create a model out of it.

For one dimension 1D: if we have to calculate the distance between two points in just one dimension, like two points on the number line, we’ll just subtract the coordinate of one point from another and then take the magnitude:

the formula is- $https://latex.codecogs.com/svg.image?\large \mathbf{{\color{Purple}(x_1-x_2)} }$

For Two dimension 2D: if we need to calculate the distance between two points in two dimensions.

the formula is- $https://latex.codecogs.com/svg.image?\large \mathbf{{\color{Purple}\sqrt{(x_1-x_2)^2+(y_1-y_2)^2}} }$

For Three dimension nD: if we need to calculate the distance between two points in three dimensions

the formula is- $https://latex.codecogs.com/svg.image?\large \mathbf{{\color{Purple}\sqrt{(x_1-x_2)^2+(y_1-y_2)^2 + (z_1-z_2)^2+ \cdots+(n_1-n_2)^2}} }$

This is the effort of calculating the distance between two points. Just imagine the number of calculations involved for all the data points involved.

One more point to consider is that as the number of dimension increases, points are going far away from each other. This means that any new point that comes when we are testing the model is going to be farther away from our training points. This leads to a less reliable model, and it makes our model overfitted to the training data.

Hard (or almost impossible) to visualise the relationship between features: As stated above, humans can not comprehend things beyond three dimensions. So, if we have an n-dimensional dataset, the only solution left to us is to create either a 2-D or 3-D graph out of it. Let’s say for simplicity, we are creating 2-D graphs. Suppose we have 1000 features in the dataset. That results in a total (1000*999)/2= 499500 combinations possible for creating the 2-D graph.

Is it humanly possible to analyse all those graphs to understand the relationship between the variables?

The questions that we need to ask at this point are:

Are all the features really contributing to decision making?
Is there a way to come to the same conclusion using a lesser number of features?
Is there a way to combine features to create a new feature and drop the old ones?
Is there a way to remodel features in a way to make them visually comprehensible?

The answer to all the above questions is- Dimensionality Reduction technique.

Issue 1

Specifically the issue of data sparsity and “closeness” of data.

It becomes very challenging to identify meaningful patterns while analyzing and visualizing the data and it also degrades the Machine Learning model’s accuracy while decreasing the computation speed as well, i.e. training the model will become much slower as the dimensions increase.

Infinite Features Requires Infinite Training.

Issue 2

In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances.

Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.

How to combat CoD

Dimensionality Reduction
Regularisation
Principal Component Analysis(PCA)

What is Dimensionality Reduction

In simple words, dimensionality reduction refers to the technique of reducing the dimension of a data feature set.
Usually, machine learning datasets (feature set) contain hundreds of columns (i.e., features) or an array of points, creating a massive sphere in a three-dimensional space.
By applying dimensionality reduction, you can decrease or bring down the number of columns to quantifiable counts, thereby transforming the three-dimensional sphere into a two-dimensional object (circle).

Approachs

There are two main approaches to reducing dimensionality:
- Projection.
- Manifold Learning.

Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions.
Many features are almost constant, while others are highly correlated (as we can see in MNIST some consecutive pixels are identical).
As a result, all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space.( 2D space or 3D space in CNN for example)

Manifold Learning