- The Curse of Dimensionality
- What is Dimensionality Reduction?
- Dimensionality Reduction benefits
- Principal Component Analysis(PCA)
The curse of dimensionality is a problem that arises when we are working with a lot of data having multiple features or we can say it as high dimensional data.
The curse of dimensionality is the expression of all phenomena that appear with high-dimensional data, and that have most often unfortunate consequences on the behavior and performances of learning algorithms.
The dimension of the data means the number of features or columns in our dataset.
In machine learning, a small increase in the dimensionality would require a large increase in the volume of the data in order to maintain a similar level of performance.
Majority of the machine learning algorithms they rely on the calculation of distance for model building and as the number of dimensions increases it becomes more and more computation-intensive to create a model out of it.
For one dimension 1D: if we have to calculate the distance between two points in just one dimension, like two points on the number line, we’ll just subtract the coordinate of one point from another and then take the magnitude:
For Two dimension 2D: if we need to calculate the distance between two points in two dimensions.
For Three dimension nD: if we need to calculate the distance between two points in three dimensions
This is the effort of calculating the distance between two points. Just imagine the number of calculations involved for all the data points involved.
One more point to consider is that as the number of dimension increases, points are going far away from each other. This means that any new point that comes when we are testing the model is going to be farther away from our training points. This leads to a less reliable model, and it makes our model overfitted to the training data.
- Hard (or almost impossible) to visualise the relationship between features: As stated above, humans can not comprehend things beyond three dimensions. So, if we have an n-dimensional dataset, the only solution left to us is to create either a 2-D or 3-D graph out of it. Let’s say for simplicity, we are creating 2-D graphs. Suppose we have 1000 features in the dataset. That results in a total (1000*999)/2= 499500 combinations possible for creating the 2-D graph.
Is it humanly possible to analyse all those graphs to understand the relationship between the variables?
The questions that we need to ask at this point are:
- Are all the features really contributing to decision making?
- Is there a way to come to the same conclusion using a lesser number of features?
- Is there a way to combine features to create a new feature and drop the old ones?
- Is there a way to remodel features in a way to make them visually comprehensible?
The answer to all the above questions is- Dimensionality Reduction technique.
- Specifically the issue of data sparsity and “closeness” of data.
- It becomes very challenging to identify meaningful patterns while analyzing and visualizing the data and it also degrades the Machine Learning model’s accuracy while decreasing the computation speed as well, i.e. training the model will become much slower as the dimensions increase.
- Infinite Features Requires Infinite Training.
- In theory, one solution to the curse of dimensionality could be to increase the size of the training set to reach a sufficient density of training instances.
- Unfortunately, in practice, the number of training instances required to reach a given density grows exponentially with the number of dimensions.
- Dimensionality Reduction
- Regularisation
- Principal Component Analysis(PCA)
- In simple words, dimensionality reduction refers to the technique of reducing the dimension of a data feature set.
- Usually, machine learning datasets (feature set) contain hundreds of columns (i.e., features) or an array of points, creating a massive sphere in a three-dimensional space.
- By applying dimensionality reduction, you can decrease or bring down the number of columns to quantifiable counts, thereby transforming the three-dimensional sphere into a two-dimensional object (circle).
- There are two main approaches to reducing dimensionality:
- Projection.
- Manifold Learning.
- In most real-world problems, training instances are not spread out uniformly across all dimensions.
- Many features are almost constant, while others are highly correlated (as we can see in MNIST some consecutive pixels are identical).
- As a result, all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space.( 2D space or 3D space in CNN for example)
- It eliminates noise and redundant features.
- Handles Multicolinearity.
- It helps improve the model’s accuracy and performance.
- It facilitates the usage of algorithms that are unfit for more large dimensions.
- It reduces the amount of storage space required (less data needs lesser storage space).
- It compresses the data, which reduces the computation time and facilitates faster training of the data.
- It Works on Statistical Concept based Machine Learning Algorithms
- Random forest doesnot need Dimensionality Reduction
- Although Neural Net doesnot works very good but doesnot need Dimensionality Reduction.
- Dimensionality reduction techniques can be categorized into two broad categories:
The feature selection method aims to find a subset of the input variables (that are most relevant) from the original dataset. Feature selection includes three strategies, namely:
- Filter strategy
- Wrapper strategy
- Embedded strategy
- Principal Component Analysis (PCA)
- Non-negative matrix factorization (NMF)
- Linear discriminant analysis (LDA)
- Generalized discriminant analysis (GDA)
- Missing Values Ratio
- Low Variance Filter
- High Correlation Filter
- Backward Feature Elimination
- Forward Feature Construction
- Random Forests

