- 1. Handling Missing Values
- 2. Data visualization
- 3. Handling Outliers
- 4. Multi Colinearity Detection & Handling
- Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques.
- It is used to discover trends, patterns, or to check assumptions with the help of statistical summary and graphical representations.
- Handling Missing Values
- Data visualization
- Handling Outliers
- It can occur when no information is provided for one or more items or for a whole unit.
- For Example, Suppose different users being surveyed may choose not to share their income, some users may choose not to share the address in this way many datasets went missing.
- Missing Data is a very big problem in real-life scenarios.
- Missing Data can also refer to as NA(Not Available) values in pandas.
- The source of missing data can be very different and here are just a few examples:
- A value is missing because it was forgotten or lost or not stored properly.
- For a certain observation, the value of the variable does not exist.
- The value can't be known or identified.
- One of the most important questions you can ask yourself to help figure this out is this:
- If a value is missing becuase it doesn't exist (like the height of the oldest child of someone who doesn't have any children) then it doesn't make sense to try and guess what it might be.
- These values you probably do want to keep as NaN.
- On the other hand, if a value is missing because it wasn't recorded, then you can try to guess what it might have been based on the other values in that column and row.
- This is called imputation.
- There are several useful functions for detecting, removing, and replacing null values in Pandas DataFrame :
isnull()notnull()dropna()fillna()replace()interpolate()
import numpy as np
import pandas as pd
# passing a dictionary inorder to make a Dataframe
df = pd.DataFrame({'age': [6, 7, np.NaN],
'born': [pd.NaT, pd.Timestamp('1998-04-25'),
pd.Timestamp('1940-05-27')],
'name': ['Alfred', 'Spiderman', ''],
'toy': [None, 'Spidertoy', 'Joker']})
df.head() age born name toy
0 6.0 NaT Alfred None
1 7.0 1998-04-25 Spiderman Spidertoy
2 NaN 1940-05-27 Joker
- The isna() function is used to detect missing values. Which is the abbriviation of "Is Null"
Series.isna(self)
- Returns: Series- values for each element in Series that indicates whether an element is not an NA value.
df.isna()age born name toy
0 False True False True
1 False False False False
2 True False False False
- How many missing(NA) values each column has
df.isna().sum()age 1
born 1
name 0
toy 1
dtype: int64
- Alternatively, you can call the
mean()method afterisnull()to visualise the percentage of the dataset that contains missing values for each variable.
df.isnull().mean()age 0.333333
born 0.333333
name 0.000000
toy 0.333333
dtype: float64
- The notna() function is used to detect existing (non-missing) values.
Series.notna(self)
- Returns: Series- Mask of bool values for each element in Series that indicates whether an element is not an NA value.
# Continuation of above DataFrame
df.isna()age born name toy
0 False True False True
1 False False False False
2 True False False False
- If you're in a hurry or don't have a reason to figure out why your values are missing, one option you have is to just remove any rows or columns that contain missing values.
- To drop rows with missing values, Pandas does have a handy function, dropna() to help you do this.
- The dropna() function is used to return a new Series with missing values removed.
- Returns: Series- Series with "NA" entries dropped from it.
df.dropna()age born name toy
1 7.0 1998-04-25 Spiderman Spidertoy
df.dropna(axis=1)name
0 Alfred
1 Spiderman
2
- Note: A blank space is not considered as 'NaN' or 'None' or 'NA'
- We can use the Panda's fillna() function to fill in missing values in a dataframe.
- One option we have is to specify what we want the "NaN" values to be replaced with.
- The
fillna()function is used to fill NA/NaN values using the specified method.
- Returns: Series- Object with missing values filled.
df.head() P Q R S
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 NaN NaN 6
3 NaN 4.0 NaN 5
df.fillna(0) P Q R S
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 5.0 0.0 0.0 6
3 0.0 4.0 0.0 5
df.fillna({'P': 0, 'Q': 1, 'R': 2, 'S': 3}, limit=2) P Q R S
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 5.0 1.0 NaN 6
3 0.0 4.0 NaN 5
# Forward
df.fillna(method='ffill')P Q R S
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 4.0 NaN 6
3 5.0 4.0 NaN 5
# Backward
df.fillna(method='bfill') P Q R S
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 5.0 4.0 NaN 6
3 NaN 4.0 NaN 5
- The interpolate() function is used to interpolate values according to different methods.
-
Returns: Series or DataFrame- Returns the same object type as the caller, interpolated at some or all NaN values.
-
Notes The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index.
s.head()0 0.0
1 2.0
2 NaN
3 5.0
dtype: float64
s.interpolate()0 0.0
1 2.0
2 3.5
3 5.0
dtype: float64
df.head()0 NaN
1 single_one
2 NaN
3 fill_two_more
4 NaN
5 NaN
6 3.71
7 NaN
s.interpolate(method='pad', limit=2)0 NaN
1 single_one
2 single_one
3 fill_two_more
4 fill_two_more
5 fill_two_more
6 3.71
7 3.71
dtype: object
- Pandas dataframe.replace() function is used to replace a string, regex, list, dictionary, series, number etc. from a dataframe.
-
This is a very rich function as it has many variations.
-
The most powerful thing about this function is that it can work with Python regex (regular expressions).
-
Example 1: Replace team “Boston Celtics” with “Omega Warrior” in the nba.csv file
df[:10] Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
5 Amir Johnson Boston Celtics 90.0 PF 29.0 6-9 240.0 NaN 12000000.0
6 Jordan Mickey Boston Celtics 55.0 PF 21.0 6-8 235.0 LSU 1170960.0
7 Kelly Olynyk Boston Celtics 41.0 C 25.0 7-0 238.0 Gonzaga 2165160.0
8 Terry Rozier Boston Celtics 12.0 PG 22.0 6-2 190.0 Louisville 1824360.0
9 Marcus Smart Boston Celtics 36.0 PG 22.0 6-4 220.0 Oklahoma State 3431040.0
- We are going to replace team “Boston Celtics” with “Omega Warrior” in the ‘df’ data frame
# this will replace "Boston Celtics" with "Omega Warrior"
df.replace(to_replace ="Boston Celtics", value ="Omega Warrior") Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Omega Warrior 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Omega Warrior 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Omega Warrior 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Omega Warrior 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Omega Warrior 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
... ... ... ... ... ... ... ... ... ...
453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 NaN 900000.0
455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 NaN 2900000.0
456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0 Kansas 947276.0
457 NaN NaN NaN NaN NaN NaN NaN NaN NaN
458 rows × 9 columns
- Example 2: Replacing more than one value at a time. Using python list as an argument
- We are going to replace team “Boston Celtics” and “Texas” with “Omega Warrior” in the ‘df’ dataframe.
# this will replace "Boston Celtics" and "Texas" with "Omega Warrior"
df.replace(to_replace =["Boston Celtics", "Texas"],
value ="Omega Warrior")Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Omega Warrior 0.0 PG 25.0 6-2 180.0 Omega Warrior 7730337.0
1 Jae Crowder Omega Warrior 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Omega Warrior 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Omega Warrior 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Omega Warrior 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
... ... ... ... ... ... ... ... ... ...
453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 NaN 900000.0
455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 NaN 2900000.0
456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0 Kansas 947276.0
457 NaN NaN NaN NaN NaN NaN NaN NaN NaN
458 rows × 9 columns
- Example 3: Replace the Nan value in the data frame with -99999 value.
# will replace Nan value in dataframe with value -99999
df.replace(to_replace = np.nan, value =-99999) Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University -99999.0
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 -99999 5000000.0
... ... ... ... ... ... ... ... ... ...
453 Shelvin Mack Utah Jazz 8.0 PG 26.0 6-3 203.0 Butler 2433333.0
454 Raul Neto Utah Jazz 25.0 PG 24.0 6-1 179.0 -99999 900000.0
455 Tibor Pleiss Utah Jazz 21.0 C 26.0 7-3 256.0 -99999 2900000.0
456 Jeff Withey Utah Jazz 24.0 C 26.0 7-0 231.0 Kansas 947276.0
457 -99999 -99999 -99999.0 -99999 -99999.0 -99999 -99999.0 -99999 -99999.0
458 rows × 9 columns
- Data Visualization is the process of analyzing data in the form of graphs or maps, making it a lot easier to understand the trends or patterns in the data.
- There are various types of visualizations –
- Univariate analysis: This type of data consists of only one variable. The analysis of univariate data is thus the simplest form of analysis since the information deals with only one quantity that changes. It does not deal with causes or relationships and the main purpose of the analysis is to describe the data and find patterns that exist within it.
- Bi-Variate analysis: This type of data involves two different variables. The analysis of this type of data deals with causes and relationships and the analysis is done to find out the relationship among the two variables.
- Multi-Variate analysis: When the data involves three or more variables, it is categorized under multivariate.
- Outliers are data values that are far outside the rest of the observations in your dataset.
- Depending on the context, you sometimes might hear outliers referred to as anomalies.
- For example, if the age of most college going students in a dataset is between 18 and 25, an observation of 60 for the age of a student would be considered an outlier
- Generally, Outliers affect statistical results while doing the EDA process, we could say a quick example is the MEAN and MODE of a given set of data set, which will be misleading that the data values would be higher than they really are.
- The CORRELATION COEFFICIENT is highly sensitive to outliers.
- Since it measures the strength of a linear relationship between two variables and the relationship dependent of the data.
- correlation is a non-resistant measure and r (correlation coefficient) is strongly affected by outliers.
- Positive Relationship When the correlation coefficient is closer to value 1
- Negative Relationship When the correlation coefficient is closer to value -1
- Independent When X and Y are independent, then the correlation coefficient is close to zero (0)
- Outliers in some cases can be useful for detection of abnormal activities.
- For instance, if a person accesses her online bank account from a specific location 95% of the time and then suddenly her bank account is accessed from a geographical location far from her previous login, the new login will be treated as an outlier and can be helpful in fraud detection.
- Outliers can also occur in your dataset due to human mistakes while entering data or even a failure of a data recording device.
- In such cases, outliers can distort the distribution of data and convey erroneous information. If not handled, this can affect the performance of statistical algorithms like machine learning models.
- Before dropping the Outliers, we must analyze the dataset with and without outliers and understand better the impact of the results.
- If you observed that it is obvious due to incorrectly entered or measured, certainly you can drop the outlier. No issues on that case.
- If you find that your assumptions are getting affected, you may drop the outlier straight away, provided that no changes in the results.
- If the outlier affects your assumptions and results. No questions simply drop the outlier and proceed with your further steps.

