Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Creating a Dataframe using Excel files
A DataFrame is a two-dimensional tabular data structure in Python used for data analysis. One of the most common ways to create a DataFrame is by importing data from Excel files using the pandas library.
Why DataFrames are Important for Data Analysis
DataFrames are essential for data analysis because they provide ?
Easy data manipulation Two-dimensional table-like structure for organizing and manipulating complex data with missing values or different data types.
Efficient processing Support vectorized operations that perform computations on entire arrays rather than iterating row by row.
Library integration Seamless integration with pandas, NumPy, Matplotlib, and Scikit-learn for data cleaning, visualization, and machine learning.
Data exploration Built-in tools for filtering, sorting, grouping, and creating various visualizations.
Prerequisites
Before starting, ensure you have ?
Python 3.x installed
Required libraries:
pip install pandas openpyxlBasic Python knowledge
Reading Excel Files with pandas
The read_excel() function reads data from both .xls and .xlsx formats ?
Basic Excel File Reading
First, let's create a sample Excel file and read it ?
import pandas as pd
# Create sample data
data = {'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'],
'Age': [25, 30, 27, 22, 32],
'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']}
df = pd.DataFrame(data)
# Save to Excel file
df.to_excel('sample.xlsx', sheet_name='Sheet1', index=False)
# Read the Excel file
df_read = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
print(df_read)
Name Age City
0 John 25 New York
1 Smith 30 London
2 Alex 27 Paris
3 James 22 Tokyo
4 Peter 32 Sydney
Reading Specific Columns and Rows
You can specify which columns and rows to read using parameters ?
import pandas as pd
# Read only specific columns (A and B = Name and Age)
df_subset = pd.read_excel('sample.xlsx', sheet_name='Sheet1', usecols='A:B')
print("First two columns:")
print(df_subset)
print("\nSkipping first row as header:")
# Read with different header row
df_no_header = pd.read_excel('sample.xlsx', sheet_name='Sheet1', header=None)
print(df_no_header)
First two columns:
Name Age
0 John 25
1 Smith 30
2 Alex 27
3 James 22
4 Peter 32
Skipping first row as header:
0 1 2
0 Name Age City
1 John 25 New York
2 Smith 30 London
3 Alex 27 Paris
4 James 22 Tokyo
5 Peter 32 Sydney
Reading Multiple Sheets
To read multiple sheets from an Excel file ?
import pandas as pd
# Create multi-sheet Excel file
df1 = pd.DataFrame({'Product': ['A', 'B', 'C'], 'Price': [100, 200, 300]})
df2 = pd.DataFrame({'Product': ['D', 'E', 'F'], 'Price': [400, 500, 600]})
with pd.ExcelWriter('multi_sheet.xlsx') as writer:
df1.to_excel(writer, sheet_name='Quarter1', index=False)
df2.to_excel(writer, sheet_name='Quarter2', index=False)
# Read multiple sheets
sheets_dict = pd.read_excel('multi_sheet.xlsx', sheet_name=['Quarter1', 'Quarter2'])
print("Quarter1 data:")
print(sheets_dict['Quarter1'])
print("\nQuarter2 data:")
print(sheets_dict['Quarter2'])
Quarter1 data: Product Price 0 A 100 1 B 200 2 C 300 Quarter2 data: Product Price 0 D 400 1 E 500 2 F 600
Data Cleaning Operations
Removing Duplicates
Remove duplicate rows using drop_duplicates() ?
import pandas as pd
# Create DataFrame with duplicates
data_with_dupes = {'Name': ['John', 'Smith', 'Alex', 'John', 'Smith'],
'Age': [25, 30, 27, 25, 30]}
df_dupes = pd.DataFrame(data_with_dupes)
print("Original data:")
print(df_dupes)
# Remove duplicates
df_clean = df_dupes.drop_duplicates()
print("\nAfter removing duplicates:")
print(df_clean)
Original data:
Name Age
0 John 25
1 Smith 30
2 Alex 27
3 John 25
4 Smith 30
After removing duplicates:
Name Age
0 John 25
1 Smith 30
2 Alex 27
Handling Missing Values
Detect and handle missing values using pandas functions ?
import pandas as pd
import numpy as np
# Create DataFrame with missing values
data_missing = {'Name': ['John', 'Smith', None, 'James'],
'Age': [25, None, 27, 22],
'Salary': [50000, 60000, 55000, None]}
df_missing = pd.DataFrame(data_missing)
print("Data with missing values:")
print(df_missing)
print("\nMissing values count:")
print(df_missing.isnull().sum())
# Fill missing values
df_filled = df_missing.fillna({'Name': 'Unknown', 'Age': df_missing['Age'].mean(), 'Salary': 0})
print("\nAfter filling missing values:")
print(df_filled)
Data with missing values:
Name Age Salary
0 John 25.0 50000.0
1 Smith NaN 60000.0
2 None 27.0 55000.0
3 James 22.0 NaN
Missing values count:
Name 1
Age 1
Salary 1
dtype: int64
After filling missing values:
Name Age Salary
0 John 25.0 50000.0
1 Smith 24.7 60000.0
2 Unknown 27.0 55000.0
3 James 22.0 0.0
Common Parameters for read_excel()
| Parameter | Description | Example |
|---|---|---|
sheet_name |
Specify sheet to read |
'Sheet1' or 0
|
header |
Row to use as column names |
0 (first row) |
usecols |
Columns to read |
'A:C' or [0,1,2]
|
skiprows |
Rows to skip |
2 (skip first 2 rows) |
Conclusion
Creating DataFrames from Excel files is straightforward using pandas' read_excel() function. This approach enables efficient data import, cleaning, and analysis for real-world data science projects.
