Creating a Dataframe using Excel files

A DataFrame is a two-dimensional tabular data structure in Python used for data analysis. One of the most common ways to create a DataFrame is by importing data from Excel files using the pandas library.

Why DataFrames are Important for Data Analysis

DataFrames are essential for data analysis because they provide ?

  • Easy data manipulation Two-dimensional table-like structure for organizing and manipulating complex data with missing values or different data types.

  • Efficient processing Support vectorized operations that perform computations on entire arrays rather than iterating row by row.

  • Library integration Seamless integration with pandas, NumPy, Matplotlib, and Scikit-learn for data cleaning, visualization, and machine learning.

  • Data exploration Built-in tools for filtering, sorting, grouping, and creating various visualizations.

Prerequisites

Before starting, ensure you have ?

  • Python 3.x installed

  • Required libraries: pip install pandas openpyxl

  • Basic Python knowledge

Reading Excel Files with pandas

The read_excel() function reads data from both .xls and .xlsx formats ?

Basic Excel File Reading

First, let's create a sample Excel file and read it ?

import pandas as pd

# Create sample data
data = {'Name': ['John', 'Smith', 'Alex', 'James', 'Peter'],
        'Age': [25, 30, 27, 22, 32],
        'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney']}

df = pd.DataFrame(data)

# Save to Excel file
df.to_excel('sample.xlsx', sheet_name='Sheet1', index=False)

# Read the Excel file
df_read = pd.read_excel('sample.xlsx', sheet_name='Sheet1')
print(df_read)
    Name  Age      City
0   John   25  New York
1  Smith   30    London
2   Alex   27     Paris
3  James   22     Tokyo
4  Peter   32    Sydney

Reading Specific Columns and Rows

You can specify which columns and rows to read using parameters ?

import pandas as pd

# Read only specific columns (A and B = Name and Age)
df_subset = pd.read_excel('sample.xlsx', sheet_name='Sheet1', usecols='A:B')
print("First two columns:")
print(df_subset)

print("\nSkipping first row as header:")
# Read with different header row
df_no_header = pd.read_excel('sample.xlsx', sheet_name='Sheet1', header=None)
print(df_no_header)
First two columns:
    Name  Age
0   John   25
1  Smith   30
2   Alex   27
3  James   22
4  Peter   32

Skipping first row as header:
       0    1         2
0   Name  Age      City
1   John   25  New York
2  Smith   30    London
3   Alex   27     Paris
4  James   22     Tokyo
5  Peter   32    Sydney

Reading Multiple Sheets

To read multiple sheets from an Excel file ?

import pandas as pd

# Create multi-sheet Excel file
df1 = pd.DataFrame({'Product': ['A', 'B', 'C'], 'Price': [100, 200, 300]})
df2 = pd.DataFrame({'Product': ['D', 'E', 'F'], 'Price': [400, 500, 600]})

with pd.ExcelWriter('multi_sheet.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Quarter1', index=False)
    df2.to_excel(writer, sheet_name='Quarter2', index=False)

# Read multiple sheets
sheets_dict = pd.read_excel('multi_sheet.xlsx', sheet_name=['Quarter1', 'Quarter2'])

print("Quarter1 data:")
print(sheets_dict['Quarter1'])
print("\nQuarter2 data:")
print(sheets_dict['Quarter2'])
Quarter1 data:
  Product  Price
0       A    100
1       B    200
2       C    300

Quarter2 data:
  Product  Price
0       D    400
1       E    500
2       F    600

Data Cleaning Operations

Removing Duplicates

Remove duplicate rows using drop_duplicates() ?

import pandas as pd

# Create DataFrame with duplicates
data_with_dupes = {'Name': ['John', 'Smith', 'Alex', 'John', 'Smith'],
                   'Age': [25, 30, 27, 25, 30]}
df_dupes = pd.DataFrame(data_with_dupes)

print("Original data:")
print(df_dupes)

# Remove duplicates
df_clean = df_dupes.drop_duplicates()
print("\nAfter removing duplicates:")
print(df_clean)
Original data:
    Name  Age
0   John   25
1  Smith   30
2   Alex   27
3   John   25
4  Smith   30

After removing duplicates:
    Name  Age
0   John   25
1  Smith   30
2   Alex   27

Handling Missing Values

Detect and handle missing values using pandas functions ?

import pandas as pd
import numpy as np

# Create DataFrame with missing values
data_missing = {'Name': ['John', 'Smith', None, 'James'],
                'Age': [25, None, 27, 22],
                'Salary': [50000, 60000, 55000, None]}
df_missing = pd.DataFrame(data_missing)

print("Data with missing values:")
print(df_missing)
print("\nMissing values count:")
print(df_missing.isnull().sum())

# Fill missing values
df_filled = df_missing.fillna({'Name': 'Unknown', 'Age': df_missing['Age'].mean(), 'Salary': 0})
print("\nAfter filling missing values:")
print(df_filled)
Data with missing values:
    Name   Age   Salary
0   John  25.0  50000.0
1  Smith   NaN  60000.0
2   None  27.0  55000.0
3  James  22.0      NaN

Missing values count:
Name      1
Age       1
Salary    1
dtype: int64

After filling missing values:
      Name   Age   Salary
0     John  25.0  50000.0
1    Smith  24.7  60000.0
2  Unknown  27.0  55000.0
3    James  22.0      0.0

Common Parameters for read_excel()

Parameter Description Example
sheet_name Specify sheet to read 'Sheet1' or 0
header Row to use as column names 0 (first row)
usecols Columns to read 'A:C' or [0,1,2]
skiprows Rows to skip 2 (skip first 2 rows)

Conclusion

Creating DataFrames from Excel files is straightforward using pandas' read_excel() function. This approach enables efficient data import, cleaning, and analysis for real-world data science projects.

Updated on: 2026-03-27T05:58:24+05:30

5K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements