Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Creating a Dataframe using CSV files
A DataFrame is a powerful two-dimensional data structure in Python's pandas library, similar to a spreadsheet. CSV files are the most common way to store tabular data. This article demonstrates how to create DataFrames from CSV files and perform essential data operations.
What are DataFrames and CSV Files?
A DataFrame is a two-dimensional, size-mutable, tabular data structure with columns of potentially different types. It's similar to a spreadsheet or SQL table, commonly used for data analysis in Python.
A CSV (Comma-Separated Values) file stores data in tabular format, with each row representing a record and columns separated by commas. CSV files are widely supported and easy to work with across different applications.
Reading CSV Files into DataFrames
Use pandas' read_csv() function to load CSV data into a DataFrame ?
import pandas as pd # Create sample CSV data import io csv_data = """Title,Year,Genre,Runtime The Shawshank Redemption,1994,Drama,142 The Godfather,1972,Crime,175 The Dark Knight,2008,Action,152 12 Angry Men,1957,Drama,96""" # Read CSV from string (simulating file read) df = pd.read_csv(io.StringIO(csv_data)) print(df)
Title Year Genre Runtime
0 The Shawshank Redemption 1994 Drama 142
1 The Godfather 1972 Crime 175
2 The Dark Knight 2008 Action 152
3 12 Angry Men 1957 Drama 96
Syntax
import pandas as pd
df = pd.read_csv('filename.csv')
The read_csv() function has many optional parameters like delimiter, encoding, and header to customize file reading.
Exploring DataFrames
Basic DataFrame Information
import pandas as pd
import io
csv_data = """Title,Year,Genre,Runtime
The Shawshank Redemption,1994,Drama,142
The Godfather,1972,Crime,175
The Dark Knight,2008,Action,152
12 Angry Men,1957,Drama,96
Pulp Fiction,1994,Crime,154"""
df = pd.read_csv(io.StringIO(csv_data))
# View first few rows
print("First 3 rows:")
print(df.head(3))
print("\nDataFrame shape:")
print(df.shape)
print("\nSummary statistics:")
print(df.describe())
First 3 rows:
Title Year Genre Runtime
0 The Shawshank Redemption 1994 Drama 142
1 The Godfather 1972 Crime 175
2 The Dark Knight 2008 Action 152
DataFrame shape:
(5, 4)
Summary statistics:
Year Runtime
count 5.000000 5.000000
mean 1985.000000 143.800000
std 20.273135 27.896438
min 1957.000000 96.000000
25% 1972.000000 142.000000
50% 1994.000000 152.000000
75% 1994.000000 154.000000
max 2008.000000 175.000000
Selecting Columns
# Select specific columns subset = df[['Title', 'Genre']] print(subset)
Title Genre
0 The Shawshank Redemption Drama
1 The Godfather Crime
2 The Dark Knight Action
3 12 Angry Men Drama
4 Pulp Fiction Crime
Manipulating DataFrames
Sorting Data
# Sort by Year in descending order
sorted_df = df.sort_values('Year', ascending=False)
print(sorted_df)
Title Year Genre Runtime
2 The Dark Knight 2008 Action 152
0 The Shawshank Redemption 1994 Drama 142
4 Pulp Fiction 1994 Crime 154
1 The Godfather 1972 Crime 175
3 12 Angry Men 1957 Drama 96
Filtering Data
# Filter movies by genre crime_movies = df[df['Genre'] == 'Crime'] print(crime_movies)
Title Year Genre Runtime
1 The Godfather 1972 Crime 175
4 Pulp Fiction 1994 Crime 154
Grouping Data
# Group by Genre and calculate mean runtime
genre_stats = df.groupby('Genre')['Runtime'].mean()
print(genre_stats)
Genre Action 152.0 Crime 164.5 Drama 119.0 Name: Runtime, dtype: float64
Writing DataFrames to CSV Files
Save your processed DataFrame back to a CSV file using to_csv() ?
# Create a modified DataFrame
df_modified = df[df['Runtime'] > 140]
# Convert to CSV string (simulating file write)
csv_output = df_modified.to_csv(index=False)
print("CSV Output:")
print(csv_output)
CSV Output: Title,Year,Genre,Runtime The Shawshank Redemption,1994,Drama,142 The Godfather,1972,Crime,175 The Dark Knight,2008,Action,152 Pulp Fiction,1994,Crime,154
Syntax
# Write to CSV file
df.to_csv('output.csv', index=False)
# Write with custom separator
df.to_csv('output.csv', sep=';', index=False)
Common Operations Summary
| Operation | Function | Purpose |
|---|---|---|
| Read CSV | pd.read_csv() |
Load CSV into DataFrame |
| View Data | df.head() |
Show first few rows |
| Get Info | df.shape |
Get dimensions |
| Sort | df.sort_values() |
Sort by column(s) |
| Filter | df[condition] |
Filter rows |
| Save CSV | df.to_csv() |
Export DataFrame |
Conclusion
DataFrames provide a powerful way to work with CSV data in Python. Use pd.read_csv() to load data, explore it with head() and describe(), and manipulate it with sorting, filtering, and grouping operations. Save your results with to_csv() for future use.
