Working with compressed archives like zip files is an integral part of software development. Whether it is distributing code libraries, sharing datasets or config files, leveraging compression allows faster transfers and reduced storage needs. This is where the ubiquitous zip format plays a major role.
In this comprehensive guide, we will go deep into programmatically extracting zip archives using Python.
Why Unzip Files with Python?
Here are some common use cases where a Python developer would need to unzip files:
1. Extract Configuration Data
Applications often bundle their config files and templates in a zipped folder which needs to be extracted on first run to customize the app. Python scripts can automate this unzipping process.
2. decompress Machine Learning Datasets
With data-driven technologies like machine learning getting popular, developers often deal with exchanging compressed datasets. Unzipping them in Python can facilitate easier ingestion and training.
3. Distribute Libraries and Dependencies
Python frameworks rely heavily on external libraries and packages. These are commonly distributed as compressed files to aid faster downloads. Unzipping them to integrate into projects is essential.
4. Unpack Downloaded Code Repositories
Open source libraries hosted on GitHub and similar platforms are shared as zip archives to allow cloning complete repositories with version history intact.
Let‘s now get into actually extracting archives programmatically in Python.
Unzipping Archives in Python
Python has a dedicated zipfile module to parse and extract zip files. It gives programmatic access to zip files similar to how you would open text files.
The shutil module also provides some high-level utilities to unzip files.
Let‘s go through the key functions one by one.
Opening Zip Files with ZipFile
The zipfile.ZipFile class allows you to work with a zip archive. You initalize an instance by passing the path to the target zip file:
import zipfile
zf = zipfile.ZipFile(‘files.zip‘)
This gives you a handle to the zip archive through which you can extract files or interrogate metadata. Make sure to close the handle once done:
zf.close()
Even better is to leverage Python‘s context manager syntax which auto-closes it for you:
with zipfile.ZipFile(‘files.zip‘) as zf:
# zip operations go here
# closed automatically
Extracting All Files
Once you have a ZipFile instance, you can extract all contents using:
zf.extractall(‘/extracted_files‘)
This decompresses all files retaining the original directory structures in the zip archive.
For a zip file containing:
files.zip
|__ folder1
|__ one.txt
|__ folder2
|__ two.txt
It will restore the paths:
/extracted_files
|__ folder1
|__ one.txt
|__ folder2
|__ two.txt
By default it overwrites existing files silently. You can change that behavior as discussed later.
Extracting a Single File
For surgical extraction of a specific file, use the extract() method:
zf.extract(‘folder1/one.txt‘, ‘/extracted‘)
This unpacks only the specified file from the archive to the target directory.
Understanding Zip Compression Methods
Did you know zip format supports a variety of compression algorithms under the hood? Some methods favor fast compression while others focus on maximizing space savings.
The key algorithms are:
- Store: No compression, just archiving files as is. Fastest.
- Deflate: Default method using combination of LZ77 and Huffman coding. Balances speed and compression ratio.
- BZip2: More memory intensive but achieves upto 10% better compression than Deflate. Slower.
- LZMA: Optimized for decompression speed and the best compression ratio but extremely slow to compress initially.
When unzipping zips in Python, it is helpful to know the underlying compression type used to set your expectations on extraction time, memory usage and where the bottlenecks can happen.
Fortunately, the ZipFile class allows checking the method used:
with ZipFile(‘foo.zip‘) as zf:
print(zf.compression) # Deflate or Store or BZip2 etc.
Extracting Based on User Input
Instead of hardcoding file paths for extraction, you can design interactive scripts that prompt users to choose files to unzip:
zf = ZipFile(‘archive.zip‘)
filename = input(‘Enter name of file to extract: ‘)
target_folder = input(‘Extraction directory: ‘)
zf.extract(filename, target_folder)
This adds flexibility to run extraction procedurally or select which archives to unpack.
Getting Zip File Information
The Pythonic zipfile module enables you to peek inside archives and access meta information without full extraction:
file_info = zf.getinfo(‘foo.txt‘)
print(file_info.file_size) # size in bytes
print(file_info.compress_size) # compressed size
print(file_info.date_time) # modification timestamp
You can query individual file details like this prior to extracting relevant entries.
Analyzing Compression Performance
An interesting analysis you can do programmatically is compare compression ratios across file types. The compress_size attribute gives compressed size for each file. Calculate the compression percentage as:
real_size = file_info.file_size
compressed_size = file_info.compress_size
compression_ratio = compressed_size / real_size * 100 # percentage
cRun this in a loop across all files in the archive to get average compression percentage.
This allows you to benchmark compression algorithms across datasets and filetypes.
For example, text-based files like XML and JSON compress better than media formats. You can generate nice visual plots to demonstrate this as well.
Unzipping in Memory Without Writing to Disk
So far we discussed extracting archives to physical storage. An advanced technique made possible by Python is in-memory decompression without needing temporary disk space.
The ZipFile class offers a open() method that takes a file inside the zip and returns a file-like object. You can leverage this to unzip contents into RAM through buffering:
import io
stream = io.BytesIO()
with ZipFile(‘files.zip‘) as zf:
with zf.open(‘file.txt‘) as zipped_file:
stream.write(zipped_file.read())
text = stream.getvalue() # unzipped content available Here
This unzips directly into the memory stream without needing gigabytes of temp space!
The approach comes in handy for containers and serverless environments like AWS Lambda with ephemeral storage.
Unzip Progress Tracking
For large archives, it can be useful to track extraction progress to monitor performance. Here is a snippet to get a progress indicator as files get unpacked:
import progressbar
progress = progressbar.ProgressBar(max_value=100)
with ZipFile(‘big_file.zip‘) as zf:
for index, filename in enumerate(zf.namelist()):
zf.extract(filename)
percent = index / len(zf.namelist()) * 100
progress.update(percent)
print(‘Done!‘)
Unzipping Password Protected Zips
Zips provide the ability to encrypt contents using passwords for secure transfers.
Python makes it straightforward to unzip these as well.
Here is an example to unpack a password protected archive:
password = ‘Pa$5w0rd‘
with ZipFile(‘secret.zip‘) as zf:
zf.setpassword(password)
zf.extractall()
The setpassword() method before extraction provides the passphrase and allows decompressing the files.
Encrypted zips add considerable overhead though. Avoid them if encryption is not mandatory.
Handling Collisions While Extracting
Python by default overwrites existing files silently when extracting zips. You can modify this to handle naming conflicts more gracefully.
For example, to append underscores if target paths already exist:
from pathlib import Path
with ZipFile(‘files.zip‘) as zf:
for fpath in zf.namelist():
extracted_path = Path(‘/target_dir‘, fpath)
if extracted_path.exists():
extracted_path = Path(str(extracted_path) + ‘_1‘)
zf.extract(fpath, ‘/target_dir‘)
Now files like report.pdf will become report_1.pdf avoiding accidental overwrites.
Recursively Searching Unzipped Archives
Once extraction finishes, you may need to search or index the inflated files programatically.
Here is some sample code to recursively traverse all extracted files while printing filenames that match a pattern:
import fnmatch
import os
for root, dirs, files in os.walk(‘/extracted_files‘):
for name in files:
if fnmatch.fnmatch(name, ‘*.jpg‘):
print(os.path.join(root, name))
This can help generate a manifest or audit unpacked contents as needed in your workflows.
Leveraging Python‘s shutil Module
The shutil module provides handy utilities to work with archives. Let‘s discuss its unzipping functions specifically.
Extracting Archives with unpack_archive()
The shutil library has a unpack_archive() method that can decompress multiple formats including zip:
import shutil
shutil.unpack_archive(‘project.zip‘, ‘/extract_here‘)
It automatically recognizes the zip format, opens it and extracts all files to the specified folder.
You can explicitly pass the format too:
shutil.unpack_archive(‘files.zip‘, ‘/extracted‘, ‘zip‘)
But Python auto detects zip extensions so it is unncessary.
This utility abstracts away the lower level details and gives a clean one-liner to unzip archives.
shutil vs zipfile: Which is Better?
So should you prefer shutil over zipfile?
Here is a comparison:
- zipfile offers richer functionality like per-file info, password protection etc. But needs more code.
- shutil is easier to use but lacks advanced operations.
In essence:
- Use zipfile when you need more control programmatically.
- Use shutil for basic everyday extraction convenience.
So leverage both as per your specific requirements.
Real World Examples
Let‘s go through some real world code snippets showcasing usage of Python‘s unzipping capabilities.
1. Unzipping Downloaded Datasets
Machine learning models rely on large datasets for training. It is common to obtain compressed archives of sample data online hosted on sites like Kaggle.
Here is how you can fetch and extract such datasets automatically:
import requests
import zipfile
import shutil
url = ‘https://filesamples.com/samples/code/sample.zip‘
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(‘data/‘)
The script:
- Downloads the zip file from URL using requests
- Creates an in-memory ZipFile object from response content
- Extracts to local folder for consumption
This unzips datasets from internet to disk without writing intermediate zip archives saving storage.
2. Distributing Apps as Zips
Python apps and binaries are often zipped for distribution. Users then need to inflate them to local folders to launch.
Here is an app skeleton with unzipping built-in:
import sys
import shutil
import custom_app
zip_name = sys.argv[0] # get executable ZIP filename
# extract self to site-packages folder
shutil.unpack_archive(zip_name, ‘$SITE_PACKAGES‘, ‘zip‘)
# now launch app from extracted files
custom_app.main()
Save it as a self-extracting executable .pyz zip file. When launched, it automatically unzips itself to standard site-packages path and runs!
3. Unzipping Configs and Dependencies
Complex programs rely on external configs and libraries. A common pattern is the distributed zip bundle containing:
app.zip
|__ config_files/
|__ library_dependencies/
|__ main.py
Here is how main.py can handle setup on first run:
import os
import zipfile
if not os.path.exists(‘config‘):
with zipfile.ZipFile(‘app.zip‘) as z:
z.extractall() # unzip bundle
import config
import libraries
# continue execution
It examines paths, unzips if missing and loads dependencies dynamically!
Going Beyond Zip Files
While zip is the most popular, Python can also handle other archive formats easily. This includes:
Tar Files:
import tarfile
tar = tarfile.open(‘files.tar‘, ‘r‘)
tar.list() # print files
tar.extractall() # unzip
tar.close()
GZip:
import gzip
with gzip.open(‘file.txt.gz‘, ‘rb‘) as f:
text = f.read() # uncompressed content
print(text)
BZip2, 7Z etc work similarly. Look into Python‘s bz2, lzma modules for those.
So your skills with zip translate well across archive formats!
Conclusion
We took a comprehensive look into the various approaches to programmatically unzip files in Python. The key takeways are:
- Use
zipfilemodule for handling zip archives - Leverage
ZipFileclass to extract contents - Apply
extractall()andextract()methods to decompress files - Fetch metadata like compression ratio without full extraction
- Track unzip progress for large archives
- Password protect zips files using
setpassword() - The
shutilmodule offers easier utilities throughunpack_archive() - Balance between features and convenience choosing between
zipfileandshutil
Unzipping compressed files is indispensable whether you are distributing code packages, exchanging data or configuring applications. Python makes it pleasantly easy.
I hope you enjoyed this detailed guide! Let me know if you have any other questions.


