Unzipping Files in Python: Practical Patterns for 2026

I still remember the first time a production job failed because a ZIP file contained a nested folder with the same filename as an existing log. The extraction code “worked,” but the output was wrong, and we didn’t catch it until a customer sent screenshots. That moment taught me two things: unzipping is not just a trivial utility step, and the defaults in Python can be both helpful and a little dangerous. If you move files around in pipelines, build tools, data ingestion, ML training, or even game assets, you will eventually need a reliable unzip layer.

In this guide, I’ll walk you through how I unzip files in Python today. I’ll show you how the standard library’s zipfile module behaves, how to extract everything or just the files you need, and where things go sideways. I’ll also show you modern practices in 2026: safety checks, performance limits, predictable output paths, and lightweight testing. You’ll leave with runnable examples, a clear mental model of extractall() and extract(), and a short checklist you can apply on your next project.

ZIP files and why they behave the way they do

A ZIP file is a container that can hold one or many files and folders, along with metadata like timestamps and paths. It uses lossless compression, which means you get the exact same bytes back after extraction. That sounds boring, but it matters because you can store binaries, images, and datasets without corruption.

What makes ZIPs tricky is that they store paths. A ZIP can contain entries like:

  • data/2026/January/metrics.csv
  • images/icons/close.svg
  • ../oops.txt

Those paths are just strings inside the archive. If you extract without checking, you might get files written outside your target directory. That’s called path traversal. I’ve seen it used accidentally and maliciously. So even if all you need is a quick unzip, you should treat the file as untrusted unless you made it yourself.

The Python standard library gives you zipfile.ZipFile, which is a thin wrapper around the ZIP format. It can list contents, inspect entries, read data without extracting, or extract to disk. The two most-used methods are:

  • extractall(): unpack everything.
  • extract(): unpack a single member.

You can do a surprising amount with just those two if you add a safety wrapper around them.

The simplest path: extract all files with extractall()

When I know a ZIP is trusted, I start with extractall(). It unpacks every file and folder and preserves the folder structure. If you don’t pass a path, it extracts into the current working directory. I almost always pass an explicit path to avoid surprises.

Here’s a full example you can run:

from zipfile import ZipFile

from pathlib import Path

zippath = Path("/tmp/databundle.zip")

outputdir = Path("/tmp/databundle")

Ensure the output directory exists

outputdir.mkdir(parents=True, existok=True)

with ZipFile(zip_path, "r") as zf:

zf.extractall(path=output_dir)

That’s it for the simplest case. But remember: extractall() will happily write whatever paths the archive contains. If the ZIP is from outside your system, you need a safety check. I’ll show a safe version later.

The main things to notice:

  • extractall() does not delete the ZIP. You decide when to clean up.
  • It preserves nested folders.
  • It will overwrite files without warning if names collide.

If you want to guard against overwrites, you need a check before extraction or a separate temp directory. I often extract into a new timestamped directory, then move or merge what I need after validation.

Targeted extraction with extract() and why it matters

Sometimes you only need one file inside a ZIP. For example, a data vendor might ship a bundle with multiple CSVs, but your pipeline only needs the day’s summary. In those cases, extract() is the right tool.

extract() expects a single member name. You can get those names with namelist() or infolist(). I prefer infolist() because it gives me file size and timestamp metadata.

Here’s a focused example:

from zipfile import ZipFile

from pathlib import Path

zippath = Path("/tmp/databundle.zip")

output_dir = Path("/tmp/extracted")

outputdir.mkdir(parents=True, existok=True)

with ZipFile(zip_path, "r") as zf:

# Pick a member by exact name

target_member = "reports/2026-01-10-summary.csv"

zf.extract(member=targetmember, path=outputdir)

A few practical notes I’ve learned:

  • Member names use forward slashes, even on Windows. Don’t join with Path for the input.
  • If you pass a member that does not exist, it throws a KeyError.
  • It preserves the internal path, so the file will be at output_dir/reports/2026-01-10-summary.csv.

If you want to flatten the folder structure, extract() won’t do it for you. You can read the file as bytes and write it yourself, which I’ll show in the “advanced patterns” section.

A safe extraction wrapper you can reuse

The most important part of unzipping is safety. Here’s the threat model: a ZIP can contain paths like ../../etc/passwd or absolute paths like /var/log/app.log. If you call extractall() blindly, you may write outside your target directory. The fix is to check the resolved path of each member before extraction.

This pattern is short, readable, and safe:

from zipfile import ZipFile

from pathlib import Path

def safeextractall(zippath: Path, outputdir: Path) -> list[Path]:

"""Extracts all files while blocking path traversal.

Returns a list of extracted file paths.

"""

outputdir = outputdir.resolve()

extracted = []

with ZipFile(zip_path, "r") as zf:

for info in zf.infolist():

# Skip directories; ZipFile handles them implicitly

member_name = info.filename

targetpath = (outputdir / member_name).resolve()

if not str(targetpath).startswith(str(outputdir)):

raise ValueError(f"Unsafe path in zip: {member_name}")

zf.extract(info, path=output_dir)

extracted.append(target_path)

return extracted

Usage

zippath = Path("/tmp/databundle.zip")

outputdir = Path("/tmp/databundle_safe")

outputdir.mkdir(parents=True, existok=True)

files = safeextractall(zippath, outputdir)

print(f"Extracted {len(files)} files")

This is the simplest strong safety check I’ve found. It does not detect overwrites, but it guarantees that every extracted path stays inside output_dir.

If you want to also prevent overwrites, add a pre-check:

  • If target_path exists, raise an error or rename.
  • If you need “update” behavior, extract into a temp directory and perform a controlled merge.

In regulated environments, I also log the list of extracted files and their sizes so we can audit outputs later.

Modern workflows: temporary directories, streaming, and large files

In 2026, I see unzip steps show up inside data pipelines, serverless functions, and AI workflows. The common issue is file size. ZIP files can be small or massive. You should decide upfront if you plan to extract to disk or read members directly.

Here are three patterns I use depending on the case:

1) Temporary extraction for batch processing

If the archive is small-to-medium and you want file system access:

from zipfile import ZipFile

from pathlib import Path

from tempfile import TemporaryDirectory

zip_path = Path("/tmp/batch.zip")

with TemporaryDirectory() as tmp_dir:

tmppath = Path(tmpdir)

with ZipFile(zip_path, "r") as zf:

zf.extractall(tmp_path)

# Process files inside tmp_path

for path in tmp_path.rglob("*.csv"):

# Replace with real processing logic

print(path.name)

Temporary directory is cleaned up automatically

This is safe and tidy. It also reduces the risk of leaving stray files on build agents or production hosts.

2) Streaming a file without extracting

If you only need one file and you want to avoid disk writes, read it directly:

from zipfile import ZipFile

from pathlib import Path

import csv

import io

zip_path = Path("/tmp/batch.zip")

with ZipFile(zip_path, "r") as zf:

with zf.open("reports/2026-01-10-summary.csv") as f:

# Wrap bytes in text mode for csv

text = io.TextIOWrapper(f, encoding="utf-8")

reader = csv.DictReader(text)

for row in reader:

# Process row

pass

This avoids extraction entirely. It’s fast and safe, but you lose the benefit of file system tools and repeated access is slower because the file is read from the ZIP each time.

3) Chunked extraction for very large archives

If the archive has thousands of files or gigabytes of data, I treat it like a batch job. I extract only the file types I need and process them in chunks. I also capture metadata so I can resume after failure.

A quick idea: filter by suffix and size before extraction. You can read info.file_size from infolist().

Traditional vs modern approach: a quick comparison

Here’s a short table I use when explaining unzipping practices to teams. It frames the differences clearly.

Approach

Traditional

Modern (2026) —

— Extract location

Current working directory

Explicit, isolated output path Safety checks

None

Path traversal guard + overwrite rules File selection

Extract all

Filter by names and size, often streaming Temp usage

Rare

Temporary directories by default Testing

Manual checks

Small automated tests around extraction

If you take one thing from this, it’s that safety checks and explicit output paths should be the default. The rest is a matter of project scale.

Common mistakes I see and how I avoid them

I review a lot of unzip code for teams. Here are the most common mistakes and how to fix them fast.

1) Extracting into the current working directory

This leads to hard-to-debug file placement. I always pass an explicit output path and create it if needed.

2) Trusting the ZIP contents

If you unzip a file from the internet or a customer, you need a traversal check. Otherwise, a path like ../../secrets.txt can escape your output directory. Use the safe extraction wrapper earlier.

3) Ignoring name collisions

ZIPs can include duplicate filenames in different folders, or even the same full path twice. zipfile will overwrite silently. If that matters, you need to check for existing paths before extraction or use a temp directory and validate.

4) Forgetting that names use forward slashes

On Windows, you still need to reference members with forward slashes. I keep member names as strings from namelist() or infolist() and never attempt to build them with Path.

5) Trying to unzip archives with unknown encoding

Some ZIPs use non-UTF-8 filenames. Python’s zipfile uses CP437 by default when flags are missing. If you see garbled names, you can inspect and then re-encode, but it’s rarely worth it unless the ZIP is internal and you control it.

6) Assuming compressed size equals extracted size

ZIPs can compress very well. A 50 MB ZIP might expand to 500 MB. If you’re on serverless or limited storage, check total file size with infolist().

7) Forgetting to close the file

When you use ZipFile as a context manager, it closes cleanly. Don’t call close() manually if you’re already using with.

These mistakes are easy to fix once you know they’re there. I keep a small helper module with safe extraction and size checks and import it wherever I need to unzip.

Practical patterns and edge cases in real projects

Beyond the core methods, there are a few patterns that show up regularly.

Selective extraction by suffix and size

If I only need CSV and JSON files under 50 MB each, I filter before extraction:

from zipfile import ZipFile

from pathlib import Path

zip_path = Path("/tmp/bundle.zip")

output_dir = Path("/tmp/filtered")

outputdir.mkdir(parents=True, existok=True)

allowed_suffixes = {".csv", ".json"}

max_size = 50 1024 1024 # 50 MB

with ZipFile(zip_path, "r") as zf:

for info in zf.infolist():

name = info.filename

if name.endswith("/"):

continue

if Path(name).suffix.lower() not in allowed_suffixes:

continue

if info.filesize > maxsize:

continue

zf.extract(info, path=output_dir)

This keeps disk usage predictable and prevents accidental extraction of large binary data you don’t need.

Flattening the directory structure

Sometimes you want a flat output folder. In that case, don’t use extract(). Instead, open and write the file yourself:

from zipfile import ZipFile

from pathlib import Path

zip_path = Path("/tmp/bundle.zip")

output_dir = Path("/tmp/flat")

outputdir.mkdir(parents=True, existok=True)

with ZipFile(zip_path, "r") as zf:

for info in zf.infolist():

if info.is_dir():

continue

original_name = Path(info.filename).name

target = outputdir / originalname

# Avoid overwriting by adding a suffix

if target.exists():

target = output_dir / f"{target.stem}-copy{target.suffix}"

with zf.open(info) as src, open(target, "wb") as dst:

dst.write(src.read())

This pattern is useful when you’re building a simple import folder for non-technical users or when you are preparing files for another system that expects flat inputs.

Dealing with passwords

zipfile supports password-protected ZIPs, but only the legacy encryption method. Many modern ZIPs use stronger encryption schemes that Python’s standard library doesn’t handle. If you control the ZIP creation, use the legacy scheme if you must, or switch to a third-party library.

If you do have a ZIP that is compatible, you can pass pwd:

from zipfile import ZipFile

from pathlib import Path

zip_path = Path("/tmp/secure.zip")

output_dir = Path("/tmp/secure")

outputdir.mkdir(parents=True, existok=True)

password = b"my_password" # Must be bytes

with ZipFile(zip_path, "r") as zf:

zf.extractall(path=output_dir, pwd=password)

If you see a RuntimeError about bad password, it’s either wrong or the archive is using an encryption format zipfile can’t read. In that case, I use a specialized library or ask for a different archive format.

Reading large files efficiently

If you stream a large file out of a ZIP, reading it all at once is not ideal. Use a buffer:

from zipfile import ZipFile

from pathlib import Path

zippath = Path("/tmp/bigbundle.zip")

output_path = Path("/tmp/output.bin")

buffer_size = 1024 * 1024 # 1 MB

with ZipFile(zip_path, "r") as zf:

with zf.open("big/file.bin") as src, open(output_path, "wb") as dst:

while True:

chunk = src.read(buffer_size)

if not chunk:

break

dst.write(chunk)

This keeps memory usage low even for very large files.

Performance notes you can actually use

I try to keep performance advice practical. Here’s what I see most often in real systems:

  • For small archives, extraction typically takes 10–50 ms on modern SSDs.
  • Medium archives (tens of MB, hundreds of files) often land in the 200–800 ms range.
  • Large archives (hundreds of MB and thousands of files) can take multiple seconds, sometimes 5–20 seconds depending on disk and CPU.

The biggest factors are:

  • File count: many small files cost more than a few large ones.
  • Disk speed: SSD vs networked storage is a huge gap.
  • Compression ratio: heavier compression costs more CPU to decompress.

If you want to speed things up, the best change is often “extract fewer files” or “stream what you need.” I’ve seen pipelines cut their unzip time in half just by filtering to the relevant suffixes before extraction.

I also avoid parallel extraction unless I’ve tested it. The ZIP format itself isn’t designed for safe concurrent extraction without careful locking, and disk I/O becomes the bottleneck quickly.

When you should not unzip

There are times when unzipping is not the right call.

  • If you only need a single file from a large archive, read it with zf.open().
  • If you’re in a serverless function with tight storage limits, streaming is safer.
  • If the archive is untrusted and you don’t have time to implement safety checks, avoid extraction and instead ask the provider for a different delivery method.
  • If you need to process the same archive many times, consider caching extracted files and validating checksums.

I’ve worked on systems where we kept the original ZIP and extracted files into a cache keyed by a hash of the archive. That way, repeated jobs were fast and predictable.

Testing your unzip logic without heavy setup

You don’t need a full test suite to validate extraction. I often do this lightweight approach:

  • Create a ZIP with a nested directory and a known file.
  • Run your extraction code into a temp directory.
  • Assert the file exists and has the expected content.

Here’s a small example that can live as a script:

from zipfile import ZipFile

from pathlib import Path

from tempfile import TemporaryDirectory

Create a tiny test zip

zippath = Path("/tmp/testbundle.zip")

with ZipFile(zip_path, "w") as zf:

zf.writestr("data/hello.txt", "hello world")

Extract and validate

with TemporaryDirectory() as tmp_dir:

tmppath = Path(tmpdir)

with ZipFile(zip_path, "r") as zf:

zf.extractall(tmp_path)

extracted = tmp_path / "data" / "hello.txt"

assert extracted.exists()

assert extracted.read_text() == "hello world"

It’s simple, but it catches the most common mistakes. If you already use pytest or another test runner, you can wrap the same logic inside a test function.

Key takeaways and your next steps

Unzipping files in Python is easy when the archive is trusted and small, but the moment you bring in external data or large bundles, you need a little discipline. I default to zipfile.ZipFile, and I always choose explicit output directories. When I’m working with untrusted archives, I add path traversal checks and refuse to write files outside the target folder. If the archive is large, I filter before extraction or stream the file I need instead of unpacking everything.

If you’re starting today, here’s the practical path I recommend: build a small helper module with safeextractall(), a function that extracts a single member by name, and a size filter. Use TemporaryDirectory for short-lived jobs. Keep member names as strings from the ZIP rather than constructing them yourself. And if performance becomes a problem, reduce the number of files you extract before reaching for parallel tricks.

The best part is that these patterns are small, readable, and easy to reuse. Once you adopt them, unzipping becomes a dependable step rather than a risk. If you want, I can also help you build a tiny library around these patterns or review your current extraction code for safety and edge cases.

Scroll to Top