Skip to content

frictionless[parquet] raises ModuleNotFoundError: pandas on first parquet read #1773

@dsmedia

Description

@dsmedia

Summary

Installing only the parquet extra is insufficient to read or write Parquet files. The parser hits ModuleNotFoundError: No module named 'pandas' the first time read_rows() is called.

Reproduction

Fully self-contained — the parquet extra already installs pyarrow, so we use it to write the test file. No other data or deps required.

python -m venv /tmp/repro && source /tmp/repro/bin/activate
pip install 'frictionless[parquet]==5.19.0'

# 1. Write a tiny parquet file using pyarrow (which the extra does provide):
python - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq
pq.write_table(pa.table({"id": [1, 2], "name": ["alice", "bob"]}), "/tmp/repro.parquet")
PY

# 2. Try to read it via frictionless — triggers the bug:
cd /tmp && python -c "from frictionless import Resource; print(Resource('/tmp/repro.parquet').read_rows())"

Output (confirmed on a clean Python 3.12 venv):

Traceback (most recent call last):
  ...
  File ".../frictionless/formats/parquet/parser.py", line 42, in read_cell_stream_create
    df = table.to_pandas(categories=control.categories or None)
         ^^^^^^^^^^^^^^^
  File "pyarrow/pandas-shim.pxi", line 50, in pyarrow.lib._PandasAPIShim._import_pandas
ModuleNotFoundError: No module named 'pandas'

frictionless.exception.FrictionlessException: [source-error] The data source has not supported or has inconsistent contents: No module named 'pandas'

Root cause

frictionless/formats/parquet/parser.py uses pandas unconditionally:

  • table.to_pandas(...) (line 42) on every read
  • TableResource(data=df, format="pandas") (line 43) on every read
  • platform.pandas.io.common.get_handle(...) (lines 32–35) on remote reads
  • source.to_pandas() (line 55) on every write

But pyproject.toml declares parquet = ["pyarrow>=14.0"]pandas is not pulled in. The import is lazy (platform.pandas via @extras(name="pandas"), and PyArrow calling it at runtime inside to_pandas()), which defers the error to first use instead of at import time. CI doesn't catch this because the hatch default env installs all extras together (e.g., frictionless[...,pandas,parquet,...]), masking the packaging gap.

The bug has been latent since PR #1260 (Oct 2022), which introduced the remote-read pandas dependency and the pandas-dataframe conversion.

Workaround

Install with both extras explicitly: pip install 'frictionless[parquet,pandas]'.

Proposed fix

Option A (quick fix): Add pandas>=1.0 to the parquet extra in pyproject.toml so it matches the actual runtime surface of ParquetParser. One-line change; makes the parquet and pandas extras strictly redundant, which honestly reflects today's runtime coupling.

Option B (architectural fix): Keep the parquet extra lightweight and avoid pulling in pandas entirely by rewriting ParquetParser to read natively from PyArrow — e.g., iterating via table.to_batches() or table.to_pylist() instead of delegating to TableResource(format="pandas"). Larger change; decouples the two extras for good.

PR coming — opens with Option A as the minimal, low-risk fix; Option B is left as a follow-up for maintainers to weigh.

Environment

  • frictionless 5.19.0
  • Python 3.12
  • Linux (WSL2), but not OS-specific — reproduces anywhere pip installs the parquet extra without pandas.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions