frictionless[parquet] raises ModuleNotFoundError: pandas on first parquet read

## Summary

Installing only the `parquet` extra is insufficient to read or write Parquet files. The parser hits `ModuleNotFoundError: No module named 'pandas'` the first time `read_rows()` is called.

## Reproduction

Fully self-contained — the `parquet` extra already installs `pyarrow`, so we use it to write the test file. No other data or deps required.

```bash
python -m venv /tmp/repro && source /tmp/repro/bin/activate
pip install 'frictionless[parquet]==5.19.0'

# 1. Write a tiny parquet file using pyarrow (which the extra does provide):
python - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq
pq.write_table(pa.table({"id": [1, 2], "name": ["alice", "bob"]}), "/tmp/repro.parquet")
PY

# 2. Try to read it via frictionless — triggers the bug:
cd /tmp && python -c "from frictionless import Resource; print(Resource('/tmp/repro.parquet').read_rows())"
```

Output (confirmed on a clean Python 3.12 venv):

```text
Traceback (most recent call last):
  ...
  File ".../frictionless/formats/parquet/parser.py", line 42, in read_cell_stream_create
    df = table.to_pandas(categories=control.categories or None)
         ^^^^^^^^^^^^^^^
  File "pyarrow/pandas-shim.pxi", line 50, in pyarrow.lib._PandasAPIShim._import_pandas
ModuleNotFoundError: No module named 'pandas'

frictionless.exception.FrictionlessException: [source-error] The data source has not supported or has inconsistent contents: No module named 'pandas'
```

## Root cause

`frictionless/formats/parquet/parser.py` uses `pandas` unconditionally:

- `table.to_pandas(...)` (line 42) on every read
- `TableResource(data=df, format="pandas")` (line 43) on every read
- `platform.pandas.io.common.get_handle(...)` (lines 32–35) on remote reads
- `source.to_pandas()` (line 55) on every write

But `pyproject.toml` declares `parquet = ["pyarrow>=14.0"]` — `pandas` is not pulled in. The import is lazy (`platform.pandas` via `@extras(name="pandas")`, and PyArrow calling it at runtime inside `to_pandas()`), which defers the error to first use instead of at import time. CI doesn't catch this because the hatch default env installs all extras together (e.g., `frictionless[...,pandas,parquet,...]`), masking the packaging gap.

The bug has been latent since PR #1260 (Oct 2022), which introduced the remote-read pandas dependency and the pandas-dataframe conversion.

## Workaround

Install with both extras explicitly: `pip install 'frictionless[parquet,pandas]'`.

## Proposed fix

**Option A (quick fix):** Add `pandas>=1.0` to the `parquet` extra in `pyproject.toml` so it matches the actual runtime surface of `ParquetParser`. One-line change; makes the `parquet` and `pandas` extras strictly redundant, which honestly reflects today's runtime coupling.

**Option B (architectural fix):** Keep the `parquet` extra lightweight and avoid pulling in pandas entirely by rewriting `ParquetParser` to read natively from PyArrow — e.g., iterating via `table.to_batches()` or `table.to_pylist()` instead of delegating to `TableResource(format="pandas")`. Larger change; decouples the two extras for good.

PR coming — opens with Option A as the minimal, low-risk fix; Option B is left as a follow-up for maintainers to weigh.

## Environment

- frictionless 5.19.0
- Python 3.12
- Linux (WSL2), but not OS-specific — reproduces anywhere pip installs the `parquet` extra without pandas.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

frictionless[parquet] raises ModuleNotFoundError: pandas on first parquet read #1773

Summary

Reproduction

Root cause

Workaround

Proposed fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

frictionless[parquet] raises ModuleNotFoundError: pandas on first parquet read #1773

Description

Summary

Reproduction

Root cause

Workaround

Proposed fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions