Skip to content

fix: parallel file validation for a datapackage#1722

Merged
pierrecamilleri merged 10 commits into
mainfrom
fix/parallel-datapackage
Mar 25, 2025
Merged

fix: parallel file validation for a datapackage#1722
pierrecamilleri merged 10 commits into
mainfrom
fix/parallel-datapackage

Conversation

@pierrecamilleri

@pierrecamilleri pierrecamilleri commented Dec 13, 2024

Copy link
Copy Markdown
Collaborator

Context

  • In validator.py, Resource was imported as a type, but then used as a class to retrieve a class method (Resource.from_descriptor).
  • The tests would not intercept this error, as they were performed on a schema with a foreign key, and the validation would fall back on synchronous validation in this case.
  • Importing Resource (not as a type) into validator.py leads to a circular import. A small workaround would have been to import Resource locally, but this was not very satisfactory.

Suggested changes and fixes

  • I repaired the tests, using datapackages with no foreign keys to test parallel validation. I checked that these tests would have failed because with the bug.
  • I moved the implementation of Validator.validate_resource and Validator.validate_package methods to the Resource.validate and Package.validate methods. This solves the circular import problem, and removes duplication (method signatures).
  • The tests have been dispatched as well.
  • I soft deprecated the Validator class, keeping the methods (forwarding input parameters) as they are part of the public API. No warning, only a deprecation message in the docstring, and a mention that there is no plan to remove the class in the future.

Note

I could not produce any significant performance gain in my tests, I wonder if there is some shared resources' lock that prevents the validation from effectively running the validation in parallel (but I have not spent much time on this yet).

Archive of explorations
  • Test that it works correctly
    • Works without error, however, I could not produce any significant performance gain in my tests.
  • Check why the tests did not intercept this error
    • The tests for parallel processing are run for a schema with a foreign key - which fall backs to single-core processing. So the validate_parallel function was actually never executed in the tests.
  • See if I can find a more elegant way to deal with the circular import. I actually don't see the benefit of having this validator.py class instead of dispatching the methods to Resource and DataPackage directly, but I may have missed something.

@pierrecamilleri pierrecamilleri marked this pull request as ready for review January 27, 2025 14:14

@pdelboca pdelboca left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @pierrecamilleri !

Just one small comment but feel free to merge :)

Comment thread frictionless/resource/resource.py Outdated
@pierrecamilleri pierrecamilleri merged commit c48cc79 into main Mar 25, 2025
@pierrecamilleri pierrecamilleri deleted the fix/parallel-datapackage branch March 25, 2025 21:19
dsmedia added a commit to dsmedia/vega-datasets that referenced this pull request Apr 19, 2026
Adds scripts/validate_datapackage.py for verifying data files against
the descriptor. Optional, local, not run in CI — the existing workflow
(taplo + ruff + npm build) is unchanged.

Runs two phases in sequence. Phase 1 is pure-Python: recomputes each
file's size and git-blob SHA-1 and compares against datapackage.json.
We do this ourselves because frictionless-py can't cover either case
reliably — its byte-count check returns None for tabular JSON / arrow /
parquet resources, and its hash check only supports md5 and sha256
(our descriptor uses sha1). Phase 2 hands each resource to
frictionless-py for schema and row validation, with byte-count and
hash-count skipped (phase 1 covered them) and serial execution because
frictionless's parallel path silently ignores Checklist.skip_errors.

Known-expected phase-2 failures — movies.json's intentional schema
mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser
upstream) — are declared in _data/validate_datapackage.toml and marked
with a yellow warning in the output without tripping the exit code.
Removing an entry from that file re-enables strict checking and
surfaces any regression in a PR.

Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate
fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime
error on --parallel for non-FK packages. The script's PEP 723 deps
also include the pandas extra as a workaround for
frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone
doesn't pull pandas, which the parquet parser needs); that extra will
be redundant once #1774 releases.
dsmedia added a commit to vega/vega-datasets that referenced this pull request Apr 30, 2026
)

* feat: add optional local datapackage.json validator

Adds scripts/validate_datapackage.py for verifying data files against
the descriptor. Optional, local, not run in CI — the existing workflow
(taplo + ruff + npm build) is unchanged.

Runs two phases in sequence. Phase 1 is pure-Python: recomputes each
file's size and git-blob SHA-1 and compares against datapackage.json.
We do this ourselves because frictionless-py can't cover either case
reliably — its byte-count check returns None for tabular JSON / arrow /
parquet resources, and its hash check only supports md5 and sha256
(our descriptor uses sha1). Phase 2 hands each resource to
frictionless-py for schema and row validation, with byte-count and
hash-count skipped (phase 1 covered them) and serial execution because
frictionless's parallel path silently ignores Checklist.skip_errors.

Known-expected phase-2 failures — movies.json's intentional schema
mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser
upstream) — are declared in _data/validate_datapackage.toml and marked
with a yellow warning in the output without tripping the exit code.
Removing an entry from that file re-enables strict checking and
surfaces any regression in a PR.

Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate
fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime
error on --parallel for non-FK packages. The script's PEP 723 deps
also include the pandas extra as a workaround for
frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone
doesn't pull pandas, which the parquet parser needs); that extra will
be redundant once #1774 releases.

* fix: unpin taplo from local aarch64 wheel registry in uv.lock

The lockfile incorrectly recorded taplo with `source = { registry = ".wheels" }`
and only the aarch64 wheel — leaked from a local WSL2 workaround when the
frictionless bump triggered a re-lock. CI on x86_64 couldn't install it.

Restore taplo's PyPI source with all platform wheels so CI passes.

* fix: add pandas to default deps for parquet reading

frictionless[parquet]>=5.18.1 switched from fastparquet to pyarrow for
parquet support. pyarrow.Table.to_pandas() needs pandas at runtime but
pyarrow doesn't declare it as a transitive dep, so build_datapackage.py
failed in CI with "No module named 'pandas'" when inferring parquet
schemas. On main this worked only because frictionless[parquet]>=5.18.0
pulled fastparquet, which in turn pulled pandas.

Add pandas as an explicit top-level dep so the default install tree
satisfies frictionless's parquet backend.

* refactor: convert datapackage validator from script to pytest

@domoritz observed on PR #782 that the script reimplements pytest's
reporting layer (custom rich panels, manual progress bars, custom
expected-failures filtering). vega-datasets is a metadata + data hub —
every line of tracked code should pay rent — so move the validator into
pytest where pytest already provides reporting, parametrization, xfail
strict, and standard CLI ergonomics.

Two tiers map onto pytest's fast/slow split:

* Default — file existence, declared bytes, git-blob SHA-1. Stdlib only,
  sub-second over 73 resources. Covers what frictionless-py doesn't
  today (byte-count returns None for tabular JSON / arrow / parquet;
  hash-count supports only md5 / sha256; descriptor uses sha1).

* Slow (`pytest --run-slow`) — frictionless schema and row validation
  per resource. Default is full read (matching the script); pass
  `--limit-rows N` to cap during iteration. flights-3m.parquet is
  ~minutes at full read.

The expected-failures allowlist (movies — intentional pedagogy;
flights_200k_arrow — no upstream parser) stays in
_data/validate_datapackage.toml. xfail marks are emitted at parametrize
time via pytest.param(resource, marks=[xfail(strict=True)]) — not via
brittle ID-string matching in pytest_collection_modifyitems. Strict
mode flips XFAIL to FAIL the moment an upstream issue resolves,
prompting allowlist removal.

Tests assert descriptor contracts, they don't skip on missing fields.
Every resource has path / bytes / sha1: hash today; if a future
descriptor regression drops one, the test fails loudly rather than
silently SKIPping.

parallel=False is preserved with a code comment explaining the
load-bearing rationale: frictionless's parallel path silently ignores
Checklist.skip_errors, which would re-surface byte-count and
hash-count errors phase 1 already covers.

Net: -330 lines from deleting scripts/validate_datapackage.py
(replaced by 213 lines under tests/), pytest>=9 promoted from
transitive in uv.lock to declared dev-group dep. The rich PEP 723
inline dep on the script goes away; rich itself stays in uv.lock
transitively via frictionless -> typer -> rich.

* docs(test): point basepath workaround at #758

Replace the misattributed reference to frictionlessdata/frictionless-py#1435
with a pointer to #758, the in-repo issue tracking
the actual cause: descriptor resource paths are bare filenames under
data/ rather than relative to the descriptor location.

#1435 is about remote descriptors with local data (CLI --basepath
silently ignored on remote URLs); our descriptor is local and fixing
#1435 upstream would not remove this workaround. #758 is the correct
upstream tracker and chains to vega/altair#3946 for the cross-repo
coordination required before the descriptor paths can be migrated.

* fix(test): correct invalid return annotation on _slow_param

`pytest.param` is a function (factory), not a type, so `-> pytest.param`
is an invalid type annotation (flagged by ty as `invalid-type-form`).
Pytest 9.0.3 doesn't expose `ParameterSet` at the public top level, so
fall back to `-> Any` with an inline comment naming the actual type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adding --parallel CLI flag breaks validation

2 participants