fix: parallel file validation for a datapackage by pierrecamilleri · Pull Request #1722 · frictionlessdata/frictionless-py

pierrecamilleri · 2024-12-13T16:24:24Z

fixes Adding --parallel CLI flag breaks validation #1644

Context

In validator.py, Resource was imported as a type, but then used as a class to retrieve a class method (Resource.from_descriptor).
The tests would not intercept this error, as they were performed on a schema with a foreign key, and the validation would fall back on synchronous validation in this case.
Importing Resource (not as a type) into validator.py leads to a circular import. A small workaround would have been to import Resource locally, but this was not very satisfactory.

Suggested changes and fixes

I repaired the tests, using datapackages with no foreign keys to test parallel validation. I checked that these tests would have failed because with the bug.
I moved the implementation of Validator.validate_resource and Validator.validate_package methods to the Resource.validate and Package.validate methods. This solves the circular import problem, and removes duplication (method signatures).
The tests have been dispatched as well.
I soft deprecated the Validator class, keeping the methods (forwarding input parameters) as they are part of the public API. No warning, only a deprecation message in the docstring, and a mention that there is no plan to remove the class in the future.

Note

I could not produce any significant performance gain in my tests, I wonder if there is some shared resources' lock that prevents the validation from effectively running the validation in parallel (but I have not spent much time on this yet).

Archive of explorations

Test that it works correctly
- Works without error, however, I could not produce any significant performance gain in my tests.
Check why the tests did not intercept this error
- The tests for parallel processing are run for a schema with a foreign key - which fall backs to single-core processing. So the validate_parallel function was actually never executed in the tests.
See if I can find a more elegant way to deal with the circular import. I actually don't see the benefit of having this validator.py class instead of dispatching the methods to Resource and DataPackage directly, but I may have missed something.

pdelboca

Thanks for this @pierrecamilleri !

Just one small comment but feel free to merge :)

Adds scripts/validate_datapackage.py for verifying data files against the descriptor. Optional, local, not run in CI — the existing workflow (taplo + ruff + npm build) is unchanged. Runs two phases in sequence. Phase 1 is pure-Python: recomputes each file's size and git-blob SHA-1 and compares against datapackage.json. We do this ourselves because frictionless-py can't cover either case reliably — its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1). Phase 2 hands each resource to frictionless-py for schema and row validation, with byte-count and hash-count skipped (phase 1 covered them) and serial execution because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected phase-2 failures — movies.json's intentional schema mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser upstream) — are declared in _data/validate_datapackage.toml and marked with a yellow warning in the output without tripping the exit code. Removing an entry from that file re-enables strict checking and surfaces any regression in a PR. Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime error on --parallel for non-FK packages. The script's PEP 723 deps also include the pandas extra as a workaround for frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone doesn't pull pandas, which the parquet parser needs); that extra will be redundant once #1774 releases.

@domoritz

) * feat: add optional local datapackage.json validator Adds scripts/validate_datapackage.py for verifying data files against the descriptor. Optional, local, not run in CI — the existing workflow (taplo + ruff + npm build) is unchanged. Runs two phases in sequence. Phase 1 is pure-Python: recomputes each file's size and git-blob SHA-1 and compares against datapackage.json. We do this ourselves because frictionless-py can't cover either case reliably — its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1). Phase 2 hands each resource to frictionless-py for schema and row validation, with byte-count and hash-count skipped (phase 1 covered them) and serial execution because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected phase-2 failures — movies.json's intentional schema mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser upstream) — are declared in _data/validate_datapackage.toml and marked with a yellow warning in the output without tripping the exit code. Removing an entry from that file re-enables strict checking and surfaces any regression in a PR. Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime error on --parallel for non-FK packages. The script's PEP 723 deps also include the pandas extra as a workaround for frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone doesn't pull pandas, which the parquet parser needs); that extra will be redundant once #1774 releases. * fix: unpin taplo from local aarch64 wheel registry in uv.lock The lockfile incorrectly recorded taplo with `source = { registry = ".wheels" }` and only the aarch64 wheel — leaked from a local WSL2 workaround when the frictionless bump triggered a re-lock. CI on x86_64 couldn't install it. Restore taplo's PyPI source with all platform wheels so CI passes. * fix: add pandas to default deps for parquet reading frictionless[parquet]>=5.18.1 switched from fastparquet to pyarrow for parquet support. pyarrow.Table.to_pandas() needs pandas at runtime but pyarrow doesn't declare it as a transitive dep, so build_datapackage.py failed in CI with "No module named 'pandas'" when inferring parquet schemas. On main this worked only because frictionless[parquet]>=5.18.0 pulled fastparquet, which in turn pulled pandas. Add pandas as an explicit top-level dep so the default install tree satisfies frictionless's parquet backend. * refactor: convert datapackage validator from script to pytest @domoritz observed on PR #782 that the script reimplements pytest's reporting layer (custom rich panels, manual progress bars, custom expected-failures filtering). vega-datasets is a metadata + data hub — every line of tracked code should pay rent — so move the validator into pytest where pytest already provides reporting, parametrization, xfail strict, and standard CLI ergonomics. Two tiers map onto pytest's fast/slow split: * Default — file existence, declared bytes, git-blob SHA-1. Stdlib only, sub-second over 73 resources. Covers what frictionless-py doesn't today (byte-count returns None for tabular JSON / arrow / parquet; hash-count supports only md5 / sha256; descriptor uses sha1). * Slow (`pytest --run-slow`) — frictionless schema and row validation per resource. Default is full read (matching the script); pass `--limit-rows N` to cap during iteration. flights-3m.parquet is ~minutes at full read. The expected-failures allowlist (movies — intentional pedagogy; flights_200k_arrow — no upstream parser) stays in _data/validate_datapackage.toml. xfail marks are emitted at parametrize time via pytest.param(resource, marks=[xfail(strict=True)]) — not via brittle ID-string matching in pytest_collection_modifyitems. Strict mode flips XFAIL to FAIL the moment an upstream issue resolves, prompting allowlist removal. Tests assert descriptor contracts, they don't skip on missing fields. Every resource has path / bytes / sha1: hash today; if a future descriptor regression drops one, the test fails loudly rather than silently SKIPping. parallel=False is preserved with a code comment explaining the load-bearing rationale: frictionless's parallel path silently ignores Checklist.skip_errors, which would re-surface byte-count and hash-count errors phase 1 already covers. Net: -330 lines from deleting scripts/validate_datapackage.py (replaced by 213 lines under tests/), pytest>=9 promoted from transitive in uv.lock to declared dev-group dep. The rich PEP 723 inline dep on the script goes away; rich itself stays in uv.lock transitively via frictionless -> typer -> rich. * docs(test): point basepath workaround at #758 Replace the misattributed reference to frictionlessdata/frictionless-py#1435 with a pointer to #758, the in-repo issue tracking the actual cause: descriptor resource paths are bare filenames under data/ rather than relative to the descriptor location. #1435 is about remote descriptors with local data (CLI --basepath silently ignored on remote URLs); our descriptor is local and fixing #1435 upstream would not remove this workaround. #758 is the correct upstream tracker and chains to vega/altair#3946 for the cross-repo coordination required before the descriptor paths can be migrated. * fix(test): correct invalid return annotation on _slow_param `pytest.param` is a function (factory), not a type, so `-> pytest.param` is an invalid type annotation (flagged by ty as `invalid-type-form`). Pytest 9.0.3 doesn't expose `ParameterSet` at the public top level, so fall back to `-> Any` with an inline comment naming the actual type.

pierrecamilleri and others added 6 commits December 13, 2024 17:17

fix: parallel file validation for a datapackage

bef5fdc

🔵 rename local variable

b0554a2

🔵 improve documentation

f20aab1

Mv validate methods & soft deprecate Validator

bc27f9b

fix: repair tests of parallel validation

0498e9b

Dispatching tests according to method change

e2f2a33

pierrecamilleri marked this pull request as ready for review January 27, 2025 14:14

pierrecamilleri requested review from pdelboca and roll January 27, 2025 14:15

pdelboca approved these changes Jan 29, 2025

View reviewed changes

Comment thread frictionless/resource/resource.py Outdated

pierrecamilleri added 4 commits March 25, 2025 21:43

doc: detail on depreciation

ff8e89d

fix: unintentional format

9408d6c

fix: unintentional format change

dd527d5

version bump

910fb80

pierrecamilleri merged commit c48cc79 into main Mar 25, 2025

pierrecamilleri deleted the fix/parallel-datapackage branch March 25, 2025 21:19

dsmedia mentioned this pull request Apr 19, 2026

feat: add tests to validate dataset files against datapackage.json vega/vega-datasets#782

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: parallel file validation for a datapackage#1722

fix: parallel file validation for a datapackage#1722
pierrecamilleri merged 10 commits into
mainfrom
fix/parallel-datapackage

pierrecamilleri commented Dec 13, 2024 •

edited

Loading

Uh oh!

pdelboca left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pierrecamilleri commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Suggested changes and fixes

Note

Uh oh!

pdelboca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pierrecamilleri commented Dec 13, 2024 •

edited

Loading