fix: parallel file validation for a datapackage#1722
Merged
Conversation
pdelboca
approved these changes
Jan 29, 2025
pdelboca
left a comment
Contributor
There was a problem hiding this comment.
Thanks for this @pierrecamilleri !
Just one small comment but feel free to merge :)
dsmedia
added a commit
to dsmedia/vega-datasets
that referenced
this pull request
Apr 19, 2026
Adds scripts/validate_datapackage.py for verifying data files against the descriptor. Optional, local, not run in CI — the existing workflow (taplo + ruff + npm build) is unchanged. Runs two phases in sequence. Phase 1 is pure-Python: recomputes each file's size and git-blob SHA-1 and compares against datapackage.json. We do this ourselves because frictionless-py can't cover either case reliably — its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1). Phase 2 hands each resource to frictionless-py for schema and row validation, with byte-count and hash-count skipped (phase 1 covered them) and serial execution because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected phase-2 failures — movies.json's intentional schema mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser upstream) — are declared in _data/validate_datapackage.toml and marked with a yellow warning in the output without tripping the exit code. Removing an entry from that file re-enables strict checking and surfaces any regression in a PR. Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime error on --parallel for non-FK packages. The script's PEP 723 deps also include the pandas extra as a workaround for frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone doesn't pull pandas, which the parquet parser needs); that extra will be redundant once #1774 releases.
dsmedia
added a commit
to vega/vega-datasets
that referenced
this pull request
Apr 30, 2026
) * feat: add optional local datapackage.json validator Adds scripts/validate_datapackage.py for verifying data files against the descriptor. Optional, local, not run in CI — the existing workflow (taplo + ruff + npm build) is unchanged. Runs two phases in sequence. Phase 1 is pure-Python: recomputes each file's size and git-blob SHA-1 and compares against datapackage.json. We do this ourselves because frictionless-py can't cover either case reliably — its byte-count check returns None for tabular JSON / arrow / parquet resources, and its hash check only supports md5 and sha256 (our descriptor uses sha1). Phase 2 hands each resource to frictionless-py for schema and row validation, with byte-count and hash-count skipped (phase 1 covered them) and serial execution because frictionless's parallel path silently ignores Checklist.skip_errors. Known-expected phase-2 failures — movies.json's intentional schema mismatch (documented pedagogy) and flights-200k.arrow (no arrow parser upstream) — are declared in _data/validate_datapackage.toml and marked with a yellow warning in the output without tripping the exit code. Removing an entry from that file re-enables strict checking and surfaces any regression in a PR. Bumps frictionless>=5.18.0 to >=5.18.1 to pick up the parallel-validate fix in frictionlessdata/frictionless-py#1722; v5.18.0 threw a runtime error on --parallel for non-FK packages. The script's PEP 723 deps also include the pandas extra as a workaround for frictionlessdata/frictionless-py#1773 / #1774 (the parquet extra alone doesn't pull pandas, which the parquet parser needs); that extra will be redundant once #1774 releases. * fix: unpin taplo from local aarch64 wheel registry in uv.lock The lockfile incorrectly recorded taplo with `source = { registry = ".wheels" }` and only the aarch64 wheel — leaked from a local WSL2 workaround when the frictionless bump triggered a re-lock. CI on x86_64 couldn't install it. Restore taplo's PyPI source with all platform wheels so CI passes. * fix: add pandas to default deps for parquet reading frictionless[parquet]>=5.18.1 switched from fastparquet to pyarrow for parquet support. pyarrow.Table.to_pandas() needs pandas at runtime but pyarrow doesn't declare it as a transitive dep, so build_datapackage.py failed in CI with "No module named 'pandas'" when inferring parquet schemas. On main this worked only because frictionless[parquet]>=5.18.0 pulled fastparquet, which in turn pulled pandas. Add pandas as an explicit top-level dep so the default install tree satisfies frictionless's parquet backend. * refactor: convert datapackage validator from script to pytest @domoritz observed on PR #782 that the script reimplements pytest's reporting layer (custom rich panels, manual progress bars, custom expected-failures filtering). vega-datasets is a metadata + data hub — every line of tracked code should pay rent — so move the validator into pytest where pytest already provides reporting, parametrization, xfail strict, and standard CLI ergonomics. Two tiers map onto pytest's fast/slow split: * Default — file existence, declared bytes, git-blob SHA-1. Stdlib only, sub-second over 73 resources. Covers what frictionless-py doesn't today (byte-count returns None for tabular JSON / arrow / parquet; hash-count supports only md5 / sha256; descriptor uses sha1). * Slow (`pytest --run-slow`) — frictionless schema and row validation per resource. Default is full read (matching the script); pass `--limit-rows N` to cap during iteration. flights-3m.parquet is ~minutes at full read. The expected-failures allowlist (movies — intentional pedagogy; flights_200k_arrow — no upstream parser) stays in _data/validate_datapackage.toml. xfail marks are emitted at parametrize time via pytest.param(resource, marks=[xfail(strict=True)]) — not via brittle ID-string matching in pytest_collection_modifyitems. Strict mode flips XFAIL to FAIL the moment an upstream issue resolves, prompting allowlist removal. Tests assert descriptor contracts, they don't skip on missing fields. Every resource has path / bytes / sha1: hash today; if a future descriptor regression drops one, the test fails loudly rather than silently SKIPping. parallel=False is preserved with a code comment explaining the load-bearing rationale: frictionless's parallel path silently ignores Checklist.skip_errors, which would re-surface byte-count and hash-count errors phase 1 already covers. Net: -330 lines from deleting scripts/validate_datapackage.py (replaced by 213 lines under tests/), pytest>=9 promoted from transitive in uv.lock to declared dev-group dep. The rich PEP 723 inline dep on the script goes away; rich itself stays in uv.lock transitively via frictionless -> typer -> rich. * docs(test): point basepath workaround at #758 Replace the misattributed reference to frictionlessdata/frictionless-py#1435 with a pointer to #758, the in-repo issue tracking the actual cause: descriptor resource paths are bare filenames under data/ rather than relative to the descriptor location. #1435 is about remote descriptors with local data (CLI --basepath silently ignored on remote URLs); our descriptor is local and fixing #1435 upstream would not remove this workaround. #758 is the correct upstream tracker and chains to vega/altair#3946 for the cross-repo coordination required before the descriptor paths can be migrated. * fix(test): correct invalid return annotation on _slow_param `pytest.param` is a function (factory), not a type, so `-> pytest.param` is an invalid type annotation (flagged by ty as `invalid-type-form`). Pytest 9.0.3 doesn't expose `ParameterSet` at the public top level, so fall back to `-> Any` with an inline comment naming the actual type.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Resourcewas imported as a type, but then used as a class to retrieve a class method (Resource.from_descriptor).Resource(not as a type) intovalidator.pyleads to a circular import. A small workaround would have been to import Resource locally, but this was not very satisfactory.Suggested changes and fixes
Validator.validate_resourceandValidator.validate_packagemethods to theResource.validateandPackage.validatemethods. This solves the circular import problem, and removes duplication (method signatures).Validatorclass, keeping the methods (forwarding input parameters) as they are part of the public API. No warning, only a deprecation message in the docstring, and a mention that there is no plan to remove the class in the future.Note
I could not produce any significant performance gain in my tests, I wonder if there is some shared resources' lock that prevents the validation from effectively running the validation in parallel (but I have not spent much time on this yet).
Archive of explorations
validate_parallelfunction was actually never executed in the tests.ResourceandDataPackagedirectly, but I may have missed something.