Skip to content

feat(integration-tests): Introduce IntegrationTestDataset fixtures; move test data to central directory.#2181

Merged
quinntaylormitchell merged 14 commits into
y-scope:mainfrom
quinntaylormitchell:testing-redesign-2
Apr 28, 2026
Merged

feat(integration-tests): Introduce IntegrationTestDataset fixtures; move test data to central directory.#2181
quinntaylormitchell merged 14 commits into
y-scope:mainfrom
quinntaylormitchell:testing-redesign-2

Conversation

@quinntaylormitchell

@quinntaylormitchell quinntaylormitchell commented Apr 7, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR stores test datasets in a centralized /data directory, and makes them available to all tests through session-scoped fixtures in datasets.py. Each test dataset directory carries a metadata.json file which specifies certain information about the dataset needed for testing. The information in this file is loaded into a pydantic object (IntegrationTestDatasetMetadata) which validates the format of the file.

NOTE: the old test datasets located at integration-tests/tests/package_tests/clp_json/data/json-multifile and integration-tests/tests/package_tests/clp_text/data/text-multifile are not removed in this PR, and they are still referenced in code. They will be removed when appropriate as the rest of the package testing code is updated and developed.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

Tested on development branch; all tests pass.

Summary by CodeRabbit

  • Tests

    • Added reusable session-scoped dataset fixtures and extensive sample log files (JSON/text) for integration testing
    • Introduced metadata-driven validation for datasets and stronger runtime checks of dataset contents
  • Documentation

    • Added a README describing dataset layout and metadata schema for integration tests
  • Chores

    • Test configuration now auto-loads the fixtures module for smoother test runs

@coderabbitai

coderabbitai Bot commented Apr 7, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds two integration-test datasets (JSON and text) and session-scoped fixtures; refactors dataset model to load and validate dataset metadata from each dataset root and verifies listed log files exist.

Changes

Cohort / File(s) Summary
Pytest configuration
integration-tests/tests/conftest.py
Injects tests.fixtures.datasets into pytest_plugins so dataset fixtures are auto-loaded.
Dataset fixtures
integration-tests/tests/fixtures/datasets.py
New module adding session-scoped json_multifile and text_multifile pytest fixtures that construct IntegrationTestDataset from a dataset root path.
Dataset class & metadata model
integration-tests/tests/utils/classes.py
Refactors IntegrationTestDataset to accept path_to_dataset_root, loads metadata.json into a new IntegrationTestDatasetMetadata Pydantic model, validates logs subdir, begin_ts ≤ end_ts, and that file_names exist; adds computed properties and __post_init__.
JSON dataset files
integration-tests/tests/data/json_multifile/metadata.json, integration-tests/tests/data/json_multifile/logs/...
integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-08.jsonl, ...-07-09.jsonl, ...-07-11.jsonl, ...-07-19.jsonl, ...-07-21.jsonl
Adds json_multifile metadata and five .jsonl log files (8 records each) enumerated in metadata.
Text dataset files
integration-tests/tests/data/text_multifile/metadata.json, integration-tests/tests/data/text_multifile/logs/...
integration-tests/tests/data/text_multifile/logs/apollo-17_day01.txt, ..._day04.txt, ..._day07.txt, ..._day10.txt, ..._day13.txt
Adds text_multifile metadata and five text log files (10 lines each) enumerated in metadata.
Documentation
integration-tests/tests/data/README.md
New README describing dataset directory layout, required metadata.json schema, and how to access datasets via fixtures.

Sequence Diagram

sequenceDiagram
    participant Pytest as pytest
    participant Conftest as conftest.py
    participant Fixture as datasets.py (fixture)
    participant Dataset as IntegrationTestDataset
    participant Metadata as metadata.json
    participant Logs as logs/

    Pytest->>Conftest: load plugins (pytest_plugins)
    Conftest->>Fixture: auto-load fixtures module
    Pytest->>Fixture: request json_multifile fixture
    Fixture->>Dataset: instantiate with path_to_dataset_root
    Dataset->>Metadata: read metadata.json
    Metadata-->>Dataset: parsed IntegrationTestDatasetMetadata
    Dataset->>Dataset: validate begin_ts ≤ end_ts
    Dataset->>Logs: check logs_subdir exists and file_names present
    Logs-->>Dataset: file existence confirmed
    Dataset-->>Fixture: return configured dataset
    Fixture-->>Pytest: provide fixture to tests
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main changes: introducing IntegrationTestDataset fixtures and centralizing test data in a /data directory.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@quinntaylormitchell quinntaylormitchell changed the title refactor(integration-tests): Introduce IntegrationTestDataset fixtures and move test data to central directory. refactor(integration-tests): Introduce IntegrationTestDataset fixtures; move test data to central directory. Apr 7, 2026
@Bill-hbrhbr Bill-hbrhbr self-requested a review April 7, 2026 17:56
@quinntaylormitchell quinntaylormitchell marked this pull request as ready for review April 7, 2026 18:15
@quinntaylormitchell quinntaylormitchell requested a review from a team as a code owner April 7, 2026 18:15

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-11.jsonl`:
- Around line 1-8: The JSONL records for mission "STS-135" (each object with
"line_index" 0–7) currently have CRLF endings; change the file so every record
uses LF-only line endings (remove trailing \r from each line) ensuring no CR
characters remain at end-of-line and re-save the JSONL with Unix/LF EOL to
prevent byte-level comparison flakes.

In `@integration-tests/tests/data/text_multifile/logs/apollo-17_day10.txt`:
- Line 4: Replace the misspelled phrase "ground based" with the hyphenated form
"ground-based" in the fixture text line that reads "navigation sightings compare
well with ground based tracking solutions" so it becomes "navigation sightings
compare well with ground-based tracking solutions"; update the test fixture file
content accordingly to preserve surrounding punctuation and spacing.

In `@integration-tests/tests/fixtures/datasets.py`:
- Around line 17-23: The fixtures duplicate dataset-loading logic and access
nested metadata without validation, causing brittle failures; create a shared
loader/validator (e.g., a helper function like load_and_validate_dataset or
load_json_to_dataset) that centralizes path resolution using path_to_dataset and
load_json_to_dict, validates required keys such as
metadata_dict["file_structure"]["logs_subdir"] (and any schema fields), raises
clear errors on missing/invalid fields, and returns an IntegrationTestDataset
populated via dataset_name, path_to_dataset_logs and metadata_dict; replace the
duplicated blocks (including the other occurrence at lines 31-37) to call this
helper so failures are fast and diagnostics are meaningful.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 59db8514-97df-4fbd-a009-57b2044ce639

📥 Commits

Reviewing files that changed from the base of the PR and between 2f48844 and ce89570.

📒 Files selected for processing (15)
  • integration-tests/tests/conftest.py
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-08.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-09.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-11.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-19.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-21.jsonl
  • integration-tests/tests/data/json_multifile/metadata.json
  • integration-tests/tests/data/text_multifile/logs/apollo-17_day01.txt
  • integration-tests/tests/data/text_multifile/logs/apollo-17_day04.txt
  • integration-tests/tests/data/text_multifile/logs/apollo-17_day07.txt
  • integration-tests/tests/data/text_multifile/logs/apollo-17_day10.txt
  • integration-tests/tests/data/text_multifile/logs/apollo-17_day13.txt
  • integration-tests/tests/data/text_multifile/metadata.json
  • integration-tests/tests/fixtures/datasets.py
  • integration-tests/tests/utils/utils.py

Comment thread integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-11.jsonl Outdated
Comment thread integration-tests/tests/fixtures/datasets.py Outdated
@quinntaylormitchell quinntaylormitchell changed the title refactor(integration-tests): Introduce IntegrationTestDataset fixtures; move test data to central directory. feat(integration-tests): Introduce IntegrationTestDataset fixtures; move test data to central directory. Apr 8, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration-tests/tests/data/text_multifile/metadata.json`:
- Line 16: The metadata file sets "columns_file_name": null but the
IntegrationTestDatasetMetadata model currently types columns_file_name as str;
update the model (IntegrationTestDatasetMetadata) to accept None by changing the
type to Optional[str] or using the union syntax (str | None) for
columns_file_name so Pydantic validation succeeds when the field is null.

In `@integration-tests/tests/utils/classes.py`:
- Line 75: The type annotation for columns_file_name must allow nulls; change
its declaration in the class to be nullable (use "str | None" or
"Optional[str]") so Pydantic accepts null from text_multifile/metadata.json;
also add or update the typing import (Optional) if you choose that form. Ensure
this aligns with the existing columns_file_path property which already handles
None.
- Line 96: The code calls validate_dir_exists on a file path
(metadata_file_path); replace this with a file-specific check: add or reuse a
validate_file_exists utility that accepts the same input type as
validate_dir_exists, verifies Path(...).exists() and Path(...).is_file(), and
raises the same error semantics as the directory validator on failure, then
change the call in the class from validate_dir_exists(self.metadata_file_path)
to validate_file_exists(self.metadata_file_path) (or directly use
Path(self.metadata_file_path).is_file() with identical error handling if you
prefer not to add a helper).
- Around line 103-104: The code incorrectly calls validate_dir_exists on
columns_file_path (a file, not a directory); replace that call with the
file-level check (e.g., validate_file_exists(self.columns_file_path)) or
otherwise validate that self.columns_file_path is an existing file (or use
os.path.isdir/os.path.isfile) inside the same class/method where
columns_file_path is set so the validator matches the resource type; update the
reference to validate_dir_exists to the appropriate file validator for
columns_file_path.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f644ae71-5762-4c9e-9cfb-2956cb47710d

📥 Commits

Reviewing files that changed from the base of the PR and between ce89570 and 8330ec6.

📒 Files selected for processing (5)
  • integration-tests/tests/data/json_multifile/columns.json
  • integration-tests/tests/data/json_multifile/metadata.json
  • integration-tests/tests/data/text_multifile/metadata.json
  • integration-tests/tests/fixtures/datasets.py
  • integration-tests/tests/utils/classes.py

Comment thread integration-tests/tests/data/text_multifile/metadata.json Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration-tests/tests/utils/classes.py`:
- Around line 98-99: The code reads JSON metadata with
self.metadata_file_path.read_text() which uses the system default encoding;
change the read to explicitly use UTF-8 (e.g., call read_text(encoding="utf-8")
or open the path with encoding="utf-8") before passing the string to
IntegrationTestDatasetMetadata.model_validate_json so self.metadata is parsed
consistently across platforms; update the read that assigns raw_metadata and
keep the subsequent
IntegrationTestDatasetMetadata.model_validate_json(raw_metadata) call unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f46b0472-5f7b-465a-be0f-686da657ca34

📥 Commits

Reviewing files that changed from the base of the PR and between 8330ec6 and 597435f.

📒 Files selected for processing (1)
  • integration-tests/tests/utils/classes.py

Comment thread integration-tests/tests/utils/classes.py
Comment thread integration-tests/tests/utils/classes.py
Comment thread integration-tests/tests/utils/classes.py
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py
Comment thread integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-08.jsonl Outdated
Comment thread integration-tests/tests/data/json_multifile/columns.json Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration-tests/tests/data/README.md`:
- Line 48: The markdown line uses straight (regular) quotes instead of inline
code formatting for boolean literals; update the text so the boolean values are
wrapped in backticks (e.g., replace "True" and "False" with `True` and `False`)
in the README entry that currently reads '`True` if logs are unstructured, else
`False`.' to ensure proper inline code formatting for the boolean values.

In `@integration-tests/tests/utils/classes.py`:
- Around line 103-114: The metadata validation currently accepts absolute paths
or '..' traversal from metadata (e.g., logs_subdir and file_names) which can
escape the dataset root; update the validation after
validate_dir_exists(self.logs_path) and before calling validate_file_exists to
resolve candidate paths (use Path.resolve or os.path.realpath) for both the
logs_subdir-derived path and each file_path_abs, then assert the resolved path
is contained within the dataset root (self.logs_path.resolve()) — if not, fail
the test (pytest.fail) with a clear message; reference functions/attributes:
validate_dir_exists, validate_file_exists, self.logs_path,
self.metadata.file_names, and any logs_subdir handling so the containment check
covers both directory and file entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: af9f042f-69b6-49a6-adc3-0997266b7d83

📥 Commits

Reviewing files that changed from the base of the PR and between 597435f and 4a23581.

📒 Files selected for processing (10)
  • integration-tests/tests/data/README.md
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-08.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-09.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-11.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-19.jsonl
  • integration-tests/tests/data/json_multifile/logs/sts-135-2011-07-21.jsonl
  • integration-tests/tests/data/json_multifile/metadata.json
  • integration-tests/tests/data/text_multifile/metadata.json
  • integration-tests/tests/fixtures/datasets.py
  • integration-tests/tests/utils/classes.py

Comment thread integration-tests/tests/data/README.md Outdated
Comment thread integration-tests/tests/utils/classes.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@integration-tests/tests/data/README.md`:
- Around line 73-83: The fixture example incorrectly passes dataset_name to the
IntegrationTestDataset constructor even though IntegrationTestDataset derives
its name from metadata; update the dataset_name fixture to only construct
IntegrationTestDataset with the path (e.g. return
IntegrationTestDataset(path_to_dataset_root=integration_test_path_config.test_data_path
/ dataset_name)) and keep the fixture name/IntegrationTestPathConfig usage the
same so the runtime won't fail due to an unexpected keyword argument.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 970428ed-e7b2-489d-aa97-488e961e7651

📥 Commits

Reviewing files that changed from the base of the PR and between 4a23581 and d7e9a40.

📒 Files selected for processing (1)
  • integration-tests/tests/data/README.md

Comment thread integration-tests/tests/data/README.md
Comment thread integration-tests/tests/data/text_multifile/metadata.json Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kirkrodrigues

shall we start creating integration-test-specific docs in <project-root>/docs/ or do you think this README.md is fine to be kept here for now?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw I have already started the docs but #1686 is on hold until the bulk of the testing suite is finished.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We have integration specific docs inside docs/src/dev-docs/testing/integration-tests.md.
  2. I think we can move this README.md to the docs directory in a follow-up PR. Preferably after the current refactoring operation is complete.

Comment thread integration-tests/tests/data/README.md Outdated
Comment thread integration-tests/tests/data/README.md Outdated
quinntaylormitchell and others added 2 commits April 20, 2026 15:41
Co-authored-by: Junhao Liao <junhao@junhao.ca>
Updated the documentation format for dataset fields.

@Bill-hbrhbr Bill-hbrhbr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Polish README
  2. Use *_dir instead of *_path for directory properties.

Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/utils/classes.py Outdated
Comment thread integration-tests/tests/data/README.md Outdated
Comment thread integration-tests/tests/data/README.md Outdated
Comment thread integration-tests/tests/data/README.md Outdated
Comment thread integration-tests/tests/data/README.md Outdated
Comment thread integration-tests/tests/data/README.md Outdated
| `end_ts` | The latest timestamp present in the dataset (ms). |
| `logs_subdir` | The name of the subdirectory containing logs. |
| `file_names` | A list of the files within `logs_subdir`. |
| `single_match_wildcard_query` | A wildcard query which, when searched, will match a single log message in the dataset. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| `single_match_wildcard_query` | A wildcard query which, when searched, will match a single log message in the dataset. |
| `single_match_wildcard_query` | A wildcard query that should match exactly one log message in the dataset. |

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Suggested change
| `single_match_wildcard_query` | A wildcard query which, when searched, will match a single log message in the dataset. |
| `single_match_wildcard_query` | A wildcard query that matches exactly one log message in the dataset. |

@Bill-hbrhbr Bill-hbrhbr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return self.dataset_root_dir / "metadata.json"

@property
def logs_path(self) -> Path:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def logs_path(self) -> Path:
def logs_dir(self) -> Path:

Still prefer using _dir suffix but let's wait for @junhaoliao

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use logs_path - i agree it's more clear in representing an absolute path. we should probably document this convention somewhere in our developer guidelines though

return self.dataset_root_dir / "metadata.json"

@property
def logs_path(self) -> Path:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use logs_path - i agree it's more clear in representing an absolute path. we should probably document this convention somewhere in our developer guidelines though

@quinntaylormitchell quinntaylormitchell merged commit 1925bc8 into y-scope:main Apr 28, 2026
23 checks passed
@junhaoliao junhaoliao added this to the Mid-May 2026 milestone May 5, 2026
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
… move test data to central directory. (y-scope#2181)

Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
Co-authored-by: Junhao Liao <junhao@junhao.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants