refactor(clp-package): Upgrade Pydantic to V2; Update existing code to match V2 syntax; Fix various typing errors. by junhaoliao · Pull Request #1338 · y-scope/clp

junhaoliao · 2025-09-25T22:17:22Z

Description

Upgrade

Upgrade Pydantic to 2.11.9
Run bump-pydantic to auto-migrate existing code, changes include:
- Migrate @validator → @field_validator
- Migrate @root_validator(pre=True) → @model_validator(mode="before")
- Migrate class Config: ... syntax to model_config = ConfigDict(...)

Pydantic v2 Migration

Migrate @root_validator to @model_validator(mode="after")
Replace root_validator sweep interval validation with Field
Migrate parse_obj → model_validate
Migrate dict() → model_dump()
Migrate copy() → model_copy()
Migrate json() → model_dump_json()
Migrate parse_raw() → model_validate_json()
Add missing type annotations required by Pydantic v2
Rename validator parameters (values → data, field → value) for Pydantic v2 conventions

Fix other typing errors

Remove unnecessary parentheses in type annotations and return statements
Remove unused imports (Optional, ArchiveOutput)
Remove unused variables and initializations (clp_site_packages_dir, s3_utils variables)
Convert container_paths_to_extract_file_path to string before appending
Add explicit StorageType and StorageEngine casting for config values
Fix type hints for cmd parameters ([str] → List[str])
Remove unused exception variables in except blocks

Checklist

The PR satisfies the contribution guidelines.
This is a breaking change and that has been indicated in the PR title, OR this isn't a
breaking change.
Necessary docs have been updated, OR no docs need to be updated.

Validation performed

1. `task` to build the package.

2. Tested below

Ran all scripts in sbin:
1. admin-tools/archive-manager.sh
2. admin-tools/dataset-manager.sh
3. compress.sh
4. decompress.sh
5. search.sh
6. start-clp.sh
7. stop-clp.sh
And the webui:
1. Viewing ingestion stats / jobs
2. Submitting and viewing queries
3. Extracting stream files and opening them in the log viewer
With below config / use-case variations:
1. storage_engine: clp / clp-s
2. with clp-s:
  1. s3 input
  2. s3 archive output
  3. s3 stream output
  4. non-default dataset
3. garbage collector enablement

Summary by CodeRabbit

New Features
- None.
Refactor
- Migrated validation/serialization to Pydantic v2 across configs, jobs, tasks and scripts; switched parse/dict usages to model_validate/model_dump/model_copy/model_dump_json.
Bug Fixes
- Fixed extraction argument formatting, made storage/type values explicit via enums, tightened optional/default handling, improved background service error logging and exit paths, and adjusted some type annotations.
Chores
- Upgraded/added pydantic dependency to ^2.11.9 across affected packages.

coderabbitai · 2025-09-25T22:17:28Z

Walkthrough

Migrates codebase Pydantic v1→v2: replaces parse_obj/parse_raw/copy/dict/json → model_validate/model_validate_json/model_copy/model_dump/model_dump_json; swaps validator/root_validator → field_validator/model_validator; updates pydantic dependency to ^2.11.9; plus small typing, enum wrapping, and minor cleanup changes across components.

Changes

Cohort / File(s)	Summary of changes
Core CLP config & validators `components/clp-py-utils/clp_py_utils/clp_config.py`	Replaced `validator`/`root_validator` with `field_validator`/`model_validator`; added `ConfigDict`/`Field`/`PrivateAttr`; adjusted field types/defaults; switched serialization from `dict()`/`json()` to `model_dump()`/`model_dump_json()`.
Config loading (model_validate) `components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py`, `components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py`, `components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py`, `components/job-orchestration/job_orchestration/reducer/reducer.py`	Replaced `CLPConfig.parse_obj(...)` with `CLPConfig.model_validate(...)`; `garbage_collector` added broader exception logging and returns `1` on parse/validation failure.
Package utils: CLP model API updates `components/clp-package-utils/clp_package_utils/general.py`	Replaced `parse_obj`→`model_validate` and `copy(deep=True)`→`model_copy(deep=True)` for `CLPConfig` and nested models when generating container/worker configs.
Package utils scripts: enums, args, typing cleanup `components/clp-package-utils/clp_package_utils/scripts/archive_manager.py`, `components/clp-package-utils/clp_package_utils/scripts/decompress.py`, `components/clp-package-utils/clp_package_utils/scripts/start_clp.py`	Wraps config values with enum constructors (`StorageType(...)`, `StorageEngine(...)`), stringifies extraction-list path argument, tightens type annotations to `List[str]`, and removes unused local `clp_site_packages_dir`.
Native compression scripts & utils `components/clp-package-utils/clp_package_utils/scripts/native/compress.py`, `components/clp-package-utils/clp_package_utils/scripts/native/utils.py`	Replaced `dict`/`parse_obj`/`json` usages with `model_dump`/`model_validate`/`model_dump_json`/`model_validate_json`; constructs `OutputConfig` via `model_validate(...)`; uses `model_dump()` for job_config before msgpack; minor typing cleanup.
Executors: parsing/dumping changes `components/job-orchestration/job_orchestration/executor/compress/compression_task.py`, `components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py`, `components/job-orchestration/job_orchestration/executor/query/fs_search_task.py`, `components/job-orchestration/job_orchestration/executor/query/utils.py`, `components/job-orchestration/job_orchestration/executor/utils.py`	Replaced `parse_obj`/`parse_raw` with `model_validate`/`model_validate_json` and `dict()` → `model_dump()` for config/result serialization/deserialization.
Schedulers & job payloads `components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py`, `components/job-orchestration/job_orchestration/scheduler/compress/partition.py`, `components/job-orchestration/job_orchestration/scheduler/job_config.py`, `components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py`, `components/job-orchestration/job_orchestration/scheduler/scheduler_data.py`	Switched parsing/serialization to `model_validate`/`model_dump`/`model_dump_json`; replaced `validator`→`field_validator`; introduced `ConfigDict`/`model_config` where needed; set defaults for several optional scheduler data fields.
S3 utils typing refinements `components/clp-py-utils/clp_py_utils/s3_utils.py`	Adjusted local variable typing to use `Optional[...]` without initial `None` assignments; no behaviour or public API changes.
Pyproject: pydantic v2 update `components/clp-package-utils/pyproject.toml`, `components/clp-py-utils/pyproject.toml`, `components/job-orchestration/pyproject.toml`	Adds/updates `pydantic` dependency to `^2.11.9` in affected packages' pyproject.toml files.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Component
  participant Reader as read_yaml_config_file()
  participant PydModel as PydanticModel

  Note over Caller,Reader: Load raw config (YAML/dict)
  Caller->>Reader: read config
  Reader->>PydModel: model_validate(raw_config)
  alt validation succeeds
    PydModel-->>Caller: validated model instance
  else validation fails
    PydModel-->>Caller: raises ValidationError
  end

  Note over Caller,PydModel: Serialization / copy API change
  Caller->>PydModel: previously .dict() / .copy(deep=True)
  Caller->>PydModel: now .model_dump() / .model_copy(deep=True)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Refactor validators to use Field() instead of @field_validator for simple gt/lt constraints #1342 — touches validators and ArchiveOutput/StreamOutput in clp_config.py; this PR migrated validators to Pydantic v2 and field definitions that overlap the issue's refactor scope.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 2.53% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly summarizes the primary change—upgrading to Pydantic V2—and notes the associated syntax updates and typing fixes in a single, coherent sentence that reflects the PR’s main objective.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.13.1)

components/clp-py-utils/clp_py_utils/clp_config.py

�[1;31mruff failed�[0m
�[1mCause:�[0m Failed to load configuration /ruff.toml
�[1mCause:�[0m Failed to parse /ruff.toml
�[1mCause:�[0m TOML parse error at line 26, column 3
|
26 | "RSE100", # Use of assert detected
| ^^^^^^^^
Unknown rule selector: RSE100

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… to `@model_validator(mode="after")` and replace `root_validator` with `Field` for sweep interval validation.

…e annotation.

…antic v2) to QueryScheduler fields.

…compatibility with Pydantic v2.

…tibility with Pydantic v2.

…compatibility with Pydantic v2.

…ine casting for config values.

…ist[str] instead of [str].

… to string before appending to command list

…zations in s3_utils.py

…ement of s3_utils.py

…o `model_validate_json()` for compatibility with Pydantic v2.

…xcept block

… validators for clarity and consistency with Pydantic v2 conventions.

…`value` for consistency with Pydantic v2 conventions.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (13)

components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (1)
45-52: Add the config path to the validation failure log

Right now the log only prints the exception message, so operators can’t easily tell which config file blew up when several are in play. Including the file path will make diagnosis much quicker.
-        logger.error(err)
+        logger.error("Failed to load config from %s: %s", config_path, err)
components/clp-py-utils/clp_py_utils/s3_utils.py (1)
77-98: Fail fast when credentials are missing

Now that aws_credentials is typed as optional, we should check for None immediately. Otherwise a misconfigured AwsAuthentication instance (for example, one constructed programmatically and bypassing the Pydantic validator) will raise a confusing AttributeError when we access aws_credentials.access_key_id. Raising a clear ValueError keeps the error handling consistent with the other branches.
     elif AwsAuthType.credentials == auth.type:
         aws_credentials = auth.credentials
+        if aws_credentials is None:
+            raise ValueError("credentials must be set when type is AwsAuthType.credentials")
components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (1)
34-40: Fix enum join bug in validator and tighten the signature.

The error path will raise a TypeError due to joining Enum objects. Also prefer value as the parameter name.

Apply:
-    def valid_status(cls, field):
-        supported_status = [CompressionTaskStatus.SUCCEEDED, CompressionTaskStatus.FAILED]
-        if field not in supported_status:
-            raise ValueError(f'must be one of the following {"|".join(supported_status)}')
-        return field
+    def valid_status(cls, value):
+        supported_status = [CompressionTaskStatus.SUCCEEDED, CompressionTaskStatus.FAILED]
+        if value not in supported_status:
+            allowed = "|".join(s.name for s in supported_status)
+            raise ValueError(f"must be one of: {allowed}")
+        return value
Optional follow-up (cleaner type-level restriction; outside selected lines):
# Replace the field type to remove the need for manual validation:
from typing import Literal
status: Literal[CompressionTaskStatus.SUCCEEDED, CompressionTaskStatus.FAILED]
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (5)
49-60: Parameterize UPDATE to prevent SQL injection and quoting bugs

Directly interpolating values into SQL is unsafe and can break on quotes; use placeholders so the driver escapes/serializes values (including datetimes) correctly.

Apply this diff:
 def update_compression_task_metadata(db_cursor, task_id, kv):
     if not len(kv):
         raise ValueError("Must specify at least one field to update")

-    field_set_expressions = [f'{k}="{v}"' for k, v in kv.items()]
-    query = f"""
-    UPDATE {COMPRESSION_TASKS_TABLE_NAME}
-    SET {", ".join(field_set_expressions)}
-    WHERE id={task_id}
-    """
-    db_cursor.execute(query)
+    field_set_expressions = [f"{k}=%s" for k in kv.keys()]
+    query = f"""
+    UPDATE {COMPRESSION_TASKS_TABLE_NAME}
+    SET {", ".join(field_set_expressions)}
+    WHERE id=%s
+    """
+    params = [*kv.values(), task_id]
+    db_cursor.execute(query, params)
62-73: Parameterize incremental UPDATE to prevent SQL injection

Same issue here; build k=k+%s expressions and pass values separately.

Apply this diff:
 def increment_compression_job_metadata(db_cursor, job_id, kv):
     if not len(kv):
         raise ValueError("Must specify at least one field to update")

-    field_set_expressions = [f"{k}={k}+{v}" for k, v in kv.items()]
-    query = f"""
-    UPDATE {COMPRESSION_JOBS_TABLE_NAME}
-    SET {", ".join(field_set_expressions)}
-    WHERE id={job_id}
-    """
-    db_cursor.execute(query)
+    field_set_expressions = [f"{k}={k}+%s" for k in kv.keys()]
+    query = f"""
+    UPDATE {COMPRESSION_JOBS_TABLE_NAME}
+    SET {", ".join(field_set_expressions)}
+    WHERE id=%s
+    """
+    params = [*kv.values(), job_id]
+    db_cursor.execute(query, params)
550-561: Good Pydantic v2 migration; include exception detail in error output

model_validate is correct. Consider returning the underlying exception message to aid debugging.

[Suggest diff below keeps the same logging while surfacing the cause in the task result.]
-    except Exception as ex:
-        error_msg = "Failed to load worker config"
+    except Exception as ex:
+        error_msg = f"Failed to load worker config: {ex}"
         logger.exception(error_msg)
         return CompressionTaskResult(
             task_id=task_id,
             status=CompressionTaskStatus.FAILED,
             duration=0,
             error_message=error_msg,
         )
396-405: Ensure stderr log file is closed on exceptions

Open file handle isn’t protected by a context manager; a mid-function exception can leak the descriptor.
-    stderr_log_file = open(stderr_log_path, "w")
+    with open(stderr_log_path, "w") as stderr_log_file:
+        # Start compression
+        logger.debug("Compressing...")
+        compression_successful = False
+        proc = subprocess.Popen(
+            compression_cmd, stdout=subprocess.PIPE, stderr=stderr_log_file, env=compression_env
+        )
+        # ... move the rest of the function contents that use stderr_log_file inside this with-block ...
502-504: Guard unlink to avoid spurious FileNotFoundError

Only unlink the temp file if it exists.
-        if logs_list_path:
-            logs_list_path.unlink()
+        if logs_list_path.exists():
+            logs_list_path.unlink()
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1)
254-259: Guard against unexpected extras when validating OutputConfig.

If OutputConfig doesn’t ignore extras, this may raise. Safer to exclude fields not in OutputConfig.
-    clp_output_config = OutputConfig.model_validate(clp_config.archive_output.model_dump())
+    clp_output_config = OutputConfig.model_validate(
+        clp_config.archive_output.model_dump(exclude={"storage", "retention_period"})
+    )
If OutputConfig is configured with extra="ignore", this is optional. Please confirm.
components/clp-package-utils/clp_package_utils/general.py (1)
315-321: Deep‑copy stream_output for consistency and safety.

Avoids accidental mutation of the root config by workers.
-    worker_config.stream_output = clp_config.stream_output
+    worker_config.stream_output = clp_config.stream_output.model_copy(deep=True)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (3)
514-526: Use exclude_none=True to keep task payloads minimal.

Shrinks serialized job config sent to workers.
-    job_config = job.get_config().model_dump()
+    job_config = job.get_config().model_dump(exclude_none=True)
999-1012: Bug: Comparing task status to job status enum.

This causes successful extraction tasks to be treated as failures.
-        if not QueryJobStatus.SUCCEEDED == task_result.status:
+        if task_result.status != QueryTaskStatus.SUCCEEDED:
1162-1169: Handle empty config files before validation.

read_yaml_config_file can return None; guard to avoid confusing errors.
-        clp_config = CLPConfig.model_validate(read_yaml_config_file(config_path))
+        raw = read_yaml_config_file(config_path)
+        clp_config = CLPConfig() if raw is None else CLPConfig.model_validate(raw)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 173755f and d11e291.

⛔ Files ignored due to path filters (3)

components/clp-package-utils/poetry.lock is excluded by !**/*.lock
components/clp-py-utils/poetry.lock is excluded by !**/*.lock
components/job-orchestration/poetry.lock is excluded by !**/*.lock

📒 Files selected for processing (25)

components/clp-package-utils/clp_package_utils/general.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/utils.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2 hunks)
components/clp-package-utils/pyproject.toml (1 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (25 hunks)
components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (1 hunks)
components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (1 hunks)
components/clp-py-utils/clp_py_utils/s3_utils.py (4 hunks)
components/clp-py-utils/pyproject.toml (1 hunks)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (3 hunks)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (4 hunks)
components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (3 hunks)
components/job-orchestration/job_orchestration/executor/query/utils.py (1 hunks)
components/job-orchestration/job_orchestration/executor/utils.py (1 hunks)
components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (2 hunks)
components/job-orchestration/job_orchestration/reducer/reducer.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3 hunks)
components/job-orchestration/job_orchestration/scheduler/compress/partition.py (2 hunks)
components/job-orchestration/job_orchestration/scheduler/job_config.py (2 hunks)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (8 hunks)
components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (3 hunks)
components/job-orchestration/pyproject.toml (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-01-16T16:58:43.190Z

Learnt from: haiqi96
PR: y-scope/clp#651
File: components/clp-package-utils/clp_package_utils/scripts/compress.py:0-0
Timestamp: 2025-01-16T16:58:43.190Z
Learning: In the clp-package compression flow, path validation and error handling is performed at the scheduler level rather than in the compress.py script to maintain simplicity and avoid code duplication.

Applied to files:

components/clp-package-utils/clp_package_utils/scripts/decompress.py
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

🧬 Code graph analysis (16)

components/job-orchestration/job_orchestration/reducer/reducer.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (848-1038)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (56-62)

components/job-orchestration/job_orchestration/executor/utils.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

WorkerConfig (1041-1058)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (56-62)

components/clp-package-utils/clp_package_utils/scripts/native/compress.py (1)

components/job-orchestration/job_orchestration/scheduler/job_config.py (1)

OutputConfig (37-43)

components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (848-1038)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (56-62)

components/clp-py-utils/clp_py_utils/s3_utils.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (4)

S3Credentials (506-523)

S3Storage (608-639)

FsStorage (588-605)

AwsAuthentication (526-556)

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (3)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

WorkerConfig (1041-1058)

Database (177-281)

components/job-orchestration/job_orchestration/scheduler/job_config.py (2)

ClpIoConfig (46-48)

PathsToCompress (16-20)

components/clp-py-utils/clp_py_utils/sql_adapter.py (1)

SQL_Adapter (60-131)

components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (3)

components/job-orchestration/job_orchestration/scheduler/job_config.py (2)

SearchJobConfig (75-93)

QueryJobConfig (59-60)

components/job-orchestration/job_orchestration/scheduler/query/reducer_handler.py (1)

ReducerHandlerMessageQueues (26-45)

components/job-orchestration/job_orchestration/scheduler/constants.py (2)

QueryJobType (77-80)

QueryTaskStatus (64-74)

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

StorageType (114-116)

StorageEngine (103-105)

components/clp-package-utils/clp_package_utils/general.py (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (848-1038)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (4)

components/job-orchestration/job_orchestration/scheduler/job_config.py (4)

ExtractIrJobConfig (63-67)

ExtractJsonJobConfig (70-72)

QueryJobConfig (59-60)

SearchJobConfig (75-93)

components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (5)

get_config (59-59)

get_config (68-69)

get_config (78-79)

get_config (96-97)

QueryTaskResult (100-104)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (848-1038)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (56-62)

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (4)

components/job-orchestration/job_orchestration/scheduler/job_config.py (1)

ClpIoConfig (46-48)

components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (1)

CompressionTaskResult (28-40)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (848-1038)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (56-62)

components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (3)

components/clp-py-utils/clp_py_utils/sql_adapter.py (1)

SQL_Adapter (60-131)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

Database (177-281)

components/job-orchestration/job_orchestration/scheduler/job_config.py (1)

SearchJobConfig (75-93)

components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (848-1038)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (56-62)

components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (3)

components/job-orchestration/job_orchestration/scheduler/job_config.py (2)

ExtractIrJobConfig (63-67)

ExtractJsonJobConfig (70-72)

components/clp-py-utils/clp_py_utils/sql_adapter.py (1)

SQL_Adapter (60-131)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

Database (177-281)

components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (2)

components/clp-py-utils/clp_py_utils/clp_config.py (1)

CLPConfig (848-1038)

components/clp-py-utils/clp_py_utils/core.py (1)

read_yaml_config_file (56-62)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

components/clp-py-utils/clp_py_utils/clp_logging.py (2)

is_valid_logging_level (32-33)

get_valid_logging_level (28-29)

components/clp-py-utils/clp_py_utils/core.py (1)

make_config_path_absolute (42-53)

🔇 Additional comments (43)

components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)

152-154: Conversion avoids shlex.join type errors

Casting the mounted path to str keeps the command list homogeneous and prevents shlex.join(cmd) from blowing up on a Path. Nice catch.

components/clp-package-utils/clp_package_utils/scripts/native/utils.py (1)

66-69: Pydantic v2 serialisation update looks good

Switching to model_dump() aligns with Pydantic v2 expectations and keeps the msgpack payload unchanged, so no further action needed.

components/job-orchestration/job_orchestration/executor/utils.py (1)

20-20: Upgrade to model_validate looks correct

Migration from parse_obj to model_validate keeps the loader behaviour intact while aligning with Pydantic v2.

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

117-124: Type hint now matches how cmd is used

The List[str] annotation reflects the in-place list mutation pattern we rely on. Looks good.

141-158: Updated signature stays consistent with call sites

Changing cmd_to_run to List[str] keeps the contract aligned with how we build the docker exec commands. No issues spotted.

components/job-orchestration/job_orchestration/executor/query/utils.py (1)

34-38: Migration to model_dump() looks correct

Replacing dict() with model_dump() keeps the return value as a plain dict while aligning with the Pydantic v2 API used elsewhere in the PR. Callers expecting the serialised payload will continue to work.

components/job-orchestration/job_orchestration/executor/query/fs_search_task.py (3)

187-188: Accurate switch to model_validate for Database.

The migration away from parse_obj keeps the validation semantics intact while aligning with the Pydantic v2 API surface—no issues here.

201-202: SearchJobConfig validation updated correctly.

Adopting model_validate ensures the config payload still round-trips through Pydantic with the expected constraints; looking good.

231-231: model_dump is the right replacement.

Serializing the task results through model_dump() mirrors the old behaviour and is V2-compliant—no further changes needed.

components/job-orchestration/job_orchestration/scheduler/compress/partition.py (2)

37-44: Pydantic v2 JSON serialisation swap looks good

Using model_dump_json(exclude_none=True) keeps the payload identical to the previous json() call while aligning with the v2 API, so the task argument remains a JSON string as expected.

83-94: model_dump usage maintains the brotli/msgpack payload

Switching to model_dump/model_dump_json preserves the dict and JSON structures that msgpack and downstream consumers expect, so the compressed payload and the task hand-off remain compatible.

components/job-orchestration/job_orchestration/garbage_collector/garbage_collector.py (1)

87-88: LGTM on the task failure logging

Dropping the unused exception binding still preserves the traceback via logger.exception, so this cleanup looks good to me.

components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (4)

57-75: Migration to model_validate reads clean.
Switching IR job validation to model_validate(...) keeps the semantics intact while aligning this path with the Pydantic v2 API.

91-145: JSON job parsing looks spot on.
Adopting ExtractJsonJobConfig.model_validate(...) ensures we stay consistent with the new validator surface without altering downstream behaviour.

202-203: Database config re-validation checks out.
Rehydrating the DB config with Database.model_validate(...) maintains the same guarantees under Pydantic v2 before handing it to SQL_Adapter.

295-295: Result serialisation change LGTM.
model_dump() is the right v2 replacement for the previous .dict() call and preserves the response contract.

components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (6)

19-19: Pydantic v2 imports look correct.

Using BaseModel, ConfigDict, and field_validator aligns with v2.

32-32: LGTM: Optional default for error_message.

This matches Pydantic v2 semantics for truly optional fields.

52-53: LGTM: Optional defaults added.

Relaxing these to None avoids required-field friction in v2.

83-85: LGTM: model_config with arbitrary_types_allowed.

Necessary for asyncio.Task and custom queues on this model only.

90-91: LGTM: Optional async/task fields defaulting to None.

Matches the model_config change and prevents validation errors.

104-104: LGTM: Optional error_log_path.

Consistent with v2 optional field handling.

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)

566-567: LGTM on Database.model_validate and adapter construction

Correct v2 migration; types align with SQL_Adapter expectations.

612-613: LGTM on returning model_dump()

Correct v2 replacement for dict(); ensures a plain dict for Celery serialization.

components/clp-package-utils/clp_package_utils/general.py (2)

239-241: Good migration to Pydantic v2 copy API.

422-423: LGTM: config loading via model_validate.

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (5)

130-138: LGTM: ExtractIrJobConfig migration to model_validate.

176-184: LGTM: ExtractJsonJobConfig migration to model_validate.

666-670: LGTM: Validation to access dataset via QueryJobConfig.

688-698: LGTM: SearchJobConfig migration to model_validate.

905-919: LGTM: QueryTaskResult migration to model_validate.

components/job-orchestration/job_orchestration/reducer/reducer.py (1)

46-49: LGTM: model_validate for config load with error handling.
components/clp-py-utils/clp_py_utils/clp_config.py (1)
613-619: Same issue for staging_directory.

Apply the same fix as for directory.
-    def validate_staging_directory(cls, value):
-        if "" == value:
+    def validate_staging_directory(cls, value):
+        if isinstance(value, pathlib.Path) and value.as_posix() == "":
             raise ValueError("staging_directory cannot be empty")
         return value
components/clp-package-utils/pyproject.toml (1)

16-16: Pydantic v2 dependency bump looks solid.
The new caret constraint lines up with the rest of the migration and keeps Poetry’s view of the dependency tree consistent.

components/job-orchestration/pyproject.toml (1)

21-21: Version bump aligns with the orchestration stack.
Great to see the scheduler package tracking the same 2.11.9 floor so runtime workers and schedulers stay in lock-step.

components/clp-py-utils/pyproject.toml (1)

18-18: Dependency update keeps shared utils consistent.
Matching the shared utils’ requirement with the rest of the codebase avoids subtle resolver mismatches.

components/clp-package-utils/clp_package_utils/scripts/archive_manager.py (1)

184-190: Explicit enum casting is the right call.
Normalizing the config-derived strings back into StorageType/StorageEngine keeps the downstream comparisons resilient to serialisation changes in v2.

components/clp-py-utils/clp_py_utils/initialize-orchestration-db.py (1)

47-48: The model_validate switch is on point.
Swapping out parse_obj for model_validate is the compatible v2 path and keeps the error handling logic unchanged.

components/clp-py-utils/clp_py_utils/initialize-clp-metadata-db.py (1)

48-49: Good migration to model_validate.
This keeps the metadata initialiser in sync with the rest of the config loaders while preserving the existing failure modes.

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py (3)

243-245: Nice use of model_validate on packed configs.
Validating the msgpack payloads through the v2 API maintains the prior guarding without introducing extra conversions.

417-418: Task result validation remains robust.
Re-validating each Celery result with CompressionTaskResult.model_validate keeps the scheduler protected against malformed payloads under the new API.

487-488: Config loading migration looks good.
Using CLPConfig.model_validate here mirrors the other entry points and maintains consistent credential loading.

components/job-orchestration/job_orchestration/scheduler/job_config.py (1)

87-93: Field validator update is clean.
Recasting the network-address check with field_validator keeps the port guard intact while complying with the v2 decorator pattern.

coderabbitai · 2025-09-26T00:09:50Z

+                msgpack.packb(clp_io_config.model_dump(exclude_none=True, exclude_unset=True)), quality=4
            )


🧹 Nitpick | 🔵 Trivial

Use msgpack with explicit use_bin_type=True for safety.

Prevents potential string/bytes ambiguity across environments.

- msgpack.packb(clp_io_config.model_dump(exclude_none=True, exclude_unset=True)), quality=4 + msgpack.packb( + clp_io_config.model_dump(exclude_none=True, exclude_unset=True), + use_bin_type=True + ), + quality=4

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

msgpack.packb(clp_io_config.model_dump(exclude_none=True, exclude_unset=True)), quality=4

)

msgpack.packb(

clp_io_config.model_dump(exclude_none=True, exclude_unset=True),

use_bin_type=True

),

quality=4

🤖 Prompt for AI Agents

In components/clp-package-utils/clp_package_utils/scripts/native/compress.py around lines 120-121, the msgpack.packb call is missing use_bin_type=True which can cause string/bytes ambiguity across environments; update the call to include use_bin_type=True (i.e., msgpack.packb(clp_io_config.model_dump(...), use_bin_type=True, quality=4) or add use_bin_type=True alongside the existing args) so msgpack serializes binary data reliably.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab83316 and c059324.

📒 Files selected for processing (1)

components/clp-py-utils/clp_py_utils/clp_config.py (26 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-09-26T21:12:21.615Z

Learnt from: junhaoliao
PR: y-scope/clp#1338
File: components/clp-py-utils/clp_py_utils/clp_config.py:753-753
Timestamp: 2025-09-26T21:12:21.615Z
Learning: Python linting rules that catch YODA conditions exist but are not enabled for Python modules in the CLP package due to time constraints. The team prefers to avoid YODA conditions in Python for consistency.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2024-10-10T05:49:36.543Z

Learnt from: LinZhihao-723
PR: y-scope/clp#554
File: components/core/src/clp/ffi/SchemaTreeNode.hpp:0-0
Timestamp: 2024-10-10T05:49:36.543Z
Learning: Yoda conditions are part of the project's coding standards for C++ code and should be used in conditional statements.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-07-28T23:14:57.751Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-08-17T15:26:22.790Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

🧬 Code graph analysis (1)

components/clp-py-utils/clp_py_utils/clp_config.py (2)

components/clp-py-utils/clp_py_utils/clp_logging.py (2)

is_valid_logging_level (32-33)

get_valid_logging_level (28-29)

components/clp-py-utils/clp_py_utils/core.py (1)

make_config_path_absolute (42-53)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: package-image
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (41)

components/clp-py-utils/clp_py_utils/clp_config.py (41)
4-14: LGTM: Pydantic v2 imports correctly added.

The new imports align with Pydantic v2 API requirements and support the migration from v1 decorators and methods.

134-142: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod is properly implemented according to Pydantic v2 patterns.

144-152: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod is properly implemented according to Pydantic v2 patterns.

154-174: LGTM: Pydantic v2 model validator correctly applied.

The migration from @root_validator to @model_validator(mode="after") is correctly implemented. The validator receives the model instance (self) and returns it, which aligns with Pydantic v2 expectations.

189-197: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod is properly implemented according to Pydantic v2 patterns.

199-204: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod is properly implemented according to Pydantic v2 patterns.

206-217: LGTM: Validator correctly uses shared helper.

The migration to @field_validator and delegation to _validate_port follows the established pattern for consistent validation across the codebase.

268-268: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

290-295: LGTM: Helper function correctly updated with class parameter.

The helper function signature change to accept cls parameter supports the shared validation pattern used by field validators.

298-300: LGTM: Helper function correctly updated with class parameter.

The helper function signature change to accept cls parameter supports the shared validation pattern used by field validators.

316-320: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod and delegation to shared helper is properly implemented.

324-347: LGTM: Pydantic v2 field validators correctly applied.

All field validators have been properly migrated from @validator to @field_validator with @classmethod, and the port validator correctly delegates to the shared helper.

353-357: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod and delegation to shared helper is properly implemented.

363-367: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod and delegation to shared helper is properly implemented.

378-389: LGTM: Pydantic v2 field validators correctly applied.

Both field validators have been properly migrated from @validator to @field_validator with @classmethod, and the port validator correctly delegates to the shared helper.

392-392: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

418-442: LGTM: Pydantic v2 field validators correctly applied.

All three field validators have been properly migrated from @validator to @field_validator with @classmethod. The host and base_port validators correctly delegate to shared helpers.

452-486: LGTM: Pydantic v2 field validators correctly applied.

All field validators have been properly migrated from @validator to @field_validator with @classmethod, with appropriate validation logic for each field.

499-510: LGTM: Pydantic v2 field validators correctly applied.

Both field validators have been properly migrated from @validator to @field_validator with @classmethod, with the port validator correctly delegating to the shared helper.

513-513: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

538-552: LGTM: Pydantic v2 field validators correctly applied.

Both field validators have been properly migrated from @validator to @field_validator with @classmethod with appropriate validation logic.

565-585: LGTM: Pydantic v2 model validator correctly applied.

The migration from @root_validator(pre=True) to @model_validator(mode="before") is correctly implemented. The validator receives raw data and returns it, which aligns with Pydantic v2 "before" mode expectations.

594-606: LGTM: Pydantic v2 field validators correctly applied.

Both field validators have been properly migrated from @validator to @field_validator with @classmethod with appropriate validation logic.

614-614: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

621-625: LGTM: Directory validator updated for Pydantic v2.

The migration to @field_validator with mode="before" ensures the validator runs before Pydantic's type conversion to pathlib.Path, allowing string validation to work properly.

631-633: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

641-645: LGTM: Directory validator updated for Pydantic v2.

The migration to @field_validator with mode="before" ensures the validator runs before Pydantic's type conversion to pathlib.Path, allowing string validation to work properly.

647-658: Simplify S3 key_prefix validation; remove unnecessary hasattr check.

Since self.s3_config is validated by Pydantic, the hasattr check is unnecessary and can be removed.
     @model_validator(mode="after")
     def validate_key_prefix(self):
-        s3_config = self.s3_config
-        if not hasattr(s3_config, "key_prefix"):
-            raise ValueError("s3_config must have field key_prefix")
-        key_prefix = s3_config.key_prefix
+        key_prefix = self.s3_config.key_prefix
         if "" == key_prefix:
             raise ValueError("s3_config.key_prefix cannot be empty")
         if not key_prefix.endswith("/"):
             raise ValueError('s3_config.key_prefix must end with "/"')

         return self
664-666: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

722-762: LGTM: Pydantic v2 field validators correctly applied.

All field validators have been properly migrated from @validator to @field_validator with @classmethod with appropriate validation logic for each field.

771-773: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

780-785: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod is properly implemented according to Pydantic v2 patterns.

794-796: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

805-831: LGTM: Pydantic v2 field validators correctly applied.

All field validators have been properly migrated from @validator to @field_validator with @classmethod, with host and port validators correctly delegating to shared helpers.

835-838: LGTM: Pydantic v2 configuration and Field usage correctly applied.

The migration from class Config to model_config = ConfigDict() and the use of Field with constraints (gt=0) follows Pydantic v2 best practices.

845-849: LGTM: Pydantic v2 field validator correctly applied.

The migration from @validator to @field_validator with @classmethod and delegation to shared helper is properly implemented.

856-866: LGTM: Pydantic v2 field validators correctly applied.

Both field validators have been properly migrated from @validator to @field_validator with @classmethod, correctly delegating to shared helpers.

885-885: LGTM: Default instance correctly specified.

The explicit default assignment maintains the existing behaviour after Pydantic v2 migration.

1034-1034: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

1079-1079: LGTM: Pydantic v2 model method correctly updated.

The migration from dict() to model_dump() aligns with Pydantic v2 API.

1098-1110: Fix directory validation to handle both string and Path inputs.

The current implementation only accepts strings, but the directory fields are typed as pathlib.Path. This will cause validation failures when Path objects (including default values) are passed.
 def _validate_directory(value: Any):
     """
     Validates that the given value represents a directory path.

     :param value:
-    :raise ValueError: if the value is not of type str.
-    :raise ValueError: if the value is an empty string.
+    :raise ValueError: if the value is not a string or pathlib.Path, or if it is empty.
     """
-    if not isinstance(value, str):
-        raise ValueError("must be a string.")
-
-    if "" == value.strip():
-        raise ValueError("cannot be empty")
+    if isinstance(value, pathlib.Path):
+        s = str(value)
+    elif isinstance(value, str):
+        s = value
+    else:
+        raise ValueError("must be a string or pathlib.Path")
+    
+    if "" == s.strip():
+        raise ValueError("cannot be empty")

coderabbitai · 2025-09-26T21:39:43Z

+def _validate_port(cls, value):
    min_valid_port = 0
    max_valid_port = 2**16 - 1
-    if min_valid_port > field or max_valid_port < field:
+    if min_valid_port > value or max_valid_port < value:
        raise ValueError(
            f"{cls.__name__}.port is not within valid range " f"{min_valid_port}-{max_valid_port}."
        )


🧹 Nitpick | 🔵 Trivial

Simplify comparison; avoid negated inequality.

Use direct comparison for clarity instead of negated inequality.

- if min_valid_port > value or max_valid_port < value: + if value < min_valid_port or value > max_valid_port: raise ValueError( f"{cls.__name__}.port is not within valid range " f"{min_valid_port}-{max_valid_port}." )

Based on learnings

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _validate_port(cls, value):

min_valid_port = 0

max_valid_port = 2**16 - 1

if min_valid_port > field or max_valid_port < field:

if min_valid_port > value or max_valid_port < value:

raise ValueError(

f"{cls.__name__}.port is not within valid range " f"{min_valid_port}-{max_valid_port}."

)

def _validate_port(cls, value):

min_valid_port = 0

max_valid_port = 2**16 - 1

if value < min_valid_port or value > max_valid_port:

raise ValueError(

f"{cls.__name__}.port is not within valid range " f"{min_valid_port}-{max_valid_port}."

)

🤖 Prompt for AI Agents

In components/clp-py-utils/clp_py_utils/clp_config.py around lines 303 to 309, replace the negated/indirect range check ("if min_valid_port > value or max_valid_port < value") with a direct comparison such as "if value < min_valid_port or value > max_valid_port" to improve clarity; keep the existing ValueError message and bounds variables unchanged.

junhaoliao · 2025-09-26T21:50:53Z

-
-        query_engine = package.get("query_engine")
-        presto = values.get("presto")
+    @model_validator(mode="after")


cc @davemarco / @hoophalab for awareness of the change

this probably will break the original desired behaviour. read the comment above

i tried for hours to switch branches and come up with different inputs, but i found no behavioural difference with or without pre=True. i can't be sure if i understand the We set pre=True ... presto config itself comment correctly. (it seems the comment was proposed by @kirkrodrigues ?)

could you share some config example to help understand the difference?

i did dig up the conversations in #1198 and found this from Haiqi:

#1198 (review)
One note to add:
The reason we put pre=true in root_validator is to prevent validating Presto config when it is not allowed.

For example, if user specifies a Presto config with invalid port when query_engine is NOT presto. Without pre=true, the script will first complain about invalid port, and only if user fixes the port, it will then let user know that presto is not allowed when query_engine is not presto.

so i tested with this clp-config.yaml

package: storage_engine: "clp" query_engine: "presto" presto: host: 1 port: -1

with the existing code on main@a040dfb4f4a14ef22f135dac13d90ff2b6984e07:
junhao@ASUS-X870E:~/workspace/2-clp/build/clp-package$ ./sbin/start-clp.sh 2025-09-28T21:48:13.377 ERROR [start_clp] Failed to load config. Traceback (most recent call last): File "/home/junhao/workspace/2-clp/build/clp-package/lib/python3/site-packages/clp_package_utils/scripts/start_clp.py", line 1185, in main clp_config = load_config_file(config_file_path, default_config_file_path, clp_home) File "/home/junhao/workspace/2-clp/build/clp-package/lib/python3/site-packages/clp_package_utils/general.py", line 422, in load_config_file clp_config = CLPConfig.parse_obj(raw_clp_config) File "pydantic/main.py", line 532, in pydantic.main.BaseModel.parse_obj File "pydantic/main.py", line 347, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 2 validation errors for CLPConfig package -> __root__ query_engine 'presto' is only compatible with storage_engine 'clp-s'. (type=value_error) presto -> port Presto.port is not within valid range 0-65535. (type=value_error)

I modified the code to use @root_validator without pre=True:
diff --git a/components/clp-py-utils/clp_py_utils/clp_config.py b/components/clp-py-utils/clp_py_utils/clp_config.py --- a/components/clp-py-utils/clp_py_utils/clp_config.py (revision a040dfb4f4a14ef22f135dac13d90ff2b6984e07) +++ b/components/clp-py-utils/clp_py_utils/clp_config.py (date 1759110179774) @@ -989,7 +989,7 @@ # We set `pre=True` so that we print errors about a mismatch between the query engine and presto # config before errors about the presto config itself. - @root_validator(pre=True) + @root_validator def validate_presto_config(cls, values): package = values.get("package") if not isinstance(package, Package): @@ -997,7 +997,7 @@ # later and throw an error). return values - query_engine = package.get("query_engine") + query_engine = package.query_engine presto = values.get("presto") if query_engine == QueryEngine.PRESTO and presto is None: raise ValueError(
->
junhao@ASUS-X870E:~/workspace/2-clp/build/clp-package$ ./sbin/start-clp.sh 2025-09-28T21:50:54.323 ERROR [start_clp] Failed to load config. Traceback (most recent call last): File "/home/junhao/workspace/2-clp/build/clp-package/lib/python3/site-packages/clp_package_utils/scripts/start_clp.py", line 1185, in main clp_config = load_config_file(config_file_path, default_config_file_path, clp_home) File "/home/junhao/workspace/2-clp/build/clp-package/lib/python3/site-packages/clp_package_utils/general.py", line 422, in load_config_file clp_config = CLPConfig.parse_obj(raw_clp_config) File "pydantic/main.py", line 532, in pydantic.main.BaseModel.parse_obj File "pydantic/main.py", line 347, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 2 validation errors for CLPConfig package -> __root__ query_engine 'presto' is only compatible with storage_engine 'clp-s'. (type=value_error) presto -> port Presto.port is not within valid range 0-65535. (type=value_error)

with the code on this PR branch where i changed it to @model_validate(mode="after) with Pydantic V2, which is similar to @root_validate in Pydantic V1: https://github.com/y-scope/clp/blob/2ca7cb95ecc25ca9ecd844177d267df54153cc74/components/clp-py-utils/clp_py_utils/clp_config.py
junhao@ASUS-X870E:~/workspace/2-clp/build/clp-package$ ./sbin/start-clp.sh 2025-09-28T22:00:01.268 ERROR [start_clp] Failed to load config. Traceback (most recent call last): File "/home/junhao/workspace/2-clp/build/clp-package/lib/python3/site-packages/clp_package_utils/scripts/start_clp.py", line 1183, in main clp_config = load_config_file(config_file_path, default_config_file_path, clp_home) File "/home/junhao/workspace/2-clp/build/clp-package/lib/python3/site-packages/clp_package_utils/general.py", line 422, in load_config_file clp_config = CLPConfig.model_validate(raw_clp_config) File "/home/junhao/workspace/2-clp/build/clp-package/lib/python3/site-packages/pydantic/main.py", line 705, in model_validate return cls.__pydantic_validator__.validate_python( pydantic_core._pydantic_core.ValidationError: 2 validation errors for CLPConfig package Value error, query_engine 'presto' is only compatible with storage_engine 'clp-s'. [type=value_error, input_value={'storage_engine': 'clp',...query_engine': 'presto'}, input_type=dict] For further information visit https://errors.pydantic.dev/2.11/v/value_error presto.port Value error, Presto.port is not within valid range 0-65535. [type=value_error, input_value=-1, input_type=int] For further information visit https://errors.pydantic.dev/2.11/v/value_error

The query_engine mismatch error was reported in all above cases.

in case we no longer need to replicate the previous handling, i have removed the outdated comment in 167f2a6. If mode="before" is indeed needed, i can revert and update the comment.

… field_validator

… for s3_config key_prefix.

…config.

gibber9809

Nice work! Just a few minor comments/clarifications, then this change should be good to go.

gibber9809 · 2025-09-29T15:06:45Z

+    env_vars: Optional[Dict[str, str]]
+    aws_credentials: Optional[S3Credentials]


Just for my own understanding, why do we remove the explicit = None for these optional arguments here, but add it at clp_config.py:538? Also seems like there are several other places with explicit = None that we've left untouched.

why do we remove the explicit = None for these optional arguments here

the None value is never used because new values will be assigned by one of the if-branches. the "unused variable error" is actually marked by PyCharm

add it at clp_config.py:538

Pydantic V2 requires that all fields with optional types be assigned None values at their declarations inside the models.

Also seems like there are several other places with explicit = None that we've left untouched.

Likely in some later branches the variable is untouched, so the None value can be used.

Eden-D-Zhang

No comments from me, just existing suggestion from Devin.

@gibber9809

…)` - Apply suggestion from @gibber9809 Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa6a22d and 925f878.

📒 Files selected for processing (1)

components/clp-py-utils/clp_py_utils/clp_config.py (27 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-09-26T21:12:21.655Z

Learnt from: junhaoliao
PR: y-scope/clp#1338
File: components/clp-py-utils/clp_py_utils/clp_config.py:753-753
Timestamp: 2025-09-26T21:12:21.655Z
Learning: Python linting rules that catch YODA conditions exist but are not enabled for Python modules in the CLP package due to time constraints. The team prefers to avoid YODA conditions in Python for consistency.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2024-10-10T05:49:36.543Z

Learnt from: LinZhihao-723
PR: y-scope/clp#554
File: components/core/src/clp/ffi/SchemaTreeNode.hpp:0-0
Timestamp: 2024-10-10T05:49:36.543Z
Learning: Yoda conditions are part of the project's coding standards for C++ code and should be used in conditional statements.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-07-28T23:14:57.751Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:42-45
Timestamp: 2025-07-28T23:14:57.751Z
Learning: In the CLP project, the team prefers Yoda-style conditions (e.g., `0 == len(dataset_name)`) over standard Python idioms to help catch accidental assignment operators (writing `=` instead of `==`). Additionally, the team finds using `not` for string emptiness checks ambiguous and prefers explicit length comparisons.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

📚 Learning: 2025-08-17T15:26:22.790Z

Learnt from: Bill-hbrhbr
PR: y-scope/clp#1100
File: integration-tests/tests/utils/config.py:0-0
Timestamp: 2025-08-17T15:26:22.790Z
Learning: In the CLP project, Yoda-style conditions (literal on the left) are only used for `== 0` checks, not for other comparison operators like `>` or `<`. This is specifically to catch accidental assignment operators (writing `=` instead of `==`) in equality comparisons against zero.

Applied to files:

components/clp-py-utils/clp_py_utils/clp_config.py

🧬 Code graph analysis (1)

components/clp-py-utils/clp_py_utils/clp_config.py (3)

components/clp-package-utils/clp_package_utils/general.py (1)

validate_port (195-207)

components/clp-py-utils/clp_py_utils/clp_logging.py (2)

is_valid_logging_level (32-33)

get_valid_logging_level (28-29)

components/clp-py-utils/clp_py_utils/core.py (1)

make_config_path_absolute (42-53)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: package-image
GitHub Check: lint-check (macos-15)
GitHub Check: lint-check (ubuntu-24.04)

🔇 Additional comments (1)

components/clp-py-utils/clp_py_utils/clp_config.py (1)
303-309: Avoid Yoda-style comparison for consistency.

Line 306 uses min_valid_port > value or max_valid_port < value. Per project guidelines, Yoda conditions should be avoided in Python for consistency.

Based on learnings

Apply this diff:
 def _validate_port(cls, value):
     min_valid_port = 0
     max_valid_port = 2**16 - 1
-    if min_valid_port > value or max_valid_port < value:
+    if value < min_valid_port or value > max_valid_port:
         raise ValueError(
             f"{cls.__name__}.port is not within valid range " f"{min_valid_port}-{max_valid_port}."
         )

coderabbitai · 2025-09-29T22:32:54Z

+def _validate_directory(value: Any):
+    """
+    Validates that the given value represents a directory path.
+
+    :param value:
+    :raise ValueError: if the value is not of type str.
+    :raise ValueError: if the value is an empty string.
+    """
+    if not isinstance(value, str):
+        raise ValueError("must be a string.")
+
+    if "" == value.strip():
+        raise ValueError("cannot be empty")


🧹 Nitpick | 🔵 Trivial

Clarify type handling in _validate_directory.

The helper only accepts str but validators calling it (lines 626-630, 646-650) use mode="before" with fields typed as pathlib.Path. While this works because YAML parsing produces strings before Pydantic converts them to Path objects, the implementation and docstring could be clearer:

The docstring states "if value is not of type str" but doesn't explain why Path objects would be invalid

The error message "must be a string" is confusing for a field declared as pathlib.Path

Consider either:

Updating the docstring to clarify this runs in "before" mode on raw string input

Or handling both str and Path types if the helper might be reused in "after" validators

Apply this diff to improve clarity:

def _validate_directory(value: Any): """ Validates that the given value represents a directory path. + Expected to run in 'before' mode validators on raw string input. :param value: - :raise ValueError: if the value is not of type str. + :raise ValueError: if the value is not a string (raw config input). :raise ValueError: if the value is an empty string. """ if not isinstance(value, str): - raise ValueError("must be a string.") + raise ValueError("directory must be a string in raw config.") if "" == value.strip(): raise ValueError("cannot be empty")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _validate_directory(value: Any):

"""

Validates that the given value represents a directory path.

:param value:

:raise ValueError: if the value is not of type str.

:raise ValueError: if the value is an empty string.

"""

if not isinstance(value, str):

raise ValueError("must be a string.")

if "" == value.strip():

raise ValueError("cannot be empty")

def _validate_directory(value: Any):

"""

Validates that the given value represents a directory path.

Expected to run in 'before' mode validators on raw string input.

:param value:

:raise ValueError: if the value is not a string (raw config input).

:raise ValueError: if the value is an empty string.

"""

if not isinstance(value, str):

raise ValueError("directory must be a string in raw config.")

if "" == value.strip():

raise ValueError("cannot be empty")

🤖 Prompt for AI Agents

components/clp-py-utils/clp_py_utils/clp_config.py around lines 1100 to 1112: the _validate_directory helper currently only accepts str and its docstring/error message are misleading when used as a Pydantic "before" validator for fields typed as pathlib.Path; update the function to either (a) clarify the docstring to state it runs in "before" mode on raw YAML/string input and that Path objects are not expected here, and make the error messages clearer (e.g., "directory path must be a non-empty string"), or (b) broaden the implementation to accept both str and pathlib.Path (convert Path to str via str(value) before validation) and update the docstring to reflect both accepted types and behavior. Ensure consistent, user-friendly error messages and keep behavior unchanged for current callers.

gibber9809

LGTM! I'm fine with the PR title as well.

junhaoliao · 2025-10-01T18:38:41Z

-    @validator("network_address")
+    @field_validator("network_address")
+    @classmethod
    def validate_network_address(cls, field):


this is purely stylistic - we should change the second argument's name (field) to value to match the convention in the latest Pydantic docs

similarly, we should do the same for valid_status in components/job-orchestration/job_orchestration/scheduler/scheduler_data.py

@coderabbitai create an issue to track this

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Create a GitHub issue to track the stylistic parameter rename gh issue create \ --title "Rename field validator parameter from 'field' to 'value' for Pydantic v2 convention" \ --body "## Description Update field validator parameter names to match Pydantic v2 documentation conventions by renaming \`field\` to \`value\`. ## Locations to update 1. \`validate_network_address\` in \`components/job-orchestration/job_orchestration/scheduler/job_config.py\` 2. \`valid_status\` in \`components/job-orchestration/job_orchestration/scheduler/scheduler_data.py\` ## Context This is a stylistic change to align with the latest Pydantic v2 documentation conventions where field validators use \`value\` as the parameter name for the field being validated. ## References - PR #1338: https://github.com/y-scope/clp/pull/1338 - Comment: https://github.com/y-scope/clp/pull/1338#discussion_r2395502615 - Requested by: @junhaoliao" \ --assignee junhaoliao

Length of output: 42

@junhaoliao Issue created! I've opened #1362 to track renaming the field parameter to value in both validate_network_address and valid_status field validators to match Pydantic v2 conventions. The issue has been assigned to you with references back to this PR and comment.

junhaoliao · 2025-10-01T18:43:44Z


 class QueryTaskResult(BaseModel):
    status: QueryTaskStatus
    task_id: str


Pydantic V2 seems to enforce stricter type checks, and creating a QueryTaskResult with the task_id being an int actually causes failure in IR extraction jobs (i.e., the "Link to log-viewer" feature in the webui).

@coderabbitai file an issue to track the bug

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Create a GitHub issue to track the Pydantic V2 type enforcement bug gh issue create \ --title "Pydantic V2 strict type checking causes IR extraction job failure with integer task_id" \ --body "## Description Pydantic V2 enforces stricter type checks than V1. The \`QueryTaskResult\` class defines \`task_id: str\`, but somewhere in the codebase an integer is being passed for \`task_id\` when creating a \`QueryTaskResult\` instance. This causes IR extraction jobs to fail, breaking the \"Link to log-viewer\" feature in the web UI. ## Context - Reported in PR #1338: https://github.com/y-scope/clp/pull/1338 - Comment: https://github.com/y-scope/clp/pull/1338#discussion_r2395517772 - Affected component: \`components/job-orchestration/job_orchestration/scheduler/scheduler_data.py\` - Affected class: \`QueryTaskResult\` - Field: \`task_id: str\` ## Impact IR extraction jobs fail when using Pydantic V2, preventing users from accessing the log-viewer link feature in the web UI. ## Reported by @junhaoliao" \ --assignee junhaoliao \ --repo y-scope/clp

Length of output: 42

@junhaoliao I've created issue #1363 to track this bug: #1363

The issue documents the Pydantic V2 strict type checking problem where QueryTaskResult is being instantiated with an integer task_id instead of a string, causing IR extraction jobs to fail and breaking the log-viewer link feature in the web UI.

…o match V2 syntax; Fix various typing errors. (y-scope#1338) Co-authored-by: Eden Zhang <49173122+Eden-D-Zhang@users.noreply.github.com> Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>

junhaoliao added 2 commits September 25, 2025 16:26

chore(clp-package): Upgrade Pydantic to 2.11.9.

0cafc12

Run bump-pydantic to auto migrate existing code.

6ad899a

junhaoliao added 4 commits September 25, 2025 18:23

lint

46d667b

refactor(clp-py-utils): Migrate Pydantic @root_validator validators…

f2e0f0f

… to `@model_validator(mode="after")` and replace `root_validator` with `Field` for sweep interval validation.

refactor(clp-py-utils): Remove unnecessary parentheses in reducer typ…

28109d4

…e annotation.

refactor(clp-py-utils): Add missing type annotations (required by Pyd…

838d79c

…antic v2) to QueryScheduler fields.

junhaoliao force-pushed the upgrade-pydantic branch from 3620bd1 to 838d79c Compare September 25, 2025 23:16

junhaoliao added 17 commits September 25, 2025 19:22

refactor: Migrate Pydantic parse_obj calls to model_validate for …

8503ac2

…compatibility with Pydantic v2.

refactor: Migrate Pydantic dict() calls to model_dump() for compa…

4cfb508

…tibility with Pydantic v2.

refactor: Migrate Pydantic copy() calls to model_copy() for compa…

8818c33

…tibility with Pydantic v2.

refactor: Migrate Pydantic json() calls to model_dump_json() for …

55a55de

…compatibility with Pydantic v2.

refactor(compress.py): Remove unused Optional import.

ac5f006

refactor(archive_manager.py): Add explicit StorageType and StorageEng…

58cc656

…ine casting for config values.

refactor(start_clp.py): Update type hints for cmd parameters to use L…

294e424

…ist[str] instead of [str].

refactor(start_clp.py): Remove unused clp_site_packages_dir variables

2b94150

refactor(decompress.py): Convert container_paths_to_extract_file_path…

aa18d38

… to string before appending to command list

refactor(clp-py-utils): Remove unnecessary & unused variable initiali…

51a8dc5

…zations in s3_utils.py

refactor(clp-py-utils): Remove unnecessary parentheses in return stat…

3de4db9

…ement of s3_utils.py

refactor(compression_task.py): Migrate Pydantic parse_raw() calls t…

5c8700d

…o `model_validate_json()` for compatibility with Pydantic v2.

refactor: Move ConfigDict declarations to class beginning

bbb70ea

refactor(compression_scheduler.py): Remove unused ArchiveOutput import

7d32235

refactor(garbage_collector.py): Remove unused exception variable in e…

7fab7c1

…xcept block

refactor(clp_config.py): Rename values parameter to data in model…

5afbd14

… validators for clarity and consistency with Pydantic v2 conventions.

refactor(clp_config.py): Rename validator parameters from field to …

d11e291

…`value` for consistency with Pydantic v2 conventions.

junhaoliao changed the title ~~refactor(clp-package): Upgrade Pydantic to V2; Update existing code to match V2 syntax.~~ refactor(clp-package): Upgrade Pydantic to V2; Update existing code to match V2 syntax; Fix various typing errors. Sep 25, 2025

junhaoliao marked this pull request as ready for review September 25, 2025 23:59

junhaoliao requested a review from a team as a code owner September 25, 2025 23:59

coderabbitai Bot reviewed Sep 26, 2025

View reviewed changes

Merge branch 'main' into upgrade-pydantic

a8d4ac3

improve error reporting in _validate_directory.

c059324

coderabbitai Bot reviewed Sep 26, 2025

View reviewed changes

junhaoliao added 2 commits September 26, 2025 17:42

avoid yoda

4ec7ce3

fix validate_presto_config

8b02254

junhaoliao commented Sep 26, 2025

View reviewed changes

junhaoliao requested a review from kirkrodrigues September 26, 2025 21:56

coderabbitai Bot mentioned this pull request Sep 26, 2025

Investigate removing pydantic.BaseModel from CompressionTaskResult class #1347

Open

5 tasks

junhaoliao added 4 commits September 27, 2025 00:36

Merge remote-tracking branch 'origin/main' into upgrade-pydantic

73f0496

refactor(clp_config.py): Update aws_config_directory validator to use…

2b4cf65

… field_validator

refactor(clp_config.py): Replace model_validator with field_validator…

2ca7cb9

… for s3_config key_prefix.

refactor(clp_config.py): Remove outdated comments in validate_presto_…

167f2a6

…config.

junhaoliao requested a review from Eden-D-Zhang September 29, 2025 02:17

Merge branch 'main' into upgrade-pydantic

99a8884

junhaoliao requested review from kirkrodrigues and removed request for Eden-D-Zhang and kirkrodrigues September 29, 2025 02:17

gibber9809 requested changes Sep 29, 2025

View reviewed changes

Eden-D-Zhang reviewed Sep 29, 2025

View reviewed changes

junhaoliao and others added 2 commits September 29, 2025 17:39

Add isinstance(data, dict) check in `@model_validator(mode="before"…

aa6a22d

…)` - Apply suggestion from @gibber9809 Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>

fix lint

925f878

junhaoliao requested a review from gibber9809 September 29, 2025 22:29

coderabbitai Bot reviewed Sep 29, 2025

View reviewed changes

gibber9809 approved these changes Sep 29, 2025

View reviewed changes

junhaoliao merged commit 6f46f23 into y-scope:main Sep 29, 2025
23 checks passed

junhaoliao deleted the upgrade-pydantic branch September 29, 2025 23:24

junhaoliao commented Oct 1, 2025

View reviewed changes

coderabbitai Bot mentioned this pull request Oct 1, 2025

Rename field validator parameter from 'field' to 'value' for Pydantic v2 convention #1362

Closed

junhaoliao commented Oct 1, 2025

View reviewed changes

coderabbitai Bot mentioned this pull request Oct 1, 2025

Pydantic V2 strict type checking causes IR extraction job failure with integer task_id #1363

Closed

		msgpack.packb(clp_io_config.model_dump(exclude_none=True, exclude_unset=True)), quality=4
		)

		env_vars: Optional[Dict[str, str]]
		aws_credentials: Optional[S3Credentials]

Conversation

junhaoliao commented Sep 25, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Upgrade

Pydantic v2 Migration

Fix other typing errors

Checklist

Validation performed

1. task to build the package.

2. Tested below

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junhaoliao Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

davemarco Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

junhaoliao Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

junhaoliao Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gibber9809 Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

junhaoliao Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Eden-D-Zhang left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

gibber9809 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junhaoliao Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 1, 2025

junhaoliao commented Sep 25, 2025 •

edited by coderabbitai Bot

Loading

1. `task` to build the package.

coderabbitai Bot commented Sep 25, 2025 •

edited

Loading

junhaoliao Sep 29, 2025 •

edited

Loading