Skip to content

fix(clp-package): Use generic archive directory as mount target in native/dataset_manager.py to match path in metadata database (fixes #1526).#1528

Merged
junhaoliao merged 5 commits into
y-scope:mainfrom
junhaoliao:fix-archive-mount
Nov 5, 2025

Conversation

@junhaoliao

@junhaoliao junhaoliao commented Oct 30, 2025

Copy link
Copy Markdown
Member

Description

  • Use explicit bind mount for FS archive output so the container sees the transformed path.
  • Replace mounts.archives_output_dir with DockerMount(DockerMountType.BIND, host_dir, container_dir) derived from a deep-copied, container-transformed archive_output config.
  • Simplify _try_deleting_archives_from_fs to take only dataset_archive_storage_dir and remove brittle top-level containment check.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

task
cd build/clp-package

# configure storage and query engine as `clp-s` in etc/clp-config.yml

./sbin/start-clp.sh
./sbin/compress.sh --timestamp-key timestamp ~/samples/postgresql.jsonl 

./sbin/admin-tools/dataset-manager.sh del default
# 2025-10-30T07:20:27.248 INFO [dataset_manager] Deleted archives of dataset `default`.
# 2025-10-30T07:20:27.274 INFO [dataset_manager] Deleted dataset `default` from the metadata database

Summary by CodeRabbit

  • Refactor
    • Improved archive storage handling and container mount configuration for filesystem-based storage.
    • Simplified archive deletion logic and removed unnecessary validation checks.

@junhaoliao junhaoliao requested a review from a team as a code owner October 30, 2025 07:20
@coderabbitai

coderabbitai Bot commented Oct 30, 2025

Copy link
Copy Markdown
Contributor

Walkthrough

Create a container-visible copy of FS archive-output config and add a bind DockerMount for the archives directory when storage type is FS. Simplify native deletion by removing the archive-output config parameter and relative-path containment validation; deletion now operates on a resolved filesystem path.

Changes

Cohort / File(s) Summary
Docker mount configuration
components/clp-package-utils/clp_package_utils/scripts/dataset_manager.py
Added imports for DockerMount and DockerMountType. When archive_output.storage.type == StorageType.FS, create a container-visible copy of the archive-output config, transform storage for container use, construct a DockerMount (bind) from host → container for the archives directory, and append it to mounts.
Archive deletion simplification
components/clp-package-utils/clp_package_utils/scripts/native/dataset_manager.py
Removed archive_output_config parameter from _try_deleting_archives_from_fs; function now accepts and resolves only dataset_archive_storage_dir and performs existence/type checks before deletion. Deleted prior relative-path/archives-dir containment validation; call sites updated accordingly.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant DM as dataset_manager
    participant Config as ArchiveConfig(FS)
    participant Mount as DockerMount

    Caller->>DM: Determine archive_output.storage.type
    alt storage.type == FS
        DM->>Config: copy config for container visibility
        DM->>Config: transform storage path for container
        DM->>Mount: create DockerMount (type: BIND) hostPath -> containerPath
        DM->>DM: append mount to mounts list
    else
        DM->>DM: leave mounts unchanged
    end

    rect rgb(240, 245, 255)
    Note over DM: Native deletion (simplified)
    Caller->>DM: _try_deleting_archives(dataset_archive_storage_dir)
    DM->>DM: resolve Path(dataset_archive_storage_dir)
    DM->>DM: check existence and is_dir
    DM->>DM: remove directory if checks pass
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas to review:
    • Correctness of container-visible config transformation and resulting container path mappings in dataset_manager.py
    • Proper construction and usage of DockerMount (bind) and ensuring no duplicate/conflicting mounts
    • All call sites updated for the _try_deleting_archives_from_fs signature change and implications of removing containment validation

Possibly related issues

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing archive directory mounting in native/dataset_manager.py to match metadata database paths, which is reflected in both file modifications.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@quinntaylormitchell quinntaylormitchell left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only one comment; I very much defer to @kirkrodrigues review though.

and fwiw, the "try" in _try_deleting_archives_from_fs (and the other functions in native/dataset_manager.py for that matter) doesn't seem like a useful or meaningful word to use in the function name(s), but that's outside the scope of this PR.

if clp_config.archive_output.storage.type == StorageType.FS:
necessary_mounts.append(mounts.archives_output_dir)
container_archive_output_config = container_clp_config.archive_output.model_copy(deep=True)
container_archive_output_config.storage.transform_for_container()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this line really need to be here? if container_clp_config is the version of the config meant for use inside the container, then its archive_output paths should already be in container form (from when they get transformed here), so the paths in container_archive_output_config should already be transformed because it's a deep copy, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, we could have used generate_docker_compose_container_config() (NOTE: it's a different function than generate_container_config() which is the one currently used in dataset_manager.py) to generate a copy of the config and directly reference it, but to limit the memory impacts, though not significantly, i only make a copy of the container_archive_output_config and transform the copy.

generate_container_config will be deprecated when we move all docker orchestrions to Docker Compose so the changes here are temporary as well

let me know if my explanations should be clarified or if i misunderstood your comment

@quinntaylormitchell quinntaylormitchell left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

fix(clp-package): Use generic archive directory as mount target in `native/dataset_manager.py` to match path in metadata database (fixes #1526).

@junhaoliao junhaoliao changed the title fix(clp-package): Fix archive dir mounting for dataset delete (fixes #1526). fix(clp-package): Use generic archive directory as mount target in native/dataset_manager.py to match path in metadata database (fixes #1526). Nov 5, 2025
@junhaoliao junhaoliao merged commit 5841cf7 into y-scope:main Nov 5, 2025
20 checks passed
@junhaoliao junhaoliao deleted the fix-archive-mount branch November 5, 2025 03:08
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
…ative/dataset_manager.py` to match path in metadata database (fixes y-scope#1526). (y-scope#1528)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants