feat(clp-package)!: Remove the archive tags feature.#1842
Conversation
WalkthroughThis PR removes "tags" support across the codebase: CLI flags, config fields, metadata tables, job/task parameters, scheduler query filtering, persistence of archive-tag relations, and Web UI table suffixes. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1)
178-209: Fix the docstring to match actual table operations.The docstring at line 183 mentions
file_splitsbut the function only deletes from thefilesandarchivestables. There's nofile_splitstable referenced in the code.📝 Proposed fix for the docstring
""" Deletes archives from the metadata database specified by a list of IDs. It also deletes - the associated entries from `file_splits` and `files` tables that reference these archives. + the associated entries from the `files` table that reference these archives. The order of deletion follows the foreign key constraints, ensuring no violations occur during the process.
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (21)
components/clp-package-utils/clp_package_utils/scripts/compress.pycomponents/clp-package-utils/clp_package_utils/scripts/compress_from_s3.pycomponents/clp-package-utils/clp_package_utils/scripts/native/compress.pycomponents/clp-package-utils/clp_package_utils/scripts/native/search.pycomponents/clp-package-utils/clp_package_utils/scripts/search.pycomponents/clp-py-utils/clp_py_utils/clp_metadata_db_utils.pycomponents/clp-rust-utils/src/job_config/clp_io_config.rscomponents/clp-rust-utils/src/job_config/ingestion.rscomponents/clp-rust-utils/src/job_config/search.rscomponents/clp-rust-utils/tests/clp_config_test.rscomponents/job-orchestration/job_orchestration/executor/compress/celery_compress.pycomponents/job-orchestration/job_orchestration/executor/compress/compression_task.pycomponents/job-orchestration/job_orchestration/executor/compress/spider_compress.pycomponents/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.pycomponents/job-orchestration/job_orchestration/scheduler/compress/partition.pycomponents/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.pycomponents/job-orchestration/job_orchestration/scheduler/job_config.pycomponents/job-orchestration/job_orchestration/scheduler/query/query_scheduler.pycomponents/log-ingestor/src/compression/compression_job_submitter.rscomponents/log-ingestor/tests/test_ingestion_job.rscomponents/webui/common/src/config.ts
💤 Files with no reviewable changes (17)
- components/clp-rust-utils/src/job_config/clp_io_config.rs
- components/webui/common/src/config.ts
- components/log-ingestor/tests/test_ingestion_job.rs
- components/log-ingestor/src/compression/compression_job_submitter.rs
- components/clp-rust-utils/src/job_config/search.rs
- components/clp-rust-utils/src/job_config/ingestion.rs
- components/job-orchestration/job_orchestration/scheduler/job_config.py
- components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
- components/clp-package-utils/clp_package_utils/scripts/native/compress.py
- components/clp-package-utils/clp_package_utils/scripts/compress.py
- components/job-orchestration/job_orchestration/scheduler/compress/partition.py
- components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py
- components/clp-package-utils/clp_package_utils/scripts/search.py
- components/job-orchestration/job_orchestration/executor/compress/celery_compress.py
- components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
- components/clp-package-utils/clp_package_utils/scripts/native/search.py
- components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.
Applied to files:
components/clp-rust-utils/tests/clp_config_test.rs
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.
Applied to files:
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: 2025-05-05T16:32:55.163Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.
Applied to files:
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2024-12-16T21:49:06.483Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
Applied to files:
components/job-orchestration/job_orchestration/executor/compress/compression_task.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: check-generated
- GitHub Check: conventional-commits
🔇 Additional comments (8)
components/clp-rust-utils/tests/clp_config_test.rs (1)
10-86: LGTM! Test expectations properly updated to reflect tags removal.The test has been correctly updated to reflect the removal of the
tagsfield fromOutputConfig:
- The
OutputConfiginitialization (lines 34-40) no longer includes thetagsfield.- The expected Brotli-compressed MessagePack hex output (lines 45-50) has been updated to match the smaller serialized payload.
- The expected JSON structure (lines 75-81) naturally excludes the
tagsfield from the output section.Both serialization format validations remain intact and are consistent with the structural change.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)
399-426: LGTM! Tag filtering removal is clean and correct.The simplification of
get_archives_for_searchcorrectly removes tag-based filtering while preserving timestamp and retention period filtering logic. The unconditional query execution is appropriate. No tag-related code remains in the codebase.components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (4)
15-20: LGTM! Correct update to TABLE_SUFFIX_MAX_LEN.The calculation now correctly reflects only the remaining table suffixes after removing tags-related constants.
161-175: LGTM! Clean removal of tags table creation.The function now correctly creates only the metadata, archives, and files tables without any references to tags-related tables.
212-237: LGTM! Correct update to dataset deletion logic.The removal of tags-related tables from
tables_in_removal_orderis correct, and the deletion order properly respects foreign key constraints.
240-253: Breaking change: Removed public API functions.The removal of
get_archive_tags_table_nameandget_tags_table_nameis a breaking change appropriately indicated in the PR title. No references to these removed functions exist elsewhere in the codebase.components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
504-508: LGTM!The call site has been correctly updated to match the new
update_job_metadatasignature. The function now receives only the necessary parameters for updating job compression statistics, while dataset handling remains properly implemented via the separateupdate_archive_metadatacall.
71-83: Function signature simplification is correct and all call sites have been updated.The removal of the
table_prefix,dataset, andtag_idsparameters is clean and properly maintained throughout the codebase. The function now focuses solely on updating job compression statistics, improving clarity and maintainability. The single call site at lines 504–508 correctly passes only the required parameters.
LinZhihao-723
left a comment
There was a problem hiding this comment.
Since the ingestion job's input schema has changed, can u rerun task codegen:openapi and commit the changes? This should fix the failing CI workflow.
junhaoliao
left a comment
There was a problem hiding this comment.
The Package changes lgtm. a few minor sugguestions:
- it does seem there are some more references in the webui that should be updated. let's also remove these references to "tags":
clp/components/webui/common/src/schemas/compression.ts
Lines 117 to 118 in ff6f76a
- since this change is BREAKING, are there migration strategies for the metadata DB?
- i believe #883 can be closed once this PR is merged.
- for the commit message, i think a
feat!or arefactor!is probably more appropriate?
|
junhaoliao
left a comment
There was a problem hiding this comment.
for the title, how about:
refactor(clp-package)!: Remove archive tags feature.
LinZhihao-723
left a comment
There was a problem hiding this comment.
In terms of the PR title: I think this should be a feat PR. According to the conveitonal commit cheatsheet (here):
- Changes relevant to the API or UI:
featCommits that add, adjust or remove a new feature to the API or UIfixCommits that fix an API or UI bug of a preceded feat commitrefactorCommits that rewrite or restructure code without altering API or UI behavior
perfCommits are special type of refactor commits that specifically improve performance
This PR changes the API behavior so it shouldn't be a refactor PR. And feat does include feature removal.
Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
Description
The archive tags feature is being removed as part of the ongoing effort to refactor and stabilize the archive metadata representation in the package.
The general rationale behind removing this feature is that it has some overlap with the datasets feature, and the remainder of the use-cases for archive tags should be covered by our plans to allow users to associate structured metadata with each file they ingest into the package. Since we also know that our major open source users aren't relying on archive tags, the simplest solution is to drop support for this feature.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.