Skip to content

feat(clp-package)!: Remove the archive tags feature.#1842

Merged
gibber9809 merged 12 commits into
y-scope:mainfrom
gibber9809:deprecate-archive-tags
Jan 21, 2026
Merged

feat(clp-package)!: Remove the archive tags feature.#1842
gibber9809 merged 12 commits into
y-scope:mainfrom
gibber9809:deprecate-archive-tags

Conversation

@gibber9809

@gibber9809 gibber9809 commented Jan 6, 2026

Copy link
Copy Markdown
Contributor

Description

The archive tags feature is being removed as part of the ongoing effort to refactor and stabilize the archive metadata representation in the package.

The general rationale behind removing this feature is that it has some overlap with the datasets feature, and the remainder of the use-cases for archive tags should be covered by our plans to allow users to associate structured metadata with each file they ingest into the package. Since we also know that our major open source users aren't relying on archive tags, the simplest solution is to drop support for this feature.

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • Validated that package build succesfully
  • Validated that basic package flow works correctly by testing that:
    • JSONL file can be compressed with/without dataset from cli
    • Data can be searched as expected from cli and webui with/without dataset
    • Tag options are no longer presented in the cli

Summary by CodeRabbit

  • Refactor
    • Removed tags support across compression, ingestion and search flows.
    • Dropped tags CLI options and tag propagation through tasks and scheduler.
    • Removed tags from job/output/search configuration schemas and public APIs.
    • Eliminated tags metadata tables and tag persistence/update logic.
    • Updated tests, UI enums and default configs to reflect tags removal.

✏️ Tip: You can customize this high-level summary in your review settings.


@coderabbitai

coderabbitai Bot commented Jan 6, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

This PR removes "tags" support across the codebase: CLI flags, config fields, metadata tables, job/task parameters, scheduler query filtering, persistence of archive-tag relations, and Web UI table suffixes.

Changes

Cohort / File(s) Summary
Python CLI scripts
components/clp-package-utils/clp_package_utils/scripts/compress.py, components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py, components/clp-package-utils/clp_package_utils/scripts/native/compress.py, components/clp-package-utils/clp_package_utils/scripts/native/search.py, components/clp-package-utils/clp_package_utils/scripts/search.py
Removed -t/--tags CLI arguments and all propagation; command construction and downstream calls no longer include --tags or parsed tag values.
Python metadata DB utils
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
Deleted ARCHIVE_TAGS_TABLE_SUFFIX and TAGS_TABLE_SUFFIX; removed tags-related table helpers, creation/deletion/usage, and public helpers get_archive_tags_table_name / get_tags_table_name.
Job orchestration — executor (Python)
components/job-orchestration/job_orchestration/executor/compress/celery_compress.py, components/job-orchestration/job_orchestration/executor/compress/compression_task.py, components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
Removed tag_ids params from task signatures and calls; removed update_tags and archive-tag persistence; simplified update_job_metadata signature to accept only archive_stats; removed related imports and parameters.
Job orchestration — scheduler (Python)
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py, components/job-orchestration/job_orchestration/scheduler/compress/partition.py, components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py, components/job-orchestration/job_orchestration/scheduler/job_config.py
Removed _get_tag_ids_for_job; dropped tag_ids from PathsToCompressBuffer constructor and removed set_tag_ids; eliminated insertion of tag_ids into per-task job arguments; removed tags fields from scheduler job config types.
Query scheduling
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
Removed tag-based archive filtering and associated subquery construction; query execution now runs without tag constraints and conditional branches for search_config.tags were removed.
Rust job configs & tests
components/clp-rust-utils/src/job_config/clp_io_config.rs, components/clp-rust-utils/src/job_config/ingestion.rs, components/clp-rust-utils/src/job_config/search.rs, components/clp-rust-utils/tests/clp_config_test.rs
Removed tags field from OutputConfig, BaseConfig (ingestion), and SearchJobConfig; updated tests and serialization expectations to omit tags.
Log ingestor (Rust) & tests
components/log-ingestor/src/compression/compression_job_submitter.rs, components/log-ingestor/tests/test_ingestion_job.rs
Omitted tags when constructing OutputConfig; removed tags: None initializations in tests.
Web UI config & schemas
components/webui/common/src/config.ts, components/webui/common/src/schemas/compression.ts, components/webui/server/src/routes/api/compress/index.ts
Removed ARCHIVE_TAGS and TAGS enum members from SqlTableSuffix; removed tags from ClpIoOutputConfigSchema; removed output.tags default in server DEFAULT_COMPRESSION_JOB_CONFIG.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: removal of the archive tags feature across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gibber9809 gibber9809 marked this pull request as ready for review January 7, 2026 18:20
@gibber9809 gibber9809 requested a review from a team as a code owner January 7, 2026 18:20

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (1)

178-209: Fix the docstring to match actual table operations.

The docstring at line 183 mentions file_splits but the function only deletes from the files and archives tables. There's no file_splits table referenced in the code.

📝 Proposed fix for the docstring
     """
     Deletes archives from the metadata database specified by a list of IDs. It also deletes
-    the associated entries from `file_splits` and `files` tables that reference these archives.
+    the associated entries from the `files` table that reference these archives.
 
     The order of deletion follows the foreign key constraints, ensuring no violations occur during
     the process.
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c932131 and be0a6ad.

📒 Files selected for processing (21)
  • components/clp-package-utils/clp_package_utils/scripts/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
  • components/clp-package-utils/clp_package_utils/scripts/search.py
  • components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
  • components/clp-rust-utils/src/job_config/clp_io_config.rs
  • components/clp-rust-utils/src/job_config/ingestion.rs
  • components/clp-rust-utils/src/job_config/search.rs
  • components/clp-rust-utils/tests/clp_config_test.rs
  • components/job-orchestration/job_orchestration/executor/compress/celery_compress.py
  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
  • components/job-orchestration/job_orchestration/scheduler/compress/partition.py
  • components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py
  • components/job-orchestration/job_orchestration/scheduler/job_config.py
  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
  • components/log-ingestor/src/compression/compression_job_submitter.rs
  • components/log-ingestor/tests/test_ingestion_job.rs
  • components/webui/common/src/config.ts
💤 Files with no reviewable changes (17)
  • components/clp-rust-utils/src/job_config/clp_io_config.rs
  • components/webui/common/src/config.ts
  • components/log-ingestor/tests/test_ingestion_job.rs
  • components/log-ingestor/src/compression/compression_job_submitter.rs
  • components/clp-rust-utils/src/job_config/search.rs
  • components/clp-rust-utils/src/job_config/ingestion.rs
  • components/job-orchestration/job_orchestration/scheduler/job_config.py
  • components/job-orchestration/job_orchestration/executor/compress/spider_compress.py
  • components/clp-package-utils/clp_package_utils/scripts/native/compress.py
  • components/clp-package-utils/clp_package_utils/scripts/compress.py
  • components/job-orchestration/job_orchestration/scheduler/compress/partition.py
  • components/job-orchestration/job_orchestration/scheduler/compress/task_manager/spider_task_manager.py
  • components/clp-package-utils/clp_package_utils/scripts/search.py
  • components/job-orchestration/job_orchestration/executor/compress/celery_compress.py
  • components/clp-package-utils/clp_package_utils/scripts/compress_from_s3.py
  • components/clp-package-utils/clp_package_utils/scripts/native/search.py
  • components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.
📚 Learning: 2024-11-19T17:30:04.970Z
Learnt from: AVMatthews
Repo: y-scope/clp PR: 595
File: components/core/tests/test-end_to_end.cpp:59-65
Timestamp: 2024-11-19T17:30:04.970Z
Learning: In 'components/core/tests/test-end_to_end.cpp', during the 'clp-s_compression_and_extraction_no_floats' test, files and directories are intentionally removed at the beginning of the test to ensure that any existing content doesn't influence the test results.

Applied to files:

  • components/clp-rust-utils/tests/clp_config_test.rs
📚 Learning: 2024-11-15T16:21:52.122Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 594
File: components/clp-package-utils/clp_package_utils/scripts/native/del_archives.py:104-110
Timestamp: 2024-11-15T16:21:52.122Z
Learning: In `clp_package_utils/scripts/native/del_archives.py`, when deleting archives, the `archive` variable retrieved from the database is controlled and is always a single string without path components. Therefore, it's acceptable to skip additional validation checks for directory traversal in this context.

Applied to files:

  • components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py
📚 Learning: 2025-05-05T16:32:55.163Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 868
File: components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py:141-144
Timestamp: 2025-05-05T16:32:55.163Z
Learning: The column metadata table (created by `_create_column_metadata_table`) is only needed for dataset-specific workflows in `CLP_S` and is obsolete for non-dataset workflows.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
📚 Learning: 2025-12-12T21:14:28.267Z
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/s3_utils.py:226-230
Timestamp: 2025-12-12T21:14:28.267Z
Learning: In the CLP project (y-scope/clp), ensure that regex patterns are compiled inline within functions for Python files under components/clp-py-utils/clp_py_utils (not at module scope). Move any re.compile calls from the module top level into the functions that use them to maintain consistency with the codebase style. This applies to all .py files in that directory; use module-level imports only for imports, not for pre-compiling regex patterns.

Applied to files:

  • components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py
📚 Learning: 2025-09-28T15:00:22.170Z
Learnt from: LinZhihao-723
Repo: y-scope/clp PR: 1340
File: components/job-orchestration/job_orchestration/executor/compress/compression_task.py:528-528
Timestamp: 2025-09-28T15:00:22.170Z
Learning: In components/job-orchestration/job_orchestration/executor/compress/compression_task.py, there is a suggestion to refactor from passing logger as a parameter through multiple functions to creating a ClpCompressor class that takes the logger as a class member, with current helper functions becoming private member functions.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2024-12-16T21:49:06.483Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 634
File: components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py:0-0
Timestamp: 2024-12-16T21:49:06.483Z
Learning: In `fs_compression_task.py`, when refactoring the archive processing loop in Python, the `src_archive_file.unlink()` operation should remain outside of the `process_archive` function.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • components/job-orchestration/job_orchestration/executor/compress/compression_task.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: check-generated
  • GitHub Check: conventional-commits
🔇 Additional comments (8)
components/clp-rust-utils/tests/clp_config_test.rs (1)

10-86: LGTM! Test expectations properly updated to reflect tags removal.

The test has been correctly updated to reflect the removal of the tags field from OutputConfig:

  • The OutputConfig initialization (lines 34-40) no longer includes the tags field.
  • The expected Brotli-compressed MessagePack hex output (lines 45-50) has been updated to match the smaller serialized payload.
  • The expected JSON structure (lines 75-81) naturally excludes the tags field from the output section.

Both serialization format validations remain intact and are consistent with the structural change.

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

399-426: LGTM! Tag filtering removal is clean and correct.

The simplification of get_archives_for_search correctly removes tag-based filtering while preserving timestamp and retention period filtering logic. The unconditional query execution is appropriate. No tag-related code remains in the codebase.

components/clp-py-utils/clp_py_utils/clp_metadata_db_utils.py (4)

15-20: LGTM! Correct update to TABLE_SUFFIX_MAX_LEN.

The calculation now correctly reflects only the remaining table suffixes after removing tags-related constants.


161-175: LGTM! Clean removal of tags table creation.

The function now correctly creates only the metadata, archives, and files tables without any references to tags-related tables.


212-237: LGTM! Correct update to dataset deletion logic.

The removal of tags-related tables from tables_in_removal_order is correct, and the deletion order properly respects foreign key constraints.


240-253: Breaking change: Removed public API functions.

The removal of get_archive_tags_table_name and get_tags_table_name is a breaking change appropriately indicated in the PR title. No references to these removed functions exist elsewhere in the codebase.

components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)

504-508: LGTM!

The call site has been correctly updated to match the new update_job_metadata signature. The function now receives only the necessary parameters for updating job compression statistics, while dataset handling remains properly implemented via the separate update_archive_metadata call.


71-83: Function signature simplification is correct and all call sites have been updated.

The removal of the table_prefix, dataset, and tag_ids parameters is clean and properly maintained throughout the codebase. The function now focuses solely on updating job compression statistics, improving clarity and maintainability. The single call site at lines 504–508 correctly passes only the required parameters.

@junhaoliao junhaoliao self-requested a review January 7, 2026 18:43

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ingestion job's input schema has changed, can u rerun task codegen:openapi and commit the changes? This should fix the failing CI workflow.

@junhaoliao junhaoliao self-assigned this Jan 19, 2026
@junhaoliao junhaoliao added this to the Backlog milestone Jan 19, 2026

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Package changes lgtm. a few minor sugguestions:

  1. it does seem there are some more references in the webui that should be updated. let's also remove these references to "tags":
  2. since this change is BREAKING, are there migration strategies for the metadata DB?
  3. i believe #883 can be closed once this PR is merged.
  4. for the commit message, i think a feat! or a refactor! is probably more appropriate?

@gibber9809

Copy link
Copy Markdown
Contributor Author

The Package changes lgtm. a few minor sugguestions:

  1. it does seem there are some more references in the webui that should be updated. let's also remove these references to "tags":

  2. since this change is BREAKING, are there migration strategies for the metadata DB?

  3. i believe Add docs for --tags option in compress.sh / search.sh for CLP-Text #883 can be closed once this PR is merged.

  4. for the commit message, i think a feat! or a refactor! is probably more appropriate?

  1. Sure, will do
  2. Migration strategy is to drop archive tags table followed by tags table for each dataset -- already confirmed with our largest open source user that they aren't using this feature, so best to drop support for it now before other users start relying on it. Mid to long term, the planned feature for per-ingested-file structured metadata will offer a superset of the functionality of this tags feature.
  3. Yep, looks like it
  4. Let's go with refactor!

@gibber9809 gibber9809 changed the title chore(clp-package)!: Remove support for the archive tags feature. refactor(clp-package)!: Remove support for the archive tags feature. Jan 20, 2026
@gibber9809 gibber9809 requested a review from junhaoliao January 20, 2026 17:19
@junhaoliao junhaoliao modified the milestones: Backlog, February 2026 Jan 21, 2026
junhaoliao

This comment was marked as duplicate.

@junhaoliao junhaoliao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the title, how about:

refactor(clp-package)!: Remove archive tags feature.

@gibber9809 gibber9809 changed the title refactor(clp-package)!: Remove support for the archive tags feature. refactor(clp-package)!: Remove the archive tags feature. Jan 21, 2026

@LinZhihao-723 LinZhihao-723 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the PR title: I think this should be a feat PR. According to the conveitonal commit cheatsheet (here):

  • Changes relevant to the API or UI:
    • feat Commits that add, adjust or remove a new feature to the API or UI
    • fix Commits that fix an API or UI bug of a preceded feat commit
  • refactor Commits that rewrite or restructure code without altering API or UI behavior
    • perf Commits are special type of refactor commits that specifically improve performance

This PR changes the API behavior so it shouldn't be a refactor PR. And feat does include feature removal.

@gibber9809 gibber9809 changed the title refactor(clp-package)!: Remove the archive tags feature. feat(clp-package)!: Remove the archive tags feature. Jan 21, 2026
@gibber9809 gibber9809 merged commit 50e7815 into y-scope:main Jan 21, 2026
23 checks passed
junhaoliao added a commit to junhaoliao/clp that referenced this pull request May 17, 2026
Co-authored-by: Junhao Liao <junhao.liao@yscope.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants