refactor(clp-package): Move archive metadata update from clp-s to compression task workers.#819
Conversation
WalkthroughThis pull request removes the dependency on a MySQL metadata database from several C++ components while enhancing archive statistics reporting and updating archive metadata through the Python compression task. The changes include removing the metadata update functions, constructors, and related member variables from the ArchiveWriter class and associated files, along with eliminating command-line options and build configurations related to metadata DB handling. Additionally, a new constant is added to the Python configuration, and a new function in the compression task now inserts archive metadata into the database. Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as run_clp
participant ArchUpd as update_archive_metadata
participant DB as Database Cursor
Runner->>ArchUpd: Call update_archive_metadata(db_cursor, table_prefix, archive_stats)
ArchUpd->>DB: Execute SQL INSERT into clp_archives
DB-->>ArchUpd: Return execution result
ArchUpd-->>Runner: Return control
Suggested reviewers
Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (11)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
13-13: Recommend using the new constant in the insert query.You introduced
ARCHIVES_TABLE_NAMEbut still use a literal"clp_archives"in the SQL statement. To maintain consistency, replace literal references with the constant.- query = f"INSERT INTO clp_archives ({keys}) VALUES ({values})" + query = f"INSERT INTO {ARCHIVES_TABLE_NAME} ({keys}) VALUES ({values})"
366-366: Consider handling exceptions in case of metadata update failures.While calling
update_archive_metadata(db_cursor, last_archive_stats)largely appears fine, an exception could disrupt the entire compression run. Including error handling and logging here would improve resilience.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
components/clp-package-utils/clp_package_utils/scripts/native/compress.py(3 hunks)components/clp-py-utils/clp_py_utils/clp_config.py(1 hunks)components/core/src/clp_s/ArchiveWriter.cpp(1 hunks)components/core/src/clp_s/ArchiveWriter.hpp(1 hunks)components/core/src/clp_s/JsonParser.cpp(1 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}`: - Prefer `false == ` rather than `!`.
**/*.{cpp,hpp,java,js,jsx,tpp,ts,tsx}: - Preferfalse == <expression>rather than!<expression>.
components/core/src/clp_s/JsonParser.cppcomponents/core/src/clp_s/ArchiveWriter.cppcomponents/core/src/clp_s/ArchiveWriter.hpp
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: build-macos (macos-15, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (7)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
39-39: New constant added for archives table name.The addition of
ARCHIVES_TABLE_NAMEfollows the project's naming convention for database table constants and is appropriately placed with other table name constants. This will be used by the compression worker task to update archive metadata as mentioned in the PR objectives.components/clp-package-utils/clp_package_utils/scripts/native/compress.py (3)
64-64: Added tracking of previous timestamp for compression reporting.The new variable
last_current_timewill help track the previous timestamp for accurate compression progress reporting.
83-86: Improved compression status reporting logic.Added conditional logic to use either the current time or the previous time based on whether the data size has increased, ensuring more accurate speed calculations when reporting compression status.
105-105: Added timestamp update at the end of the loop.Updates the tracked timestamp at the end of each loop iteration to maintain accurate timing information.
components/core/src/clp_s/JsonParser.cpp (1)
130-130: Removed metadata DB parameter from ArchiveWriter instantiation.This change removes the metadata database parameter when creating the ArchiveWriter, aligning with the PR objective to remove archive metadata update functionality from this component and move it to the compression worker task.
components/core/src/clp_s/ArchiveWriter.cpp (1)
432-433: Enhanced archive statistics with timestamp information.Added begin and end timestamps to the archive statistics output, providing valuable temporal context about the archive's data. This will help with operations and debugging by clearly showing the time range covered by the archive.
components/core/src/clp_s/ArchiveWriter.hpp (1)
68-68: Ensure no references rely on the removed metadata constructor.Removing the DB-specific constructor reflects the new design that no longer updates archive metadata within
ArchiveWriter. This is acceptable, provided that no remaining code depends on the old constructor.
clp-s to compression task workers.
clp-s to compression task workers.clp-s to compression task workers.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (2)
225-236: Consider removing unnecessary parameter from function signature.The
clp_metadata_db_connection_configparameter is no longer passed tomake_clp_s_command_and_envbut remains in the function signature ofrun_clp. Consider removing it if it's only used for the CLP storage engine.def run_clp( worker_config: WorkerConfig, clp_config: ClpIoConfig, clp_home: pathlib.Path, logs_dir: pathlib.Path, job_id: int, task_id: int, tag_ids, paths_to_compress: PathsToCompress, sql_adapter: SQL_Adapter, - clp_metadata_db_connection_config, ):And update the corresponding calls to this function as needed.
86-101: Consider adding error handling for SQL operations.The
update_archive_metadatafunction lacks error handling for SQL-related exceptions. Adding a try-except block would allow for more graceful error handling and clearer error messages.def update_archive_metadata(db_cursor, table_prefix, archive_stats): archive_stats_defaults = { "begin_timestamp": 0, "end_timestamp": 0, "creator_id": "", "creation_ix": 0, } for k, v in archive_stats_defaults.items(): archive_stats.setdefault(k, v) keys = ", ".join(archive_stats.keys()) value_placeholders = ", ".join(["%s"] * len(archive_stats)) query = ( f"INSERT INTO {table_prefix}{ARCHIVES_TABLE_SUFFIX} ({keys}) VALUES ({value_placeholders})" ) + try: db_cursor.execute(query, list(archive_stats.values())) + except Exception as e: + logger.error(f"Failed to update archive metadata: {e}") + raise
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/clp-py-utils/clp_py_utils/clp_config.py(1 hunks)components/core/src/clp_s/CMakeLists.txt(0 hunks)components/job-orchestration/job_orchestration/executor/compress/compression_task.py(3 hunks)
💤 Files with no reviewable changes (1)
- components/core/src/clp_s/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- components/clp-py-utils/clp_py_utils/clp_config.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (1)
components/clp-py-utils/clp_py_utils/clp_config.py (1)
StorageEngine(48-50)
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-lint
- GitHub Check: build-macos (macos-15, true)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (4)
components/job-orchestration/job_orchestration/executor/compress/compression_task.py (4)
13-13: Good use of constant instead of hardcoded string.Using the
ARCHIVES_TABLE_SUFFIXconstant fromclp_py_utils.clp_configinstead of hardcoding the table suffix ensures consistency across the codebase and makes future changes easier to maintain.
86-101: Security improvement: Well-implemented parameter substitution for SQL queries.The implementation of
update_archive_metadataproperly addresses the SQL injection concern raised in previous reviews. The function:
- Sets sensible defaults for required fields
- Uses parameter substitution with placeholders instead of string interpolation
- Correctly uses the table prefix and constant suffix for table name construction
This is a secure way to handle SQL queries and aligns with database best practices.
178-183: Appropriate signature update to remove database dependency.Removing the
db_config_file_pathparameter frommake_clp_s_command_and_envand replacing it withuse_single_file_archivealigns with the PR objective of moving archive metadata update fromclp-sto compression task workers.
364-367: Clean implementation of metadata update logic.The code correctly extracts the table prefix once and uses it consistently. The conditional update of archive metadata only when using the CLP_S storage engine aligns with the PR objective.
clp-s to compression task workers.clp-s to compression task workers.
gibber9809
left a comment
There was a problem hiding this comment.
Mostly looks good to me -- I'm fine with deleting all of the metadata DB code until we start using it again. Just a small comment for print_archive_stats.
| void ArchiveWriter::print_archive_stats() { | ||
| nlohmann::json json_msg; | ||
| json_msg["id"] = m_id; | ||
| json_msg["begin_timestamp"] = m_timestamp_dict.get_begin_timestamp(); | ||
| json_msg["end_timestamp"] = m_timestamp_dict.get_end_timestamp(); | ||
| json_msg["uncompressed_size"] = m_uncompressed_size; | ||
| json_msg["size"] = m_compressed_size; | ||
| std::cout << json_msg.dump(-1, ' ', true, nlohmann::json::error_handler_t::ignore) << std::endl; |
There was a problem hiding this comment.
A few minor things while we're touching this method.
- Could we mark
print_archive_statsasconst? - Could we replace these string literals with some
constexpr std::string_viewthat live inArchiveWriter.hpp?
…ompression task workers. (y-scope#819)
…ompression task workers. (y-scope#819)
Description
Moves the archive metadata update logic out of clp-s and into the compression task worker. Specifically, in
clp-s, this PR:clp-s.ArchiveWriterandJsonParserclp-stargetThis is the first step outlined in the dataset feature implementation plan.
Checklist
breaking change.
Validation performed
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Refactor