Skip to content

Compression scheduler crashes on job submission in CLP Text package due to missing datasets table #1214

@coderabbitai

Description

@coderabbitai

Bug Report

Description

The compression scheduler crashes when processing any job submission in the CLP Text package due to attempting to query a non-existent clp_datasets table.

Environment

  • CLP Version: Current main branch
  • Storage Engine: CLP Text (not CLP-S)
  • Database: MariaDB

Steps to Reproduce

  1. cd clp-package/sbin
  2. ./start-clp.sh
  3. ./compress.sh ~/samples/hive-24h

Expected Behavior

Job should complete successfully with compression statistics displayed:

2025-08-18T01:28:52.264 INFO [compress] Compression job 1 submitted.
2025-08-18T01:28:57.353 INFO [compress] Compressed 79.16MB into 1.74MB (45.42x). Speed: 60.15MB/s.
2025-08-18T01:28:58.858 INFO [compress] Compressed 1.08GB into 28.37MB (38.91x). Speed: 391.22MB/s.
2025-08-18T01:28:59.363 INFO [compress] Compressed 1.58GB into 41.66MB (38.82x). Speed: 486.25MB/s.
2025-08-18T01:29:00.371 INFO [compress] Compression finished.
2025-08-18T01:29:00.371 INFO [compress] Compressed 1.99GB into 45.22MB (45.03x). Speed: 512.79MB/s.

Actual Behavior

Compression scheduler crashes with the following error:

2025-08-17 23:31:57,128 compression_scheduler [INFO] Starting compression scheduler
2025-08-17 23:31:57,130 compression_scheduler [ERROR] Error in scheduling.
Traceback (most recent call last):
  File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 430, in main
    search_and_schedule_new_tasks(
  File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 171, in search_and_schedule_new_tasks
    existing_datasets = fetch_existing_datasets(
  File "/opt/clp/lib/python3/site-packages/clp_py_utils/clp_metadata_db_utils.py", line 194, in fetch_existing_datasets
    db_cursor.execute(f"SELECT name FROM \`{get_datasets_table_name(table_prefix)}\`")
mariadb.ProgrammingError: Table 'clp-db.clp_datasets' doesn't exist

Root Cause

The code unconditionally calls fetch_existing_datasets which tries to query the clp_datasets table. However, this table only exists for CLP-S storage engine, not for CLP Text package.

Proposed Solution

Add storage engine check before fetching existing datasets, as suggested by @haiqi96:

  • Only call fetch_existing_datasets when storage engine is CLP-S
  • Initialize existing_datasets as empty set for other storage engines

Additional Context

Reporter

@junhaoliao

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions