Bug Report
Description
The compression scheduler crashes when processing any job submission in the CLP Text package due to attempting to query a non-existent clp_datasets table.
Environment
- CLP Version: Current main branch
- Storage Engine: CLP Text (not CLP-S)
- Database: MariaDB
Steps to Reproduce
cd clp-package/sbin
./start-clp.sh
./compress.sh ~/samples/hive-24h
Expected Behavior
Job should complete successfully with compression statistics displayed:
2025-08-18T01:28:52.264 INFO [compress] Compression job 1 submitted.
2025-08-18T01:28:57.353 INFO [compress] Compressed 79.16MB into 1.74MB (45.42x). Speed: 60.15MB/s.
2025-08-18T01:28:58.858 INFO [compress] Compressed 1.08GB into 28.37MB (38.91x). Speed: 391.22MB/s.
2025-08-18T01:28:59.363 INFO [compress] Compressed 1.58GB into 41.66MB (38.82x). Speed: 486.25MB/s.
2025-08-18T01:29:00.371 INFO [compress] Compression finished.
2025-08-18T01:29:00.371 INFO [compress] Compressed 1.99GB into 45.22MB (45.03x). Speed: 512.79MB/s.
Actual Behavior
Compression scheduler crashes with the following error:
2025-08-17 23:31:57,128 compression_scheduler [INFO] Starting compression scheduler
2025-08-17 23:31:57,130 compression_scheduler [ERROR] Error in scheduling.
Traceback (most recent call last):
File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 430, in main
search_and_schedule_new_tasks(
File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 171, in search_and_schedule_new_tasks
existing_datasets = fetch_existing_datasets(
File "/opt/clp/lib/python3/site-packages/clp_py_utils/clp_metadata_db_utils.py", line 194, in fetch_existing_datasets
db_cursor.execute(f"SELECT name FROM \`{get_datasets_table_name(table_prefix)}\`")
mariadb.ProgrammingError: Table 'clp-db.clp_datasets' doesn't exist
Root Cause
The code unconditionally calls fetch_existing_datasets which tries to query the clp_datasets table. However, this table only exists for CLP-S storage engine, not for CLP Text package.
Proposed Solution
Add storage engine check before fetching existing datasets, as suggested by @haiqi96:
- Only call
fetch_existing_datasets when storage engine is CLP-S
- Initialize
existing_datasets as empty set for other storage engines
Additional Context
Reporter
@junhaoliao
Bug Report
Description
The compression scheduler crashes when processing any job submission in the CLP Text package due to attempting to query a non-existent
clp_datasetstable.Environment
Steps to Reproduce
cd clp-package/sbin./start-clp.sh./compress.sh ~/samples/hive-24hExpected Behavior
Job should complete successfully with compression statistics displayed:
Actual Behavior
Compression scheduler crashes with the following error:
Root Cause
The code unconditionally calls
fetch_existing_datasetswhich tries to query theclp_datasetstable. However, this table only exists for CLP-S storage engine, not for CLP Text package.Proposed Solution
Add storage engine check before fetching existing datasets, as suggested by @haiqi96:
fetch_existing_datasetswhen storage engine is CLP-Sexisting_datasetsas empty set for other storage enginesAdditional Context
components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.pydataset-managerscripts to support listing datasets, and deleting them entirely. #1144dataset-managerscripts to support listing datasets, and deleting them entirely. #1144 (comment)Reporter
@junhaoliao