Bug
Currently, when compression tasks in clp package fail, the compression scheduler concatentate the task error messages into a single status_msg and store it into the database.
This could cause an issue when multiple tasks fail with long error messages. We had a job launching 260 tasks, that all failed due to write permission issue. As a result, the error message can not be written into the database and cause the compression scheduler to crash, with the following error message
2025-02-07 21:39:55,291 compression_scheduler [ERROR] Error in scheduling.
Traceback (most recent call last):
File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 395, in main
poll_running_jobs(db_conn, db_cursor)
File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 339, in poll_running_jobs
update_compression_job_metadata(
File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 84, in update_compression_job_metadata
db_cursor.execute(query, values)
mariadb.DataError: Data too long for column 'status_msg' at row 1
Increasing the data width in the data column won't help because it just can't scale with the number of tasks. need to find a smarter way to report errors.
Potential fix:
We can consider storing the long error logs into a file and store it to the localfilesystem. just need someway to let webui load/download the error log.
CLP version
0f7f433
Environment
ubuntu22.04, but it doesn't really matter
Reproduction steps
There should be multiple ways to reproduce it. Just need to make sure:
- the job spawns enough tasks
- The compression job should fail somehow
How we ended up discovering this is:
- Compress the 65gb mongodb dataset, with each file ~ 256MB, can be from disk.
- Use S3 output for archive, with a valid S3 credentials and bucket, but with a prefix that the account is not allowed to write to.
Bug
Currently, when compression tasks in clp package fail, the compression scheduler concatentate the task error messages into a single status_msg and store it into the database.
This could cause an issue when multiple tasks fail with long error messages. We had a job launching 260 tasks, that all failed due to write permission issue. As a result, the error message can not be written into the database and cause the compression scheduler to crash, with the following error message
Increasing the data width in the data column won't help because it just can't scale with the number of tasks. need to find a smarter way to report errors.
Potential fix:
We can consider storing the long error logs into a file and store it to the localfilesystem. just need someway to let webui load/download the error log.
CLP version
0f7f433
Environment
ubuntu22.04, but it doesn't really matter
Reproduction steps
There should be multiple ways to reproduce it. Just need to make sure:
How we ended up discovering this is: