Skip to content

Job status_msg overflow when multiple task fails. #716

@haiqi96

Description

@haiqi96

Bug

Currently, when compression tasks in clp package fail, the compression scheduler concatentate the task error messages into a single status_msg and store it into the database.

This could cause an issue when multiple tasks fail with long error messages. We had a job launching 260 tasks, that all failed due to write permission issue. As a result, the error message can not be written into the database and cause the compression scheduler to crash, with the following error message

2025-02-07 21:39:55,291 compression_scheduler [ERROR] Error in scheduling.
Traceback (most recent call last):
  File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 395, in main
    poll_running_jobs(db_conn, db_cursor)
  File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 339, in poll_running_jobs
    update_compression_job_metadata(
  File "/opt/clp/lib/python3/site-packages/job_orchestration/scheduler/compress/compression_scheduler.py", line 84, in update_compression_job_metadata
    db_cursor.execute(query, values)
mariadb.DataError: Data too long for column 'status_msg' at row 1

Increasing the data width in the data column won't help because it just can't scale with the number of tasks. need to find a smarter way to report errors.

Potential fix:

We can consider storing the long error logs into a file and store it to the localfilesystem. just need someway to let webui load/download the error log.

CLP version

0f7f433

Environment

ubuntu22.04, but it doesn't really matter

Reproduction steps

There should be multiple ways to reproduce it. Just need to make sure:

  1. the job spawns enough tasks
  2. The compression job should fail somehow

How we ended up discovering this is:

  1. Compress the 65gb mongodb dataset, with each file ~ 256MB, can be from disk.
  2. Use S3 output for archive, with a valid S3 credentials and bucket, but with a prefix that the account is not allowed to write to.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions