Skip to content

Set celery task limits and fix worker timeout#1486

Merged
jleaniz merged 14 commits into
google:masterfrom
aarontp:retry-fixes
Jun 7, 2024
Merged

Set celery task limits and fix worker timeout#1486
jleaniz merged 14 commits into
google:masterfrom
aarontp:retry-fixes

Conversation

@aarontp

@aarontp aarontp commented May 30, 2024

Copy link
Copy Markdown
Member

Description of the change

  • Sets various retry and timeout options for celery task setup
  • Updates the worker timeout to kill all child processes if execution time limit is reached
  • Adds counters for soft timeouts and uncaught exceptions
  • Also fixes issue with server timeout which was causing timed out tasks not to be cleaned up properly.
  • Removes unused SINGLE_RUN feature and config option.
  • Changes worker start args from '--pool=solo to --concurrency=1 as a solo pool executes tasks directly instead of forking and cannot process celery timeouts.

After this change there will be four different timeouts that can occur:

  • Turbinia worker timeout: This timeout happens when executing anything through TurbiniaTask.execute() and uses the timeout values set in the Job config in the turbinia config file.
  • Celery Soft timeout: This will send a catchable celery.exceptions.SoftTimeLimitExceeded exception to the worker and should be caught in the TurbiniaTask.execute() or TurbiniaTask.run_wrapper() methods depending on the state of the Task. This is set slightly longer than the Turbinia workout timeout to allow the worker to clean-up and give results when possible. The buffer is defined by task_manager.CELERY_SOFT_TIMEOUT_BUFFER.
  • Celery Hard timeout: After some buffer time after the other timeouts (task_manager.CELERY_HARD_TIMEOUT_BUFFER), celery is configured to kill the worker running the task. This is not catchable so we will not get any results back from the worker so the server will also need to time out the task.
  • Turbinia Server task timeout: If the server does not get any results back from the worker it will time out the task and consider it unrecoverable. This happens task_manager.SERVER_TASK_TIMEOUT_BUFFER seconds after the Task timeout. This buffer is large because the timer starts from when the task is enqueued and can't account for any scheduling and other wait time. This is to avoid timing out valid tasks due to worker congestion.

Applicable issues

Additional information

Checklist

  • All tests were successful.
  • Unit tests added.
  • Documentation updated.

@jleaniz

jleaniz commented May 31, 2024

Copy link
Copy Markdown
Collaborator

@aarontp removing '--pool=solo' from worker.py:253 should enable time limits and avoid the blocking issues we were experiencing.

@aarontp

aarontp commented May 31, 2024

Copy link
Copy Markdown
Member Author

@aarontp removing '--pool=solo' from worker.py:253 should enable time limits and avoid the blocking issues we were experiencing.

Indeed, that worked, thanks!

@aarontp aarontp marked this pull request as ready for review June 4, 2024 02:49
@aarontp aarontp requested a review from jleaniz June 4, 2024 02:49

@jleaniz jleaniz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, leaving this for you to make any changes based on our chat. Thanks for the PR!

Comment thread turbinia/worker.py
Comment thread turbinia/workers/__init__.py
@aarontp aarontp requested a review from jleaniz June 6, 2024 21:20
@aarontp

aarontp commented Jun 6, 2024

Copy link
Copy Markdown
Member Author

PTAL, I added the soft limit exception to the execute() method as well.

@jleaniz jleaniz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jleaniz jleaniz merged commit 7678a63 into google:master Jun 7, 2024
aleu04 pushed a commit to aleu04/turbinia that referenced this pull request Jun 7, 2024
* Set celery task limits and fix worker timeout

* Remove solo pool and change concurrency=1

* Add soft/hard limit buffers

* Fix server timeout

* Late import for psutil

* small fixes, tests

* fix run tests

* test format string

* Fix process_result test

* fix execute test

* revert unnecessary config changes

* Handle soft timeout exception in execute()

* update timeout message

* Yaaaaaaaapf
@aarontp aarontp deleted the retry-fixes branch June 7, 2024 18:14
jleaniz pushed a commit that referenced this pull request Jun 13, 2024
* Added the TurbiniaRequest hashed object in Redis

* Converted TurbiniaTasks to hash objects in Redis

* Made get_request_data more efficient

* Made get_requests_summary more efficient

* Fixed some issues in request_status

* Remove GCP dependencies (#1440)

* Remove gcp dependencies

* Update dockerfiles

* Update dockerfiles

* Update gcp error reporting

* Updates to formatting

* Add unit test

* Update unit test

* Clean up

* Update unit test

* Update error reporting

* Update file

* Update config template

* Catch exception

* Updates

* fix lint

* Lint fixes

* Updates

* Updates

* Various updates and fixes

* Updates

* --- (#1483)

updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Updates

* Update typo

* Updates to unit tests

* Updates to unit tests and linter fixes

* Update table width UI

* Bump the pip group across 2 directories with 1 update (#1487)

Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests).
Bumps the pip group with 1 update in the /turbinia/api/cli directory: [requests](https://github.com/psf/requests).


Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix small UI bug

* Minor typos/errors

* Set celery task limits and fix worker timeout (#1486)

* Set celery task limits and fix worker timeout

* Remove solo pool and change concurrency=1

* Add soft/hard limit buffers

* Fix server timeout

* Late import for psutil

* small fixes, tests

* fix run tests

* test format string

* Fix process_result test

* fix execute test

* revert unnecessary config changes

* Handle soft timeout exception in execute()

* update timeout message

* Yaaaaaaaapf

* Review fixes

* Updates and yapf fix

* Minor updates

* Update docstrings

* Updates to evidence_size

* Updates

* Change log level for message

* Lint

* bug fixes

* Address review comments

* Fix docstrings

* Minor UI update

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Igor Rodrigues <igormr@nyu.edu>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Aaron Peterson <aaronp@gmail.com>
jleaniz pushed a commit that referenced this pull request Jun 13, 2024
* Added the TurbiniaRequest hashed object in Redis

* Converted TurbiniaTasks to hash objects in Redis

* Made get_request_data more efficient

* Made get_requests_summary more efficient

* Fixed some issues in request_status

* Remove GCP dependencies (#1440)

* Remove gcp dependencies

* Update dockerfiles

* Update dockerfiles

* Update gcp error reporting

* Updates to formatting

* Add unit test

* Update unit test

* Clean up

* Update unit test

* Update error reporting

* Update file

* Update config template

* Catch exception

* Updates

* fix lint

* Lint fixes

* Updates

* Updates

* Various updates and fixes

* Updates

* --- (#1483)

updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...




* Updates

* Update typo

* Updates to unit tests

* Updates to unit tests and linter fixes

* Update table width UI

* Bump the pip group across 2 directories with 1 update (#1487)

Bumps the pip group with 1 update in the / directory: [requests](https://github.com/psf/requests).
Bumps the pip group with 1 update in the /turbinia/api/cli directory: [requests](https://github.com/psf/requests).


Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

Updates `requests` from 2.32.0 to 2.32.3
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](psf/requests@v2.32.0...v2.32.3)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
- dependency-name: requests
  dependency-type: indirect
  dependency-group: pip
...




* Fix small UI bug

* Minor typos/errors

* Set celery task limits and fix worker timeout (#1486)

* Set celery task limits and fix worker timeout

* Remove solo pool and change concurrency=1

* Add soft/hard limit buffers

* Fix server timeout

* Late import for psutil

* small fixes, tests

* fix run tests

* test format string

* Fix process_result test

* fix execute test

* revert unnecessary config changes

* Handle soft timeout exception in execute()

* update timeout message

* Yaaaaaaaapf

* Review fixes

* Updates and yapf fix

* Minor updates

* Update docstrings

* Updates to evidence_size

* Updates

* Change log level for message

* Lint

* bug fixes

* Address review comments

* Fix docstrings

* Minor UI update

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Igor Rodrigues <igormr@nyu.edu>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Aaron Peterson <aaronp@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Redis timeout

2 participants