Add teuthology-dispatcher by shraddhaag · Pull Request #1546 · ceph/teuthology

shraddhaag · 2020-08-10T19:27:33Z

Related PRs:

Motivation
Project Idea Link The existing worker framework leads to jobs, that require large number of machines to run on, waiting for a long time (often indefinitely). This also leads to running jobs in unreliable priority order. This PR tries to solve the above. It the first step towards replacing the existing worker framework with a dispatcher. This PR adds a dispatcher that takes jobs from a queue, allocates nodes for them to run on, and executes them as its subprocess.

Flow charts for:

Changes in the current working:

teuthology-dispatcher takes jobs from the queue, locks machines without reimaging them, populates job config with target machines and invokes teuthology-dispacther in --supervisor mode. Without waiting for its child process (teuthology-dispacther in --supervisor mode) to finish, it moves on to take another job from the queue and repeat the above process.
teuthology-dispacther --supervisor mode is a job supervisor process and thus is job specific. It sets up a supervisor log file in suites' archive dir named supervisor.{job_id}. It then reimages the target machines, constructs the teuthology cmd and invokes it. At the end of the job run, the supervisor checks the job status and depending on whether it was successfully completed or not it unlocks target machines or nukes them.
graceful job timeout: When a job times out, before nuking nodes in cluster, the supervisor first compresses and archives logs from each node and then nukes them.

Development Lab Setup:

I have setup a development lab for teuthology, using some mira nodes from the current cluster, to test out the above mentioned changes. I have done so using:

Master node (teuthworker/teuthology node): mira045
Paddles/Pulpito node: mira119
Test nodes: mira062, mira072, mira085

Pulpito can be accessed here.
archive logs are on the master node (mira045) in location: /home/teuthworker/archive

Test performed

Job meant to pass: link
Job meant to fail: link
Job successfully transfering logs from remote machine to archive dir in case of job timeout:
1. (teuthology init + ceph task): link
2. (teuthology init + cephadm task): link

Checklist

Add a new dispatcher command, teuthology-dispatcher, that runs each job in the queue as its subprocess by invoking teuthology-dispatcher in --supervisor mode.
Add supervisor mode to teuthology-dispatcher such that it reimages target machines, runs the job whose config is provided to it and unlocks/nukes targets at the end of the job run.
move reimaging and unlocking targets in supervisor subprocess.
Add graceful job timeout by archiving logs before nuking machines commit.
Use same teuthology workspace for entire test suite.
Update documentation

This commit does the following: 1. a new cmd teuthology-dispatcher: It watches a queue and takes job from it, locks required nodes without reimaging them and runs the job as its suprocess by invoking teuthology-dispacther in supervisor mode. Supervisor mode reimages the target machines in the config, and invokes teuthology cmd to run the job. 2. refactors task/internal/lock_machines.py: doing so enables locking machines in dispatcher while following DRY. 3. refactors reimaging logic in lock/ops.py: doing so enables reimaging machines in dispatcher's supervisor mode while following DRY. 4. adds an argument, reimage, to lock_many in lock/ops.py: enables optional reimagining of machines depending on the value. Defaults to True. Used in dispatcher to lock machines without reimaging them. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

This commit unlocks target machines in teuthology-dispacther's supervisor mode after job completes run and teuthology subprocess exits. It either unlock nodes or nukes them depending on job's status. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

ceph-jenkins · 2020-08-10T19:27:35Z

Can one of the admins verify this patch?

scripts/dispatcher.py

kshtsk · 2020-08-11T10:30:40Z

teuthology/dispatcher/__init__.py

+            '--archive-dir', archive_dir,
+        ]
+
+        with tempfile.NamedTemporaryFile(prefix='teuthology-dispatcher.',


Instead of temporary file can we create this file in the job archive_dir for debugging, reproducibility and simplifying issue analysis.

We create the same file (orig.config.yaml) in the job archive directory when teuthology cmd is invoked by the supervisor while initiating run in run.py here. I'll refactor so that the file is created here.

Right, just store under different name, no need to refactor teuthology.run right now, it would be better to drop saving the orig.config.yaml in follow up PR after teuthology-dispatcher. Another reason to minimize invasion to teuthology.run is that some system can still use older py2 branch for running teuthology test, so we want to have teuthology-worker to be backwards compatible for some time.

Created orig.config.yaml in dispatcher itself (along with creating archive dirs for run and job) without refactoring run.py to maintain backward compatibility.

teuthology/dispatcher/supervisor.py

teuthology/dispatcher/__init__.py

teuthology/dispatcher/supervisor.py

kshtsk · 2020-08-11T10:37:41Z

teuthology/dispatcher/supervisor.py

+    config_fd = int(args["--config-fd"])
+
+    with open(config_fd, 'r') as config_file:
+        config_file.seek(0)
+        job_config = yaml.safe_load(config_file)


Please, use file path instead of file descriptor.

kshtsk · 2020-08-11T10:40:23Z

teuthology/task/internal/lock_machines.py

+    lock_machines_helper(ctx, config)
+    try:
+        yield
+    finally:
+        unlock_machines(ctx)
+
+
+def lock_machines_helper(ctx, config, reimage=True):


I think we decided to not touch internal tasks for now. The dispatcher should not use lock_machines module in any ways.

It doesn't change the working of the internal task. Just refactors it such that the same logic can be used in dispatcher to wait to for required number of machines to be available and then lock them (using lock_many of lock module). Since the waiting logic is not in the lock module, I thought it's best to reuse the logic from here to avoid repetitive code while still maintaining backward compatibility. Please let me know if this needs to be done differently.

I can see it does not change the behavior of the code, it just wrong place for this can of refactoring right now, overall locking mechanism should be rewritten in future, but for the goals of this PR those changes are not required. I would prefer to leave internal tasks untouched right now. Later we should refactor the lock_machines tasks to use teuthology.lock.ops code. The dispatcher should not user anything from the directory teuthology/task, and use lock api directly. The tasks should only be invoked from the run itself.

Added func block_and_lock_machines in lock module which is used by both the lock_machines task and dispatcher. Is this approach okay?

kshtsk · 2020-08-11T10:43:52Z

teuthology/dispatcher/supervisor.py

+            symlink_worker_log(job_config['worker_log'],
+                               job_config['archive_path'])


There is no sense in this code since we can redirect output of corresponding dispatcher log directly to archive_path.

Removed this symlink since the supervisor log is already inside the job archive now.

kshtsk · 2020-08-11T10:45:13Z

Good job overall, see inline comments.

kshtsk · 2020-08-11T10:46:44Z

teuthology/dispatcher/__init__.py

+
+def lock_machines(job_config):
+    fake_ctx = supervisor.create_fake_context(job_config, block=True)
+    lock_machines_helper(fake_ctx, [len(job_config['roles']),


Instead of using internal tasks we should use lock module directly.

kshtsk · 2020-08-11T13:32:48Z

ok to test

kshtsk · 2020-08-11T16:03:33Z

@susebot run deploy

susebot · 2020-08-11T17:06:51Z

Commit 3b4d864 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/240/

teuthology/dispatcher/supervisor.py

kshtsk · 2020-08-12T08:43:56Z

teuthology/dispatcher/supervisor.py

+    suite_dir = os.path.join(archive_dir, job_config['name'])
+    if (not os.path.exists(suite_dir)):
+        os.mkdir(suite_dir)
+    log_file_path = os.path.join(suite_dir, 'supervisor.{job_id}'.format(


Can we put the supervisor log into the job directory, i.e. f'{run_dir}/{job_id}/supervisor.log'

The only reason I didn't put it there is that we will have unnecessary folders for first and last jobs in a run. If that is not an issue, this makes more sense.

Yes... right, the first and last in suite are specific jobs, and they should be handled differently, since they do not start any teuthology job.

Done. Removes the unnecessary dirs created at the end of execution in case the job is first or last in suite here.

This commit saves job config in its archive dir and sends its path instead of file descriptor of a temporary file in the dispatcher and the supervisor. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

teuthology/dispatcher/supervisor.py

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

teuthology/dispatcher/supervisor.py

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

This commit adds the method block_and_lock machines to ops.py to enable locking machines in dispatcher. This is to ensure that lock_machines task is not used in dispatcher. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

shraddhaag · 2020-08-28T19:06:15Z

We need to update manual within the docs directory how to use and deploy the dispatcher.

I have added the teuthology-dispatcher cmd page and updated COMPONENTS.rst with the new architecture and working details. Since dispatcher will not be directly interacted with by the users of teuthology, I'm not sure if more details are required on how to use it and if so, where should I add them?

Regarding how to deploy the dispatcher, other than changing the cmd teuthology-worker to teuthology-dispacther, not much has changed syntactically. I do although would like to update:

Teuthology lab setup notes
the script, worker_start.sh (mentioned in the above doc to start workers) to start single dispatchers for various machine types.

Please let me know if the above sounds right.

kshtsk · 2020-08-31T18:23:44Z

I was going to try this changes on my desktop, but I need to merge the #1548 first.

teuthology/dispatcher/__init__.py

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

jdurgin · 2020-09-01T21:28:09Z

thanks, that looks good. only other thing I noticed was due to problems with checking out the teuthology branch again, which would be resolved by the separate teuthology workspaces in the future.

kshtsk · 2020-09-02T10:08:38Z

docs/COMPONENTS.rst

-started), information about the job is recorded in ``paddles``'s internal 
-database. Depending on the requirements (in terms of requested machines), the 
-scheduler eventually determines when a job can get executed. At this point the 
-master communicates with ``teuthology-worker``, which in turn invokes 


@jdurgin so with this change you guys are dropping any reference to teuthology-worker?

yes, I think it'll be obsolete once the dispatcher si stabilized

kshtsk · 2020-09-02T12:28:56Z

teuthology/dispatcher/__init__.py

+    verbose = args["--verbose"]
+    tube = args["--tube"]
+    log_dir = args["--log-dir"]
+    archive_dir = args["--archive-dir"]


Can we add the default value for archive taken from teuthology config, that will make --archive-dir optional.

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

kshtsk · 2020-10-21T10:37:31Z

@jdurgin were you able to test it appropriately?

jdurgin · 2020-10-21T19:59:55Z

@kshtsk hey, sorry I forgot to update the PR. Discussed with @shraddhaag on IRC a bit ago, I saw errors from using the same teuthology checkout as it was being changed/rechecked out, so it seems we need the independent checkouts for this to work well.

@shraddhaag do you have time to look at that? If not someone else can finish that part, and then we can re-test.

shraddhaag · 2020-10-22T08:28:42Z

@jdurgin yes, I'll add teuthology workspace bit.

jdurgin · 2021-01-20T18:18:57Z

I've been testing this with the mira queue in sepia, and it's been working well with #1599 . It requires jobs to be scheduled with teuthology_branch: wip-dispatcher to work right now.

I'd like to merge and start trying it with smithi next. @kshtsk what do you think?

kshtsk · 2021-01-27T15:58:24Z

@susebot run deploy

susebot · 2021-01-27T17:02:50Z

Commit 7c77115 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/291/

kshtsk · 2021-01-27T17:56:18Z

I guess we need to remove [WIP] tag before merge.

The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>

shraddhaag added 2 commits August 11, 2020 00:09

shraddhaag mentioned this pull request Aug 10, 2020

[WIP] add-teuthology-dispatcher #1540

Closed

9 tasks

jdurgin requested a review from kshtsk August 10, 2020 20:43

kshtsk reviewed Aug 11, 2020

View reviewed changes

scripts/dispatcher.py Outdated Show resolved Hide resolved

kshtsk reviewed Aug 11, 2020

View reviewed changes

teuthology/dispatcher/supervisor.py Outdated Show resolved Hide resolved

kshtsk reviewed Aug 11, 2020

View reviewed changes

teuthology/dispatcher/__init__.py Outdated Show resolved Hide resolved

kshtsk reviewed Aug 11, 2020

View reviewed changes

teuthology/dispatcher/__init__.py Outdated Show resolved Hide resolved

kshtsk reviewed Aug 11, 2020

View reviewed changes

teuthology/dispatcher/supervisor.py Outdated Show resolved Hide resolved

kshtsk reviewed Aug 11, 2020

View reviewed changes

kshtsk reviewed Aug 12, 2020

View reviewed changes

teuthology/dispatcher/supervisor.py Outdated Show resolved Hide resolved

kshtsk reviewed Aug 12, 2020

View reviewed changes

shraddhaag added 2 commits August 13, 2020 11:47

send config file path instead of file descriptor

583d85c

This commit saves job config in its archive dir and sends its path instead of file descriptor of a temporary file in the dispatcher and the supervisor. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

remove symlink_worker_log

195e312

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

kshtsk reviewed Aug 13, 2020

View reviewed changes

teuthology/dispatcher/supervisor.py Outdated Show resolved Hide resolved

dispatcher:rename --config-file to --job-config

cc937a1

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

shraddhaag force-pushed the add-minimal-dispatcher branch from 605d4bb to cc937a1 Compare August 13, 2020 12:12

kshtsk reviewed Aug 13, 2020

View reviewed changes

teuthology/dispatcher/supervisor.py Outdated Show resolved Hide resolved

shraddhaag added 2 commits August 13, 2020 18:49

supervisor: change machine status to running after reimaging

88460eb

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

shraddhaag mentioned this pull request Aug 19, 2020

qa/tasks/ceph.py: add ceph logs directory in job's info.yaml ceph/ceph#36718

Merged

jdurgin reviewed Sep 1, 2020

View reviewed changes

teuthology/dispatcher/__init__.py Outdated Show resolved Hide resolved

teuthology/dispatcher: handle exception while spawning subprocess

0db107f

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

shraddhaag dismissed jdurgin’s stale review via 0db107f September 1, 2020 04:26

kshtsk reviewed Sep 2, 2020

View reviewed changes

dispatcher: make archive flag optional

b96613f

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>

jdurgin mentioned this pull request Jan 17, 2021

Use the same version of teuthology and ceph-qa-suite for a whole run #1599

Merged

Merge branch 'master' into add-minimal-dispatcher

7c77115

jdurgin approved these changes Jan 21, 2021

View reviewed changes

jdurgin changed the title ~~[WIP] Add cmd teuthology-dispatcher~~ Add teuthology-dispatcher Jan 27, 2021

jdurgin merged commit f185165 into ceph:master Jan 28, 2021

zmc mentioned this pull request Jun 25, 2024

Finish removing teuthology-worker #1960

Merged

		symlink_worker_log(job_config['worker_log'],
		job_config['archive_path'])

Conversation

shraddhaag commented Aug 10, 2020 • edited by jdurgin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceph-jenkins commented Aug 10, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shraddhaag Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshtsk commented Aug 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshtsk commented Aug 11, 2020

Uh oh!

kshtsk commented Aug 11, 2020

Uh oh!

susebot commented Aug 11, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shraddhaag Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

shraddhaag commented Aug 28, 2020

Uh oh!

kshtsk commented Aug 31, 2020

Uh oh!

Uh oh!

jdurgin commented Sep 1, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kshtsk commented Oct 21, 2020

Uh oh!

jdurgin commented Oct 21, 2020

Uh oh!

shraddhaag commented Oct 22, 2020

Uh oh!

jdurgin commented Jan 20, 2021

shraddhaag commented Aug 10, 2020 •

edited by jdurgin

Loading

shraddhaag Aug 13, 2020 •

edited

Loading

shraddhaag Aug 13, 2020 •

edited

Loading