Skip to content

Add teuthology-dispatcher#1546

Merged
jdurgin merged 21 commits intoceph:masterfrom
shraddhaag:add-minimal-dispatcher
Jan 28, 2021
Merged

Add teuthology-dispatcher#1546
jdurgin merged 21 commits intoceph:masterfrom
shraddhaag:add-minimal-dispatcher

Conversation

@shraddhaag
Copy link
Contributor

@shraddhaag shraddhaag commented Aug 10, 2020

Related PRs:

Motivation
Project Idea Link The existing worker framework leads to jobs, that require large number of machines to run on, waiting for a long time (often indefinitely). This also leads to running jobs in unreliable priority order. This PR tries to solve the above. It the first step towards replacing the existing worker framework with a dispatcher. This PR adds a dispatcher that takes jobs from a queue, allocates nodes for them to run on, and executes them as its subprocess.

Flow charts for:

image

Changes in the current working:

  • teuthology-dispatcher takes jobs from the queue, locks machines without reimaging them, populates job config with target machines and invokes teuthology-dispacther in --supervisor mode. Without waiting for its child process (teuthology-dispacther in --supervisor mode) to finish, it moves on to take another job from the queue and repeat the above process.
  • teuthology-dispacther --supervisor mode is a job supervisor process and thus is job specific. It sets up a supervisor log file in suites' archive dir named supervisor.{job_id}. It then reimages the target machines, constructs the teuthology cmd and invokes it. At the end of the job run, the supervisor checks the job status and depending on whether it was successfully completed or not it unlocks target machines or nukes them.
  • graceful job timeout: When a job times out, before nuking nodes in cluster, the supervisor first compresses and archives logs from each node and then nukes them.

Development Lab Setup:

I have setup a development lab for teuthology, using some mira nodes from the current cluster, to test out the above mentioned changes. I have done so using:

  • Master node (teuthworker/teuthology node): mira045
  • Paddles/Pulpito node: mira119
  • Test nodes: mira062, mira072, mira085

Pulpito can be accessed here.
archive logs are on the master node (mira045) in location: /home/teuthworker/archive

Test performed

  • Job meant to pass: link
  • Job meant to fail: link
  • Job successfully transfering logs from remote machine to archive dir in case of job timeout:
    1. (teuthology init + ceph task): link
    2. (teuthology init + cephadm task): link

Checklist

  • Add a new dispatcher command, teuthology-dispatcher, that runs each job in the queue as its subprocess by invoking teuthology-dispatcher in --supervisor mode.
  • Add supervisor mode to teuthology-dispatcher such that it reimages target machines, runs the job whose config is provided to it and unlocks/nukes targets at the end of the job run.
  • move reimaging and unlocking targets in supervisor subprocess.
  • Add graceful job timeout by archiving logs before nuking machines commit.
  • Use same teuthology workspace for entire test suite.
  • Update documentation

This commit does the following:

1. a new cmd teuthology-dispatcher: It watches a queue and takes
job from it, locks required nodes without reimaging them and runs
the job as its suprocess by invoking teuthology-dispacther in
supervisor mode. Supervisor mode reimages the target machines in
the config, and invokes teuthology cmd to run the job.

2. refactors  task/internal/lock_machines.py: doing so enables
locking machines in dispatcher while following DRY.

3. refactors reimaging logic in lock/ops.py: doing so enables
reimaging machines in dispatcher's supervisor mode while following
DRY.

4. adds an argument, reimage, to lock_many in lock/ops.py: enables
optional reimagining of machines depending on the value. Defaults
to True. Used in dispatcher to lock machines without reimaging them.

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
This commit unlocks target machines in teuthology-dispacther's
supervisor mode  after job completes run and teuthology subprocess
exits. It either unlock nodes or nukes them depending on job's
status.

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
@ceph-jenkins
Copy link

Can one of the admins verify this patch?

@shraddhaag shraddhaag mentioned this pull request Aug 10, 2020
9 tasks
@jdurgin jdurgin requested a review from kshtsk August 10, 2020 20:43
'--archive-dir', archive_dir,
]

with tempfile.NamedTemporaryFile(prefix='teuthology-dispatcher.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of temporary file can we create this file in the job archive_dir for debugging, reproducibility and simplifying issue analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create the same file (orig.config.yaml) in the job archive directory when teuthology cmd is invoked by the supervisor while initiating run in run.py here. I'll refactor so that the file is created here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, just store under different name, no need to refactor teuthology.run right now, it would be better to drop saving the orig.config.yaml in follow up PR after teuthology-dispatcher. Another reason to minimize invasion to teuthology.run is that some system can still use older py2 branch for running teuthology test, so we want to have teuthology-worker to be backwards compatible for some time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created orig.config.yaml in dispatcher itself (along with creating archive dirs for run and job) without refactoring run.py to maintain backward compatibility.

Comment on lines +31 to +35
config_fd = int(args["--config-fd"])

with open(config_fd, 'r') as config_file:
config_file.seek(0)
job_config = yaml.safe_load(config_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, use file path instead of file descriptor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +26 to +33
lock_machines_helper(ctx, config)
try:
yield
finally:
unlock_machines(ctx)


def lock_machines_helper(ctx, config, reimage=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we decided to not touch internal tasks for now. The dispatcher should not use lock_machines module in any ways.

Copy link
Contributor Author

@shraddhaag shraddhaag Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change the working of the internal task. Just refactors it such that the same logic can be used in dispatcher to wait to for required number of machines to be available and then lock them (using lock_many of lock module). Since the waiting logic is not in the lock module, I thought it's best to reuse the logic from here to avoid repetitive code while still maintaining backward compatibility. Please let me know if this needs to be done differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see it does not change the behavior of the code, it just wrong place for this can of refactoring right now, overall locking mechanism should be rewritten in future, but for the goals of this PR those changes are not required. I would prefer to leave internal tasks untouched right now. Later we should refactor the lock_machines tasks to use teuthology.lock.ops code. The dispatcher should not user anything from the directory teuthology/task, and use lock api directly. The tasks should only be invoked from the run itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added func block_and_lock_machines in lock module which is used by both the lock_machines task and dispatcher. Is this approach okay?

Comment on lines +159 to +160
symlink_worker_log(job_config['worker_log'],
job_config['archive_path'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no sense in this code since we can redirect output of corresponding dispatcher log directly to archive_path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this symlink since the supervisor log is already inside the job archive now.

@kshtsk
Copy link
Contributor

kshtsk commented Aug 11, 2020

Good job overall, see inline comments.


def lock_machines(job_config):
fake_ctx = supervisor.create_fake_context(job_config, block=True)
lock_machines_helper(fake_ctx, [len(job_config['roles']),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using internal tasks we should use lock module directly.

@kshtsk
Copy link
Contributor

kshtsk commented Aug 11, 2020

ok to test

@kshtsk
Copy link
Contributor

kshtsk commented Aug 11, 2020

@susebot run deploy

@susebot
Copy link

susebot commented Aug 11, 2020

Commit 3b4d864 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/240/

suite_dir = os.path.join(archive_dir, job_config['name'])
if (not os.path.exists(suite_dir)):
os.mkdir(suite_dir)
log_file_path = os.path.join(suite_dir, 'supervisor.{job_id}'.format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the supervisor log into the job directory, i.e. f'{run_dir}/{job_id}/supervisor.log'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I didn't put it there is that we will have unnecessary folders for first and last jobs in a run. If that is not an issue, this makes more sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... right, the first and last in suite are specific jobs, and they should be handled differently, since they do not start any teuthology job.

Copy link
Contributor Author

@shraddhaag shraddhaag Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removes the unnecessary dirs created at the end of execution in case the job is first or last in suite here.

This commit saves job config in its archive dir and sends its
path instead of file descriptor of a temporary file in the
dispatcher and the supervisor.

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
@shraddhaag shraddhaag force-pushed the add-minimal-dispatcher branch from 605d4bb to cc937a1 Compare August 13, 2020 12:12
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
This commit adds the method block_and_lock machines to ops.py
to enable locking machines in dispatcher. This is to ensure that
lock_machines task is not used in dispatcher.

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
@shraddhaag
Copy link
Contributor Author

We need to update manual within the docs directory how to use and deploy the dispatcher.

I have added the teuthology-dispatcher cmd page and updated COMPONENTS.rst with the new architecture and working details. Since dispatcher will not be directly interacted with by the users of teuthology, I'm not sure if more details are required on how to use it and if so, where should I add them?

Regarding how to deploy the dispatcher, other than changing the cmd teuthology-worker to teuthology-dispacther, not much has changed syntactically. I do although would like to update:

Please let me know if the above sounds right.

@kshtsk
Copy link
Contributor

kshtsk commented Aug 31, 2020

I was going to try this changes on my desktop, but I need to merge the #1548 first.

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
@jdurgin
Copy link
Member

jdurgin commented Sep 1, 2020

thanks, that looks good. only other thing I noticed was due to problems with checking out the teuthology branch again, which would be resolved by the separate teuthology workspaces in the future.

started), information about the job is recorded in ``paddles``'s internal
database. Depending on the requirements (in terms of requested machines), the
scheduler eventually determines when a job can get executed. At this point the
master communicates with ``teuthology-worker``, which in turn invokes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jdurgin so with this change you guys are dropping any reference to teuthology-worker?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think it'll be obsolete once the dispatcher si stabilized

verbose = args["--verbose"]
tube = args["--tube"]
log_dir = args["--log-dir"]
archive_dir = args["--archive-dir"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the default value for archive taken from teuthology config, that will make --archive-dir optional.

Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
@kshtsk
Copy link
Contributor

kshtsk commented Oct 21, 2020

@jdurgin were you able to test it appropriately?

@jdurgin
Copy link
Member

jdurgin commented Oct 21, 2020

@kshtsk hey, sorry I forgot to update the PR. Discussed with @shraddhaag on IRC a bit ago, I saw errors from using the same teuthology checkout as it was being changed/rechecked out, so it seems we need the independent checkouts for this to work well.

@shraddhaag do you have time to look at that? If not someone else can finish that part, and then we can re-test.

@shraddhaag
Copy link
Contributor Author

@jdurgin yes, I'll add teuthology workspace bit.

@jdurgin
Copy link
Member

jdurgin commented Jan 20, 2021

I've been testing this with the mira queue in sepia, and it's been working well with #1599 . It requires jobs to be scheduled with teuthology_branch: wip-dispatcher to work right now.

I'd like to merge and start trying it with smithi next. @kshtsk what do you think?

@kshtsk
Copy link
Contributor

kshtsk commented Jan 27, 2021

@susebot run deploy

@susebot
Copy link

susebot commented Jan 27, 2021

Commit 7c77115 is OK.
Check tests results in the Jenkins job: https://ceph-ci.suse.de/job/pr-teuthology-deploy/291/

@kshtsk
Copy link
Contributor

kshtsk commented Jan 27, 2021

I guess we need to remove [WIP] tag before merge.

@jdurgin jdurgin changed the title [WIP] Add cmd teuthology-dispatcher Add teuthology-dispatcher Jan 27, 2021
@jdurgin jdurgin merged commit f185165 into ceph:master Jan 28, 2021
zmc added a commit that referenced this pull request Jun 25, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
zmc added a commit that referenced this pull request Jun 25, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
zmc added a commit that referenced this pull request Jun 25, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
zmc added a commit that referenced this pull request Jun 25, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
zmc added a commit that referenced this pull request Jun 25, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
zmc added a commit that referenced this pull request Jul 24, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
zmc added a commit that referenced this pull request Jul 24, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
zmc added a commit that referenced this pull request Jul 25, 2024
The dispatcher and supervisor were added in #1546, but code was copied and
pasted into the new modules, leaving the worker untouched. Also untouched were
the unit tests, meaning that the dispatcher and supervisor were never unit
tested. As the copied code changed, the dispatcher and supervisor were not being
tested for regressions, while the worker - which wasn't being anymore - had
passing unit tests, giving some false sense of security.

This commit removes the old worker code, and adapts the old worker tests to
apply to the dispatcher and supervisor. It also splits out teuthology-supervisor
into its own command.

Signed-off-by: Zack Cerza <zack@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants