Conversation
This commit does the following: 1. a new cmd teuthology-dispatcher: It watches a queue and takes job from it, locks required nodes without reimaging them and runs the job as its suprocess by invoking teuthology-dispacther in supervisor mode. Supervisor mode reimages the target machines in the config, and invokes teuthology cmd to run the job. 2. refactors task/internal/lock_machines.py: doing so enables locking machines in dispatcher while following DRY. 3. refactors reimaging logic in lock/ops.py: doing so enables reimaging machines in dispatcher's supervisor mode while following DRY. 4. adds an argument, reimage, to lock_many in lock/ops.py: enables optional reimagining of machines depending on the value. Defaults to True. Used in dispatcher to lock machines without reimaging them. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
This commit unlocks target machines in teuthology-dispacther's supervisor mode after job completes run and teuthology subprocess exits. It either unlock nodes or nukes them depending on job's status. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
|
Can one of the admins verify this patch? |
teuthology/dispatcher/__init__.py
Outdated
| '--archive-dir', archive_dir, | ||
| ] | ||
|
|
||
| with tempfile.NamedTemporaryFile(prefix='teuthology-dispatcher.', |
There was a problem hiding this comment.
Instead of temporary file can we create this file in the job archive_dir for debugging, reproducibility and simplifying issue analysis.
There was a problem hiding this comment.
We create the same file (orig.config.yaml) in the job archive directory when teuthology cmd is invoked by the supervisor while initiating run in run.py here. I'll refactor so that the file is created here.
There was a problem hiding this comment.
Right, just store under different name, no need to refactor teuthology.run right now, it would be better to drop saving the orig.config.yaml in follow up PR after teuthology-dispatcher. Another reason to minimize invasion to teuthology.run is that some system can still use older py2 branch for running teuthology test, so we want to have teuthology-worker to be backwards compatible for some time.
There was a problem hiding this comment.
Created orig.config.yaml in dispatcher itself (along with creating archive dirs for run and job) without refactoring run.py to maintain backward compatibility.
teuthology/dispatcher/supervisor.py
Outdated
| config_fd = int(args["--config-fd"]) | ||
|
|
||
| with open(config_fd, 'r') as config_file: | ||
| config_file.seek(0) | ||
| job_config = yaml.safe_load(config_file) |
There was a problem hiding this comment.
Please, use file path instead of file descriptor.
| lock_machines_helper(ctx, config) | ||
| try: | ||
| yield | ||
| finally: | ||
| unlock_machines(ctx) | ||
|
|
||
|
|
||
| def lock_machines_helper(ctx, config, reimage=True): |
There was a problem hiding this comment.
I think we decided to not touch internal tasks for now. The dispatcher should not use lock_machines module in any ways.
There was a problem hiding this comment.
It doesn't change the working of the internal task. Just refactors it such that the same logic can be used in dispatcher to wait to for required number of machines to be available and then lock them (using lock_many of lock module). Since the waiting logic is not in the lock module, I thought it's best to reuse the logic from here to avoid repetitive code while still maintaining backward compatibility. Please let me know if this needs to be done differently.
There was a problem hiding this comment.
I can see it does not change the behavior of the code, it just wrong place for this can of refactoring right now, overall locking mechanism should be rewritten in future, but for the goals of this PR those changes are not required. I would prefer to leave internal tasks untouched right now. Later we should refactor the lock_machines tasks to use teuthology.lock.ops code. The dispatcher should not user anything from the directory teuthology/task, and use lock api directly. The tasks should only be invoked from the run itself.
There was a problem hiding this comment.
Added func block_and_lock_machines in lock module which is used by both the lock_machines task and dispatcher. Is this approach okay?
teuthology/dispatcher/supervisor.py
Outdated
| symlink_worker_log(job_config['worker_log'], | ||
| job_config['archive_path']) |
There was a problem hiding this comment.
There is no sense in this code since we can redirect output of corresponding dispatcher log directly to archive_path.
There was a problem hiding this comment.
Removed this symlink since the supervisor log is already inside the job archive now.
|
Good job overall, see inline comments. |
teuthology/dispatcher/__init__.py
Outdated
|
|
||
| def lock_machines(job_config): | ||
| fake_ctx = supervisor.create_fake_context(job_config, block=True) | ||
| lock_machines_helper(fake_ctx, [len(job_config['roles']), |
There was a problem hiding this comment.
Instead of using internal tasks we should use lock module directly.
|
ok to test |
|
@susebot run deploy |
|
Commit 3b4d864 is OK. |
teuthology/dispatcher/supervisor.py
Outdated
| suite_dir = os.path.join(archive_dir, job_config['name']) | ||
| if (not os.path.exists(suite_dir)): | ||
| os.mkdir(suite_dir) | ||
| log_file_path = os.path.join(suite_dir, 'supervisor.{job_id}'.format( |
There was a problem hiding this comment.
Can we put the supervisor log into the job directory, i.e. f'{run_dir}/{job_id}/supervisor.log'
There was a problem hiding this comment.
The only reason I didn't put it there is that we will have unnecessary folders for first and last jobs in a run. If that is not an issue, this makes more sense.
There was a problem hiding this comment.
Yes... right, the first and last in suite are specific jobs, and they should be handled differently, since they do not start any teuthology job.
There was a problem hiding this comment.
Done. Removes the unnecessary dirs created at the end of execution in case the job is first or last in suite here.
This commit saves job config in its archive dir and sends its path instead of file descriptor of a temporary file in the dispatcher and the supervisor. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
605d4bb to
cc937a1
Compare
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
This commit adds the method block_and_lock machines to ops.py to enable locking machines in dispatcher. This is to ensure that lock_machines task is not used in dispatcher. Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
I have added the Regarding how to deploy the dispatcher, other than changing the cmd
Please let me know if the above sounds right. |
|
I was going to try this changes on my desktop, but I need to merge the #1548 first. |
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
|
thanks, that looks good. only other thing I noticed was due to problems with checking out the teuthology branch again, which would be resolved by the separate teuthology workspaces in the future. |
| started), information about the job is recorded in ``paddles``'s internal | ||
| database. Depending on the requirements (in terms of requested machines), the | ||
| scheduler eventually determines when a job can get executed. At this point the | ||
| master communicates with ``teuthology-worker``, which in turn invokes |
There was a problem hiding this comment.
@jdurgin so with this change you guys are dropping any reference to teuthology-worker?
There was a problem hiding this comment.
yes, I think it'll be obsolete once the dispatcher si stabilized
| verbose = args["--verbose"] | ||
| tube = args["--tube"] | ||
| log_dir = args["--log-dir"] | ||
| archive_dir = args["--archive-dir"] |
There was a problem hiding this comment.
Can we add the default value for archive taken from teuthology config, that will make --archive-dir optional.
Signed-off-by: Shraddha Agrawal <shraddha.agrawal000@gmail.com>
|
@jdurgin were you able to test it appropriately? |
|
@kshtsk hey, sorry I forgot to update the PR. Discussed with @shraddhaag on IRC a bit ago, I saw errors from using the same teuthology checkout as it was being changed/rechecked out, so it seems we need the independent checkouts for this to work well. @shraddhaag do you have time to look at that? If not someone else can finish that part, and then we can re-test. |
|
@jdurgin yes, I'll add teuthology workspace bit. |
|
@susebot run deploy |
|
Commit 7c77115 is OK. |
|
I guess we need to remove [WIP] tag before merge. |
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
The dispatcher and supervisor were added in #1546, but code was copied and pasted into the new modules, leaving the worker untouched. Also untouched were the unit tests, meaning that the dispatcher and supervisor were never unit tested. As the copied code changed, the dispatcher and supervisor were not being tested for regressions, while the worker - which wasn't being anymore - had passing unit tests, giving some false sense of security. This commit removes the old worker code, and adapts the old worker tests to apply to the dispatcher and supervisor. It also splits out teuthology-supervisor into its own command. Signed-off-by: Zack Cerza <zack@redhat.com>
Related PRs:
Motivation
Project Idea Link The existing worker framework leads to jobs, that require large number of machines to run on, waiting for a long time (often indefinitely). This also leads to running jobs in unreliable priority order. This PR tries to solve the above. It the first step towards replacing the existing worker framework with a dispatcher. This PR adds a dispatcher that takes jobs from a queue, allocates nodes for them to run on, and executes them as its subprocess.
Flow charts for:
Changes in the current working:
teuthology-dispatchertakes jobs from the queue, locks machines without reimaging them, populates job config with target machines and invokesteuthology-dispactherin--supervisormode. Without waiting for its child process (teuthology-dispactherin--supervisormode) to finish, it moves on to take another job from the queue and repeat the above process.teuthology-dispacther --supervisor modeis a job supervisor process and thus is job specific. It sets up a supervisor log file in suites' archive dir named supervisor.{job_id}. It then reimages the target machines, constructs theteuthologycmd and invokes it. At the end of the job run, the supervisor checks the job status and depending on whether it was successfully completed or not it unlocks target machines or nukes them.Development Lab Setup:
I have setup a development lab for teuthology, using some mira nodes from the current cluster, to test out the above mentioned changes. I have done so using:
Pulpito can be accessed here.
archive logs are on the master node (mira045) in location: /home/teuthworker/archive
Test performed
Checklist
teuthology-dispatcher, that runs each job in the queue as its subprocess by invokingteuthology-dispatcherin--supervisormode.supervisormode toteuthology-dispatchersuch that it reimages target machines, runs the job whose config is provided to it and unlocks/nukes targets at the end of the job run.supervisorsubprocess.