Improved DRMAAJobRunner #4275
Closed
bernt-matthias wants to merge 140 commits intogalaxyproject:devfrom
Closed
Conversation
…into drmaa-fix
- the content of the function is the part of the original check_watched_items function that determined the job state with drmaa.job_state + exception handling - the check_watched_items now just calls the check_watched_item function The rationale for this change is that the drmaa.job_state function does not work if jobs are submitted as real user. With the new function the method to determine the job state can be changed in sub classes.
this solves the problem that the wait called from _complete_terminal_job was unsuccessful (which led to the use of qacct) because it can only be called once for completed jobs (but this was already done in last call from check_watched_item)
Contributor
Author
|
Commit solves #4308. But I do not know if this is the best place. |
This was referenced Jul 20, 2017
- the content of the function is the part of the original check_watched_items function that determined the job state with drmaa.job_state + exception handling - the check_watched_items now just calls the check_watched_item function The rationale for this change is that the drmaa.job_state function does not work if jobs are submitted as real user. With the new function the method to determine the job state can be changed in sub classes.
this solves the problem that the wait called from _complete_terminal_job was unsuccessful (which led to the use of qacct) because it can only be called once for completed jobs (but this was already done in last call from check_watched_item)
…vailable immediately)
DRMAAUnivaJobRunner -> UnivaJobRunner
…into drmaa-fix
Member
|
Superseded by #4857. |
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Nov 12, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
2 tasks
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Nov 12, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Nov 13, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Nov 14, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Nov 14, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Nov 14, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Dec 4, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
bernt-matthias
added a commit
to bernt-matthias/galaxy
that referenced
this pull request
Jan 29, 2019
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A subclass of the DRMAAJobRunner that allows to detect time and memory violations. I took some inspiration from the slurm runner. I would be happy if I would get some comments, in particular if the logic of the main function will integrate nicely in the way jobs are pipelined through the runner.
The main imrovements over the current DRMAAJobRunner are:
Where the used memory and time are determined with drmaa.wait() or qacct
In addition it solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy):
I tested it successfully for our Univa grid engine. Also job resubmission seems to work nicely.
Open:
commit cca871e changes in addition that the job template is loaded in the external runner before the user is changed. otherwise the cluster_files_directory needs to be readable by all users.