A new runner for DRMAA (currently UNIVA)#4857
A new runner for DRMAA (currently UNIVA)#4857bernt-matthias wants to merge 28 commits intogalaxyproject:devfrom
Conversation
|
@natefoo @bgruening still struggling with git. Since I was unable to rebase my branch to an updated dev I created a new PR with my changes. The old one (#4275) can be closed. |
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine. This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by checking the stderr output and by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. In addition that the job template is loaded in the external runner before the user is changed. otherwise the cluster_files_directory needs to be readable by all users.
e15cf66 to
dcba7a4
Compare
|
Opened a PR against your branch to touch up the formatting and merge the latest dev to hopefully fix some of these tests. Can you merge it or give me permission to push these sorts of things to your branch directly? Would you be willing to pull the checks of standard error for memory tool errors out of the runner? Like these lines: I think the TODO there is correct and the tools should be annotating it - e.g. #3107. We could even setup up some common set of checks and provide and easy option for tool authors to quickly check for them but it isn't something I think we should be doing in one particular job runner. I also thing that there could be some tools that emit those things even for perfectly fine runs - e.g. a tool that tries one approach until it runs out of memory and if it does switches to another. |
|
@jmchilton I invited you as collaborator of my fork. We can move the stderr checks implemented in I haven't thought about jobs that are successful despite of such errors, I guess the simplest solution would be that within |
|
Just had the idea that instead of the tool definition For instance data_manager_hisat2 could just echo We could just define a flag that wont appear in any tool, e.g. something like |
|
@bernt-matthias I've opened a PR that allows tools to describe out of memory detection criteria here #5196 - want to give me your thoughts? |
|
Just added a memory statement for the UNIVA runner. Does somebody know if this is per thread or overall? |
|
@bernt-matthias Thanks for sticking with this and sorry for the repeated delays, any chance you can merge with dev now that the memory stuff is merged to fix the conflict. There are also some linting issues with whitespace that you can see if you click on the Travis tests above. There is also a unit test failures stemming from reclaim_ownership interface changes between jobs and job wrappers - probably sticking a no-op That test file can be ran locally with: |
4aaf096 to
b86b28e
Compare
|
OK. Did a rebase. Is this what you requested? If there are any linting issues I will fix them as soon as travis finished. I will see if I understand the failed test. |
|
Test seems to run now. Maybe related to the rebase..? |
|
All linting errors should be fixed now. Still unsure about the faling unit test. I guess it works locally because I did a rebase and no merge. Any comments? |
|
I'm not sure what is up - but that unit test definitely fails for me the same way locally. The fix I pushed to your branch fixes it for me. For the rest of the PR, I looks really good me. I was still hoping that #5196 would mean we could make: just and the variables and computation making up |
|
I'm open for moving or removing this .. preferring the former. Can you imagine a place where a check for such language specific memory errors could go such that also other/all schedulers profit? I still think that it would be a lot of redundancy if checks for language specific errors need to be implemented by every tool. On our cluster the checks seem to perform quite well giving me the opportunity to have a quite simple job_conf: we have 6 destinations that have resource requirements as combinations of
|
Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine. This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by checking the stderr output and by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. In addition that the job template is loaded in the external runner before the user is changed. otherwise the cluster_files_directory needs to be readable by all users.
bf23d03 to
6fad4b1
Compare
since unsuccessful job submission raises an error this is needed, before this crashed the job queue
Merge latest dev and touchup formatting.
|
Hi, just though about the open issue for jobs that are deleted by the user before the scheduler had a chance to precess them. Is there somewhere a flag in the job data structure that indicates the job as deleted? Then I could accept the failing qstat/qacct. |
|
It'll go through states |
|
@jmchilton I finally decided to remove all memviolation tests (in favour of per tool tests / aggressive mode) |
|
replaced #7004 |
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine.
This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
the drmaa library can not be used to check the job status and to get run
time information.
functions are not working (DRMAAJobRunner had implemented a repeated
checking to handle this problem) the runner can still use the command
line tools.
Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
by comparing the used and the requested memory
killed the job and by comparing the used and the requested run time
Where the used memory and time are determined with drmaa.wait() or
qacct
Open (or better perspective):
command line calls and result parsing) might be specific for the
Univa grid engine. to include other GEs one could determine the
GE (+ version) and make the calls and result parsing depending
on this.
In addition that the job template is loaded in the external runner before
the user is changed. otherwise the cluster_files_directory needs to be
readable by all users.