A new runner for DRMAA (currently UNIVA) by bernt-matthias · Pull Request #7004 · galaxyproject/galaxy

bernt-matthias · 2018-11-12T12:15:33Z

Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy):

this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information.
this has the additional advantage that if the drmaa library functions are not working DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools.

Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations:

memory violations are determined by comparing the used and the requested memory
run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time where the used memory and time are determined with drmaa.wait() or qacct

TODO:

There is still a problem if the Galaxy user kills the job before it entered the schedulers job data base (can not be accessed by qstat or qacct). The bug is that nonexistent members of extdata or accessed. So I need to add some tests for members of extdata or set useful defaults.
The external_kill script needs a bit of testing. I guess it does also not work if jobs are submitted as real user (since only jobs that are started in the same session can be accessed).

Open (or better perspective):

adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function check_watched_items was put into a new function check_watched_item in order to make subclassing more convenient.

Replaces #4857 (which replaced #4275 ), since I did mess up with git again (there were some duplicated commits).

lachlansimpson · 2018-11-12T21:38:37Z

Because I'm new around here and not 100% sure what this implies, I'm going to ask some questions.

Does this replace or offer an alternative to natefoo's slurm-drmaa?
Alternatively, is this just the interface that natefoo's plugin uses?

bernt-matthias · 2018-11-13T13:08:07Z

Hi @datakid

Does this replace or offer an alternative to natefoo's slurm-drmaa?

No. Slurm-drmaa is for SLURM clusters (which use sacct,... for querying jobs). univa-drmaa is for clusters running UNIVA grid engine (which use qacct,... for querying) -- but I guess it also works for SUN grid engine (but I can not test this).

Both SlurmJobRunner and UnivaJobRunner derive from DRMAAJobRunner which can not be used in the setting that submits jobs as the real user. This is because the (python) drmaa library can only query jobs that are created in the same drmaa session, but in the real user setting jobs are started by an external script (drmaa_external_run|kill) which uses its own drmaa session which can not be accessed by the galaxy process. Hence Galaxy can not query the job state.

The solution of SlurmJobRunner and UnivaJobRunner is to use the corresponding command line tools to query the job state.

Alternatively, is this just the interface that natefoo's plugin uses?

I do not understand this question.

Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).

jmchilton · 2018-11-19T10:40:11Z

This looks like a good, isolated first start so I'm merging. Thanks so much for the work, and sorry for making you jump through hoops about the memory handling.

bernt-matthias mentioned this pull request Nov 12, 2018

A new runner for DRMAA (currently UNIVA) #4857

Closed

galaxybot added the triage label Nov 12, 2018

galaxybot added this to the 19.01 milestone Nov 12, 2018

bernt-matthias force-pushed the topic/univa3 branch from 1b921d6 to 17e2b7a Compare November 12, 2018 15:03

bernt-matthias force-pushed the topic/univa3 branch 2 times, most recently from 763dc12 to 9d71988 Compare November 14, 2018 10:09

bernt-matthias force-pushed the topic/univa3 branch from 9d71988 to 55f5235 Compare November 14, 2018 11:42

jmchilton merged commit 72b2781 into galaxyproject:dev Nov 19, 2018

jmchilton added kind/enhancement area/jobs status/review and removed triage labels Nov 19, 2018

bernt-matthias deleted the topic/univa3 branch January 2, 2019 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new runner for DRMAA (currently UNIVA)#7004

A new runner for DRMAA (currently UNIVA)#7004
jmchilton merged 1 commit intogalaxyproject:devfrom
bernt-matthias:topic/univa3

bernt-matthias commented Nov 12, 2018 •

edited

Loading

Uh oh!

lachlansimpson commented Nov 12, 2018

Uh oh!

bernt-matthias commented Nov 13, 2018

Uh oh!

jmchilton commented Nov 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bernt-matthias commented Nov 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lachlansimpson commented Nov 12, 2018

Uh oh!

bernt-matthias commented Nov 13, 2018

Uh oh!

jmchilton commented Nov 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bernt-matthias commented Nov 12, 2018 •

edited

Loading