Skip to content

A new runner for DRMAA (currently UNIVA)#7004

Merged
jmchilton merged 1 commit intogalaxyproject:devfrom
bernt-matthias:topic/univa3
Nov 19, 2018
Merged

A new runner for DRMAA (currently UNIVA)#7004
jmchilton merged 1 commit intogalaxyproject:devfrom
bernt-matthias:topic/univa3

Conversation

@bernt-matthias
Copy link
Contributor

@bernt-matthias bernt-matthias commented Nov 12, 2018

Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy):

  • this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information.
  • this has the additional advantage that if the drmaa library functions are not working DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools.

Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations:

  • memory violations are determined by comparing the used and the requested memory
  • run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time where the used memory and time are determined with drmaa.wait() or qacct

TODO:

  • There is still a problem if the Galaxy user kills the job before it entered the schedulers job data base (can not be accessed by qstat or qacct). The bug is that nonexistent members of extdata or accessed. So I need to add some tests for members of extdata or set useful defaults.
  • The external_kill script needs a bit of testing. I guess it does also not work if jobs are submitted as real user (since only jobs that are started in the same session can be accessed).

Open (or better perspective):

  • adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function check_watched_items was put into a new function check_watched_item in order to make subclassing more convenient.

Replaces #4857 (which replaced #4275 ), since I did mess up with git again (there were some duplicated commits).

@lachlansimpson
Copy link

Because I'm new around here and not 100% sure what this implies, I'm going to ask some questions.

  1. Does this replace or offer an alternative to natefoo's slurm-drmaa?
  2. Alternatively, is this just the interface that natefoo's plugin uses?

@bernt-matthias
Copy link
Contributor Author

Hi @datakid

Does this replace or offer an alternative to natefoo's slurm-drmaa?

No. Slurm-drmaa is for SLURM clusters (which use sacct,... for querying jobs). univa-drmaa is for clusters running UNIVA grid engine (which use qacct,... for querying) -- but I guess it also works for SUN grid engine (but I can not test this).

Both SlurmJobRunner and UnivaJobRunner derive from DRMAAJobRunner which can not be used in the setting that submits jobs as the real user. This is because the (python) drmaa library can only query jobs that are created in the same drmaa session, but in the real user setting jobs are started by an external script (drmaa_external_run|kill) which uses its own drmaa session which can not be accessed by the galaxy process. Hence Galaxy can not query the job state.

The solution of SlurmJobRunner and UnivaJobRunner is to use the corresponding command line tools to query the job state.

Alternatively, is this just the interface that natefoo's plugin uses?

I do not understand this question.

@bernt-matthias bernt-matthias force-pushed the topic/univa3 branch 2 times, most recently from 763dc12 to 9d71988 Compare November 14, 2018 10:09
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
@jmchilton
Copy link
Member

This looks like a good, isolated first start so I'm merging. Thanks so much for the work, and sorry for making you jump through hoops about the memory handling.

@bernt-matthias bernt-matthias deleted the topic/univa3 branch January 2, 2019 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants