Improved DRMAAJobRunner by bernt-matthias · Pull Request #4275 · galaxyproject/galaxy

bernt-matthias · 2017-07-04T15:04:48Z

A subclass of the DRMAAJobRunner that allows to detect time and memory violations. I took some inspiration from the slurm runner. I would be happy if I would get some comments, in particular if the logic of the main function will integrate nicely in the way jobs are pipelined through the runner.

The main imrovements over the current DRMAAJobRunner are:

memory violations are determined by checking the stderr output and by comparing the used and the requested memory
run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time
Where the used memory and time are determined with drmaa.wait() or qacct

In addition it solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy):

this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information.
this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools.

I tested it successfully for our Univa grid engine. Also job resubmission seems to work nicely.

Open:

adaptions to other grid engines. the current implementation is specific for the Univa grid engine. For other grid engines the output of the command line tools might be different
is it a problem that jobs are started with drmaa library (external runner) and the checks are done with the command line? Probably not.
there is probably more :)

commit cca871e changes in addition that the job template is loaded in the external runner before the user is changed. otherwise the cluster_files_directory needs to be readable by all users.

…into drmaa-fix

- the content of the function is the part of the original check_watched_items function that determined the job state with drmaa.job_state + exception handling - the check_watched_items now just calls the check_watched_item function The rationale for this change is that the drmaa.job_state function does not work if jobs are submitted as real user. With the new function the method to determine the job state can be changed in sub classes.

this solves the problem that the wait called from _complete_terminal_job was unsuccessful (which led to the use of qacct) because it can only be called once for completed jobs (but this was already done in last call from check_watched_item)

…into drmaa-fix

bernt-matthias · 2017-07-12T14:19:39Z

Commit solves #4308. But I do not know if this is the best place.

- the content of the function is the part of the original check_watched_items function that determined the job state with drmaa.job_state + exception handling - the check_watched_items now just calls the check_watched_item function The rationale for this change is that the drmaa.job_state function does not work if jobs are submitted as real user. With the new function the method to determine the job state can be changed in sub classes.

this solves the problem that the wait called from _complete_terminal_job was unsuccessful (which led to the use of qacct) because it can only be called once for completed jobs (but this was already done in last call from check_watched_item)

…vailable immediately)

DRMAAUnivaJobRunner -> UnivaJobRunner

for substrings

…/G/T

…into drmaa-fix

nsoranzo · 2017-10-25T10:15:31Z

Superseded by #4857.

Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine (but I'm optimistic that it should work as well for other drmaa systems). This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. Implementation note: The changes in drmaa.py do not change the functionality at all, but only reorganize the code. In particular part of the function `check_watched_items` was put into a new function `check_watched_item` in order to make subclassing more convenient. Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git again (there were some duplicated commits).

bernt-matthias added 8 commits June 15, 2017 12:43

UFZ specific message for logging in to the galaxy

1735320

added drmaa.wait

a3e1c2a

added drmaa.wait

b9cf5fd

added a comment on a potential flaw in the slurm runner implementation

c292e60

added initial implementation of a runner for Univa grid engine

e16a447

added a comment to _complete_terminal_job

f71c743

added small comment

b579988

undone changes applied to the wrong branch

21dd8e3

galaxybot added the triage label Jul 4, 2017

galaxybot added this to the 17.09 milestone Jul 4, 2017

Matthias Bernt and others added 11 commits July 4, 2017 23:19

making travis a bit more happy

d689804

more for travis' happiness

86a7158

even more happiness

b7520c9

made state a string (instead of set)

2c78a98

output of subprocess is string not file -> change iteration

ca2fa9f

Merge branch 'drmaa-fix' of https://github.com/bernt-matthias/galaxy …

56af183

…into drmaa-fix

version that survived the first tests

f855892

minor bug fix

00c8c76

bug fixes in time/memory parsing

68a399d

nsoranzo requested a review from natefoo July 11, 2017 08:51

bernt-matthias and others added 4 commits July 12, 2017 10:36

making travis happy again

d4c258f

bugfixes in time parsing

45cbd3a

Merge branch 'drmaa-fix' of https://github.com/bernt-matthias/galaxy …

acd073f

…into drmaa-fix

reclaim ownership also for job restart, solves galaxyproject#4308

f842bc5

This was referenced Jul 20, 2017

DRMAAJobRunner as real user fails when jobs are resubmitted #4308

Closed

DRMAA Job runner as real user fails when importing history #4325

Closed

Merge remote-tracking branch 'origin/release_17.05'

4a7045c

bernt-matthias and others added 18 commits October 23, 2017 14:23

version that survived the first tests

bf762e7

minor bug fix

613a5cb

bug fixes in time/memory parsing

762fefc

making travis happy again

320fd25

bugfixes in time parsing

86846b9

reclaim ownership also for job restart, solves galaxyproject#4308

f72c7b8

improved run time and memory parsing

edd8353

added repeated polling of qacct (since accounting data might not be a…

d1713ee

…vailable immediately)

logging -> log

0bdd6fc

DRMAAUnivaJobRunner -> UnivaJobRunner

try to use memory parsing from util

d55bc1e

use utils memory parsing at all places

c6ba609

created a common function in util that checks the tail of a file

edc98a5

for substrings

util.size_to_bytes: accept also memory strings without multiplier K/M…

75a23d5

…/G/T

missed something while merging

640f20d

removed/commented log output + added comments

b2cb31e

Merge branch 'drmaa-fix' of https://github.com/bernt-matthias/galaxy …

5855fff

…into drmaa-fix

bernt-matthias mentioned this pull request Oct 25, 2017

A new runner for DRMAA (currently UNIVA) #4857

Closed

nsoranzo closed this Oct 25, 2017

bernt-matthias mentioned this pull request Nov 12, 2018

A new runner for DRMAA (currently UNIVA) #7004

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved DRMAAJobRunner #4275

Improved DRMAAJobRunner #4275
bernt-matthias wants to merge 140 commits intogalaxyproject:devfrom
bernt-matthias:drmaa-fix

bernt-matthias commented Jul 4, 2017 •

edited

Loading

Uh oh!

bernt-matthias commented Jul 12, 2017

Uh oh!

nsoranzo commented Oct 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

bernt-matthias commented Jul 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bernt-matthias commented Jul 12, 2017

Uh oh!

nsoranzo commented Oct 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

bernt-matthias commented Jul 4, 2017 •

edited

Loading