A new runner for DRMAA (currently UNIVA) by bernt-matthias · Pull Request #4857 · galaxyproject/galaxy

bernt-matthias · 2017-10-25T09:04:55Z

Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine.

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):

this is done by resorting to command line tools qstat and qacct if
the drmaa library can not be used to check the job status and to get run
time information.
this has the additional advantage that if the drmaa library
functions are not working (DRMAAJobRunner had implemented a repeated
checking to handle this problem) the runner can still use the command
line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:

memory violations are determined by checking the stderr output and
by comparing the used and the requested memory
run time violations are determined by checking the signal that
killed the job and by comparing the used and the requested run time
Where the used memory and time are determined with drmaa.wait() or
qacct

Open (or better perspective):

adaptions to other grid engines. the current implementation (the
command line calls and result parsing) might be specific for the
Univa grid engine. to include other GEs one could determine the
GE (+ version) and make the calls and result parsing depending
on this.

In addition that the job template is loaded in the external runner before
the user is changed. otherwise the cluster_files_directory needs to be
readable by all users.

bernt-matthias · 2017-10-25T09:06:43Z

@natefoo @bgruening still struggling with git. Since I was unable to rebase my branch to an updated dev I created a new PR with my changes. The old one (#4275) can be closed.

Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine. This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by checking the stderr output and by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. In addition that the job template is loaded in the external runner before the user is changed. otherwise the cluster_files_directory needs to be readable by all users.

scripts/drmaa_external_runner.py

jmchilton · 2017-12-07T18:14:06Z

Opened a PR against your branch to touch up the formatting and merge the latest dev to hopefully fix some of these tests. Can you merge it or give me permission to push these sorts of things to your branch directly?

Would you be willing to pull the checks of standard error for memory tool errors out of the runner? Like these lines:

+            # TODO 1 should be moved to runners/util and 2 should be implemented via the stdio tag of the tools 
 +            memerrors = set(["xrealloc: cannot allocate",
 +                 "MemoryError",
 +                 "std::bad_alloc",
 +                 "java.lang.OutOfMemoryError: Java heap space",
 +                 "Out of memory!"])
 +            toolmemerrors = set(["Out of memory" # data_manager_hisat2 
 +                 ])
 +            memviolation = util.grep_tail( ajs.error_file, memerrors | toolmemerrors, MEMORY_LIMIT_SCAN_SIZE )

I think the TODO there is correct and the tools should be annotating it - e.g. #3107. We could even setup up some common set of checks and provide and easy option for tool authors to quickly check for them but it isn't something I think we should be doing in one particular job runner. I also thing that there could be some tools that emit those things even for perfectly fine runs - e.g. a tool that tries one approach until it runs out of memory and if it does switches to another.

bernt-matthias · 2017-12-08T10:28:18Z

@jmchilton I invited you as collaborator of my fork.

We can move the stderr checks implemented in memerrors somewhere else. My idea is that its still a good idea to run them for every job. This could be done in the runners or more highlevel (which would be nice). The toolmemerrors could be moved to the tool's stdio checks, but currently the stdio checks of the tools seem not to allow resubmission of jobs (which is a really useful features which I use to avoid specifying memory and time requirements for every tool).

I haven't thought about jobs that are successful despite of such errors, I guess the simplest solution would be that within <command> some post-processing of stderr is done, ie. tool authors need to take care.

bernt-matthias · 2017-12-08T10:39:03Z

Just had the idea that instead of the tool definition <stdio> way one could use some code in the <command> to set a flag indicating job resubmission (because of out of memory or out of time).

For instance data_manager_hisat2 could just echo MemoryError to stderr if it finds "Out of memory", or use sed on stderr).

We could just define a flag that wont appear in any tool, e.g. something like GALAXY_TOOL_OUT_OF_MEMORY.

jmchilton · 2017-12-12T04:09:30Z

@bernt-matthias I've opened a PR that allows tools to describe out of memory detection criteria here #5196 - want to give me your thoughts?

bernt-matthias · 2018-02-26T13:31:50Z

Just added a memory statement for the UNIVA runner. Does somebody know if this is per thread or overall?

jmchilton · 2018-03-20T14:09:04Z

@bernt-matthias Thanks for sticking with this and sorry for the repeated delays, any chance you can merge with dev now that the memory stuff is merged to fix the conflict. There are also some linting issues with whitespace that you can see if you click on the Travis tests above. There is also a unit test failures stemming from reclaim_ownership interface changes between jobs and job wrappers - probably sticking a no-op reclaim_ownership method on this stub https://github.com/galaxyproject/galaxy/blob/dev/test/unit/jobs/test_runner_local.py#L121 should fix it.

That test file can be ran locally with:

. .venv/bin/activate
nosetests test/unit/jobs/test_runner_local.py

bernt-matthias · 2018-03-20T16:01:46Z

OK. Did a rebase. Is this what you requested? If there are any linting issues I will fix them as soon as travis finished.

I will see if I understand the failed test.

bernt-matthias · 2018-03-20T16:52:34Z

Test seems to run now. Maybe related to the rebase..?

nosetests test/unit/jobs/test_runner_local.py
..........
----------------------------------------------------------------------
Ran 10 tests in 5.372s

OK

bernt-matthias · 2018-03-22T07:55:23Z

All linting errors should be fixed now. Still unsure about the faling unit test. I guess it works locally because I did a rebase and no merge. Any comments?

jmchilton · 2018-03-22T15:46:43Z

I'm not sure what is up - but that unit test definitely fails for me the same way locally. The fix I pushed to your branch fixes it for me.

For the rest of the PR, I looks really good me. I was still hoping that #5196 would mean we could make:

elif memviolation or mem_wasted > mem_granted * slots:

just

elif mem_wasted > mem_granted * slots:

and the variables and computation making up memviolation. Are you willing to do that or do you still think we should keep that logic here also for now?

bernt-matthias · 2018-03-22T19:29:26Z

I'm open for moving or removing this .. preferring the former.

Can you imagine a place where a check for such language specific memory errors could go such that also other/all schedulers profit? I still think that it would be a lot of redundancy if checks for language specific errors need to be implemented by every tool.

On our cluster the checks seem to perform quite well giving me the opportunity to have a quite simple job_conf: we have 6 destinations that have resource requirements as combinations of

1hour, 1day, and 1 week
normal-memory, high memory
By detecting memory and runtime violations jobs are just escalated accordingly. I find this quite handy because only a few tools need special rules (usually that they enter not at the lowest requirements).

Reimplementation of the DRMAA runner inspired by the SLURM runner. Currently tested only for the UNIVA grid engine. This solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy): - this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information. - this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools. Furthermore (in contrast to the original drmaa runner) the new one tests for run time and memory violations: - memory violations are determined by checking the stderr output and by comparing the used and the requested memory - run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time Where the used memory and time are determined with drmaa.wait() or qacct Open (or better perspective): - adaptions to other grid engines. the current implementation (the command line calls and result parsing) might be specific for the Univa grid engine. to include other GEs one could determine the GE (+ version) and make the calls and result parsing depending on this. In addition that the job template is loaded in the external runner before the user is changed. otherwise the cluster_files_directory needs to be readable by all users.

since unsuccessful job submission raises an error this is needed, before this crashed the job queue

Merge latest dev and touchup formatting.

bernt-matthias · 2018-10-29T13:57:06Z

Hi, just though about the open issue for jobs that are deleted by the user before the scheduler had a chance to precess them.

Is there somewhere a flag in the job data structure that indicates the job as deleted? Then I could accept the failing qstat/qacct.

natefoo · 2018-10-29T16:37:15Z

It'll go through states Job.states.DELETED_NEW and Job.states.DELETED but there's no guarantee that it will end on Job.states.DELETED. You could look through the job's state history (<job>.state_history) but that might be a little tedious. If you wanted to add a deleted boolean column to Job that seems reasonable to me.

bernt-matthias · 2018-10-31T12:01:36Z

Thanks @natefoo: I guess ajs.job_wrapper.get_state() in [Job.states.DELETED, Job.states.DELETED_NEW] might work.

In order to finish this I need to update my galaxies in order to be able to test it (currently I can't stop jobs: #6912). I hope to find some time next week.

bernt-matthias · 2018-10-31T12:02:49Z

@jmchilton I finally decided to remove all memviolation tests (in favour of per tool tests / aggressive mode)

bernt-matthias · 2018-11-12T12:16:02Z

replaced #7004

galaxybot added the triage label Oct 25, 2017

galaxybot added this to the 18.01 milestone Oct 25, 2017

nsoranzo mentioned this pull request Oct 25, 2017

Improved DRMAAJobRunner #4275

Closed

bernt-matthias force-pushed the topic/univa-drmaa branch from e15cf66 to dcba7a4 Compare November 21, 2017 13:38

jmchilton reviewed Dec 7, 2017

View reviewed changes

scripts/drmaa_external_runner.py Show resolved Hide resolved

jmchilton added 2 commits December 7, 2017 12:56

Reformat (PEP8) and touchup univa runner / PR changes.

86a230e

Merge remote-tracking branch 'jmchilton/dev' into topic/univa-drmaa

b53de48

martenson added status/review kind/feature area/framework and removed triage labels Jan 1, 2018

jmchilton modified the milestones: 18.01, 18.05 Jan 2, 2018

This was referenced Feb 26, 2018

drmaa requirement installation #5603

Closed

Give samtools sort information on available memory galaxyproject/tools-iuc#1748

Open

added GALAXY_MEMORY_MB_PER_SLOT variable #5625

Merged

bernt-matthias force-pushed the topic/univa-drmaa branch from 4aaf096 to b86b28e Compare March 20, 2018 15:58

bernt-matthias and others added 13 commits June 5, 2018 13:37

fixed more linter issues

920f166

more linting issues

0cde7d9

linting fix

d95ff12

Remove comment with question - the answer is yes I believe.

678d284

Fix failing unit test.

6073021

added the description of known problem

474048b

fixed linter issues

e2ecdda

fixed more linter issues

d22511d

more linting issues

235ab68

linting fix

657466c

Remove comment with question - the answer is yes I believe.

5da8efb

check for memory violations only for failed jobs

6fad4b1

bernt-matthias force-pushed the topic/univa-drmaa branch from bf23d03 to 6fad4b1 Compare June 5, 2018 11:37

dannon modified the milestones: 18.09, 19.01 Sep 6, 2018

bernt-matthias added 4 commits September 21, 2018 10:34

univa runner: removed duplicated function

7ed0b8b

drmaa univa: added try block around external run script

0b6bf4f

since unsuccessful job submission raises an error this is needed, before this crashed the job queue

Merge branch 'topic/univa-drmaa' into topic/univa-drmaa

13e0747

Merge pull request #1 from jmchilton/topic/univa-drmaa

0a4441b

Merge latest dev and touchup formatting.

univa: memchecks and unused functions removed

dacfcc0

This was referenced Nov 9, 2018

external_runner: load json template first #6995

Merged

reclaim ownership for failed jobs #6996

Merged

bernt-matthias closed this Nov 12, 2018

bernt-matthias mentioned this pull request Nov 12, 2018

A new runner for DRMAA (currently UNIVA) #7004

Merged

2 tasks

Conversation

bernt-matthias commented Oct 25, 2017

Uh oh!

bernt-matthias commented Oct 25, 2017

Uh oh!

Uh oh!

jmchilton commented Dec 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bernt-matthias commented Dec 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bernt-matthias commented Dec 8, 2017

Uh oh!

jmchilton commented Dec 12, 2017

Uh oh!

bernt-matthias commented Feb 26, 2018

Uh oh!

jmchilton commented Mar 20, 2018

Uh oh!

bernt-matthias commented Mar 20, 2018

Uh oh!

bernt-matthias commented Mar 20, 2018

Uh oh!

bernt-matthias commented Mar 22, 2018

Uh oh!

jmchilton commented Mar 22, 2018

Uh oh!

bernt-matthias commented Mar 22, 2018

Uh oh!

bernt-matthias commented Oct 29, 2018

Uh oh!

natefoo commented Oct 29, 2018

Uh oh!

bernt-matthias commented Oct 31, 2018

Uh oh!

bernt-matthias commented Oct 31, 2018

Uh oh!

bernt-matthias commented Nov 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jmchilton commented Dec 7, 2017 •

edited

Loading

bernt-matthias commented Dec 8, 2017 •

edited

Loading