Skip to content

A new runner for DRMAA (currently UNIVA)#4857

Closed
bernt-matthias wants to merge 28 commits intogalaxyproject:devfrom
bernt-matthias:topic/univa-drmaa
Closed

A new runner for DRMAA (currently UNIVA)#4857
bernt-matthias wants to merge 28 commits intogalaxyproject:devfrom
bernt-matthias:topic/univa-drmaa

Conversation

@bernt-matthias
Copy link
Contributor

Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine.

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):

  • this is done by resorting to command line tools qstat and qacct if
    the drmaa library can not be used to check the job status and to get run
    time information.
  • this has the additional advantage that if the drmaa library
    functions are not working (DRMAAJobRunner had implemented a repeated
    checking to handle this problem) the runner can still use the command
    line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:

  • memory violations are determined by checking the stderr output and
    by comparing the used and the requested memory
  • run time violations are determined by checking the signal that
    killed the job and by comparing the used and the requested run time
    Where the used memory and time are determined with drmaa.wait() or
    qacct

Open (or better perspective):

  • adaptions to other grid engines. the current implementation (the
    command line calls and result parsing) might be specific for the
    Univa grid engine. to include other GEs one could determine the
    GE (+ version) and make the calls and result parsing depending
    on this.

In addition that the job template is loaded in the external runner before
the user is changed. otherwise the cluster_files_directory needs to be
readable by all users.

@bernt-matthias
Copy link
Contributor Author

@natefoo @bgruening still struggling with git. Since I was unable to rebase my branch to an updated dev I created a new PR with my changes. The old one (#4275) can be closed.

@galaxybot galaxybot added this to the 18.01 milestone Oct 25, 2017
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine.

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
the drmaa library can not be used to check the job status and to get run
time information.
- this has the additional advantage that if the drmaa library
functions are not working (DRMAAJobRunner had implemented a repeated
checking to handle this problem) the runner can still use the command
line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by checking the stderr output and
by comparing the used and the requested memory
- run time violations are determined by checking the signal that
killed the job and by comparing the used and the requested run time
Where the used memory and time are determined with drmaa.wait() or
qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

In addition that the job template is loaded in the external runner before
the user is changed. otherwise the cluster_files_directory needs to be
readable by all users.
@jmchilton
Copy link
Member

jmchilton commented Dec 7, 2017

Opened a PR against your branch to touch up the formatting and merge the latest dev to hopefully fix some of these tests. Can you merge it or give me permission to push these sorts of things to your branch directly?

Would you be willing to pull the checks of standard error for memory tool errors out of the runner? Like these lines:

+            # TODO 1 should be moved to runners/util and 2 should be implemented via the stdio tag of the tools 
 +            memerrors = set(["xrealloc: cannot allocate",
 +                 "MemoryError",
 +                 "std::bad_alloc",
 +                 "java.lang.OutOfMemoryError: Java heap space",
 +                 "Out of memory!"])
 +            toolmemerrors = set(["Out of memory" # data_manager_hisat2 
 +                 ])
 +            memviolation = util.grep_tail( ajs.error_file, memerrors | toolmemerrors, MEMORY_LIMIT_SCAN_SIZE )

I think the TODO there is correct and the tools should be annotating it - e.g. #3107. We could even setup up some common set of checks and provide and easy option for tool authors to quickly check for them but it isn't something I think we should be doing in one particular job runner. I also thing that there could be some tools that emit those things even for perfectly fine runs - e.g. a tool that tries one approach until it runs out of memory and if it does switches to another.

@bernt-matthias
Copy link
Contributor Author

bernt-matthias commented Dec 8, 2017

@jmchilton I invited you as collaborator of my fork.

We can move the stderr checks implemented in memerrors somewhere else. My idea is that its still a good idea to run them for every job. This could be done in the runners or more highlevel (which would be nice). The toolmemerrors could be moved to the tool's stdio checks, but currently the stdio checks of the tools seem not to allow resubmission of jobs (which is a really useful features which I use to avoid specifying memory and time requirements for every tool).

I haven't thought about jobs that are successful despite of such errors, I guess the simplest solution would be that within <command> some post-processing of stderr is done, ie. tool authors need to take care.

@bernt-matthias
Copy link
Contributor Author

Just had the idea that instead of the tool definition <stdio> way one could use some code in the <command> to set a flag indicating job resubmission (because of out of memory or out of time).

For instance data_manager_hisat2 could just echo MemoryError to stderr if it finds "Out of memory", or use sed on stderr).

We could just define a flag that wont appear in any tool, e.g. something like GALAXY_TOOL_OUT_OF_MEMORY.

@jmchilton
Copy link
Member

@bernt-matthias I've opened a PR that allows tools to describe out of memory detection criteria here #5196 - want to give me your thoughts?

@bernt-matthias
Copy link
Contributor Author

Just added a memory statement for the UNIVA runner. Does somebody know if this is per thread or overall?

@jmchilton
Copy link
Member

@bernt-matthias Thanks for sticking with this and sorry for the repeated delays, any chance you can merge with dev now that the memory stuff is merged to fix the conflict. There are also some linting issues with whitespace that you can see if you click on the Travis tests above. There is also a unit test failures stemming from reclaim_ownership interface changes between jobs and job wrappers - probably sticking a no-op reclaim_ownership method on this stub https://github.com/galaxyproject/galaxy/blob/dev/test/unit/jobs/test_runner_local.py#L121 should fix it.

That test file can be ran locally with:

. .venv/bin/activate
nosetests test/unit/jobs/test_runner_local.py

@bernt-matthias
Copy link
Contributor Author

OK. Did a rebase. Is this what you requested? If there are any linting issues I will fix them as soon as travis finished.

I will see if I understand the failed test.

@bernt-matthias
Copy link
Contributor Author

Test seems to run now. Maybe related to the rebase..?

nosetests test/unit/jobs/test_runner_local.py
..........
----------------------------------------------------------------------
Ran 10 tests in 5.372s

OK

@bernt-matthias
Copy link
Contributor Author

All linting errors should be fixed now. Still unsure about the faling unit test. I guess it works locally because I did a rebase and no merge. Any comments?

@jmchilton
Copy link
Member

I'm not sure what is up - but that unit test definitely fails for me the same way locally. The fix I pushed to your branch fixes it for me.

For the rest of the PR, I looks really good me. I was still hoping that #5196 would mean we could make:

elif memviolation or mem_wasted > mem_granted * slots:

just

elif mem_wasted > mem_granted * slots:

and the variables and computation making up memviolation. Are you willing to do that or do you still think we should keep that logic here also for now?

@bernt-matthias
Copy link
Contributor Author

I'm open for moving or removing this .. preferring the former.

Can you imagine a place where a check for such language specific memory errors could go such that also other/all schedulers profit? I still think that it would be a lot of redundancy if checks for language specific errors need to be implemented by every tool.

On our cluster the checks seem to perform quite well giving me the opportunity to have a quite simple job_conf: we have 6 destinations that have resource requirements as combinations of

  • 1hour, 1day, and 1 week
  • normal-memory, high memory
    By detecting memory and runtime violations jobs are just escalated accordingly. I find this quite handy because only a few tools need special rules (usually that they enter not at the lowest requirements).

bernt-matthias and others added 13 commits June 5, 2018 13:37
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine.

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
the drmaa library can not be used to check the job status and to get run
time information.
- this has the additional advantage that if the drmaa library
functions are not working (DRMAAJobRunner had implemented a repeated
checking to handle this problem) the runner can still use the command
line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by checking the stderr output and
by comparing the used and the requested memory
- run time violations are determined by checking the signal that
killed the job and by comparing the used and the requested run time
Where the used memory and time are determined with drmaa.wait() or
qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

In addition that the job template is loaded in the external runner before
the user is changed. otherwise the cluster_files_directory needs to be
readable by all users.
@dannon dannon modified the milestones: 18.09, 19.01 Sep 6, 2018
@bernt-matthias
Copy link
Contributor Author

Hi, just though about the open issue for jobs that are deleted by the user before the scheduler had a chance to precess them.

Is there somewhere a flag in the job data structure that indicates the job as deleted? Then I could accept the failing qstat/qacct.

@natefoo
Copy link
Member

natefoo commented Oct 29, 2018

It'll go through states Job.states.DELETED_NEW and Job.states.DELETED but there's no guarantee that it will end on Job.states.DELETED. You could look through the job's state history (<job>.state_history) but that might be a little tedious. If you wanted to add a deleted boolean column to Job that seems reasonable to me.

@bernt-matthias
Copy link
Contributor Author

Thanks @natefoo: I guess ajs.job_wrapper.get_state() in [Job.states.DELETED, Job.states.DELETED_NEW] might work.

In order to finish this I need to update my galaxies in order to be able to test it (currently I can't stop jobs: #6912). I hope to find some time next week.

@bernt-matthias
Copy link
Contributor Author

@jmchilton I finally decided to remove all memviolation tests (in favour of per tool tests / aggressive mode)

@bernt-matthias
Copy link
Contributor Author

replaced #7004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants