Skip to content

Improved DRMAAJobRunner #4275

Closed
bernt-matthias wants to merge 140 commits intogalaxyproject:devfrom
bernt-matthias:drmaa-fix
Closed

Improved DRMAAJobRunner #4275
bernt-matthias wants to merge 140 commits intogalaxyproject:devfrom
bernt-matthias:drmaa-fix

Conversation

@bernt-matthias
Copy link
Contributor

@bernt-matthias bernt-matthias commented Jul 4, 2017

A subclass of the DRMAAJobRunner that allows to detect time and memory violations. I took some inspiration from the slurm runner. I would be happy if I would get some comments, in particular if the logic of the main function will integrate nicely in the way jobs are pipelined through the runner.

The main imrovements over the current DRMAAJobRunner are:

  • memory violations are determined by checking the stderr output and by comparing the used and the requested memory
  • run time violations are determined by checking the signal that killed the job and by comparing the used and the requested run time
    Where the used memory and time are determined with drmaa.wait() or qacct

In addition it solves the problem that the current DRMAAJobRunner does not work when jobs are submitted as real user (because jobs that are started in a different drmaa session can not be accessed from the session that is open in galaxy):

  • this is done by resorting to command line tools qstat and qacct if the drmaa library can not be used to check the job status and to get run time information.
  • this has the additional advantage that if the drmaa library functions are not working (DRMAAJobRunner had implemented a repeated checking to handle this problem) the runner can still use the command line tools.

I tested it successfully for our Univa grid engine. Also job resubmission seems to work nicely.

Open:

  • adaptions to other grid engines. the current implementation is specific for the Univa grid engine. For other grid engines the output of the command line tools might be different
  • is it a problem that jobs are started with drmaa library (external runner) and the checks are done with the command line? Probably not.
  • there is probably more :)

commit cca871e changes in addition that the job template is loaded in the external runner before the user is changed. otherwise the cluster_files_directory needs to be readable by all users.

@galaxybot galaxybot added this to the 17.09 milestone Jul 4, 2017
Matthias Bernt and others added 11 commits July 4, 2017 23:19
- the content of the function is the part of the original check_watched_items function
  that determined the job state with drmaa.job_state + exception handling
- the check_watched_items now just calls the check_watched_item function
The rationale for this change is that the drmaa.job_state function does not
work if jobs are submitted as real user. With the new function the method
to determine the job state can be changed in sub classes.
this solves the problem that the wait called from _complete_terminal_job was unsuccessful (which led to the use of qacct) because it can only be called once for completed jobs (but this was already done in last call from check_watched_item)
@nsoranzo nsoranzo requested a review from natefoo July 11, 2017 08:51
@bernt-matthias
Copy link
Contributor Author

Commit solves #4308. But I do not know if this is the best place.

bernt-matthias and others added 18 commits October 23, 2017 14:23
- the content of the function is the part of the original check_watched_items function
  that determined the job state with drmaa.job_state + exception handling
- the check_watched_items now just calls the check_watched_item function
The rationale for this change is that the drmaa.job_state function does not
work if jobs are submitted as real user. With the new function the method
to determine the job state can be changed in sub classes.
this solves the problem that the wait called from _complete_terminal_job was unsuccessful (which led to the use of qacct) because it can only be called once for completed jobs (but this was already done in last call from check_watched_item)
DRMAAUnivaJobRunner -> UnivaJobRunner
@nsoranzo
Copy link
Member

Superseded by #4857.

@nsoranzo nsoranzo closed this Oct 25, 2017
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Nov 12, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Nov 12, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Nov 13, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Nov 14, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Nov 14, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Nov 14, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Dec 4, 2018
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
bernt-matthias added a commit to bernt-matthias/galaxy that referenced this pull request Jan 29, 2019
Reimplementation of the DRMAA runner inspired by the SLURM runner.
Currently tested only for the UNIVA grid engine (but I'm optimistic
that it should work as well for other drmaa systems).

This solves the problem that the current DRMAAJobRunner does
not work when jobs are submitted as real user (because jobs that are
started in a different drmaa session can not be accessed from the
session that is open in galaxy):
- this is done by resorting to command line tools qstat and qacct if
  the drmaa library can not be used to check the job status and to get run
  time information.
- this has the additional advantage that if the drmaa library
  functions are not working (DRMAAJobRunner had implemented a repeated
  checking to handle this problem) the runner can still use the command
  line tools.

Furthermore (in contrast to the original drmaa runner) the new one
tests for run time and memory violations:
- memory violations are determined by comparing the used and the
  requested memory
- run time violations are determined by checking the signal that
  killed the job and by comparing the used and the requested run time
  Where the used memory and time are determined with drmaa.wait() or
  qacct

Open (or better perspective):
- adaptions to other grid engines. the current implementation (the
  command line calls and result parsing) might be specific for the
  Univa grid engine. to include other GEs one could determine the
  GE (+ version) and make the calls and result parsing depending
  on this.

Implementation note:

The changes in drmaa.py do not change the functionality at all,
but only reorganize the code. In particular part of
the function `check_watched_items` was put into a new function
`check_watched_item` in order to make subclassing more convenient.

Replaces galaxyproject#6931 (which replaced galaxyproject#4275), since I did mess up with git
again (there were some duplicated commits).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants