Skip to content

samtools view update to v1.9#1963

Merged
mvdbeek merged 16 commits intogalaxyproject:masterfrom
bernt-matthias:topic/samtools-view
Sep 19, 2018
Merged

samtools view update to v1.9#1963
mvdbeek merged 16 commits intogalaxyproject:masterfrom
bernt-matthias:topic/samtools-view

Conversation

@bernt-matthias
Copy link
Contributor

@bernt-matthias bernt-matthias commented Jul 4, 2018

This updates samtools view to v1.9

  • upgrade to 1.9

Discussion:

  • Do we need samtools view based tools: slice_bam, bam_to_cram, bam_to_sam, cram_to_bam, and sam_to_bam. The logic with reference data and indices could potentially integrated into samtools_view.
    • keep converters for compatibily, remove/deprecate slice
      • mark slice deprecated / remove it? How? In any case move the tests here.
    • Note that I would have also the updated tools which I could push here, but lets decide first
      • No need to push. I checked: no changes are necessary.
  • tool version is now 2.xx can we change it to 1.7+galaxy2.xx?
    • if no problem for galaxy. is it?
      • not for this tool, because new
  • macro PREPIDX
    • for cram there was a test for the presence of the index, do we need this also for bam
      • implement this
    • do we want indexing for SAM files? then tabix is needed which seems out-of-date in bioconda (0.2.5 available, latest 1.9 in htslib
      • no indexing for sam files
  • do we need to allow explicitely for the different BAM data types, I guess we need them all.
    • yep we need them: Q: do we need to allow them explicitely?
    • in PREPIDX: how can we check for the other types? is there a simple way.
  • check logic of setting outtype. also if statements for other types necessary? Can we simplify this?
  • readtags as repeat or text field
    • repeat is fine
  • input of reference fasta: allow fai? planemo complains about fasta.gz
    • fix this!
  • output sam: dbkey action needed? change_format needs to be replaced or improved
    change_format deprecated?

Copy link
Contributor

@Slugger70 Slugger70 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff!

#end if

## if data is converted from an unsorted file (SAM, CRAM, or unsorted BAM) to BAM
## then sort the output by coordinate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not so sure about the auto sorting... maybe make it more explicit by asking the user to choose from SAM, CRAM (maybe sort versions of cram - I don't use them?), unsorted.bam, qname_sorted.bam, indexed bam (coord sorted). With the default being indexed bam?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean that we need to remove: or $input.metadata.sort_order == 'queryname'

Otherwise I hope that its correct. The main point is that we need to coordinate sort if the output is bam (by assumptions of Galaxys bam data type). I just check if the input is sorted .. because then we can skip this. Just not sure if the metadata of the bam data type is set correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think user should specify what they want for an output format (or choose the default -which i think shoudl be bam). Since the output format implies the sort order, think that alone shoud determine whether or not to sort (e.g. sort for at least bam, maybe cram too? )

@bwlang
Copy link
Contributor

bwlang commented Jul 9, 2018

Commendable work on these important tools!

I'll respond to your discussion points in order:

  1. I think we should keep the format conversion tools (sam <-> bam <-> cram) for usability/dicoverability reasons. Slice could be eliminated.

  2. New version scheme seems good to me, assuming that we detect 1.7+g is > 2.x

  3. I don't know about cram indexes, but I think we can assume in galaxy that all files of type "bam" are indexed. I say no indexing for sam files

  4. Would be nice to add the other galaxy bam input types(that are not indexed) for tools like view that don't need an index:

    • qname_input_sorted.bam
    • qname_sorted.bam
    • unsorted.bam
  5. I think the tool should ask about output type (.e.g. bam, sam, cram, indexed bam(implies coord sorting), indexed cram) and remove the "same" type - which, IMHO, adds complexity without much benefit. It would be tough to render a workflow from a tool using a same type. (left an inline comment to this effect)

  6. Read tags support looks fine to me as a repeat

  7. I think it's reasonable to require an fa even if the tool only really needs the index. fasta.gz should be supported

  8. Sorry - I don't know about the change_format bits - maybe someone else will

@bernt-matthias
Copy link
Contributor Author

bernt-matthias commented Jul 9, 2018

I think we should keep the format conversion tools (sam <-> bam <-> cram) for usability/dicoverability reasons. Slice could be eliminated.

Agreed. What would be the IUC way to remove slice? Delete it? Or mark it deprecated..? How?

New version scheme seems good to me, assuming that we detect 1.7+g is > 2.x

Jep. Can someone answer if this works? Otherwise I would suggest to increase the version slowly, such that samtools version will catch up eventually.

I don't know about cram indexes, but I think we can assume in galaxy that all files of type "bam" are indexed. I say no indexing for sam files

OK.

Would be nice to add the other galaxy bam input types(that are not indexed) for tools like view that don't need an index:
    qname_input_sorted.bam
    qname_sorted.bam
    unsorted.bam

OK. Do we need to add them explicitely? Or are they implicitely allowed anyway, because of the data types class structure?

I think the tool should ask about output type (.e.g. bam, sam, cram, indexed bam(implies coord sorting), indexed cram) and remove the "same" type - which, IMHO, adds complexity without much benefit. It would be tough to render a workflow from a tool using a same type. (left an inline comment to this effect)

OK

Read tags support looks fine to me as a repeat

OK

I think it's reasonable to require an fa even if the tool only really needs the index. fasta.gz should be supported

I will check

Sorry - I don't know about the change_format bits - maybe someone else will

@@ -0,0 +1,438 @@
<tool id="samtools_view" name="Samtools view" version="2.0.1">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version="@TOOL_VERSION@"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nsoranzo
Copy link
Member

New version scheme seems good to me, assuming that we detect 1.7+g is > 2.x

Jep. Can someone answer if this works? Otherwise I would suggest to increase the version slowly, such that samtools version will catch up eventually.

This doesn't apply here, since this is a new tool. Any way 2.0 > 1.7+g, which can be verified with:

>>> from packaging.version import parse
>>> parse('2.0') > parse('1.7+g')
True

@bernt-matthias
Copy link
Contributor Author

Somehow I have a problem with planemo on my local install (only this branch):

All tests fail immediately, I just get:

Traceback (most recent call last):
  File "/tmp/tmpH13feS/galaxy-dev/test/functional/test_toolbox.py", line 88, in test_tool
    self.do_it(tool_version=tool_version, test_index=test_index)
  File "/tmp/tmpH13feS/galaxy-dev/test/functional/test_toolbox.py", line 41, in do_it
    verify_tool(tool_id, self.galaxy_interactor, resource_parameters=resource_parameters, test_index=test_index, tool_version=tool_version, register_job_data=register_job_data)
  File "/tmp/tmpH13feS/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 653, in verify_tool
    stage_data_in_history(galaxy_interactor, tool_id, testdef.test_data(), test_history)
  File "/tmp/tmpH13feS/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 59, in stage_data_in_history
    upload_wait = galaxy_interactor.stage_data_async(test_data, history, tool_id)
  File "/tmp/tmpH13feS/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 267, in stage_data_async
    file_name = self.test_data_path(tool_id, fname)
  File "/tmp/tmpH13feS/galaxy-dev/lib/galaxy/tools/verify/interactor.py", line 232, in test_data_path
    assert response.status_code == 200
AssertionError

I also tried to reset my venv and planemo configuration directories. I'm completely stuck on this one. Any ideas?

- v1.8
- `-h` per default
- macros use CDATA
- checking of CRAM index
@bernt-matthias bernt-matthias changed the title samtools view update to v1.7 samtools view update to v1.8 Jul 13, 2018
@bernt-matthias
Copy link
Contributor Author

got planemo running again (problem was that I linked to the test data of the converters). my todos are done. the remaining points are open questions.

@bgruening
Copy link
Member

There is a merge conflict here @bernt-matthias.

- added PREPIDX_MULTIPLE (needed by depth)
- removed no-chrom-options (only needed by slice which is to be
  deprecated)
- removed same option for outtype
- tests w bam output now work wo sim_size
@bgruening
Copy link
Member

I think I would not include the different BAM datatypes here. My intuition is we should keep the tools simple in this regard and offer one sort tool that can create the other datatypes if needed.

Also please note, that Galaxy ships by default with a few converters, so BAM to CRAM and so on should be already in Galaxy.

@bwlang
Copy link
Contributor

bwlang commented Jul 16, 2018

@bgruening : I value simplicity too, but in this case I wonder whether the cost of duplicating the potentially very large data files for a format conversion is worth a bit more code complexity.

I don't think having a list of 4 output types (sam, bam, cram, count) is much more complex for users than "bam (coordinate sorted), cram (coordinate sorted), sam, unsorted.bam, qname_input_sorted.bam, qname_sorted.bam, unsorted.bam,count"

Or maybe you were referring to the input formats? I think there is even less user complexity if that's your concern.

@bernt-matthias
Copy link
Contributor Author

@bwlang has a point with the data duplication, but users can decide to delete intermediate data (in workflows you can do this automatically).

- only bam output triggers sort (if not sorted already)
@bwlang
Copy link
Contributor

bwlang commented Jul 17, 2018

agreed - also, these other bam formats are proably not typically used much...

@mvdbeek
Copy link
Member

mvdbeek commented Sep 6, 2018

@bernt-matthias do you mind if I rebase this against the new macros.xml file and address my last comment ?

@bernt-matthias
Copy link
Contributor Author

I merged the macros.xml file using the github frontend and commited an update using nicer names for the new macros.

@mvdbeek mvdbeek changed the title samtools view update to v1.8 samtools view update to v1.9 Sep 6, 2018
@mvdbeek
Copy link
Member

mvdbeek commented Sep 6, 2018

thanks, looks good. what do you think about #1963 (comment) ? I'm a little worried about users generating CRAM files and then being disappointed that they don't work when downloading them from Galaxy.

@bernt-matthias
Copy link
Contributor Author

I'm a little worried about users generating CRAM files and then being disappointed that they don't work when downloading them from Galaxy.

You have a point with this. But this might be treated by $reffa!=None which adds --output-fmt-option no_ref (note this is the default). One could state in the help (which needs some love anyway) that in case the user wants to use CRAM outside of Galaxy without the reference available then no refernce must be specified in the tool. But I think that programs using CRAM as input should allow to specify the path to the reference (and the md5 check verifies that its the correct reference).

@bernt-matthias
Copy link
Contributor Author

It seems that only the md5 is stored in CRAM.

https://www.htslib.org/workflow/#mapping_to_cram :

In CRAM format the reference sequence is linked to by the md5sum (M5 auxiliary tag) in the CRAM header (@SQ tags). This is mandatory and part of the CRAM specification.

@bernt-matthias
Copy link
Contributor Author

@mvdbeek took some time to provide a somewhat useful help.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 7, 2018

We still need a .shed.yml for samtools view.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 11, 2018

@bernt-matthias could you add the .shed.yml file ?

@bernt-matthias
Copy link
Contributor Author

Done

Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, great work!

@mvdbeek mvdbeek merged commit aa97817 into galaxyproject:master Sep 19, 2018
@bernt-matthias bernt-matthias deleted the topic/samtools-view branch March 8, 2020 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants