Skip to content

Commit a811e65

Browse files
committed
Revise planemo tools docs to be more explicit about collection identifiers.
1 parent d0d21e5 commit a811e65

File tree

3 files changed

+51
-19
lines changed

3 files changed

+51
-19
lines changed

docs/_writing_collections.rst

Lines changed: 43 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -30,18 +30,26 @@ Consuming Collections
3030
-------------------------------
3131

3232
Many Galaxy tools can be used without modification in conjuction with collections.
33-
Galaxy users can take a collection and ``map over`` any tool that
33+
Galaxy users can take a collection and `map over` any tool that
3434
consumes individual datasets. For instance, early in typical bioinformatics
3535
workflows you may have steps that filter raw data, convert to standard
3636
formats, perform QC on individual files - users can take lists, pairs, or
3737
lists of paired datasets and map over such tools that consume individual
38-
files. Galaxy will then run the tool once for each dataset in the collection
39-
and for each output of that tool Galaxy will rebuild a new collection with the
40-
same ``identifier`` structure (so sample name or forward/reverse structure is
41-
perserved).
38+
dataset (files). Galaxy will then run the tool once for each dataset in the
39+
collection and for each output of that tool Galaxy will rebuild a new collection.
4240

43-
Tools can also consume collections if they must or should process multiple
44-
files at once. We will discuss three cases:
41+
Collection elements have the concept an `identifier` and an `index` when
42+
the collection is created. Both of these are preserved during these mapping
43+
steps. As Galaxy builds output collections from these mapping steps, the
44+
identifier and index for the output entries match those of the supplied input.
45+
46+
.. image:: images/identifiers.svg
47+
48+
If a tool's functionality can be applied to individual files in isolation, the
49+
implicit mapping described above should be sufficient and no knowledge of collections
50+
by tools should be needed. However, tools may need to process multiple
51+
files at once - in this case explict collection consumption is required. This
52+
document outlines three cases:
4553

4654
* consuming pairs of datasets
4755
* consuming lists
@@ -94,6 +102,18 @@ In Galaxy's ``command`` block, the individual datasets can be accessed using
94102
arbitrary collection types an array syntax can also be used (e.g.
95103
``$fastq_input['forward']``).
96104

105+
.. note::
106+
107+
Mirroring the ability of Galaxy users to map tools that consume individual
108+
datasets over lists (and other collection types), users may also map lists
109+
of pairs over tools which explicitly consume dataset pair.
110+
111+
If the output of the tool is datasets, the output of this mapping operation
112+
(sometimes referred to as subcollection mapping) will be lists. The element
113+
identifier and index of the top level of the list will be preserved.
114+
115+
.. image:: images/subcollection_mapping_identifiers.svg
116+
97117
Some example tools which consume paired datasets include:
98118

99119
- `collection_paired_test <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/collection_paired_test.xml>`__ (minimal test tool in Galaxy test suite)
@@ -154,8 +174,13 @@ Also see the tools-devteam repository `Pull Request #20 <https://github.com/gala
154174
Processing Identifiers
155175
-------------------------------
156176

157-
As mentioned previously, sample identifiers are preserved through mapping
158-
steps, during reduction steps one may likely want to use these - for
177+
Collection elements have identifiers that can be used for various kinds of sample
178+
tracking. These identifiers are set when the collection is first created - either
179+
explicitly in the UI (or API), through mapping over collections that preserves input
180+
identifers, or as the ``identifier`` when dynamically discovering collection outputs
181+
described below.
182+
183+
During reduction steps one may likely want to use these - for
159184
reporting, comparisons, etc. When using these multiple ``data`` parameters
160185
the dataset objects expose a field called ``element_identifier``. When these
161186
parameters are used with individual datasets - this will just default to being
@@ -173,18 +198,17 @@ derived from using a little fictitious program called ``merge_rows``.
173198
merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" --file "$input" --to $output;
174199
#end for
175200

201+
.. note:: Here we are rewriting the element identifiers to assure everything is safe to
202+
put on the command-line. In the future, collections will not be able to contain
203+
keys that are potentially harmful and this won't be nessecary.
204+
176205
Some example tools which utilize ``element_identifier`` include:
177206

178-
- `identifier_multiple <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/identifier_multiple.xml>`_
179-
- `identifier_single <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/identifier_single.xml>`_
180-
- `vcftools_merge <https://github.com/galaxyproject/tools-devteam/blob/master/tool_collections/vcftools/vcftools_merge/vcftools_merge.xml>`_
207+
- `identifier_multiple <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/identifier_multiple.xml>`__
208+
- `identifier_single <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/identifier_single.xml>`__
209+
- `vcftools_merge <https://github.com/galaxyproject/tools-devteam/blob/master/tool_collections/vcftools/vcftools_merge/vcftools_merge.xml>`__
181210
- `jbrowse <https://github.com/galaxyproject/tools-iuc/blob/master/tools/jbrowse/jbrowse.xml>`_
182-
183-
.. TODO: https://github.com/galaxyproject/tools-devteam/pull/363/files
184-
185-
.. note:: Here we are rewriting the element identifiers to assure everything is safe to
186-
put on the command-line. In the future collections will not be able to contain
187-
keys that are potentially harmful and this won't be nessecary.
211+
- `kraken-mpa-report <https://github.com/blankenberg/tools-devteam/blob/master/tool_collections/kraken/kraken_report/kraken-mpa-report.xml>`__
188212

189213
More on ``data_collection`` parameters
190214
----------------------------------------------
@@ -229,7 +253,7 @@ collection or just a dataset.
229253
--nested ${input.is_collection}
230254
#end for
231255

232-
Some example tools which consume collections include:
256+
Some example tools which consume nested collections include:
233257

234258
- `collection_nested_test <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/collection_nested_test.xml>`_ (small test tool demonstrating consumption of nested collections)
235259

docs/images/identifiers.svg

Lines changed: 4 additions & 0 deletions
Loading

docs/images/subcollection_mapping_identifiers.svg

Lines changed: 4 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)