Skip to content

Commit 13a5ae7

Browse files
committed
port docs about collection as an output from bb
1 parent 4a96f6d commit 13a5ae7

File tree

1 file changed

+60
-3
lines changed

1 file changed

+60
-3
lines changed

docs/_writing_collections.rst

Lines changed: 60 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ pairs of datasets can also process single datasets. The following
6262
<when value="paired">
6363
<param name="fastq_input1" type="data" format="fastqsanger" label="Select first set of reads" help="Specify dataset with forward reads"/>
6464
<param name="fastq_input2" type="data" format="fastqsanger" label="Select second set of reads" help="Specify dataset with reverse reads"/>
65-
</when>
65+
</when>
6666
<when value="single">
6767
<param name="fastq_input1" type="data" format="fastqsanger" label="Select fastq dataset" help="Specify dataset with single reads"/>
6868
</when>
@@ -113,7 +113,7 @@ For instance:
113113

114114
::
115115

116-
#for $input in $inputs
116+
#for $input in $inputs
117117
--input "$input"
118118
#end for
119119

@@ -151,7 +151,7 @@ derived from using a little ficitious program called ``merge_rows``.
151151
::
152152

153153
#import re
154-
#for $input in $inputs
154+
#for $input in $inputs
155155
merge_rows --name "${re.sub('[^\w\-_]', '_', $input.element_identifier)}" --file "$input" --to $output;
156156
#end for
157157

@@ -216,6 +216,63 @@ Some example tools which consume collections include:
216216

217217
- `collection_nested_test <https://github.com/galaxyproject/galaxy/blob/dev/test/functional/tools/collection_nested_test.xml>`_ (small test tool demonstrating consumption of nested collections)
218218

219+
220+
-------------------------------
221+
Collection as an Output
222+
-------------------------------
223+
224+
Whenever possible simpler operations that produce datasets should be implicitly "mapped over" to produce collections - but there are a variety of situations for which this idiom is insufficient.
225+
226+
Progressively more complex syntax elements exist for the increasingly complex scenarios. Broadly speaking - the three scenarios covered are the tool produces...
227+
228+
- a collection with a static number of elements (mostly for paired, but if a tool does say fixed binning it might make sense to create a list this way as well)
229+
- a list with the same number of elements as an input (common pattern for normalization applications for instance).
230+
- a list where the number of elements is not knowable until the job is complete.
231+
232+
For the first case - the tool can simply declare standard data elements below an output collection element in the outputs tag of the tool definition.
233+
234+
::
235+
236+
<collection name="paired_output" type="paired" label="Split Pair">
237+
<data name="forward" format="txt" />
238+
<data name="reverse" format_source="input1" from_work_dir="reverse.txt" />
239+
</collection>
240+
241+
242+
Templates (e.g. the ``command`` tag) can then reference ``$forward`` and ``$reverse`` or whatever ``name`` the corresponding ``data`` elements are given - as demonstrated in ``test/functional/tools/collection_creates_pair.xml``.
243+
244+
The tool should describe the collection type via the type attribute on the collection element. Data elements can define ``format``, ``format_source``, ``metadata_source``, ``from_work_dir``, and ``name``.
245+
246+
The above syntax would also work for the corner case of static lists. For paired collections specifically however, the type plugin system now knows how to prototype a pair so the following even easier (though less configurable) syntax works.
247+
248+
::
249+
250+
<collection name="paired_output" type="paired" label="Split Pair" format_source="input1">
251+
</collection>
252+
253+
In this case the command template could then just reference ``${paried_output.forward}`` and ``${paired_output.reverse}`` as demonstrated in ``test/functional/tools/collection_creates_pair_from_type.xml``.
254+
255+
For the second case - where the structure of the output is based on the structure of an input - a structured_like attribute can be defined on the collection tag.
256+
257+
::
258+
259+
<collection name="list_output" type="list" label="Duplicate List" structured_like="input1" inherit_format="true">
260+
261+
Templates can then loop over ``input1`` or ``list_output`` when buliding up command-line expressions. See ``test/functional/tools/collection_creates_list.xml`` for an example.
262+
263+
``format``, ``format_source``, and ``metadata_source`` can be defined for such collections if the format and metadata are fixed or based on a single input dataset. If instead the format or metadata depends on the formats of the collection it is structured like - ``inherit_format="true"`` and/or ``inherit_metadata="true"`` should be used instead - which will handle corner cases where there are for instance subtle format or metadata differences between the elements of the incoming list.
264+
265+
The third and most general case is when the number of elements in a list cannot be determined until runtime. For instance, when splitting up files by various dynamic criteria.
266+
267+
In this case a collection may define one of more discover_dataset elements. As an example of one such tool that splits a tabular file out into multiple tabular files based on the first column see ``test/functional/tools/collection_split_on_column.xml`` - which includes the following output definition:
268+
269+
::
270+
271+
<collection name="split_output" type="list" label="Table split on first column">
272+
<discover_datasets pattern="__name_and_ext__" directory="outputs" />
273+
</collection>
274+
275+
219276
----------------------
220277
Further Reading
221278
----------------------

0 commit comments

Comments
 (0)