-
Notifications
You must be signed in to change notification settings - Fork 776
Description
I have a scatter/gather workflow and need to gather the files from a particular sample using the groupBy() operator. The format of my input channel is as follows:
[
{
"id": "SAMP_0002",
"single_end": "false"
},
"SAMP_0002_0.ubam"
],
[
{
"id": "SAMP_0002",
"single_end": "false"
},
"SAMP_0002_1.ubam"
]
What I would like to do is wait until all the emissions from the upstream channel have finished and then to merge them with:
reads_sized
.groupTuple(by: 0)
.set {reads_grouped}
To get a merged channel like this:
[
{
"id": "SAMP_0002",
"single_end": "false"
},
["SAMP_0002_0.ubam", "SAMP_0002_1.ubam"]
],
It's really unclear to me what happens if you don't specify a size parameter to groupTuple(). Does it block until the input channel has finished all its emissions? Is it possible that it will emit multiple tuples with the same key if they come in at different times?
The "tip" suggests that you need to either specify a constant size or calculate the sizes of each key in advance and using built in function groupKey, but it's not clear to me exactly how groupKey works, and I don't see any documentation on groupKey or its inputs and outputs.
I went to the "patterns" example on groupTuple (https://nextflow-io.github.io/patterns/process-into-groups/), but it does not use groupKey at all even though the groups would have variable size in that example.
Overall, scatter gather seems like an important pattern and I think people would benefit from reviewing this documentation. Or is there a better operator to use for my example above?
(Note, the history of groupKey seems to be in this issue #796 )