Skip to content

Better documentation for groupTuple and groupKey #3935

@benbfly

Description

@benbfly

I have a scatter/gather workflow and need to gather the files from a particular sample using the groupBy() operator. The format of my input channel is as follows:

[
    {
        "id": "SAMP_0002",
        "single_end": "false"
    },
    "SAMP_0002_0.ubam"
],
[
    {
        "id": "SAMP_0002",
        "single_end": "false"
    },
    "SAMP_0002_1.ubam"
]

What I would like to do is wait until all the emissions from the upstream channel have finished and then to merge them with:

    reads_sized
        .groupTuple(by: 0)
        .set {reads_grouped}

To get a merged channel like this:

[
    {
        "id": "SAMP_0002",
        "single_end": "false"
    },
    ["SAMP_0002_0.ubam", "SAMP_0002_1.ubam"]
],

It's really unclear to me what happens if you don't specify a size parameter to groupTuple(). Does it block until the input channel has finished all its emissions? Is it possible that it will emit multiple tuples with the same key if they come in at different times?

The "tip" suggests that you need to either specify a constant size or calculate the sizes of each key in advance and using built in function groupKey, but it's not clear to me exactly how groupKey works, and I don't see any documentation on groupKey or its inputs and outputs.

I went to the "patterns" example on groupTuple (https://nextflow-io.github.io/patterns/process-into-groups/), but it does not use groupKey at all even though the groups would have variable size in that example.

Overall, scatter gather seems like an important pattern and I think people would benefit from reviewing this documentation. Or is there a better operator to use for my example above?

(Note, the history of groupKey seems to be in this issue #796 )

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions