Skip to content

groupTuple not correctly grouping based on shared key #6399

@skelly02

Description

@skelly02

Bug report

I have a sample Nextflow script like this which is collecting sample ID's from processes throughout the workflow and then grouping them at the end to show which samples passed which steps in the pipeline. At the end I am using a simple groupTuple operation to gather up the list of successful workflow steps for each sample. However groupTuple is not correctly grouping items based on their shared keys. I have tried a lot of different methods of swapping the Channels and operators around with different combinations of other operators and nothing seems to make this very very simple basic grouping operation work correctly. It seems like there is either some fundamental broken behavior in groupTuple or perhaps there is some hidden discrepancy that is not apparent.

The Nextflow script is here;

process DO_THING {
    // debug true
    errorStrategy "ignore"

    input:
    val(x)

    output:
    tuple val("${sampleID}"), val("${task.process}"), topic: passed
    tuple val("${sampleID}"), val("${task.process}"), emit: passed

    script:
    sampleID = "${x}"
    """
    echo "got $x"
    if [ "$x" == "Sample1" ]; then echo "bad sample!"; exit 1; fi
    """
}

process DO_THING2 {
    // debug true
    errorStrategy "ignore"

    input:
    val(x)

    output:
    tuple val("${sampleID}"), val("${task.process}"), topic: passed
    tuple val("${sampleID}"), val("${task.process}"), emit: passed

    script:
    sampleID = "${x}"
    """
    echo "got $x"
    if [ "$x" == "Sample2" ]; then echo "bad sample!"; exit 1; fi
    """
}

process DO_THING3 {
    // debug true
    errorStrategy "ignore"

    input:
    val(x)

    output:
    tuple val("${sampleID}"), val("${task.process}"), topic: passed
    tuple val("${sampleID}"), val("${task.process}"), emit: passed

    script:
    sampleID = "${x}"
    """
    echo "got $x"
    if [ "$x" == "Sample3" ]; then echo "bad sample!"; exit 1; fi
    """
}

process FAKE_MULTIQC {
    // put your multiqc here to do thing with the table you made
    // debug true

    input:
    path(input_file)

    script:
    """
    echo "put your multiqc with the table here"
    echo "here is your table:"
    cat "${input_file}"
    """
}

workflow {
    samples = Channel.from("Sample1", "Sample2", "Sample3", "Sample4")

    // list all the input samples
    inputSamples = samples.map { sampleID ->
        // coerce all sample ID's to new string objects for consistency
        return ["${sampleID}", "INPUT"]
    }
    // sanity check that these INPUT items group correctly
    inputSamples2 = samples.map { sampleID ->
        return ["${sampleID}", "INPUT2"]
    }

    DO_THING(samples)
    DO_THING2(samples)
    DO_THING3(samples)

    // view the samples that passed
    // concat the channels sequentially to force blocking behavior for sanity check
    allSamples = inputSamples.concat(channel.topic("passed")).concat(inputSamples2).groupTuple()
    // NOTE: this gives the same output so its not due to topic channels; allSamples = inputSamples.concat(DO_THING.out.passed, DO_THING2.out.passed).concat(inputSamples2).groupTuple(by: 0)
    allSamples.view()

    // gives the following output;
    // [Sample1, [INPUT, INPUT2]]
    // [Sample2, [INPUT, INPUT2]]
    // [Sample3, [INPUT, INPUT2]]
    // [Sample4, [INPUT, INPUT2]]
    // [Sample1, [DO_THING3, DO_THING2]]
    // [Sample2, [DO_THING3, DO_THING]]
    // [Sample4, [DO_THING2, DO_THING3, DO_THING]]
    // [Sample3, [DO_THING2, DO_THING]]

    // but its supposed to look like this;
    // [Sample1, [INPUT, INPUT2, DO_THING3, DO_THING2]]
    // [Sample3, [INPUT, INPUT2, DO_THING2, DO_THING]]
    // [Sample4, [INPUT, INPUT2, DO_THING2, DO_THING3, DO_THING]]
    // [Sample2, [INPUT, INPUT2, DO_THING3, DO_THING]]

    // make a table out of the passing samples
    samplesTable = channel.topic("passed")
        .map { processName, sampleId ->
            return "${processName}\t${sampleId}"
        }
        .collectFile(name: "passed.txt", storeDir: "output", newLine: true)
    FAKE_MULTIQC(samplesTable)
}

Expected behavior and actual behavior

This line in the script here;

// view the samples that passed
    allSamples = inputSamples.concat(channel.topic("passed")).concat(inputSamples2).groupTuple()
    allSamples.view()

Is supposed to give output like this;

[Sample1, [INPUT, INPUT2, DO_THING3, DO_THING2]]
    [Sample3, [INPUT, INPUT2, DO_THING2, DO_THING]]
    [Sample4, [INPUT, INPUT2, DO_THING2, DO_THING3, DO_THING]]
    [Sample2, [INPUT, INPUT2, DO_THING3, DO_THING]]

But instead it gives this output;

[Sample1, [INPUT, INPUT2]]
[Sample2, [INPUT, INPUT2]]
[Sample3, [INPUT, INPUT2]]
[Sample4, [INPUT, INPUT2]]
[Sample4, [DO_THING, DO_THING2]]
[Sample2, [DO_THING]]
[Sample3, [DO_THING, DO_THING2]]
[Sample1, [DO_THING2]]

The items are not correctly grouped based on the sample ID, however, the items from the "input samples" channels are getting grouped together and the items from the downstream Nextflow process channels are getting grouped together.

This behavior seems wrong because the expectation is that all items would get grouped.

For context, I re-read the docs for groupTuple here https://www.nextflow.io/docs/latest/reference/operator.html#grouptuple multiple times but I cannot figure out what is wrong with the example shown here. Note that docs there seem to mention something to do with groupKey and size but the explanation there on these topics is mostly unintelligible (I could not figure out what it was talking about and neither could my colleagues) so I am not sure if that could be affecting this or not?

I also saw this discussion #4716 but I am not sure the problem is the same here

Environment

  • Nextflow version: 25.04.6 ; I test on a variety of older versions of Nextflow and saw the same behavior
  • Java version:
openjdk version "17.0.10" 2024-01-16 LTS
OpenJDK Runtime Environment Corretto-17.0.10.7.1 (build 17.0.10+7-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.10.7.1 (build 17.0.10+7-LTS, mixed mode, sharing)
  • Operating system: tested on macOS and Linux
  • Bash version: (use the command $SHELL --version)
GNU bash, version 3.2.57(1)-release (arm64-apple-darwin24)

GNU bash, version 5.2.21(1)-release (x86_64-pc-linux-gnu)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions