-
Notifications
You must be signed in to change notification settings - Fork 776
Description
Bug report
I have a sample Nextflow script like this which is collecting sample ID's from processes throughout the workflow and then grouping them at the end to show which samples passed which steps in the pipeline. At the end I am using a simple groupTuple operation to gather up the list of successful workflow steps for each sample. However groupTuple is not correctly grouping items based on their shared keys. I have tried a lot of different methods of swapping the Channels and operators around with different combinations of other operators and nothing seems to make this very very simple basic grouping operation work correctly. It seems like there is either some fundamental broken behavior in groupTuple or perhaps there is some hidden discrepancy that is not apparent.
The Nextflow script is here;
process DO_THING {
// debug true
errorStrategy "ignore"
input:
val(x)
output:
tuple val("${sampleID}"), val("${task.process}"), topic: passed
tuple val("${sampleID}"), val("${task.process}"), emit: passed
script:
sampleID = "${x}"
"""
echo "got $x"
if [ "$x" == "Sample1" ]; then echo "bad sample!"; exit 1; fi
"""
}
process DO_THING2 {
// debug true
errorStrategy "ignore"
input:
val(x)
output:
tuple val("${sampleID}"), val("${task.process}"), topic: passed
tuple val("${sampleID}"), val("${task.process}"), emit: passed
script:
sampleID = "${x}"
"""
echo "got $x"
if [ "$x" == "Sample2" ]; then echo "bad sample!"; exit 1; fi
"""
}
process DO_THING3 {
// debug true
errorStrategy "ignore"
input:
val(x)
output:
tuple val("${sampleID}"), val("${task.process}"), topic: passed
tuple val("${sampleID}"), val("${task.process}"), emit: passed
script:
sampleID = "${x}"
"""
echo "got $x"
if [ "$x" == "Sample3" ]; then echo "bad sample!"; exit 1; fi
"""
}
process FAKE_MULTIQC {
// put your multiqc here to do thing with the table you made
// debug true
input:
path(input_file)
script:
"""
echo "put your multiqc with the table here"
echo "here is your table:"
cat "${input_file}"
"""
}
workflow {
samples = Channel.from("Sample1", "Sample2", "Sample3", "Sample4")
// list all the input samples
inputSamples = samples.map { sampleID ->
// coerce all sample ID's to new string objects for consistency
return ["${sampleID}", "INPUT"]
}
// sanity check that these INPUT items group correctly
inputSamples2 = samples.map { sampleID ->
return ["${sampleID}", "INPUT2"]
}
DO_THING(samples)
DO_THING2(samples)
DO_THING3(samples)
// view the samples that passed
// concat the channels sequentially to force blocking behavior for sanity check
allSamples = inputSamples.concat(channel.topic("passed")).concat(inputSamples2).groupTuple()
// NOTE: this gives the same output so its not due to topic channels; allSamples = inputSamples.concat(DO_THING.out.passed, DO_THING2.out.passed).concat(inputSamples2).groupTuple(by: 0)
allSamples.view()
// gives the following output;
// [Sample1, [INPUT, INPUT2]]
// [Sample2, [INPUT, INPUT2]]
// [Sample3, [INPUT, INPUT2]]
// [Sample4, [INPUT, INPUT2]]
// [Sample1, [DO_THING3, DO_THING2]]
// [Sample2, [DO_THING3, DO_THING]]
// [Sample4, [DO_THING2, DO_THING3, DO_THING]]
// [Sample3, [DO_THING2, DO_THING]]
// but its supposed to look like this;
// [Sample1, [INPUT, INPUT2, DO_THING3, DO_THING2]]
// [Sample3, [INPUT, INPUT2, DO_THING2, DO_THING]]
// [Sample4, [INPUT, INPUT2, DO_THING2, DO_THING3, DO_THING]]
// [Sample2, [INPUT, INPUT2, DO_THING3, DO_THING]]
// make a table out of the passing samples
samplesTable = channel.topic("passed")
.map { processName, sampleId ->
return "${processName}\t${sampleId}"
}
.collectFile(name: "passed.txt", storeDir: "output", newLine: true)
FAKE_MULTIQC(samplesTable)
}Expected behavior and actual behavior
This line in the script here;
// view the samples that passed
allSamples = inputSamples.concat(channel.topic("passed")).concat(inputSamples2).groupTuple()
allSamples.view()Is supposed to give output like this;
[Sample1, [INPUT, INPUT2, DO_THING3, DO_THING2]]
[Sample3, [INPUT, INPUT2, DO_THING2, DO_THING]]
[Sample4, [INPUT, INPUT2, DO_THING2, DO_THING3, DO_THING]]
[Sample2, [INPUT, INPUT2, DO_THING3, DO_THING]]
But instead it gives this output;
[Sample1, [INPUT, INPUT2]]
[Sample2, [INPUT, INPUT2]]
[Sample3, [INPUT, INPUT2]]
[Sample4, [INPUT, INPUT2]]
[Sample4, [DO_THING, DO_THING2]]
[Sample2, [DO_THING]]
[Sample3, [DO_THING, DO_THING2]]
[Sample1, [DO_THING2]]
The items are not correctly grouped based on the sample ID, however, the items from the "input samples" channels are getting grouped together and the items from the downstream Nextflow process channels are getting grouped together.
This behavior seems wrong because the expectation is that all items would get grouped.
For context, I re-read the docs for groupTuple here https://www.nextflow.io/docs/latest/reference/operator.html#grouptuple multiple times but I cannot figure out what is wrong with the example shown here. Note that docs there seem to mention something to do with groupKey and size but the explanation there on these topics is mostly unintelligible (I could not figure out what it was talking about and neither could my colleagues) so I am not sure if that could be affecting this or not?
I also saw this discussion #4716 but I am not sure the problem is the same here
Environment
- Nextflow version: 25.04.6 ; I test on a variety of older versions of Nextflow and saw the same behavior
- Java version:
openjdk version "17.0.10" 2024-01-16 LTS
OpenJDK Runtime Environment Corretto-17.0.10.7.1 (build 17.0.10+7-LTS)
OpenJDK 64-Bit Server VM Corretto-17.0.10.7.1 (build 17.0.10+7-LTS, mixed mode, sharing)
- Operating system: tested on macOS and Linux
- Bash version: (use the command
$SHELL --version)
GNU bash, version 3.2.57(1)-release (arm64-apple-darwin24)
GNU bash, version 5.2.21(1)-release (x86_64-pc-linux-gnu)