Skip to content

splitFastq produces wrong number of outputs #3216

@qscacheri

Description

@qscacheri

Bug report

Expected behavior and actual behavior

Given a pair of fastq files from S3 with 171390646 reads, calling fastqChannel.splitFastq(by: params.chunkSize, pe: true, file: true) where chunkSize is 10000000, the operator does not create the correct number of output files.

Steps to reproduce the problem

#!/usr/bin/env nextflow 

nextflow.enable.dsl=2

process validate {
    input:
        path(fastqFiles)
    output:
        path("*.f*q*"), includeInputs: true, emit: fastqFiles
        stdout emit: logs
    shell:
    '''
    for f in !{fastqFiles}; do
        echo "${f}:"
        du -h $(realpath $f)
    done
    '''
}

workflow {
    fastqsChannel = Channel.fromPath(params.fastqFiles)
    validate(fastqsChannel)
    
    groupedFastqs = validate.out.fastqFiles
    .map {file -> 
        m = file =~ /.*\/([\w\d\-_]+)?[\-_]R?[1,2]/
        return tuple(m[0][1], file)
    }
    .groupTuple()
    .map { tuple(it[0], it[1][0], it[1][1]) }

    chunksChannel = groupedFastqs.splitFastq(by: params.chunkSize, pe: true, file: true)
    chunksChannel.subscribe { println "Created chunk ${it}"}
    chunksChannel.count().view { "Created ${it} chunks" }

}

Program output

Prints Created 1 chunks

Environment

  • Nextflow version: 22.09.3.edge build 5767
  • Java version:
  • Operating system: Linux
  • Bash version: (use the command $SHELL --version)

Additional context

I'm using AWS batch to test since the files are too big for me to test locally.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions