Skip to content

Allow groupTuple operator to handle multi-size tuples #796

@micans

Description

@micans

Suggestion:

A way to split, parallelise and combine from samples to sub-samples and back with NF being aware of sub-sample set sizes, to prevent waiting during the combine stage (which will be groupTuple like).

Use case:

Our sample data often comes in multiple cram files. Merging these cram files and sorting them with samtools (in case we need the paired end reads in FASTQ format) takes a large amount of memory and time, and results in large files (up to 50G seen so far).

Both samtools and downstream STAR alignment of these files place require huge resources (RAM and cpus) and often require too much time on our regular queue. Their outlier status causes issues in our pipeline predictability, and the resource/retry regime is very difficult to parameterise.

Ideally we want to process these multiple cram files (call them sub-samples), independently. This is currently possible using flatMap(), then later groupTuple(). However, NF will wait at the groupTuple channel until all incoming process have finished. We cannot use the groupTuple size: parameter as the number of sub-samples is variable.

The current step from sample to subsamples may look like this:

ch_sample
    .flatMap { item ->
        samplename = item[0];
        files  = item[1];
        files.collect { onefile -> return [ samplename, onefile ] }
    } .set { ch_subsample }

I propose something like

ch_sample
    .map { item -> 
        samplename = item[0];
        files  = item[1];
        files.collect { onefile -> return [ samplename, onefile ] }
    } 
    .emitFlatTuples()
    .set { ch_subsample }

where the idea is that emitFlatTuples keeps track of the tuples it emits, by default using the first element as key. In this case, samplename would be tracked so that NF knows how many tuples [ samplename, onefile ] exist for each samplename. Perhaps groupTuple could have an argument such as by_emit_key. All this requires global state.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions