Allow groupTuple operator to handle multi-size tuples


# Suggestion:
  
  A way to split, parallelise and combine from samples to sub-samples and back with NF being aware of sub-sample set sizes, to prevent waiting during the combine stage (which will be groupTuple like).

# Use case:
  
  Our sample data often comes in multiple cram files. Merging these cram files and sorting them with samtools (in case we need the paired end reads in FASTQ format) takes a large amount of memory and time, and results in large files (up to 50G seen so far). 
  
  Both samtools and downstream STAR alignment of these files place require huge resources (RAM and cpus) and often require too much time on our regular queue. Their outlier status causes issues in our pipeline predictability, and the resource/retry regime is very difficult to parameterise.
  
  Ideally we want to process these multiple cram files (call them sub-samples), independently.  This is currently possible using flatMap(), then later groupTuple().  However, NF will wait at the groupTuple channel until all incoming process have finished.  We cannot use the groupTuple `size:` parameter as the number of sub-samples is variable.
  
  The current step from sample to subsamples may look like this:
```
ch_sample
    .flatMap { item ->
        samplename = item[0];
        files  = item[1];
        files.collect { onefile -> return [ samplename, onefile ] }
    } .set { ch_subsample }
 ```   
  
  I propose something like
```
ch_sample
    .map { item -> 
        samplename = item[0];
        files  = item[1];
        files.collect { onefile -> return [ samplename, onefile ] }
    } 
    .emitFlatTuples()
    .set { ch_subsample }
  ```  
  
  where the idea is that emitFlatTuples keeps track of the tuples it emits, by default using the first element as key. In this case, samplename would be tracked so that NF knows how many tuples `[ samplename, onefile ]` exist for each samplename. Perhaps groupTuple could have an argument such as `by_emit_key`. All this requires global state.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow groupTuple operator to handle multi-size tuples #796

Suggestion:

Use case:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow groupTuple operator to handle multi-size tuples #796

Description

Suggestion:

Use case:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions