Declaration of BAM->FASTQ translation scheme

# Background
New library construction technologies, especially highly multiplexed assays like single-cell RNA-seq, employ a large variety of library molecule configurations. These configurations insert 'technical' information such as cellular barcodes, UMI and 'nuisance' bases such a read-through of a fixed sequence, at various positions within the reads. In some assays, critical information is attached to the second index read (sometimes known as I2 or I5).  Pipelines for handling such data are generally constructed to ingest FASTQs and parse out these read components based on prior knowledge of the assay. The technical read components are not typically stored in the main SEQ field, but may be stored in various optional tags.  The wide variety of assay schemes makes it
impossible to declare a bam-to-fastq  that is a fixed function of static set of tags.
 
Using BAM/CRAM as an archival format for such data is problematic, because it's not always clear how to reconstruct the 'original' FASTQ sequences from the BAM file. However, most pipelines require the raw sequence, in the original FASTQ format. Therefore people with access to only the BAM file may have difficulty reprocessing the 'raw' data represented by that BAM file.

The GA4GH file formats group expressed interest in attempting to formally specify the BAM->FASTQ translation method as metadata inside the BAM file.

10x Genomics has adopted a solution for encoding the BAM->FASTQ translation process as special `@CO` tags which can be interpreted by a general purpose conversion tool called [bamtofastq](https://github.com/10XGenomics/bamtofastq), which we've recently open-sourced.

@nunofonseca has also created a bam-to-fastq tool that relies on a pre-determined set of assay configurations: (https://github.com/nunofonseca/fastq_utils#fastq2bam---lossless-fastq-to-bam-convertor)

This may serve as a basis for discussion about how to proceed.

# 10x Bam-to-fastq Scheme

The BAM to FASTQ configuration is specified in special `@CO` headers in the BAM file. Each 'raw' sequencer read is described by one line, with the following format (in [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form)):
```
line = "10x_bam_to_fastq:", read_name, "(",  read_component, {"," read_component }, ")"
read_name = "R1" | "R2" | "I1" | "I2"
read_component = (seq_taq ":" qual_tag) |  "N", digits | "SEQ:QUAL"
seq_tag = letter, letter
qual_tag = letter, letter
```

For example,  the schma for the Chromium Genome product is as follows:
```
10x_bam_to_fastq:R1(RX:QX,TR:TQ,SEQ:QUAL)
10x_bam_to_fastq:R2(SEQ:QUAL)
10x_bam_to_fastq:I1(BC:QT)
```

In particular: 

```
10x_bam_to_fastq:R1(RX:QX,TR:TQ,SEQ:QUAL)
```
declares how to construct the original R1 fastq sequence and quality values. The sequence is 
the concatenation of the tags `RX` and `TR`, followed by the record sequence, denoted by `SEQ`. The quality values are the concatenation of `QX`, `TQ`, and the record quals, denoted `QUAL`. This scheme applies to reads marked as r1 (`FLAG & 0x40 == True`).

```
10x_bam_to_fastq:R2(SEQ:QUAL)
```
declares how to construct the R2 sequence fastq sequence and quality values. In this case the R2 sequence is entirely contained in the `SEQ` and `QUAL` values of the r2 BAM record.


## Notes:
-  The `SEQ:QUAL` read component indicates the full sequence / qual of the read, reverse complemented if the 'reverse' bit is set on the read.
- if the `R2` descriptor does not contain an `SEQ:QUAL` entry, then it is assumed that no R2 records exist in the BAM file, and that the raw R2 read can be derived from the R1 record.
- the bam-to-fastq implementation must match corresponding R1 and R2 records 
- implementors are encourage to use lower-case tags for technology specific read components that are not yet a part of the SAM spec

# Open Questions

- should it be possible to declare `N` bases in the raw read? This is useful if some trimmed bases are not important to retain.
- is it possible to to handle BAM files with a mix of library configurations? Should the bam-to-fastq declaration be part of the `@RG` header in order to support this?
- backward compatibility: will existing BAM readers break if new fields are introduced in the `@RG` header
- what naming conventions should be used for the output files? Should there be templates the automatically generate Illumina compatible file names?


Thoughts? I'm happy to write up a real PR once I've had some feedback on this.

@jkbonfield @daviesrob @dkj @pryvkin10x @raskoleinonen @nunofonseca


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Declaration of BAM->FASTQ translation scheme #270

Background

10x Bam-to-fastq Scheme

Notes:

Open Questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Declaration of BAM->FASTQ translation scheme #270

Description

Background

10x Bam-to-fastq Scheme

Notes:

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions