Background
New library construction technologies, especially highly multiplexed assays like single-cell RNA-seq, employ a large variety of library molecule configurations. These configurations insert 'technical' information such as cellular barcodes, UMI and 'nuisance' bases such a read-through of a fixed sequence, at various positions within the reads. In some assays, critical information is attached to the second index read (sometimes known as I2 or I5). Pipelines for handling such data are generally constructed to ingest FASTQs and parse out these read components based on prior knowledge of the assay. The technical read components are not typically stored in the main SEQ field, but may be stored in various optional tags. The wide variety of assay schemes makes it
impossible to declare a bam-to-fastq that is a fixed function of static set of tags.
Using BAM/CRAM as an archival format for such data is problematic, because it's not always clear how to reconstruct the 'original' FASTQ sequences from the BAM file. However, most pipelines require the raw sequence, in the original FASTQ format. Therefore people with access to only the BAM file may have difficulty reprocessing the 'raw' data represented by that BAM file.
The GA4GH file formats group expressed interest in attempting to formally specify the BAM->FASTQ translation method as metadata inside the BAM file.
10x Genomics has adopted a solution for encoding the BAM->FASTQ translation process as special @CO tags which can be interpreted by a general purpose conversion tool called bamtofastq, which we've recently open-sourced.
@nunofonseca has also created a bam-to-fastq tool that relies on a pre-determined set of assay configurations: (https://github.com/nunofonseca/fastq_utils#fastq2bam---lossless-fastq-to-bam-convertor)
This may serve as a basis for discussion about how to proceed.
10x Bam-to-fastq Scheme
The BAM to FASTQ configuration is specified in special @CO headers in the BAM file. Each 'raw' sequencer read is described by one line, with the following format (in EBNF):
line = "10x_bam_to_fastq:", read_name, "(", read_component, {"," read_component }, ")"
read_name = "R1" | "R2" | "I1" | "I2"
read_component = (seq_taq ":" qual_tag) | "N", digits | "SEQ:QUAL"
seq_tag = letter, letter
qual_tag = letter, letter
For example, the schma for the Chromium Genome product is as follows:
10x_bam_to_fastq:R1(RX:QX,TR:TQ,SEQ:QUAL)
10x_bam_to_fastq:R2(SEQ:QUAL)
10x_bam_to_fastq:I1(BC:QT)
In particular:
10x_bam_to_fastq:R1(RX:QX,TR:TQ,SEQ:QUAL)
declares how to construct the original R1 fastq sequence and quality values. The sequence is
the concatenation of the tags RX and TR, followed by the record sequence, denoted by SEQ. The quality values are the concatenation of QX, TQ, and the record quals, denoted QUAL. This scheme applies to reads marked as r1 (FLAG & 0x40 == True).
10x_bam_to_fastq:R2(SEQ:QUAL)
declares how to construct the R2 sequence fastq sequence and quality values. In this case the R2 sequence is entirely contained in the SEQ and QUAL values of the r2 BAM record.
Notes:
- The
SEQ:QUAL read component indicates the full sequence / qual of the read, reverse complemented if the 'reverse' bit is set on the read.
- if the
R2 descriptor does not contain an SEQ:QUAL entry, then it is assumed that no R2 records exist in the BAM file, and that the raw R2 read can be derived from the R1 record.
- the bam-to-fastq implementation must match corresponding R1 and R2 records
- implementors are encourage to use lower-case tags for technology specific read components that are not yet a part of the SAM spec
Open Questions
- should it be possible to declare
N bases in the raw read? This is useful if some trimmed bases are not important to retain.
- is it possible to to handle BAM files with a mix of library configurations? Should the bam-to-fastq declaration be part of the
@RG header in order to support this?
- backward compatibility: will existing BAM readers break if new fields are introduced in the
@RG header
- what naming conventions should be used for the output files? Should there be templates the automatically generate Illumina compatible file names?
Thoughts? I'm happy to write up a real PR once I've had some feedback on this.
@jkbonfield @daviesrob @dkj @pryvkin10x @raskoleinonen @nunofonseca
Background
New library construction technologies, especially highly multiplexed assays like single-cell RNA-seq, employ a large variety of library molecule configurations. These configurations insert 'technical' information such as cellular barcodes, UMI and 'nuisance' bases such a read-through of a fixed sequence, at various positions within the reads. In some assays, critical information is attached to the second index read (sometimes known as I2 or I5). Pipelines for handling such data are generally constructed to ingest FASTQs and parse out these read components based on prior knowledge of the assay. The technical read components are not typically stored in the main SEQ field, but may be stored in various optional tags. The wide variety of assay schemes makes it
impossible to declare a bam-to-fastq that is a fixed function of static set of tags.
Using BAM/CRAM as an archival format for such data is problematic, because it's not always clear how to reconstruct the 'original' FASTQ sequences from the BAM file. However, most pipelines require the raw sequence, in the original FASTQ format. Therefore people with access to only the BAM file may have difficulty reprocessing the 'raw' data represented by that BAM file.
The GA4GH file formats group expressed interest in attempting to formally specify the BAM->FASTQ translation method as metadata inside the BAM file.
10x Genomics has adopted a solution for encoding the BAM->FASTQ translation process as special
@COtags which can be interpreted by a general purpose conversion tool called bamtofastq, which we've recently open-sourced.@nunofonseca has also created a bam-to-fastq tool that relies on a pre-determined set of assay configurations: (https://github.com/nunofonseca/fastq_utils#fastq2bam---lossless-fastq-to-bam-convertor)
This may serve as a basis for discussion about how to proceed.
10x Bam-to-fastq Scheme
The BAM to FASTQ configuration is specified in special
@COheaders in the BAM file. Each 'raw' sequencer read is described by one line, with the following format (in EBNF):For example, the schma for the Chromium Genome product is as follows:
In particular:
declares how to construct the original R1 fastq sequence and quality values. The sequence is
the concatenation of the tags
RXandTR, followed by the record sequence, denoted bySEQ. The quality values are the concatenation ofQX,TQ, and the record quals, denotedQUAL. This scheme applies to reads marked as r1 (FLAG & 0x40 == True).declares how to construct the R2 sequence fastq sequence and quality values. In this case the R2 sequence is entirely contained in the
SEQandQUALvalues of the r2 BAM record.Notes:
SEQ:QUALread component indicates the full sequence / qual of the read, reverse complemented if the 'reverse' bit is set on the read.R2descriptor does not contain anSEQ:QUALentry, then it is assumed that no R2 records exist in the BAM file, and that the raw R2 read can be derived from the R1 record.Open Questions
Nbases in the raw read? This is useful if some trimmed bases are not important to retain.@RGheader in order to support this?@RGheaderThoughts? I'm happy to write up a real PR once I've had some feedback on this.
@jkbonfield @daviesrob @dkj @pryvkin10x @raskoleinonen @nunofonseca