Skip to content

Declaration of BAM->FASTQ translation scheme #270

Description

@pmarks

Background

New library construction technologies, especially highly multiplexed assays like single-cell RNA-seq, employ a large variety of library molecule configurations. These configurations insert 'technical' information such as cellular barcodes, UMI and 'nuisance' bases such a read-through of a fixed sequence, at various positions within the reads. In some assays, critical information is attached to the second index read (sometimes known as I2 or I5). Pipelines for handling such data are generally constructed to ingest FASTQs and parse out these read components based on prior knowledge of the assay. The technical read components are not typically stored in the main SEQ field, but may be stored in various optional tags. The wide variety of assay schemes makes it
impossible to declare a bam-to-fastq that is a fixed function of static set of tags.

Using BAM/CRAM as an archival format for such data is problematic, because it's not always clear how to reconstruct the 'original' FASTQ sequences from the BAM file. However, most pipelines require the raw sequence, in the original FASTQ format. Therefore people with access to only the BAM file may have difficulty reprocessing the 'raw' data represented by that BAM file.

The GA4GH file formats group expressed interest in attempting to formally specify the BAM->FASTQ translation method as metadata inside the BAM file.

10x Genomics has adopted a solution for encoding the BAM->FASTQ translation process as special @CO tags which can be interpreted by a general purpose conversion tool called bamtofastq, which we've recently open-sourced.

@nunofonseca has also created a bam-to-fastq tool that relies on a pre-determined set of assay configurations: (https://github.com/nunofonseca/fastq_utils#fastq2bam---lossless-fastq-to-bam-convertor)

This may serve as a basis for discussion about how to proceed.

10x Bam-to-fastq Scheme

The BAM to FASTQ configuration is specified in special @CO headers in the BAM file. Each 'raw' sequencer read is described by one line, with the following format (in EBNF):

line = "10x_bam_to_fastq:", read_name, "(",  read_component, {"," read_component }, ")"
read_name = "R1" | "R2" | "I1" | "I2"
read_component = (seq_taq ":" qual_tag) |  "N", digits | "SEQ:QUAL"
seq_tag = letter, letter
qual_tag = letter, letter

For example, the schma for the Chromium Genome product is as follows:

10x_bam_to_fastq:R1(RX:QX,TR:TQ,SEQ:QUAL)
10x_bam_to_fastq:R2(SEQ:QUAL)
10x_bam_to_fastq:I1(BC:QT)

In particular:

10x_bam_to_fastq:R1(RX:QX,TR:TQ,SEQ:QUAL)

declares how to construct the original R1 fastq sequence and quality values. The sequence is
the concatenation of the tags RX and TR, followed by the record sequence, denoted by SEQ. The quality values are the concatenation of QX, TQ, and the record quals, denoted QUAL. This scheme applies to reads marked as r1 (FLAG & 0x40 == True).

10x_bam_to_fastq:R2(SEQ:QUAL)

declares how to construct the R2 sequence fastq sequence and quality values. In this case the R2 sequence is entirely contained in the SEQ and QUAL values of the r2 BAM record.

Notes:

  • The SEQ:QUAL read component indicates the full sequence / qual of the read, reverse complemented if the 'reverse' bit is set on the read.
  • if the R2 descriptor does not contain an SEQ:QUAL entry, then it is assumed that no R2 records exist in the BAM file, and that the raw R2 read can be derived from the R1 record.
  • the bam-to-fastq implementation must match corresponding R1 and R2 records
  • implementors are encourage to use lower-case tags for technology specific read components that are not yet a part of the SAM spec

Open Questions

  • should it be possible to declare N bases in the raw read? This is useful if some trimmed bases are not important to retain.
  • is it possible to to handle BAM files with a mix of library configurations? Should the bam-to-fastq declaration be part of the @RG header in order to support this?
  • backward compatibility: will existing BAM readers break if new fields are introduced in the @RG header
  • what naming conventions should be used for the output files? Should there be templates the automatically generate Illumina compatible file names?

Thoughts? I'm happy to write up a real PR once I've had some feedback on this.

@jkbonfield @daviesrob @dkj @pryvkin10x @raskoleinonen @nunofonseca

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions