Unlabeled synthetic HBV sequences in Genbank can impact VIRUSBreakend output

I followed the instructions in virusbreakend-build to produce the virusbreakenddb and have been examining VBE (GRIDSS 2.12.0) output for 11 ICGC PCAWG HCC samples, four of which have known HBV integrations, with previous short-read results and a recent long-read sequencing based analysis.  For one sample in particular (HCC RK147), about half of integration sites were not being reported by VBE.  Of course, some of those could be due to cutoffs for frequency/fragment support, but when I examined the gridss.assembly.bam and viral.bam files in IGV, I noticed that regions on the ends consisted mostly of reads with MAPQ 0.  Since the read extraction and breakend assembly process removes reads with low mapping quality (GridssConfiguration minMapq=20 in the log), I suspect that causes overlapping genome integration sites to not be assembled and called by VBE.

Here is the IGV figure:

![RK147_adjusted_AB819617 1_igv_snapshot](https://user-images.githubusercontent.com/17885404/123768219-1a0b2780-d903-11eb-8532-e25109b6913c.png)

Both of the HBV entries in Genbank that were chosen by VBE as the best viral sequence (AB819617.1 and AB206816.2) for the Japanese HCC samples were uploaded by Japanese researchers, but they are not patient isolates or curated viral genome sequences, but rather, they were genetically engineered.  AB819617.1 has two versions of the HBV X gene (one on each end) and AB206816.2 is described as being a "1.3 x complete genome" in the Genbank entry (designed to better infect mouse models).

Unless this is already included in the recent mods made for issue  #502, it seems that the database build process may need additional filters.  I am currently downloading the pre-built virusbreakenddb to check the results. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unlabeled synthetic HBV sequences in Genbank can impact VIRUSBreakend output #508

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unlabeled synthetic HBV sequences in Genbank can impact VIRUSBreakend output #508

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions