-
Notifications
You must be signed in to change notification settings - Fork 75
Description
I followed the instructions in virusbreakend-build to produce the virusbreakenddb and have been examining VBE (GRIDSS 2.12.0) output for 11 ICGC PCAWG HCC samples, four of which have known HBV integrations, with previous short-read results and a recent long-read sequencing based analysis. For one sample in particular (HCC RK147), about half of integration sites were not being reported by VBE. Of course, some of those could be due to cutoffs for frequency/fragment support, but when I examined the gridss.assembly.bam and viral.bam files in IGV, I noticed that regions on the ends consisted mostly of reads with MAPQ 0. Since the read extraction and breakend assembly process removes reads with low mapping quality (GridssConfiguration minMapq=20 in the log), I suspect that causes overlapping genome integration sites to not be assembled and called by VBE.
Here is the IGV figure:
Both of the HBV entries in Genbank that were chosen by VBE as the best viral sequence (AB819617.1 and AB206816.2) for the Japanese HCC samples were uploaded by Japanese researchers, but they are not patient isolates or curated viral genome sequences, but rather, they were genetically engineered. AB819617.1 has two versions of the HBV X gene (one on each end) and AB206816.2 is described as being a "1.3 x complete genome" in the Genbank entry (designed to better infect mouse models).
Unless this is already included in the recent mods made for issue #502, it seems that the database build process may need additional filters. I am currently downloading the pre-built virusbreakenddb to check the results.
