Skip to content

Samtools markdup with high depth regions #1325

@denriquez

Description

@denriquez

Are you using the latest version of samtools and HTSlib? If not, please specify.

(run samtools --version)

I am using samtools 1.11 (htslib 1.11)

Please describe your environment.

  • OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)
  • machine architecture (run uname -m on Linux/Mac OS or wmic os get OSArchitecture on Windows)
  • compiler (run gcc --version or clang --version)

CentOS Linux 7 x86_64 with gcc 4.8.5

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

I have encountered an issue with samtools markdup running past 8 days without actually completing on a few BAMs. I don’t get any relevant output since it just doesn’t finish. Digging deeper, I found out that a ~400 bp region in chromosome 2, with more than 8 million reads, is causing this issue. It looks like markdup just isn't optimized to deal with abnormally high depth regions, unless, I missed some parameter that would help. I have created micro BAMs of that region (with an additional 500bp on both ends) and tried running markdup on them but I still get the same results. For both types of BAMs, I used the command (BAM names were replaced with "micro"):

samtools markdup \
    -d “2500” \
    -S \
    -f “micro.samtools.markdup.txt” \
    --threads 4 \
    --write-index \
    -T “micro” \
    “micro.bam” \
    “micro.md.bam”

Also, I ran GATK MarkDuplicates (v4.1.8.0) on these micro BAMs and they finished under 35 minutes. The GATK command I used was the following:

gatk MarkDuplicates \
    --java-options “-Xmx28G” \
    --TMP_DIR “temp” \
    --ASSUME_SORT_ORDER coordinate \
    --CLEAR_DT false \
    --ADD_PG_TAG_TO_READS false \
    --OPTICAL_DUPLICATE_PIXEL_DISTANCE “2500” \
    --INPUT "micro.bam" \
    --METRICS_FILE “micro.gatk.mdmetrics.txt” \
    --OUTPUT “micro.md.bam”

If needed, I could provide a micro BAM that is causing this issue. Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions