Skip to content

Performance notes / ideas #2

@fedarko

Description

@fedarko

Ideas for speeding this up

  • Use sparse matrices
  • Faster k-mer counting / comparison algorithms (update: I used suffix arrays, although the way I used them could probs be sped up)
  • Use input sequence size to determine which k-mer counting method to use (the older naive, memory-inefficient method is faster for small inputs than the suffix-array-based method)
  • Do the conversions of s1, s2, and rc(s2) to bytes at the start of matrix construction, and then do everything thereafter in bytes? (At the very least, don't convert both s2 and rc(s2) to bytes separately; that's silly.)
  • Use Cython / etc.
  • Support FASTA files as input and then process them in chunks or something -- removes need to store massively long sequences in memory (not sure how this would work with pydivsufsort, tho)
  • Use rolling hashes? See https://bioinformatics.stackexchange.com/questions/19/are-there-any-rolling-hash-functions-that-can-hash-a-dna-sequence-and-its-revers (and also LJA paper)
  • Replace rc() function with str.maketrans: https://bioinformatics.stackexchange.com/questions/3583/what-is-the-fastest-way-to-get-the-reverse-complement-of-a-dna-sequence-in-pytho
  • If both sequences are equal (i.e. we're creating a self dot plot), use this to speed up dot plot construction. Some ideas:
    • Don't bother creating an extra suffix array
    • Only fill in one half of the matrix triangle, since the upper and lower triangle in a self dot plot should be symmetric? (this might be hard to do using the suffix array approach, tho)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions