Further optimisations to probaln_glocal. by jkbonfield · Pull Request #1188 · samtools/htslib

jkbonfield · 2020-12-07T13:56:47Z

The changes are:

Replace a lot of the repeated set_u macro usages that go from k to
{u, v01, v10, v11} with pointer increments instead. This rather
obfuscates things as the difference between v01/u and v11/v10 (+/-3)
is now hidden (but see PROBALN_ORIG comment below).
An attempt at manual ordering of code to avoid as many instruction
latency issues as possible. Mostly compilers do this already, but
some combinations benefit here.
Cache and reuse of variables computed during the previous loop
iterations. This can have a big impact, but again whether or not it
was already done for us is compiler specific.
Removal of some loop invariants (ideally already done for us, but
safer to do it ourselves).
Perform the bi (backwards) rescaling during the backwards
calculation loop instead of in a second pass.
Reduce conditionals in the map stage.
Plus some code formatting tidyups, in particular removal of
many;statements;on;same;long;line.

The original code wasn't exactly easy to understand, but the new code
is possibly even worse if that's possible. Hence I've kept the old
code in place guarded by a #ifdef PROBALN_ORIG (technically not
original as bdf85e4 already did some optimisations) to act as a bit of
a translation guide. If we prefer to simply keep this in git history
then feel free to nuke the ifdefed bits.

Benchmarks vs PROBALN_ORIG (time in this function only)
35% quicker with gcc 7 -O2 (NEW: 1168742 usec, ORIG: 1575843)
14% quicker with gcc 7 -O3 (NEW: 1165683 ORIG: 1333712)
33% quicker with gcc 9 -O2 (NEW: 1169097 ORIG: 1556956)
29% quicker with gcc 9 -O3 (NEW: 1157547 ORIG: 1491839)
24% quicker with clang 7.0 -O2 (NEW: 1207968 ORIG: 1501549)
24% quicker with clang 7.0 -O3 (NEW: 1211660 ORIG: 1504905)

On samtools mpileup of 10 million reads this reduced the time from
6m58 to 5m27 (28% faster throughput). This is because 86% of all CPU
time was spent in this one function! It's still 81.5% even after
optimisation.

Note I initially attempted a SIMD implementation, which I think would
still be possible, but it cannot be faster without majorly changing
the data layout and order of evaluation, plus realistically it'd need
AVX or above as SSE4 can only do SIMD on 2 doubles so the overhead is
significant.

The changes are: - Replace a lot of the repeated set_u macro usages that go from k to {u, v01, v10, v11} with pointer increments instead. This rather obfuscates things as the difference between v01/u and v11/v10 (+/-3) is now hidden (but see PROBALN_ORIG comment below). - An attempt at manual ordering of code to avoid as many instruction latency issues as possible. Mostly compilers do this already, but some combinations benefit here. - Cache and reuse of variables computed during the previous loop iterations. This can have a big impact, but again whether or not it was already done for us is compiler specific. - Removal of some loop invariants (ideally already done for us, but safer to do it ourselves). - Perform the bi (backwards) rescaling during the backwards calculation loop instead of in a second pass. - Reduce conditionals in the map stage. - Plus some code formatting tidyups, in particular removal of many;statements;on;same;long;line. The original code wasn't exactly easy to understand, but the new code is possibly even worse if that's possible. Hence I've kept the old code in place guarded by a #ifdef PROBALN_ORIG (technically not original as bdf85e4 already did some optimisations) to act as a bit of a translation guide. If we prefer to simply keep this in git history then feel free to nuke the ifdefed bits. Benchmarks vs PROBALN_ORIG (time in this function only) 35% quicker with gcc 7 -O2 (NEW: 1168742 usec, ORIG: 1575843) 14% quicker with gcc 7 -O3 (NEW: 1165683 ORIG: 1333712) 33% quicker with gcc 9 -O2 (NEW: 1169097 ORIG: 1556956) 29% quicker with gcc 9 -O3 (NEW: 1157547 ORIG: 1491839) 24% quicker with clang 7.0 -O2 (NEW: 1207968 ORIG: 1501549) 24% quicker with clang 7.0 -O3 (NEW: 1211660 ORIG: 1504905) On samtools mpileup of 10 million reads this reduced the time from 6m58 to 5m27 (28% faster throughput). This is because 86% of all CPU time was spent in this one function! It's still 81.5% even after optimisation. Note I initially attempted a SIMD implementation, which I think would still be possible, but it cannot be faster without majorly changing the data layout and order of evaluation, plus realistically it'd need AVX or above as SSE4 can only do SIMD on 2 doubles so the overhead is significant.

jkbonfield · 2020-12-07T13:59:08Z

Oops I forgot I'd set some bcftools benchmarks off too. I meant to put those in the commit message, but no matter.

time ./bcftools.gl mpileup --fasta-ref $HREF38 /var/tmp/novaseq-10m.bam | egrep -v '^#' | md5sum
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
ce10e64e4de9623e8640d82730d2d2dd  -

real	8m12.121s
user	8m29.351s
sys	0m19.585s

time ./bcftools.dev mpileup --fasta-ref $HREF38 /var/tmp/novaseq-10m.bam | egrep -v '^#' | md5sum
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
ce10e64e4de9623e8640d82730d2d2dd  -

real	9m52.420s
user	10m9.672s
sys	0m19.515s

That makes it 20% faster. It's a better less of a speedup than samtools as bcftools is doing other work too. I'll take a look at that next infact. :-)

jkbonfield · 2020-12-07T16:32:29Z

Note if we're willing to accept non-exact matching results for a further ~5% mpileup speed up, it's possible to ditch that log and replace it with an approximation that uses the IEEE fp logic to pull out the exponent. Eg:

// Tylor (deg 3) implementation of the log2(x):                                 
// https://www.flipcode.com/archives/Fast_log_Function.shtml                    
static inline double fast_log2 (double val)                                            
{                                                                               
   register int64_t *const     exp_ptr = ((int64_t*)&val);                      
   register int64_t            x = *exp_ptr;                                    
   register const int      log_2 = ((x >> 52) & 2047) - 1024;                   
   x &= ~(2047LL << 52);                                                        
   x += 1023LL << 52;                                                           
   *exp_ptr = x;                                                                
                                                                                
   val = ((-1.0f/3) * val + 2) * val - 2.0f/3;                                  
                                                                                
   return val + log_2;                                                          
}

We could maybe use this elsewhere in htslib infact.
The differences are slight. 4353 of 36994133 qualities differed in rounding, implying 0.011% changed.

It does however make things differ. (Marginally more than fixing that .499 vs .5 double to int rounding.)

lh3 · 2020-12-12T03:11:29Z

According to this SO question, the more accurate log2 is this (single-precision):

static inline float mg_log2(float x)
{
	union { float f; uint32_t i; } z = { x };
	float log_2 = ((z.i >> 23) & 255) - 128;
	z.i &= ~(255 << 23);
	z.i += 127 << 23;
	log_2 += (-0.34484843f * z.f + 2.02466578f) * z.f - 0.67487759f;
	return log_2;
}

For integers between 0–100000, it is indeed more accurate. The SO page also points to this implementation, it is even more accurate for integers in that range.

jkbonfield · 2020-12-12T19:17:09Z

As I've got your attention Heng, I'd like your thoughts on this alternative approach.

A considerably faster approximation is to simple avoid calling BAQ except where we think it matters. I've been experimenting with putting in the BAQ call to the mpileup loop so as we add new reads to mpileup it scans the CIGAR for indels. Then on each column iteration, if any read in the current pileup contains an indel it computes BAQ (if not already done) on all reads in the pileup (potentially this can be reduced further).

The net effect is a considerable reduction to the number of BAQ calls, making bcftools mpileup run at approx double the speed. So far my assessment with SynDip at 45x and a 15x subsample is this is actually more accurate than calling BAQ everywhere, as it sometimes harms SNP calls.

It's still a work in progress, but if this works then likely the small gain by using a faster log will become somewhat redudant.

lh3 · 2020-12-12T19:54:36Z

BAQ played its role in early days but I think it is outdated and its use should be minimized.

jkbonfield · 2020-12-13T11:28:22Z

The ideal would clearly be local MSA to produce allele consensi, then BAQ becomes redundant. However I think it still has a use until such a thing arrives. Minimising it to only run on cases where we think the pair-wise alignment may not represent the full sample(s) MSA is clearly both beneficial for speed, but also accuracy as we're not reducing qualities in places where the alignment was already correct.

daviesrob assigned whitwham Dec 15, 2020

daviesrob requested a review from whitwham December 15, 2020 14:10

whitwham approved these changes Jan 11, 2021

View reviewed changes

whitwham merged commit 85240ba into samtools:develop Jan 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further optimisations to probaln_glocal.#1188

Further optimisations to probaln_glocal.#1188
whitwham merged 1 commit intosamtools:developfrom
jkbonfield:probaln_ideas

jkbonfield commented Dec 7, 2020

Uh oh!

jkbonfield commented Dec 7, 2020

Uh oh!

jkbonfield commented Dec 7, 2020

Uh oh!

lh3 commented Dec 12, 2020

Uh oh!

jkbonfield commented Dec 12, 2020 •

edited

Loading

Uh oh!

lh3 commented Dec 12, 2020

Uh oh!

jkbonfield commented Dec 13, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jkbonfield commented Dec 7, 2020

Uh oh!

jkbonfield commented Dec 7, 2020

Uh oh!

jkbonfield commented Dec 7, 2020

Uh oh!

lh3 commented Dec 12, 2020

Uh oh!

jkbonfield commented Dec 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lh3 commented Dec 12, 2020

Uh oh!

jkbonfield commented Dec 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jkbonfield commented Dec 12, 2020 •

edited

Loading

jkbonfield commented Dec 13, 2020 •

edited

Loading