Further optimisations to probaln_glocal.#1188
Conversation
The changes are:
- Replace a lot of the repeated set_u macro usages that go from k to
{u, v01, v10, v11} with pointer increments instead. This rather
obfuscates things as the difference between v01/u and v11/v10 (+/-3)
is now hidden (but see PROBALN_ORIG comment below).
- An attempt at manual ordering of code to avoid as many instruction
latency issues as possible. Mostly compilers do this already, but
some combinations benefit here.
- Cache and reuse of variables computed during the previous loop
iterations. This can have a big impact, but again whether or not it
was already done for us is compiler specific.
- Removal of some loop invariants (ideally already done for us, but
safer to do it ourselves).
- Perform the bi (backwards) rescaling during the backwards
calculation loop instead of in a second pass.
- Reduce conditionals in the map stage.
- Plus some code formatting tidyups, in particular removal of
many;statements;on;same;long;line.
The original code wasn't exactly easy to understand, but the new code
is possibly even worse if that's possible. Hence I've kept the old
code in place guarded by a #ifdef PROBALN_ORIG (technically not
original as bdf85e4 already did some optimisations) to act as a bit of
a translation guide. If we prefer to simply keep this in git history
then feel free to nuke the ifdefed bits.
Benchmarks vs PROBALN_ORIG (time in this function only)
35% quicker with gcc 7 -O2 (NEW: 1168742 usec, ORIG: 1575843)
14% quicker with gcc 7 -O3 (NEW: 1165683 ORIG: 1333712)
33% quicker with gcc 9 -O2 (NEW: 1169097 ORIG: 1556956)
29% quicker with gcc 9 -O3 (NEW: 1157547 ORIG: 1491839)
24% quicker with clang 7.0 -O2 (NEW: 1207968 ORIG: 1501549)
24% quicker with clang 7.0 -O3 (NEW: 1211660 ORIG: 1504905)
On samtools mpileup of 10 million reads this reduced the time from
6m58 to 5m27 (28% faster throughput). This is because 86% of all CPU
time was spent in this one function! It's still 81.5% even after
optimisation.
Note I initially attempted a SIMD implementation, which I think would
still be possible, but it cannot be faster without majorly changing
the data layout and order of evaluation, plus realistically it'd need
AVX or above as SSE4 can only do SIMD on 2 doubles so the overhead is
significant.
|
Oops I forgot I'd set some bcftools benchmarks off too. I meant to put those in the commit message, but no matter. That makes it 20% faster. It's a better less of a speedup than samtools as bcftools is doing other work too. I'll take a look at that next infact. :-) |
|
Note if we're willing to accept non-exact matching results for a further ~5% mpileup speed up, it's possible to ditch that log and replace it with an approximation that uses the IEEE fp logic to pull out the exponent. Eg: We could maybe use this elsewhere in htslib infact. It does however make things differ. (Marginally more than fixing that .499 vs .5 double to int rounding.) |
|
According to this SO question, the more accurate log2 is this (single-precision): static inline float mg_log2(float x)
{
union { float f; uint32_t i; } z = { x };
float log_2 = ((z.i >> 23) & 255) - 128;
z.i &= ~(255 << 23);
z.i += 127 << 23;
log_2 += (-0.34484843f * z.f + 2.02466578f) * z.f - 0.67487759f;
return log_2;
}For integers between 0–100000, it is indeed more accurate. The SO page also points to this implementation, it is even more accurate for integers in that range. |
|
As I've got your attention Heng, I'd like your thoughts on this alternative approach. A considerably faster approximation is to simple avoid calling BAQ except where we think it matters. I've been experimenting with putting in the BAQ call to the mpileup loop so as we add new reads to mpileup it scans the CIGAR for indels. Then on each column iteration, if any read in the current pileup contains an indel it computes BAQ (if not already done) on all reads in the pileup (potentially this can be reduced further). The net effect is a considerable reduction to the number of BAQ calls, making bcftools mpileup run at approx double the speed. So far my assessment with SynDip at 45x and a 15x subsample is this is actually more accurate than calling BAQ everywhere, as it sometimes harms SNP calls. It's still a work in progress, but if this works then likely the small gain by using a faster log will become somewhat redudant. |
|
BAQ played its role in early days but I think it is outdated and its use should be minimized. |
|
The ideal would clearly be local MSA to produce allele consensi, then BAQ becomes redundant. However I think it still has a use until such a thing arrives. Minimising it to only run on cases where we think the pair-wise alignment may not represent the full sample(s) MSA is clearly both beneficial for speed, but also accuracy as we're not reducing qualities in places where the alignment was already correct. |
The changes are:
Replace a lot of the repeated set_u macro usages that go from k to
{u, v01, v10, v11} with pointer increments instead. This rather
obfuscates things as the difference between v01/u and v11/v10 (+/-3)
is now hidden (but see PROBALN_ORIG comment below).
An attempt at manual ordering of code to avoid as many instruction
latency issues as possible. Mostly compilers do this already, but
some combinations benefit here.
Cache and reuse of variables computed during the previous loop
iterations. This can have a big impact, but again whether or not it
was already done for us is compiler specific.
Removal of some loop invariants (ideally already done for us, but
safer to do it ourselves).
Perform the bi (backwards) rescaling during the backwards
calculation loop instead of in a second pass.
Reduce conditionals in the map stage.
Plus some code formatting tidyups, in particular removal of
many;statements;on;same;long;line.
The original code wasn't exactly easy to understand, but the new code
is possibly even worse if that's possible. Hence I've kept the old
code in place guarded by a #ifdef PROBALN_ORIG (technically not
original as bdf85e4 already did some optimisations) to act as a bit of
a translation guide. If we prefer to simply keep this in git history
then feel free to nuke the ifdefed bits.
Benchmarks vs PROBALN_ORIG (time in this function only)
35% quicker with gcc 7 -O2 (NEW: 1168742 usec, ORIG: 1575843)
14% quicker with gcc 7 -O3 (NEW: 1165683 ORIG: 1333712)
33% quicker with gcc 9 -O2 (NEW: 1169097 ORIG: 1556956)
29% quicker with gcc 9 -O3 (NEW: 1157547 ORIG: 1491839)
24% quicker with clang 7.0 -O2 (NEW: 1207968 ORIG: 1501549)
24% quicker with clang 7.0 -O3 (NEW: 1211660 ORIG: 1504905)
On samtools mpileup of 10 million reads this reduced the time from
6m58 to 5m27 (28% faster throughput). This is because 86% of all CPU
time was spent in this one function! It's still 81.5% even after
optimisation.
Note I initially attempted a SIMD implementation, which I think would
still be possible, but it cannot be faster without majorly changing
the data layout and order of evaluation, plus realistically it'd need
AVX or above as SSE4 can only do SIMD on 2 doubles so the overhead is
significant.