Skip to content

Further optimisations to probaln_glocal.#1188

Merged
whitwham merged 1 commit intosamtools:developfrom
jkbonfield:probaln_ideas
Jan 11, 2021
Merged

Further optimisations to probaln_glocal.#1188
whitwham merged 1 commit intosamtools:developfrom
jkbonfield:probaln_ideas

Conversation

@jkbonfield
Copy link
Copy Markdown
Contributor

The changes are:

  • Replace a lot of the repeated set_u macro usages that go from k to
    {u, v01, v10, v11} with pointer increments instead. This rather
    obfuscates things as the difference between v01/u and v11/v10 (+/-3)
    is now hidden (but see PROBALN_ORIG comment below).

  • An attempt at manual ordering of code to avoid as many instruction
    latency issues as possible. Mostly compilers do this already, but
    some combinations benefit here.

  • Cache and reuse of variables computed during the previous loop
    iterations. This can have a big impact, but again whether or not it
    was already done for us is compiler specific.

  • Removal of some loop invariants (ideally already done for us, but
    safer to do it ourselves).

  • Perform the bi (backwards) rescaling during the backwards
    calculation loop instead of in a second pass.

  • Reduce conditionals in the map stage.

  • Plus some code formatting tidyups, in particular removal of
    many;statements;on;same;long;line.

The original code wasn't exactly easy to understand, but the new code
is possibly even worse if that's possible. Hence I've kept the old
code in place guarded by a #ifdef PROBALN_ORIG (technically not
original as bdf85e4 already did some optimisations) to act as a bit of
a translation guide. If we prefer to simply keep this in git history
then feel free to nuke the ifdefed bits.

Benchmarks vs PROBALN_ORIG (time in this function only)
35% quicker with gcc 7 -O2 (NEW: 1168742 usec, ORIG: 1575843)
14% quicker with gcc 7 -O3 (NEW: 1165683 ORIG: 1333712)
33% quicker with gcc 9 -O2 (NEW: 1169097 ORIG: 1556956)
29% quicker with gcc 9 -O3 (NEW: 1157547 ORIG: 1491839)
24% quicker with clang 7.0 -O2 (NEW: 1207968 ORIG: 1501549)
24% quicker with clang 7.0 -O3 (NEW: 1211660 ORIG: 1504905)

On samtools mpileup of 10 million reads this reduced the time from
6m58 to 5m27 (28% faster throughput). This is because 86% of all CPU
time was spent in this one function! It's still 81.5% even after
optimisation.

Note I initially attempted a SIMD implementation, which I think would
still be possible, but it cannot be faster without majorly changing
the data layout and order of evaluation, plus realistically it'd need
AVX or above as SSE4 can only do SIMD on 2 doubles so the overhead is
significant.

The changes are:
- Replace a lot of the repeated set_u macro usages that go from k to
  {u, v01, v10, v11} with pointer increments instead.  This rather
  obfuscates things as the difference between v01/u and v11/v10 (+/-3)
  is now hidden (but see PROBALN_ORIG comment below).

- An attempt at manual ordering of code to avoid as many instruction
  latency issues as possible.  Mostly compilers do this already, but
  some combinations benefit here.

- Cache and reuse of variables computed during the previous loop
  iterations.  This can have a big impact, but again whether or not it
  was already done for us is compiler specific.

- Removal of some loop invariants (ideally already done for us, but
  safer to do it ourselves).

- Perform the bi (backwards) rescaling during the backwards
  calculation loop instead of in a second pass.

- Reduce conditionals in the map stage.

- Plus some code formatting tidyups, in particular removal of
  many;statements;on;same;long;line.

The original code wasn't exactly easy to understand, but the new code
is possibly even worse if that's possible.  Hence I've kept the old
code in place guarded by a #ifdef PROBALN_ORIG (technically not
original as bdf85e4 already did some optimisations) to act as a bit of
a translation guide.  If we prefer to simply keep this in git history
then feel free to nuke the ifdefed bits.

Benchmarks vs PROBALN_ORIG (time in this function only)
    35% quicker with gcc 7 -O2     (NEW: 1168742 usec, ORIG: 1575843)
    14% quicker with gcc 7 -O3     (NEW: 1165683       ORIG: 1333712)
    33% quicker with gcc 9 -O2     (NEW: 1169097       ORIG: 1556956)
    29% quicker with gcc 9 -O3     (NEW: 1157547       ORIG: 1491839)
    24% quicker with clang 7.0 -O2 (NEW: 1207968       ORIG: 1501549)
    24% quicker with clang 7.0 -O3 (NEW: 1211660       ORIG: 1504905)

On samtools mpileup of 10 million reads this reduced the time from
6m58 to 5m27 (28% faster throughput).  This is because 86% of all CPU
time was spent in this one function!  It's still 81.5% even after
optimisation.

Note I initially attempted a SIMD implementation, which I think would
still be possible, but it cannot be faster without majorly changing
the data layout and order of evaluation, plus realistically it'd need
AVX or above as SSE4 can only do SIMD on 2 doubles so the overhead is
significant.
@jkbonfield
Copy link
Copy Markdown
Contributor Author

Oops I forgot I'd set some bcftools benchmarks off too. I meant to put those in the commit message, but no matter.

time ./bcftools.gl mpileup --fasta-ref $HREF38 /var/tmp/novaseq-10m.bam | egrep -v '^#' | md5sum
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
ce10e64e4de9623e8640d82730d2d2dd  -

real	8m12.121s
user	8m29.351s
sys	0m19.585s

time ./bcftools.dev mpileup --fasta-ref $HREF38 /var/tmp/novaseq-10m.bam | egrep -v '^#' | md5sum
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
ce10e64e4de9623e8640d82730d2d2dd  -

real	9m52.420s
user	10m9.672s
sys	0m19.515s

That makes it 20% faster. It's a better less of a speedup than samtools as bcftools is doing other work too. I'll take a look at that next infact. :-)

@jkbonfield
Copy link
Copy Markdown
Contributor Author

Note if we're willing to accept non-exact matching results for a further ~5% mpileup speed up, it's possible to ditch that log and replace it with an approximation that uses the IEEE fp logic to pull out the exponent. Eg:

// Tylor (deg 3) implementation of the log2(x):                                 
// https://www.flipcode.com/archives/Fast_log_Function.shtml                    
static inline double fast_log2 (double val)                                            
{                                                                               
   register int64_t *const     exp_ptr = ((int64_t*)&val);                      
   register int64_t            x = *exp_ptr;                                    
   register const int      log_2 = ((x >> 52) & 2047) - 1024;                   
   x &= ~(2047LL << 52);                                                        
   x += 1023LL << 52;                                                           
   *exp_ptr = x;                                                                
                                                                                
   val = ((-1.0f/3) * val + 2) * val - 2.0f/3;                                  
                                                                                
   return val + log_2;                                                          
}                                                                               

We could maybe use this elsewhere in htslib infact.
The differences are slight. 4353 of 36994133 qualities differed in rounding, implying 0.011% changed.

It does however make things differ. (Marginally more than fixing that .499 vs .5 double to int rounding.)

@lh3
Copy link
Copy Markdown
Member

lh3 commented Dec 12, 2020

According to this SO question, the more accurate log2 is this (single-precision):

static inline float mg_log2(float x)
{
	union { float f; uint32_t i; } z = { x };
	float log_2 = ((z.i >> 23) & 255) - 128;
	z.i &= ~(255 << 23);
	z.i += 127 << 23;
	log_2 += (-0.34484843f * z.f + 2.02466578f) * z.f - 0.67487759f;
	return log_2;
}

For integers between 0–100000, it is indeed more accurate. The SO page also points to this implementation, it is even more accurate for integers in that range.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

jkbonfield commented Dec 12, 2020

As I've got your attention Heng, I'd like your thoughts on this alternative approach.

A considerably faster approximation is to simple avoid calling BAQ except where we think it matters. I've been experimenting with putting in the BAQ call to the mpileup loop so as we add new reads to mpileup it scans the CIGAR for indels. Then on each column iteration, if any read in the current pileup contains an indel it computes BAQ (if not already done) on all reads in the pileup (potentially this can be reduced further).

The net effect is a considerable reduction to the number of BAQ calls, making bcftools mpileup run at approx double the speed. So far my assessment with SynDip at 45x and a 15x subsample is this is actually more accurate than calling BAQ everywhere, as it sometimes harms SNP calls.

It's still a work in progress, but if this works then likely the small gain by using a faster log will become somewhat redudant.

@lh3
Copy link
Copy Markdown
Member

lh3 commented Dec 12, 2020

BAQ played its role in early days but I think it is outdated and its use should be minimized.

@jkbonfield
Copy link
Copy Markdown
Contributor Author

jkbonfield commented Dec 13, 2020

The ideal would clearly be local MSA to produce allele consensi, then BAQ becomes redundant. However I think it still has a use until such a thing arrives. Minimising it to only run on cases where we think the pair-wise alignment may not represent the full sample(s) MSA is clearly both beneficial for speed, but also accuracy as we're not reducing qualities in places where the alignment was already correct.

@whitwham whitwham merged commit 85240ba into samtools:develop Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants