Conversation
63f3bb2 to
9057c67
Compare
|
So, at least on newer GCCs, the compiler is recognizing the constant shift and using the immediate shift instruction. So this makes basically zero difference on the compiled binary (but could on a crappier or older compiler). I do have some micro optimizations that reduces a trip to the stack and back (s1 & s2) as well. There's an obvious horizontal add optimization that can be done if not for the goofy base that Mark Adler picks that we have to do integer modulus against. Because NMAX is the maximum runs we can do before possibly overflowing the base, we have to do the modulo operation on every vector element. This requires integer division (a no go for SIMD, obviously), or full 64 bit products (absent from AVX2), shifts, and subtractions, as the compiler is actually doing on the ALU. Hopefully, I should be able to have something that does just this with AVX512, and it should be faster for a larger stream of bytes. |
|
@KungFuJesus AVX512 indeed has better support for single-vector horizontal add, which has been missing since AMD dropped XOP instructions... I haven't added support for it yet as most PCs currently in use still don't support it... Also, I'm not sure using 512-bit vectors will give much gain as adler32 hardly ever needs to checksum that long byte sequences. |
I just purchased a Cascade Lake X CPU, so I should be able to have something that does this shortly. Adler32 checksums tend to do things like checksum the entire zlib compressed row of bytes in a PNG. I guess realistically for this to be really a danger you'd need something larger than could possibly be rendered. However, as a generic checksum, there are cases where this could happen. A thought did occur to me, though. Instead of rebasing to the prime number base for the adler checksum to each vector element, you really only need to rebase in such a way that the partial sums, when horizontally summed together, don't overflow 32 bit precision. To do this, you could simply just do a modulo with 2^32/8, which is a power of 2. Simply AND'ing with 2^29-1 and only doing the prime number modulus when it's a scalar in a GPR is probably feasible, do you agree? |
|
Ahh scratch that above idea, it was dumb. In order for that to work the base would need to be a multiple of that prime or we're losing the biproduct of the modular arithmetic. Still, full 64 bit multiplication with AVX512 should allow me to do the same operations the compiler is generating on the ALU in GPRs for each vector element before the horizontal sum. Or even just summing into a higher precision and doing the modulus at the end could get us there. |
|
@mtl1979 I just pushed another commit into this PR that marginally has a real impact on efficiency. |
4b2d136 to
dbccab4
Compare
|
@KungFuJesus Masked loads require AVX512-capable processor, otherwise the set1 intrinsics can't be mapped to single instruction. Bitwise operations on vector registers can be really slow on older processors. |
I'm not using masked loads (though avx does support them, mostly for float domains). The set1 intrinsic is typically shorthand for the vpbroadcast/w/d/q family of instructions. I'm applying a mask by creating full sized ymm registers so that after the relatively quick broadcast function, I can zero out all but the carry over from the last round of sums. It's shorthand for the s1[7] = adler behavior the code was already doing. In integer domain, on all AVX2 capable CPUs I've seen, bitwise masks are very low latency and can often be done on multiple execution ports, doing at least 2 per cycle. Basically the reason I'm broadcasting to all 8 elements and then zeroing out the first 7 is because there's no idiom to set the last element of a vector to a value (though there is a way to do a "scalar" move to set the first). If the vectors were only 128 bits wide we could bit shift, but with AVX2 you can't shift outside of the 128 bit lanes. We can obviously do a permute to do this, but, that is also going to be a slower sequence. The vpinsert set of instructions could conceivably allow us to insert into a vector at a fixed index position, however, for 256 bit wide vectors, the "sequence" that it compiles to also involves creating 2 128 bit vectors and combining them. This code compiles cleanly on haswell but should work on any avx2 capable CPU. |
|
@mtl1979 All the intrinsic used start with mm256 how is that avx512? Edit: oh maybe you were replying to a different comment. Never mind. |
I assure you it isn't, I don't even have a capable avx512 CPU in hand yet, hah. I tested this code thoroughly, -mavx2 should be all you need for GCC flags. |
|
@nmoinvaz Some move/load instructions can map to combination of SSE and AVX instructions on processors that don't support AVX512. This will cause slowdown on processors where there is high penalty moving data between SSE and AVX units. AVX512-capable processors can move data from general-purpose registers directly to AVX unit. @KungFuJesus I have Haswell-based processor and it's really slow executing bitwise operations on vector registers... It's basically faster to move the data to general-purpose registers and then doing the same there. To sum things up, it might be that gcc can fix the slowdowns by mapping the slow intrinsics to faster equivalents, but that's what the original code already tried to do. As such we need to benchmark the pull request with and without the set1+and combination. I didn't notice anything else wrong in the changes, but it was also middle of the night and I was reading the patch on my phone... |
Every compiler worth anything will translate Here's what GCC compiles the set1 + AND to: It takes the function call arguments, passed via registers, uses the vex encoded move to move them to XMM registers, then broadcasts them. The ymm6 mask register is loaded via memory once, at the beginning of the function call. |
|
@KungFuJesus That assembler output is still wrong... |
I'm not sure I follow...the "AND" operation that happens here is with the mask: The same happens for xmm8 (don't recall which is s1 and which is s2). This is effectively the same thing as your memset(), s[7] = adler sequence, as the compiler would have to generate code that populated the ymm register at that vector position. This just does it in fewer cycles. I assure you broadcasting an xmm register to a ymm register is idiomatic for setting all lanes for ymm register and it is not dead code (and the compiler, at this optimization level, would have certainly eliminated it in an early stage). That is, it's using a zero'ing idiom to fill the xmm register, then it's moving all of the zeros to a location on the stack, as well as the values contained in the registers to a offset at an aligned location. Then finally, it's moving the zeros generated in xmm0 to a contiguous location on stack just before where the actual values are stored, giving 4 leading zeros before a series of zeros followed by the value. The code I'm proposing here skips all of that stack churn, and while it amounts to a micro optimization, it's fewer instructions and faster. The real optimization, though, lies in not having to do a modulus with every vector element that gets extracted. This saves 8 different 64 bit integer multiplications, shifts, and subtractions. |
|
@KungFuJesus Your explanation just asserts that it's basically doing vector element reversal... As far as I remember vector units are big-endian even on little-endian CPUs... XMM and YMM registers with equal numbers do overlay on x86_64. |
Yes...the code generated is exploiting that fact. I'm laying out positions in memory order here, which of course fights with the convention of the little endian sometimes used to describe the vector registers (e.g. setr vs set intrinsics). Ignoring all that, you were doing: Edit: Are you maybe confused by the vpbroadcastd xmm7, ymm7? Yes, they do overlay but this is valid and is idiomatic for setting all values in a vector register to the first value (just as pbroadcastd xmm7, xmm7 would be). This does not blow away what's already in the register any more than an instruction that accumulates back to the same register. |
|
@KungFuJesus Like I said earlier, I'm not rejecting the code, I just want to make sure it's the fastest possible way to do things and that future developers know why it is done this way... This is where proper inline comments are useful... |
In general, if you want all values in a vector register to be the same, Is there anything else you think merits explanatory comments? |
|
@KungFuJesus Part of what I do here requires me to think as dumber person than I actually am... If that makes me unsure about something, then most likely some other contributors are unsure also... A lot of existing code could warrant better explaining, but my job here is not to comment on existing code, only code that is modified or added... I have been around since early days of zlib-ng, but I haven't reviewed every commit or pull request because I have also other projects to work on and sometimes I had to reinstall operating system to fix issues and that reduced the time I have for any third-party projects. I will do final review after one of the other reviewers have done benchmarking so we can see how much faster this code actually is compared to baseline results. I usually don't do benchmarking myself as I run Linux virtualized, so it doesn't have high-precision timers. |
Would that be @Dead2 we're waiting on @nmoinvaz? I could run the benchmarks on a variety of hardware, as well, just let me know. |
|
@KungFuJesus we usually run benchmarks using https://github.com/zlib-ng/deflatebench. Compile two binaries - one with and one without the changes and then run it for each. Here is an example: #934 (comment) |
Hmm, weird, github says it's fine. Where are the conflicts? |
|
Here are my benchmark tests: Here is my code: https://gist.github.com/nmoinvaz/b31b11c35724a549476b37da4ef8cd17. All you have to do is clone https://github.com/google/benchmark into the same directory and then run CMake. Using median values there is possibly 12% improvement? |
|
It would be better if the style nits commit was squashed - and that particular change if the same number of array elements were on each line, but it is not a condition of approval for me so I have approved this PR. |
Awesome, that proper benchmark should come in handy. |
|
@Dead2 I can rebase with develop if needed but it doesn't look like Github presently has any issues merging this. |
|
Here is also an old adler32 benchmark I made once that is meant to be compiled with zlib-ng. https://gist.github.com/nmoinvaz/8bdc503804e17164b2d33312764feecc |
|
It does not tell me what the conflict is for some reason. If you could rebase, lose the merge commit, and squash the last 3 fix-commits, leaving a clean and concise 3 commits, that would be very great. Hopefully that will resolve the conflict problem at the same time. 👍 |
Ugh, I've already branched from this branch to work on an AVX512 variant. I mean, I should be able to, but it's going to require some annoying amount of git history revision to get there. The one thing I don't love above rebase is how it royally screws anything that manages to fork from an earlier state. I'll see about squashing this somehow. |
|
It is a pain but it does make for a cleaner git history in the repository. There are times when I have made PR in zlib-ng and I've had to do lots of rebasing and reworking and waited many months for acceptance. I think you're getting close. I'm looking forward to the AVX512 variant even though I don't have hardware to test it on. |
Since this is constant, anyway, we may as well use the variant that doesn't add vector register pressure, has better ILP opportunities, and has shorter instruction latency.
This now leverages the broadcasting instrinsics with an AND mask to load up the registers. Additionally, there's a minor efficiency boost here by casting up to 64 bit precision (by means of register aliasing) so that the modulo can be safely deferred until the write back to the full sums. The "write" back to the stack here is actually optimized out by GCC and turned into a write directly to a 32 bit GPR for each of the 8 elements. This much is not new, but now, since we don't have to do a modulus with the BASE value, we can bypass 8 64 bit multiplications, shifts, and subtractions while in those registers. I tried to do a horizontal reduction sum on the 8 64 bit elements since the vpextract* set of instructions aren't exactly low latency, however to do this safely (no overflow) it requires 2 128 bit register extractions, 8 vpmovsxdq to bring the things up to 64 bit precision, some shuffles, more 128 bit extractions to get around the 128 bit lane requirement of the shuffles, and finally a trip to a GPR and back to do the modulus on the scalar value. This method could have been more efficient if there were an inexpensive 64 bit horizontal addition instruction for AVX, but there isn't. To test this, I wrote a pretty basic benchmark using Python's zlib bindings on a huge set of random data, carefully timing only the checksum bits. Invoking perf stat from within the python process after the RNG shows a lower average number of cycles to complete and a shorter runtime.
02f4630 to
aa21923
Compare
|
Like every time I attempt rebase - git and I both got royally confused and I ended up having to construct half the state of head of the current branch on the remote via a squashing commit :-/. Anyway, it should be broken down to the desirable commits now - let me know if I screwed something up (I did make the backup of the pre rebase repo locally). |
|
LGTM |
aa21923 to
a5ac452
Compare
|
@KungFuJesus You might want to try out git fixup, have a look at https://github.com/zlib-ng/zlib-ng/wiki/Git-workflow-tips if you are not familiar with it. 😄 |
I mean that's essentially what I did for the squash. That wasn't the issue I had, the issue was the rebase causing headaches with a merge commit that was established on this from another branch early on in the development (getting working horizontal sums was done on a different branch). The conflict resolution for that caused headaches because it was trying to resolve deltas from older commits that had since been reformatted from early git amended commits and force pushes because of all of those fixup changes. It's working now at least, with maybe some less satisfactory states of commit in between, but I'm pretty sure it'll at least compile and bisect fine. I just really dislike force push workflows because it breaks anything that has a history established with anything up until that point. |
|
I should mention, if it isn't already obvious, that he failure in the CI checks seems to be a debian packaging related issue with the version of wine being installed. An apt issue rather than a real one. |
|
@KungFuJesus I understand the problems 😄 And yes, the failing CI is an everlasting problem it seems. The Github Actions build platforms are constantly changing, causing new problems every 1-2 months or so 🙄 |
For some reason the movq instruction from a 128 bit register to a 64 bit GPR is not supported in 32 bit code. A simple workaround seems to be to invoke movl if compiling with -m32. Also addressing some style nits.
a5ac452 to
eddfcf7
Compare
|
Hah, I came to the (now obvious) realization today that NMAX is derived to be the maximum scalar sums that can be performed before overflowing. This means the trip to 64 bit wide words wasn't even necessary, we can do a plain old 32 bit horizontal sum safely and probably save us a handful of cycles. If my math is right it should be 12 cycles for the conversion saved, plus a cycle or two for the additional sums. Of course this also means the modulus operation probably could have been done on the FPU, but I'd wager it'd be more costly than doing it with GPRs thanks to register aliasing being so inexpensive and many of the instructions simply moving between GPRs and AVX registers instead of needing to go through the stack. |

Since this is constant, anyway, we may as well use the variant that
doesn't add vector register pressure, has better ILP opportunities,
and has shorter instruction latency.