Performance: Use faster hashing algo for cache key generation#3508
Performance: Use faster hashing algo for cache key generation#3508samsonasik merged 2 commits intorectorphp:mainfrom
Conversation
| { | ||
| $oldToNewClasses = $this->resolveOldToNewClassCallbacks($node, $oldToNewClasses); | ||
| $cacheKey = md5(serialize($oldToNewClasses)); | ||
| $cacheKey = \hash('crc32c', \serialize($oldToNewClasses)); |
There was a problem hiding this comment.
the length seems got shorter, ref https://3v4l.org/e6hLl , I think current md5 is just fine.
In other area, we use sha1() and sha1_file() for config key and value and I think that's just fine.
There was a problem hiding this comment.
Yep the key length got shorter, from 128 bit to 32 bit, but honestly i don't think that 128 bit keys are needed here to prevent collision.
From what i've seen there are not so many different $oldToNewClasses permutations, the chance of collisions in that case are pretty low, refer to https://preshing.com/20110504/hash-collision-probabilities/
And crc32c hashing is a lot faster then md5, the reason i choose that algo was from the performance comparison done here: https://php.watch/articles/php-hash-benchmark
Why do i think we should change this usage of md5? Because it showed up in the traces that while md5 was called overall ~300.000, but the ~55.000 calls from this place accounted for 96% of the cpu time in this method. The reason is that probably for the other calls to md5 the data input was a lot shorter and thus it had basically no effect on the performance, but here we feed it the serialized data of an array that could be quite long
This was in the blackfire trace before this change:

After this change neither the md5 calls, nor the hash calls appear in the trace because they were not significant enough in terms of performance.
There was a problem hiding this comment.
As far as I look into it, crc seems not cryptographycally, so I think it is better to use common just works, less performance is fine if we got more confident with data integrity.
There was a problem hiding this comment.
Md5 isn't cryptographic secure either (and sha1 too).
I don't think its relevant for the use-case here though.
There was a problem hiding this comment.
md5 is cryptographically, while not secure for password, it is fine for cache use case
There was a problem hiding this comment.
I switched back to a 128bit hash, as i think the point about the collisions is fair enough.
Now it uses the xxh128 algorithm if it is available (as it was only introduced in php 8.1), but i think it matches the needs perfectly:
xxHash is an extremely fast hashing algorithm that is not designed for cryptographic purposes, but provides excellent randomness and dispersion of output, and uniqueness of to minimize collisions.
see: https://php.watch/versions/8.1/xxHash
And i totally agree with @staabm as in this case cryptographically secure hashing is not needed, what we need is fast hashing with a low chance of collision
So for systems running on php 8.1 they see the same performance gains as with the original change, system running on older php versions still see a slight improvement by switching from md5 to the slightly (nearly 2x) faster md4
samsonasik
left a comment
There was a problem hiding this comment.
Thank you @keulinho , let's give it a try
* Use faster hashing algo for cache key generation * Use 128 bit hashing algo again
The serialized array can become quite big and md5 slows down the bigger the input data.
On our test run this saved ~2s: https://blackfire.io/profiles/compare/0abc026f-4855-4bb0-ac84-c4a2eef825e0/graph

I guess now we are at the limit of the "quick fixes" in terms of performance improvements without actually touching the architecture of the whole system