Add a url_hash column for faster SQL lookups with large tables#4006
Add a url_hash column for faster SQL lookups with large tables#4006matt-h wants to merge 2 commits intoYOURLS:masterfrom
Conversation
SHA1 was chosen as the hash algorithm since it is supported by mysql for the upgrade function. Otherwise the upgrade would need to be done in PHP on each row which would be much slower on large tables. Only the first 8 characters of the hash are stored to minimize storage and index requirements on the database. Even though SHA1 is considered cryptographially inscure at this point and we are using only the first 8 characters making the possiblitly of collisions even higher. That is okay since we are only using it to index the urls in the database for faster lookups. Even with many collisions it is still substantially faster to lookup since it will still match on the indexed hashes and then only need to filter down to the matching url (which we still do in the queries). Fixes YOURLS#3793
|
I also made an option to use However, it shouldn't be needed since the function is called only on additions so won't get heavy usage to matter with the speed boost. It also has the downside on upgrade needing to be done in PHP since it can't be done directly in MySQL. |
|
Interesting as always :) I'll need to run a couple tests on large datasets, I'm always afraid with DB changes on large scales where the upgrade operation might take longer than Another question : how faster is the query, since you're searching at the same time for |
…ster (4x in my tests).
|
I was testing the upgrade with 100,000 URLs in the database and it upgraded basically instantly. I just did some more tests with 1 million URLs and 4.5 million rows. I found that the order I had the update queries mattered (obvious in hindsight having to calculate indexes on updates vs just after). I pushed a new commit to fix that. 1 million URLs: Upgrade time 9 seconds. (before fix was 28 seconds). 4.5 million URLs. Upgrade time 58 seconds. So for longer than Also as to how faster is the query: After hashing almost always 0.0001 seconds The reason that is faster when doing |
|
I was thinking more about this and while this does work really well to speed up database inserts with a large URL table, this change is basically only speeding up the From my testing, it is really only useful for setups with over 2 million URLs. Below that amount the performance hit is basically negligible. So, while this does make this fast at any URL list size, I'm not sure if it is worth it for the added complexity of the query and ensuring that hashes are always set on insert/update of URLs just to support those with really large databases. I think some of it depends on how many people are out there actually running into performance issues like in #3793 The other thought is for those that this would actually help the most, the upgrade does take awhile now. So, we might want to add something in the upgrade that counts the number of URLs that you have and displays a message that since you have so many URLs you should run these SQL commands manually on your server instead of running them automatically. Maybe, depending on how many people actually need this performance boost, it would be better implemented as a plugin. |
Yeah, my thought there was someone manually inserting data to the the database from a bulk insert or something. Or a plugin overriding the add url function and not hashing it. |
|
Without surprise, the very vast majority of (known) YOURLS instances have less than 100K urls -- only 1128 have more, breakdown as follows: So I'm not too worried about a couple folks hitting a max time exec and filling a support request here. Plus, the PHP page about setting time limit states something that, if I understand it well, says we're clear :
So, to sum it up : I'm in favor of this patch that introduces a DB change to allow for faster lookups & inserts Letting @dgw and @LeoColomb chime in if they'd like to, as I have the feeling I'm the worst when it comes to DB schema |
|
I'm closing this one since I think #4013 is much better. |



Add a url_hash column for faster SQL lookups with large tables
SHA1 was chosen as the hash algorithm since it is supported by mysql for the upgrade function. Otherwise the upgrade would need to be done in PHP on each row which would be much slower on large tables.
Only the first 8 characters of the hash are stored to minimize storage and index requirements on the database. Even though SHA1 is considered cryptographicically insecure at this point and we are using only the first 8 characters making the possibility of collisions even higher. That is okay since we are only using it to index the urls in the database for faster lookups. Even with many collisions it is still substantially faster to lookup since it will still match on the indexed hashes and then only need to filter down to the matching url (which we still do in the queries).
Fixes #3793