Skip to content

Add a url_hash column for faster SQL lookups with large tables#4006

Closed
matt-h wants to merge 2 commits intoYOURLS:masterfrom
matt-h:fix/3793-hash-url-for-lookup
Closed

Add a url_hash column for faster SQL lookups with large tables#4006
matt-h wants to merge 2 commits intoYOURLS:masterfrom
matt-h:fix/3793-hash-url-for-lookup

Conversation

@matt-h
Copy link
Copy Markdown
Contributor

@matt-h matt-h commented Oct 24, 2025

Add a url_hash column for faster SQL lookups with large tables

SHA1 was chosen as the hash algorithm since it is supported by mysql for the upgrade function. Otherwise the upgrade would need to be done in PHP on each row which would be much slower on large tables.

Only the first 8 characters of the hash are stored to minimize storage and index requirements on the database. Even though SHA1 is considered cryptographicically insecure at this point and we are using only the first 8 characters making the possibility of collisions even higher. That is okay since we are only using it to index the urls in the database for faster lookups. Even with many collisions it is still substantially faster to lookup since it will still match on the indexed hashes and then only need to filter down to the matching url (which we still do in the queries).

Fixes #3793

SHA1 was chosen as the hash algorithm since it is supported by mysql for the upgrade function.
Otherwise the upgrade would need to be done in PHP on each row which would be much slower on large tables.

Only the first 8 characters of the hash are stored to minimize storage and index requirements on the database.
Even though SHA1 is considered cryptographially inscure at this point and we are using only the first 8 characters making the possiblitly of collisions even higher.
That is okay since we are only using it to index the urls in the database for faster lookups.
Even with many collisions it is still substantially faster to lookup since it will still match on the indexed hashes and then only need to filter down to the matching url (which we still do in the queries).

Fixes YOURLS#3793
@matt-h
Copy link
Copy Markdown
Contributor Author

matt-h commented Oct 24, 2025

I also made an option to use xxh32 as an alternative since it is a faster/more secure algorithm. 631cf56

However, it shouldn't be needed since the function is called only on additions so won't get heavy usage to matter with the speed boost. It also has the downside on upgrade needing to be done in PHP since it can't be done directly in MySQL.

@ozh
Copy link
Copy Markdown
Member

ozh commented Oct 24, 2025

Interesting as always :)

I'll need to run a couple tests on large datasets, I'm always afraid with DB changes on large scales where the upgrade operation might take longer than max_execution_time. You already tested that ?

Another question : how faster is the query, since you're searching at the same time for `url_hash` = :hash AND `url` = :url" ?

@matt-h
Copy link
Copy Markdown
Contributor Author

matt-h commented Oct 24, 2025

I was testing the upgrade with 100,000 URLs in the database and it upgraded basically instantly.

I just did some more tests with 1 million URLs and 4.5 million rows. I found that the order I had the update queries mattered (obvious in hindsight having to calculate indexes on updates vs just after). I pushed a new commit to fix that.

1 million URLs: Upgrade time 9 seconds. (before fix was 28 seconds).

4.5 million URLs. Upgrade time 58 seconds.

So for longer than max_execution_time, I'm not sure. Depending on how many really large database there are out there.

Also as to how faster is the query:
Before hashing:
1 million URLs: About 1 second.
4.5 million URLs: About 8 seconds.

After hashing almost always 0.0001 seconds

The reason that is faster when doing `url_hash` = :hash AND `url` = :url is MySQL will use the index url_hash to find just those records first, and then use a where to search just those records for url.
Alternatively on the normal `url` = :url it uses a where lookup to check every record.

Using just url
image

Using both url_hash and url
image

@matt-h
Copy link
Copy Markdown
Contributor Author

matt-h commented Oct 26, 2025

I was thinking more about this and while this does work really well to speed up database inserts with a large URL table, this change is basically only speeding up the yourls_long_url_exists function requests. If you have YOURLS_UNIQUE_URLS set to false then this does nothing at all for you.

From my testing, it is really only useful for setups with over 2 million URLs. Below that amount the performance hit is basically negligible. So, while this does make this fast at any URL list size, I'm not sure if it is worth it for the added complexity of the query and ensuring that hashes are always set on insert/update of URLs just to support those with really large databases.

I think some of it depends on how many people are out there actually running into performance issues like in #3793

The other thought is for those that this would actually help the most, the upgrade does take awhile now. So, we might want to add something in the upgrade that counts the number of URLs that you have and displays a message that since you have so many URLs you should run these SQL commands manually on your server instead of running them automatically.

Maybe, depending on how many people actually need this performance boost, it would be better implemented as a plugin.

@ozh
Copy link
Copy Markdown
Member

ozh commented Oct 26, 2025

As of writing, nearly 86% of people use YOURLS_UNIQUE_URLS, and I'm suspecting the other 14% are not just because there was an option to change :)
image

What do you mean "ensuring that hashes are always set" ? Do you see a scenario where they wouldn't ? Besides manually writing SQL queries to add entries

@matt-h
Copy link
Copy Markdown
Contributor Author

matt-h commented Oct 26, 2025

What do you mean "ensuring that hashes are always set" ? Do you see a scenario where they wouldn't ? Besides manually writing SQL queries to add entries

Yeah, my thought there was someone manually inserting data to the the database from a bulk insert or something. Or a plugin overriding the add url function and not hashing it.

@ozh
Copy link
Copy Markdown
Member

ozh commented Oct 26, 2025

Without surprise, the very vast majority of (known) YOURLS instances have less than 100K urls -- only 1128 have more, breakdown as follows:
image

So I'm not too worried about a couple folks hitting a max time exec and filling a support request here.

Plus, the PHP page about setting time limit states something that, if I understand it well, says we're clear :

Note:
The set_time_limit() function and the configuration directive max_execution_time only affect the execution time of the script itself. Any time spent on activity that happens outside the execution of the script such as system calls using system(), stream operations, database queries, etc. is not included when determining the maximum time that the script has been running.

So, to sum it up : I'm in favor of this patch that introduces a DB change to allow for faster lookups & inserts

Letting @dgw and @LeoColomb chime in if they'd like to, as I have the feeling I'm the worst when it comes to DB schema

@matt-h
Copy link
Copy Markdown
Contributor Author

matt-h commented Nov 2, 2025

I'm closing this one since I think #4013 is much better.

@matt-h matt-h closed this Nov 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

YOURLS_UNIQUE_URLS slows down large instances

2 participants