RFC 4122 defines three different classes of UUID versions with the following high level properties:
v4are completely random, very simple algorithm, 122 random bits.v1are time- and MAC-address based, rather complex algorithm. If not used carefully, collisions are much more likely than withv4.v1UUIDs have time-ordering properties.v3andv5are name-based, special use case. Constant input leads to constant output.
Following the principle of least surprise we assume that developers should always use the simplest UUID version that fulfills a given use case in order to reduces the risk of unexpected problems.
- You should use
v4if all you need is a universally unique identifier. - You should use
v1if and only if you need time-encoding. - You should use
v3/v5if and only if you need namespacing.
In particular, accidentally using v1 instead of v4 UUIDs in cases where the developer is simply
expecting a random value but is not aware of the fact that the generated IDs are
time-ordered to a certain extent
can have very negative consequences:
- If these IDs are used as database keys and a database/cache does ID-based sharding (without further hashing the shard keys), it can lead to "hot shards".
- Developers who unintentionally use
v1UUIDs in public datasets may not be aware of the fact that the creation timestamp of the UUID and the MAC address of the computer that generates it may be leaked (even though most modern implementations no longer leak the MAC address). See: “This privacy hole was used when locating the creator of the Melissa virus.”
We want to analyze current usage of the uuid npm module in Open Source projects in
order to design an appropriate API for the UUID standard library.
Hypothesis 1: v4 is the by far most commonly used UUID version, followed by v1 and only
marginal amounts of v3/v5 usage.
We
expect that at least 80% - 90%
of all uuid npm module usage exclusively makes use of v4 UUIDs and that v3/v5 usage is very
uncommon (much less than 10%).
Consequence 1: If above hypothesis can be validated we will only consider support v4 UUIDs in
the initial proposal.
Hypothesis 2: A non-negligible amount of v1 UUID usage is "accidental" in the sense that for
the given use case the special semantics of v1 UUIDs are not needed and therefore v4 would be
the more appropriate choice.
This is based on the observation that
v1 UUIDs are documented "above the fold"
in the uuid npm module and that v1 sounds much more like the "default" UUID version
rather than v4.
Consequence 2: If above hypothesis can be validated we will propose an API that favors v4
UUIDs over the other UUID versions to reduce accidental use of v1 UUIDs. Otherwise we will
propose an API that is symmetric in the different UUID versions.
Google provides a public BigQuery Dataset that contains all Open Source code from GitHub and that is updated on a weekly basis.
This directory contains some queries and helper scripts which make use of the GitHub data in order
to analyze usage patterns of the uuid npm module. The analysis roughly does the
following:
- Find all
package.jsonfiles on GitHub to see if they contain theuuidmodule as a dependency. - Get the
repo_namefor all matchingpackage.jsonfiles. - Get all JavaScript source files from these repos.
- Analyze the usage of the
uuidmodule from these source files.
We consider Hypothesis 1 confirmed.
It seems evident that v4 UUIDs are by far the most popular UUID version. They are used in 77.0%
of repositories, that depend on uuid npm module. Weighted by GitHub watch count, v4
UUIDs even more popular, adding up to 89.5% of popularity.
Usage of v1 UUIDs is also significant while v3/v5 UUIDs don't seem to be widely used.
| version | repo_count | repo_count_ratio | watch_count | watch_count_ratio |
|---|---|---|---|---|
| v4 | 4315 | 77.0% | 149802 | 89.5% |
| v1 | 1228 | 21.9% | 16219 | 9.7% |
| v5 | 51 | 0.9% | 1290 | 0.8% |
| v3 | 11 | 0.2% | 116 | 0.1% |
The top 100 repositories (by GitHub watch count) for each UUID version are listed in this Google Sheet.
Results as of June 26, 2019.
We consider Hypothesis 2 confirmed.
So far the top 10 projects which make use of v1 UUIDs have been investigated and only one
inevitable use case of v1 UUIDs has been identified (https://github.com/sequelize/sequelize). We
consider this amount of "accidental" v1 UUID usage non-negligible.
Pull-requests to remove v1 UUIDs in favor of v4 UUIDs for some of the actively maintained repos
which made use of v1 UUIDs have been sent and so far all of them have been accepted:
- storybookjs/storybook#7397
- TryGhost/Ghost#10871
- influxdata/chronograf#5235
- gatsbyjs/gatsby#15407
- sqlpad/sqlpad#451
- microsoft/azure-pipelines-tasks#11021
It is still work-in-progress to discuss with the authors of more Open Source projects whether v1
usage was "accidental" and could be replaced with v4 UUIDs. Results of these efforts are tracked
in this Google Sheet.
Feedback as of August 13, 2019.
In order to reproduce the results:
- Use the
analyze.jshelper to run the queries. - In order to reproduce the results you need a Google Cloud account with billing enabled.
- You must be authenticated with Google Cloud using
gcloud authand the corresponding user must have BigQuery IAM permissions. - All query results are written to result tables.
- Running all queries will cost you around $20.
Examples:
# Print all queries:
node analyze.js -p PROJECT -d DATASET -q all -m print
# Print the first query:
node analyze.js -p PROJECT -d DATASET -q 01 -m print
# Execute the first query:
node analyze.js -p PROJECT -d DATASET -q 01 -m execute