analysis

Analysis of UUID usage in GitHub BigQuery Dataset

Background

RFC 4122 defines three different classes of UUID versions with the following high level properties:

v4 are completely random, very simple algorithm, 122 random bits.
v1 are time- and MAC-address based, rather complex algorithm. If not used carefully, collisions are much more likely than with v4. v1 UUIDs have time-ordering properties.
v3 and v5 are name-based, special use case. Constant input leads to constant output.

Following the principle of least surprise we assume that developers should always use the simplest UUID version that fulfills a given use case in order to reduces the risk of unexpected problems.

You should use v4 if all you need is a universally unique identifier.
You should use v1 if and only if you need time-encoding.
You should use v3/v5 if and only if you need namespacing.

In particular, accidentally using v1 instead of v4 UUIDs in cases where the developer is simply expecting a random value but is not aware of the fact that the generated IDs are time-ordered to a certain extent can have very negative consequences:

If these IDs are used as database keys and a database/cache does ID-based sharding (without further hashing the shard keys), it can lead to "hot shards".
Developers who unintentionally use v1 UUIDs in public datasets may not be aware of the fact that the creation timestamp of the UUID and the MAC address of the computer that generates it may be leaked (even though most modern implementations no longer leak the MAC address). See: “This privacy hole was used when locating the creator of the Melissa virus.”

We want to analyze current usage of the uuid npm module in Open Source projects in order to design an appropriate API for the UUID standard library.

Hypothesis

Version Distribution

Hypothesis 1: v4 is the by far most commonly used UUID version, followed by v1 and only marginal amounts of v3/v5 usage.

We expect that at least 80% - 90% of all uuid npm module usage exclusively makes use of v4 UUIDs and that v3/v5 usage is very uncommon (much less than 10%).

Consequence 1: If above hypothesis can be validated we will only consider support v4 UUIDs in the initial proposal.

Accidental `v1` Usage

Hypothesis 2: A non-negligible amount of v1 UUID usage is "accidental" in the sense that for the given use case the special semantics of v1 UUIDs are not needed and therefore v4 would be the more appropriate choice.

This is based on the observation that v1 UUIDs are documented "above the fold" in the uuid npm module and that v1 sounds much more like the "default" UUID version rather than v4.

Consequence 2: If above hypothesis can be validated we will propose an API that favors v4 UUIDs over the other UUID versions to reduce accidental use of v1 UUIDs. Otherwise we will propose an API that is symmetric in the different UUID versions.

Methodology

Google provides a public BigQuery Dataset that contains all Open Source code from GitHub and that is updated on a weekly basis.

This directory contains some queries and helper scripts which make use of the GitHub data in order to analyze usage patterns of the uuid npm module. The analysis roughly does the following:

Find all package.json files on GitHub to see if they contain the uuid module as a dependency.
Get the repo_name for all matching package.json files.
Get all JavaScript source files from these repos.
Analyze the usage of the uuid module from these source files.

Results

Version Distribution

We consider Hypothesis 1 confirmed.

It seems evident that v4 UUIDs are by far the most popular UUID version. They are used in 77.0% of repositories, that depend on uuid npm module. Weighted by GitHub watch count, v4 UUIDs even more popular, adding up to 89.5% of popularity.

Usage of v1 UUIDs is also significant while v3/v5 UUIDs don't seem to be widely used.

version	repo_count	repo_count_ratio	watch_count	watch_count_ratio
v4	4315	77.0%	149802	89.5%
v1	1228	21.9%	16219	9.7%
v5	51	0.9%	1290	0.8%
v3	11	0.2%	116	0.1%

The top 100 repositories (by GitHub watch count) for each UUID version are listed in this Google Sheet.

Results as of June 26, 2019.

Accidental `v1` Usage

We consider Hypothesis 2 confirmed.

So far the top 10 projects which make use of v1 UUIDs have been investigated and only one inevitable use case of v1 UUIDs has been identified (https://github.com/sequelize/sequelize). We consider this amount of "accidental" v1 UUID usage non-negligible.

Pull-requests to remove v1 UUIDs in favor of v4 UUIDs for some of the actively maintained repos which made use of v1 UUIDs have been sent and so far all of them have been accepted:

It is still work-in-progress to discuss with the authors of more Open Source projects whether v1 usage was "accidental" and could be replaced with v4 UUIDs. Results of these efforts are tracked in this Google Sheet.

Feedback as of August 13, 2019.

How to Reproduce

In order to reproduce the results:

Use the analyze.js helper to run the queries.
In order to reproduce the results you need a Google Cloud account with billing enabled.
You must be authenticated with Google Cloud using gcloud auth and the corresponding user must have BigQuery IAM permissions.
All query results are written to result tables.
Running all queries will cost you around $20.

Examples:

# Print all queries:
node analyze.js -p PROJECT -d DATASET -q all -m print

# Print the first query:
node analyze.js -p PROJECT -d DATASET -q 01 -m print

# Execute the first query:
node analyze.js -p PROJECT -d DATASET -q 01 -m execute

Name		Name	Last commit message	Last commit date
parent directory ..
queries		queries
README.md		README.md
analyze.js		analyze.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Analysis of UUID usage in GitHub BigQuery Dataset

Background

Hypothesis

Version Distribution

Accidental `v1` Usage

Methodology

Results

Version Distribution

Accidental `v1` Usage

How to Reproduce

References

FilesExpand file tree

analysis

Directory actions

More options

Directory actions

More options

Latest commit

History

analysis

Folders and files

parent directory

README.md

Analysis of UUID usage in GitHub BigQuery Dataset

Background

Hypothesis

Version Distribution

Accidental v1 Usage

Methodology

Results

Version Distribution

Accidental v1 Usage

How to Reproduce

References

Accidental `v1` Usage

Accidental `v1` Usage