Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Analysis of UUID usage in GitHub BigQuery Dataset

Background

RFC 4122 defines three different classes of UUID versions with the following high level properties:

Following the principle of least surprise we assume that developers should always use the simplest UUID version that fulfills a given use case in order to reduces the risk of unexpected problems.

  • You should use v4 if all you need is a universally unique identifier.
  • You should use v1 if and only if you need time-encoding.
  • You should use v3/v5 if and only if you need namespacing.

In particular, accidentally using v1 instead of v4 UUIDs in cases where the developer is simply expecting a random value but is not aware of the fact that the generated IDs are time-ordered to a certain extent can have very negative consequences:

  • If these IDs are used as database keys and a database/cache does ID-based sharding (without further hashing the shard keys), it can lead to "hot shards".
  • Developers who unintentionally use v1 UUIDs in public datasets may not be aware of the fact that the creation timestamp of the UUID and the MAC address of the computer that generates it may be leaked (even though most modern implementations no longer leak the MAC address). See: “This privacy hole was used when locating the creator of the Melissa virus.”

We want to analyze current usage of the uuid npm module in Open Source projects in order to design an appropriate API for the UUID standard library.

Hypothesis

Version Distribution

Hypothesis 1: v4 is the by far most commonly used UUID version, followed by v1 and only marginal amounts of v3/v5 usage.

We expect that at least 80% - 90% of all uuid npm module usage exclusively makes use of v4 UUIDs and that v3/v5 usage is very uncommon (much less than 10%).

Consequence 1: If above hypothesis can be validated we will only consider support v4 UUIDs in the initial proposal.

Accidental v1 Usage

Hypothesis 2: A non-negligible amount of v1 UUID usage is "accidental" in the sense that for the given use case the special semantics of v1 UUIDs are not needed and therefore v4 would be the more appropriate choice.

This is based on the observation that v1 UUIDs are documented "above the fold" in the uuid npm module and that v1 sounds much more like the "default" UUID version rather than v4.

Consequence 2: If above hypothesis can be validated we will propose an API that favors v4 UUIDs over the other UUID versions to reduce accidental use of v1 UUIDs. Otherwise we will propose an API that is symmetric in the different UUID versions.

Methodology

Google provides a public BigQuery Dataset that contains all Open Source code from GitHub and that is updated on a weekly basis.

This directory contains some queries and helper scripts which make use of the GitHub data in order to analyze usage patterns of the uuid npm module. The analysis roughly does the following:

  • Find all package.json files on GitHub to see if they contain the uuid module as a dependency.
  • Get the repo_name for all matching package.json files.
  • Get all JavaScript source files from these repos.
  • Analyze the usage of the uuid module from these source files.

Results

Version Distribution

We consider Hypothesis 1 confirmed.

It seems evident that v4 UUIDs are by far the most popular UUID version. They are used in 77.0% of repositories, that depend on uuid npm module. Weighted by GitHub watch count, v4 UUIDs even more popular, adding up to 89.5% of popularity.

Usage of v1 UUIDs is also significant while v3/v5 UUIDs don't seem to be widely used.

version repo_count repo_count_ratio watch_count watch_count_ratio
v4 4315 77.0% 149802 89.5%
v1 1228 21.9% 16219 9.7%
v5 51 0.9% 1290 0.8%
v3 11 0.2% 116 0.1%

The top 100 repositories (by GitHub watch count) for each UUID version are listed in this Google Sheet.

Results as of June 26, 2019.

Accidental v1 Usage

We consider Hypothesis 2 confirmed.

So far the top 10 projects which make use of v1 UUIDs have been investigated and only one inevitable use case of v1 UUIDs has been identified (https://github.com/sequelize/sequelize). We consider this amount of "accidental" v1 UUID usage non-negligible.

Pull-requests to remove v1 UUIDs in favor of v4 UUIDs for some of the actively maintained repos which made use of v1 UUIDs have been sent and so far all of them have been accepted:

It is still work-in-progress to discuss with the authors of more Open Source projects whether v1 usage was "accidental" and could be replaced with v4 UUIDs. Results of these efforts are tracked in this Google Sheet.

Feedback as of August 13, 2019.

How to Reproduce

In order to reproduce the results:

  • Use the analyze.js helper to run the queries.
  • In order to reproduce the results you need a Google Cloud account with billing enabled.
  • You must be authenticated with Google Cloud using gcloud auth and the corresponding user must have BigQuery IAM permissions.
  • All query results are written to result tables.
  • Running all queries will cost you around $20.

Examples:

# Print all queries:
node analyze.js -p PROJECT -d DATASET -q all -m print

# Print the first query:
node analyze.js -p PROJECT -d DATASET -q 01 -m print

# Execute the first query:
node analyze.js -p PROJECT -d DATASET -q 01 -m execute

References