Santiago Torres-Arias

How to easily try out TUF + in-toto

2020-05-01T17:00:00-04:00

I’ve been speaking quite a lot with quite a lot of people about the benfits of in-toto and TUF together. Indeed, my reaction after saying “hey, you don’t need TUF to use in-toto”, is “but they do go really well together”. I’ve done it so much that by now I have a very well rehearsed canned answer as to why they go well. It was only a matter of will and free time (look ma! I’m a Doctor now!), before I decided to dust off this blog and probably share why it matters and — more importantly — how you can see it for yourself in four easy steps.

What is TUF and in-toto and how they are different?

So, as I said, I generally start my engagements with people asserting TUF != in-toto, and the reason why is because they are generally conflated together because they come from similar teams and follow similar design principles. However, they serve the same overarching goal: secure delivery of content but they provide different security properties. So, to make things super clear, they are not the same — they complement each other.

What is TUF? TUF stores sTUFf (and does other stuff as well)

TUF started as an Update Framework, but as noted by a lot of people, it is actually a very neat way to provide trust information about arbitrary collections of software elements (anything you can hash, really). We oftentime refer to these as software artifacts. However, TUF can also provide trust information about other metainformation about these artifacts. Think of TUF as some sort of very powerful mechanism to store sTUFf securely.

Not only TUF stores sTUFf, but it also allows you to make sure these artifacts actually came from whoever should’ve put them there. Say, that you trust your friend Eliza who owns a pharmacy to give you a bottle of pills. With TUF (and if we were able to hash bottles of pills), you could make sure that this bottle of pills was put in the counter (i.e., a repository) by Eliza and Eliza only. So, in other words, TUF ensures authenticity of the provider of the data it’s storing.

If we were talking in person I would’ve eluded to my fantastic car salesman voice, but given that you’re reading it you’ll have to do the heavy lifting, because there’s more!

TUF also ensures other very important things, like the software artifacts being fresh. That is, that your bottle of pills from the previous example is not actually an old one.

Lastly, and very importantly, it also ensures that the repository where all these software artifacts are located follows a consistent state. This is important, because as it has been noted before, attackers can mix and match software artifacts in such a way that the sum of their parts is actually malicious. This would be akin to having somebody using different versions of Eliza’s pharmaceutical offerings and tricking you into taking two very incompatible chemicals¹ — technically, you git it all from Eliza didn’t you?

So to wrap this up. Imagine TUF being this system that ensures that you:

Got what you wanted
From who’s supposed to put it there
That it’s not expired or old
And that there is a consistent state of the place you got them to avoid many things

(Now, TUF does other neat things, but these are the big selling points in my opinion)

Great, so I’ve told you about TUF, let me go ahead and introduce its sister: in-toto.

in-toto answers how Eliza got her sTUFf

Now, a crucial question that you may ask about Eliza’s pharmacy is: “well I trust Eliza, and she hopefully is selilng me things that were FDA approved and whatnot”. And the truth is, unfortunately, you don’t know. In the world of software as the stakes are, you either walk back yourself or hope that people aren’t lying about what they put in their software repositories. In other words, in-toto allows you to do cool things like put FDA approval stamps on Eliza’s pills.

To do this, in-toto creates a cryptographic paper trail that’s very akin to Bills of Materials (in fact, in-toto is very related to cryptographically enforce-able bills of materials), so that you can walk a cryptographically-authenticated paper trail from your bottle of pills (err, software artifact), all the way to the raw materials that created it (e.g., think of source code, configuration files, etc)

This way, you have a very strong coallition of products, one that ensures everything is very tighly sealed and packaged (TUF), and one that gives you cryptographic visibility on the process that produced what you just got (in-toto).

TUF and in-toto together

The basic idea is simple, we will use in-toto in the pipeline to create the paper trail, and then we will use TUF to store all the sTUFf. We will do this in basically four steps, and here they are:

Initialize a TUF repository
Create an in-toto layout and register it in a special place in the TUF repository
Carry out your pipeline as you normally would, but create in-toto attestations that are submitted to a TUF repository
Profit

I took the liberty of gathering a bunch of demos that both the in-toto and TUF communities put together through countless hours and stitched them together to create this four-step process to profit. You can get it from here, you’ll see it has a bunch of submodules so make sure you’re cloning recursively (with the -r flag)

So let’s go and do it!

0. Set up your environment

Ok, I lied. it’s five steps. The first one is to install TUF and in-toto. You can probably use a virtualenv and install them from pip:

$ pip install tuf in-toto pynacl cryptography

Done. I also threw in the pynacl dependencies and cryptography so you can use any keys you like — this deal won’t last forever!

1. Initialize the tuf repository

So, you can grok the script under scripts/init.py, or you can blindly execute my code (tough choice, I know). Either way, once you run it or read it you’ll find out it basically does the following:

Sets up a place to store your repository metadata (think of it as git init-ing things)
Goes ahead and sets up the trust relationships (i.e., it says, I trust this key for packages, this one for layouts and links).
Sets up your client environment (i.e., copies the root of trust into the client directory)

Yes, with 50 lines of code and you have your own shiny TUF/in-toto repo. Order today!

2. Creating and publishing a layout

I tried to reduce code duplication around, so I copied the layout from the basic in-toto demo, which basically starts from a git repository and then finally creates a tarball called “demo-package.tar.gz”. This is our bottle of pills.

However, before we make all this, we want to create an in-toto layout, which is a policy file which will describe how our supply chain should actually look like. This time, run the script to publish this:

$ python scripts/publish.py

This will do the following:

Create this layout and sign it with the root of trust
Add it to a special location in the TUF repository

Done, a couple of lines more and now you have a TUF repository publishing in-toto layouts. We will use these layouts later when we want to make sure our pipeline was followed to the letter.

Now that you have something published, it may be a good time to serve the content using the accompanying script:

$ bash create_server.sh

In another terminal so you can see your TUF repo in action.

3. Carry our the pipeline

Now, we do our usual stuff, code some, pre-commit some stuff, package it and put it somewhere so people can download it. So let’s do just that, but use in-toto tooling to create cryptographic attestations of what happened so we can create an audit trail. This was shamelessly copied as well from the in-toto demo modulo a small wrapper to also put these attestations in our TUF repository. You can run:

$ python scripts/run.py

And see it all happen with your own eyes.

4. Now verify it’s all together

At last, it’s time to consume our package. To do this, we will download things, then make sure they are kosher and finally open it to unwrap what’s inside. This is very similar to how things in the meatspace work. Let’s think about our bottle of pills again:

You your new bottle of pills from your medicine cabinet
You ensure that the tamperproof seal around it is nice, and that the expiration dates are correct (TUF)
If you want to be extra sure, you will also make sure there’s an FDA approval seal and notice a lot-number besides the expiration date (in-toto)

Once you notice all these things, then you go ahead and open the bottle.

Software shouldn’t be too different. In fact, this is what our downloader will do. In something shy of 30 lines, it will first use TUF to connect to our repository and download our package, then it will notice there is also other information that’s attached to it and download it too. Once it’s all downloaded it will go ahead and run in-toto verification on it. If everything is ok, then you will happily consume your new package:

$ tar xvf demo-package.tar.gz && pyton demo-package/foo.py

That’s it!

Where do we go from here?

There are a lot of resources to work with in-toto and TUF. I usually put things into two buckets. You can do things with TUF and in-toto or you could do things for TUF and in-toto.

For the first, you can set up a repository and play with things more by tweaking these scripts. You can also take a look at the documentation and explore ways to add these tools to your environments and ecosystems. Shoot an email to the TUF or in-toto lists if you ever run into issues as we’d be more than hapy to help

If you are interested in also developing for in-toto or TUF, these communities are super, super welcoming, and I’d encourage you to reach out and play with things. There are already some labels for newcomers in the issues. I think there are places to work in which we could use more hands:

We have a golang library that needs more love: it’s a somewhat feature-complete implementation but it could use more work on the non core in-toto features. If you like go, I think this is a great place to leave a mark.
The Jenkins Plugin and the Kubernetes Admission controller are also great places to work with things. I’d suggest you take a look at the repositories and play with them a little bit.
Anything you’d like really. If you have ideas on how to make things better I’m sure that we’d be more than happy to hear them.

I hope this all helps, and see you around!

Acknowledgements

So, this is nothing new, and actually you can read how Datadog does this very well here. This blogpost and demo is of course inspired on the work by Trishank Kuppusamy, who made it all happen on the Datadog side of things. You may also want to read how this is getting encoded into an in-toto ITE here

I’m not a chemist and I’m not going to give you any bad ideas so the example chemicals are left as an exercise to the reader. ↩

Creating a web-enabled USB drived with WebUSB

2017-11-29T20:00:00-05:00

I got caught in the crossfire of adapting one of my projects (PolyPasswordHasher, if you’re curious) to support two factor authentication recently. One of the goals that I had prepared for the summer was to have an actual demo website in which someone could register a yubikey and log in to a website using PPH + HOTP (I’ll leave the reason as to why HOTP out of this post) without too much hassle.

Sadly, the ecosystem for browser USB extensions feels like a wasteland:

You could write a plugin, but that’s incredibly insecure and close to being deprecated in one or two years.
You could use chrome’s USB extension library but, guess what, it’s also going to be deprecated.
You can also try to ship a binary with a browser extension but that would start more cross-platform compatility problems than I’d like to list here.

This leaves out to a somehwat experimental technology: webUSB.

Enter WebUSB

WebUSB is a (finally!) standarized technology to provide a USB bridge so websites can connect to users’ USB devices using JavaScript. You can look at it as if the website was providing you with a USB driver along the two tonnes of JQuery it uses to make rounded boxes in your site.

This may sound like a security nightmare, at least during its first impression. Shipping code that has access to the user’s hardware sounds somewhat problematic. However, webUSB is an improvement security-wise if you consider the previous alternatives:

It’s not code running outside of a sandbox, like a flashplugin would be.
Permissions must be granted by the user to allow a website to access a usb device explicitly.
Some devices, like USB keyboards, are not accessible to webusb (e.g., to avoid keylogging)

That being said, I wouldn’t be surprised if someone finds a way to abuse it during these early stages.

Besides the security aspects of webusb, the only drawback that I found is that, well, there is not much documentation on how to write a webusb device handler. Here, I’ll document how I ‘reversed’ (I could’ve just read the code for their open-source tools, but that’s not fun) and wrote a webusb driver for a yubikey with HOTP enabled.

Setting up your dev environment

In order to develop for webusb, you need to move a couple of things around. First, you need the latest(-ish?) version of chromium. Second, you need to run it with a couple of flags and a local webserver to serve your webusb JavaScript flies. Third, you may need to enable a couple of flags within chromium to enable experimental features.

Start chromium like this:

$ chromium --disable-web-security --allow-insecure-localhost

The reason as to why is that we’ll be serving the files using a plain http server from python (you can use whatever makes you happy here though). By default, webusb is not enabled if the content is not served through HTTPS and the certificate is trusted (++ for security here). A complete list of flags can be taken from this site in case you’re curious, although you don’t need more than these two.

Finally, depending on how old your version of chromium is, you may need to enable the experimental features by navigating to chrome://flags and enabling a flag called “Experimental web platform features.” If you have done this, then you will need to restart your browser.

After setting up chromium, you can start serving your local WebUSB files like so:

$ python3 -m http.server

Cool! Now you should be able to navigate to localhost and play around with webusb and your device.

Sniffing the USB device

Another necessary task is to understand what the original usb driver is sending to the USB device in order to replicate it. Although you may want to write something that already has an implementation using other libraries (e.g., libusb), or a specification describing these tasks, you may run into devices that are not documented (again, this wasn’t the case of the yubikey device, but I still opted for not checking the docs). If you’re under the third case and there are no documents on how to interact with your device, a simple pcap using wireshark can work wonders.

Setting up wireshark for USB sniffing

Under linux, wireshark needs to have a couple of modules loaded and permissions changed so you can sniff usb traffic. The instructions are taken from this article, but I’ll inline the linux instructions anyway.

First, load the usbmon kernel module:

$ sudo modprobe usbmon

This will create a series of /dev/usbmonN devices. You need to make them readable by regular users:

$ sudo setfacl -m u:USER:r /dev/usbmon*

Having done this, you can launch wireshark and pick up an inerface to sniff. Which one to pick can be easily seen using dmesg. Launch dmesg on follow and then plug in your device. You should see something like this:

[ 6350.949823] usb 1-4: new full-speed USB device number 9 using xhci_hcd 
[ 6351.093360] input: Yubico Yubikey NEO OTP+CCID as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.0/0003:1050:0111.0006/input/input22 
[ 6351.150902] hid-generic 0003:1050:0111.0006: input,hidraw0: USB HID v1.10 Keyboard [Yubico Yubikey NEO OTP+CCID] on usb-0000:00:14.0-4/input0 

The important bits of this part of the log is the usb 1-4 part. This means it was connected to the usbmon1 interface. you can also know what “address” Wireshark will use from the information on the rest of the line (1.9.x). A sample wireshark capture of a packet going to our USB device would look like this:

398 19.946717 host 1.9.0 USB 72 URB_CONTROL out

This was a packet sent from your computer into 1.9.0, the device that was just connected. Using Wireshark, we can capture the “conversation” made between the laptop and the Yubikey (or any other device) and sort-of, tell what’s exactly being sent and received.

In this case the host sends a series of URB_CONTROL_OUT message(s) with certain flags and the challenge to hash, then waits for a status flag to be set on the replies and starts reading the resulting HOTP hash. You can see the relevant bits of the conversation on packets 8 to 47 in this pcap.

Translating Sniffed packets into webusb calls

Now that we know what we need to do, we can try to replicate the behavior using webusb to interact with our devices. For example, the details of the packet I listed above are as follows:

This can be translated into the following webusb call:

Device.controlTransferOut({
    "recipient": "interface",
    "requestType": "class",
    "request": 9,
    "value": 0x0300,
    "index": 0 }, Data);

You may suspect that some of the values on the wireshark scan are the same as the arugments sent to the control transfer out. Well, it is that simple. If you don’t want to understand what these values mean (I certainly won’t cover them here), you could just blindly build the same request and see how the device behaves.

These calls return a promise object, which resolves with the data that the device contains after our call. We would have to chain these promises to effectively have a conversation with our yubikey. However, this may not be as straightforward as with other approaches.

Fun with promises

The webusb API is reliannce on promises, makes writing driver-like code a little weird. This is because webusb is merging two worlds: one with the weird JavaScript “asynchronicity” on the web-space and the structured-protocol, raw-byte-handling world of low-level device interaction. This construction will often lead to a design pattern: nested-promises. At least in my very humble opinion.

A nested webusb promise, in simple terms is something that does the following:

Start a promise by sending a request. The result will be handled by another promise
The second promise will check whether the request is read (i.e., the read frame says “good to go”):
- If it’s not ready, start another promise exactly like the one in step 2.
- If it is ready, then move on and resolve the “outer” promise so we can continue onwards to the next step.

This may be easier to picture in the diagram below.

This construction makes it so that the outer promise can construct a promise chain that follows a structured protocol such as the one used by the Yubikey HOTP interface. In contrast, the inner promise chain will make the outer promises hold on until the device is ready for the next step.

This way, we can write a usb device handler that looks pretty much like the drivers you would write with libusb, but using the async and pretty JavaScript-y API of webusb.

Writing webusb handlers/drivers is rather fun and easy once you get started. Reversing usb devices is a fun side-project that may keep you interested/busy for a good week while learning a little about the things we plug into our computers every day.

Looking at the Git landscape through SHATTERED glass

2017-03-04T20:00:00-05:00

A recent blogpost from Google and CWI showed us what many had suspected would happen soon: a practical attack on SHA-1 could be successfully carried out. Although this is an important milestone for the history of cryptographic hash algorithms (if that’s even a thing), the practical implications are more nuanced. As it is with the emerging trend of branded vulnerabilities — (this one is called shattered) — the details are lost in a sea of PR-littered vacuity and witty names for vulnerabilities.

Among the long list of “broken” applications, there is Git, probably the most widely used version control system. Given this popularity, it is not surprising that many people have been running around flailing their arms, writing risible headlines and the tweeting the not–so–amusing–anymore “securely holier than thou” tweets.

At first sight, the concern is well founded. A collision in SHA-1 allows attackers to replace files in such a way that git — not even with signing enabled — cannot identify it. Of course, this would let resourceful attackers sneak in backdoored versions of files after a repository compromise, or create a signed commit object that wasn’t made by the author of the signature. However, after further inspection, the picture is not that grim.

The truth is, as I will show below, SHA-1 is broken in such a way that performing any of these attacks is unfeasible; at least from the economic standpoint. Throughout this post, I’ll show how a doomsday scenario would be for git if someone really, really wanted to attack by colliding hashes.

Background (or “wait, wasn’t this already very, really broken?!”)

Before digging into the details of git and its use of SHA-1, I wanted to make a brief rundown of what actually happened this last Thursday, and try to place it in the context of other hash functions.

The truth is, SHA-1 was already broken, as it was announced by rivest back in 2005. The reason for this conclusion is that, although the name is not similar, the construction is practically the same as MD5. This construction, Merkle-Damgard, is the one used by both hash algorithms and the reason an “arbitrary prefix attack” is possible. It is of no surprise that some rockstars in the industry have all came out, unsurprised themselves, to point out SHA-1 was already deprecated by NIST 6 years ago.

This comparison with MD5 is a really good starting point to understand the nature of these “news” and how they apply to the use of SHA-1. Here is the timeline of the practical attacks against MD5:

In 1996, collisions were found in the compression function of MD5. Since then, experts recommended to stay away from MD5.
In 2005, researchers were able to create a Postscript file that collided. When this happened, Rivest came out to declare both MD5 and SHA-1 broken in terms of collision resistance.
In 2007, Marc Stevens (sounds familiar), wrote hashclash, which Nat McHugh used to collide the hash of two PNG files.
In 2008, CCC were able to impersonate a CA by colliding their MD5 hashes.
In 2012, The flame malware used an MD5 collision to fake a certificate owned by microsoft.
In 2017, people still use MD5, god knows why.

As I hinted before, the timeline for SHA-1 is (and will be) incredibly similar to the one for MD5, with the current events matching the ones for the year 2005. A collision in a PS/PDF file is pretty much the same, as these file formats are not brittle as others, and they allow random data to be located somewhere in the file without showing it or damaging the file format in any way. Other file formats, such as X.509 certificates, source code files and lossy-compression images are not so resilient, which greatly downplays the possibility of a collision. I’ll elaborate more on this fact later.

Other similarities also arise; approximately 10 years after people declared MD5 broken, it was broken in practice. This time, it took us 12 years to go from warning to be able to hash two files and get the exact same value. The final similarity is rather obvious: thanks to Moore’s law and further research, attacks are only going to become more effective. There is no reason to continue using SHA-1 for newer applications.

The lesson to take from this comparison is that this marks a milestone in the usual life of a hash algorithm, it may put a nail in the coffin, but the algorithm is not completely dead for applications that rely on it today (as it is the case of git).

Regardless of this fact, we wanted to do a doomsday scenario for git, so you will have it. To do this, I have to give you a little bit of background on git.

How does git work?

The information we need from git to carry out our attack is minimal. I need to describe the file formats, and then from that we will pick the best point to wreak havoc in a git repository (not really, just on paper).

A git repository is mostly made of two types of files: references and objects. To keep things brief, I’ll skip the details of the former. It suffices to say that git references, like in a programming language, are pointers to other entities. In this case, they are pointers to git objects. An example of a reference is branches, who point to git commit objects.

Git objects hold information about a repository. As of today, there are four of them:

Git commit objects: these hold information about a revision in the repository. Inside a commit object you will find the author, the commit message, a date, and other stuff. However, the most important bit of information is the id of a root tree object.
A tree object is akin to a folder in a filesystem, and it contains a listing of other tree objects and blob objects (among with information about them, as it shown below).
A blob object which, as you may have guessed, is a file. This contains the size of the file and the contents of the file itself.
Finally, there are tag objects, which are pretty much like commit objects, but they are meant to point to a static position in the repository(e.g., release v1.0). Git tags are usually signed using GPG to ensure the authenticity of the tag and all the files to which it is indirectly pointing to.

It is in the ID of all of these objects, where the SHA-1 function is used. To obtain the id of an object, you simply hash the contents and header of the object to obtain the id. By creating a new object that shares an SHA-1 with a known object, we can perform our attack.

For example:

We could create a new commit object with the same id and replace it. This would let us replace the whole root tree and pretty much serve another repository.
We would also create a new root tree object with the same id and replace it. This would let us replace all or any of the files in the repository.
We could also create a file that hashes to the same value (with a minor caveat that I’ll cover later) and replace the object in the repository.
We could do the same thing we do for a commit as we do for a git tag.

From the attacks that I outlined above, only the third is really feasible. The reason as to why this is lies in the fact that the git commit/tree object format is rather brittle. Unlike PDF’s and PS files, you cannot put many bytes of random junk somewhere and expect git to not complain about a corrupt object.

This pretty much leaves us with the ability to replace blobs, which should be enough to do something evil. Let me elaborate on this a little bit more.

Random Junk, and the probability of getting the right Junk

When carrying the attacks described in the paper the researchers exploited the fact that the Merkle-Damgard construction is likely to produce the same hash if these two files share a common prefix. For this, you need to set a common prefix between both files.

Immediately after, it comes two blocks of random bytes. The first block is is used to make the hash function reach a certain “state” called “near-collision.” The second block of random bytes is used to cause a complete collision that also allows for any kind of suffix. These blocks apparently need to be found only once, so you can create your own colliding pdf’s using these blocks.

However, these collisions require the same prefix-length, a chosen prefix and a place to locate 84 bytes of junk after the prefix in such a way that it doesn’t result in a corrupted file. It is the placement of this junk what gets in the way with brittle file formats. For example, for git trees, any byte that falls out of the format for a file list would be corrupted:

$ cat .git/objects/da/3d3fc569dc3ded6c67e5209840ff4205202613 | zlib-flate -uncompress | hexdump -C
00000000 74 72 65 65 20 31 33 37 00 31 30 30 37 35 35 20 |tree 137.100755 |
00000010 74 68 65 5f 73 6f 6f 74 68 69 6e 67 5f 73 6f 75 |thesoothingsou|
00000020 6e 64 5f 6f 66 5f 68 61 73 68 5f 63 6f 6c 6c 69 |ndofhashcolli|
00000030 73 69 6f 6e 73 2e 70 79 00 45 ca db ed b6 09 fd |sions.py.E......|
00000040 cc 29 c5 5b 79 54 83 ba 6a c6 b1 f9 b0 31 30 30 |.).[yT..j....100|
00000050 36 34 34 20 74 68 65 5f 73 6f 6f 74 68 69 6e 67 |644 thesoothing|
00000060 5f 73 6f 75 6e 64 5f 6f 66 5f 68 61 73 68 5f 63 |soundofhashc|
00000070 6f 6c 6c 69 73 69 6f 6e 73 2e 77 61 76 00 0e 0a |ollisions.wav...|
00000080 e5 a7 bf 61 78 d4 90 12 7b 2c 74 0e 78 34 b0 85 |...ax...{,t.x4..|
00000090 cf 59 |.Y|
00000092

So that a tree object is correct, you would need:

The ascii word tree (this could easily be a prefix), followed by a space, the size and a null
Then a list of entries, consisting of:
- 6 decimal ascii digits, for permission and file-type followed by a space
- the filename followed by a null
- 20 bytes that point to another existing git object (submodules are repesented by a commit)
- and then, immediately, the next entry

That’s it, that’s what appears in a tree object. If we were to place 84 bytes of random junk somewhere there, we would have to be really lucky to find colliding blocks that match (we can’t use the ones for pdf’s) the format for a tree. The probability is pretty much 0.

A more realistic approach would be to repeat this with a blob object. However, this is still sensitive to the file format that we use. The header of a blob object is only the size of the blob (in bytes) and a null, followed by the rest of the content. Due to this small header, we can’t use the same colliding blocks for the pdfs

Luckily for us (remember we are attackers), it is feasible to account for this header and find a new collision. Indeed, the easy way to do this would be to just prepend the size of the file and a newline to the pdf prefix and recompute the collision. Causing collisions for other file formats would be harder — the probability to find a two blocks of 84 + 64 printable bytes (assuming an even distribution) to create a “meaningful” source code file floats around 97/256^100 = 1.454705909478762e-239 if there were only 100 bytes in total.

A colliding C file may look like this.

/* prefix stuff */
include <...>
int main (...) {
  /* BLOCK1 of ascii-printable crap
   *
   */

  /* BLOCK2 of ascii-printable crap
   *
   */

   evil_code(); // not part of the collision.

}

Notice we had to be super lucky to not have any nulls inside the comment, or control characters, or the sequence */ that would terminate the comment early.

Now, let’s assume we are that lucky, and that we are still willing to collide a file, any file. We would still need to consider other factors. The first one that came to mind is blob-lifetime.

Blob lifetime

Blob lifetime is what I dubbed as the time between a blob’s inception to the time it is replaced by another, newer blob. This is particularly relevant because, if you are going to spend a year like these researchers did in finding a collision, maybe you want to collide a blob that is not going to be unused by the time you find it, which may possibly replace the prefix. To collide an arbitrary blob, an attacker should:

Take the blob header. This is, size + newline as a prefix, plus other things that are usually static (e.g, some of the imports, maybe the license on the comments, etc.)
Compute the colliding blocks immediately after in such a way that it doesn’t corrupt the file (here’s where they would have to get really lucky if this was code).
Once finding a collision, pad with whatever backdoor code they’d like and create the evil blob. Another non-evil blob should be sent to the repository to create a valid entry, maybe through a non-malicious pull request, although you would need a good explanation for your two block comments with random ascii characters.

On the graph below you can explore the lifetime of blob objects for some popular git repositories (if you’d like me to compute a dataset with your repo please ping me!). You can explore the files, where they appear and see the time it took for them to be replaced. I did a linear equation to “estimate” the cost of cracking a certain blob that (if we were able to know how long would a blob survive before it was replaced) you can hover.

You can see a bigger version of this graph here

This graph is pretty interesting due to many factors. It sheds light on the low hanging fruit that an attacker would maybe exploit. The files that are easy to crack are for example:

LICENSE files: boring
Test code: also boring
Documentaton: Also boring
Vendored code (check the rightmost delta on the docker dataset): somewhat interesting
Images: I guess we could prank someone by replacing their logo with their childhood pictures by spending 110 GPU years to cause a collission or something.

The second exercise that I’d like you to do is the following: pick any file that you would like to crack on the rightmost side, and then write the name of that file in the filter. You will most likely see that the file appears also on the left. The reason as to why this happens is that, as projects grow, the code churn increases with it. The newer versions of the files (and the ones that are used on the latest revision) usually fall on the left margin.

Result: you may be able to change the LICENSE file of a single-developer project by spending 31k and two and a half years.

Sneaking the blob

Now, let’s say that we got lucky with the blob, we spent our tuition money and two years of our life on because we really wanted to screw that guy’s .gitignore file. We did it, we have a blob that hashes to the same thing, now what? The story is not really over.

The way the git transport protocol works doesn’t let us replace the remote blob. I won’t go into the details, but a cartoonish depiction of what a push would look like is as follows:

> CLIENT: I'm pushing this commit, with this tree, and here's this new blob with commit id 0xCAFEC0FFEE
> SERVER: Oh, I already got 0xCAFEC0FFEE, but thanks
> CLIENT: =(

What alternatives do we have? keeping the blob locally is useless, because we want people to use our blob. To do so, I’m imagining the following alternatives:

We could destroy all history on the git repository that contains all mentions of the blob, and wait for it to get garbage-collected, and then push again — noisy
We could, instead, trick people to get their stuff from our repository — we need to be known and this would be pretty obvious
We could man in the middle the connection — Oh, why don’t we break GitHub’s certificate fingerprints? Sadly, they don’t use SHA-1 anymore so we can’t do that…
We could break into the repository, and change the file manually.

The story would be maybe different if the repository owner was the one acting maliciously but, if he’s not a third party, he could just replace the history. This may be the most dangerous case, for he could replace blobs for as long as the prefix doesn’t change.

To be fair, of all of the highly unlikely things that I’ve listed here, at least the third and fourth seem plausible. Let’s assume you did that. At this point, you successfully hacked a git repository by causing a SHA-1 collision.

Note, I presented this as an thought exercise, for the git folks are already working on integrating the hardened SHA-1 code and your attack would unsuccessful once it is merged. Shucks!

To summarize:

This attack is an important milestone in the evolution of SHA-1’s deprecation.
This attack is not feasible to do against git. An arbitrary prefix attack would be more interesting, but we aren’t there yet.
Even if it was, collisions of certain files are harder than others, given the code churn and the entropy of the repository’s files.
You would have to be really lucky to find two colliding blocks for files that have brittle file formats (such as code or certain git objects).
Say you were lucky, there are better ways to mess with people’s LICENSE files
To do some damage, you would target vendored code, but vendored code is already a mess that should be handled better. There are other ways to abuse this fact, just look at libtiff, zziplib and libwmf.
The story is not over one you compute the collision. You have to put the blob in the right place. This often means either stealing a certificate or hacking into a server. The story is of course different if you are a malicious server, but things can go wrong in many other ways if that happens.
There are already works in the making to both harden git’s use of sha1 and replace the hashing algorithm, so don’t get your hopes up.