lwt hiker

HTTP/2 fingerprinting: A relatively-unknown method for web fingerprinting

2022-06-17T12:30:00+00:00

HTTP/2 fingerprinting is a method by which web servers can identify which client is sending the request to them¹. It can identify the browser type and version, for instance, or whether a script is used. The method relies on the internals of the HTTP/2 protocol which are less widely known that those of its simpler predecessor HTTP/1.1. In this post I will first give a short description of the HTTP/2 protocol, then provide details on how a web server can use the protocol’s various parameters to identify the client. Finally, I will list methods of checking and controlling a client’s HTTP/2 signature.

This is the second part of a two-part series about web fingerprinting. Read the previous post about TLS fingerprinting here.

Back to HTTP/1.1
A short introduction to HTTP/2
- Frames and streams
Client fingerprinting with HTTP/2
Where is HTTP/2 fingerprinting being used?
Controlling your HTTP/2 signature
Checking a client’s HTTP/2 signature
- The TS1 method and library
Concluding

Back to HTTP/1.1

With HTTP/1.1 - the older, more familiar protocol - a client sends a textual request to the server (usually encrypted with TLS). Here’s how Chrome’s request looks like by default:

GET / HTTP/1.1
Host: www.wikipedia.org

sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="101", "Google Chrome";v="101"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

The User-Agent header contains the client’s exact version and thus can be used to identify the client. However, this is easy to fake with any http library or command line tool and is no longer considered a reliable method of fingerprinting by any means. A little less known fact is that the Accept header also takes different values depending on the client. This is also easy to fake however.

A short introduction to HTTP/2

HTTP/2 is a major revision of the HTTP protocol and has been around since around 2015. About half of all websites now use HTTP/2², and basically all the popular sites use it by default. A great in-depth overview of the HTTP/2 protocol can be found in this article. I will detail the parts most important to this article.

You can check if a website is running HTTP/2 with the Chrome/Firefox developer tools. For example, in Firefox it would look like the following:

The primary goal of HTTP/2 is to improve the performance of websites and web applications. It achieves that goal by implementing a few core features:

Multiplexing - Multiple requests and responses can share the same TCP connection simultaneously, thus reducing the time to fetch sites with a large number of resources (images, scripts, etc.).
Prioritization - HTTP/2 supports prioritizing certain requests and responses.
Server push - In HTTP/2, the server can send resources to the client before the client requests them.

The application semantics of the HTTP protocol are not changed however: It is still composed of the familiar request/response model with URIs, HTTP methods, HTTP headers and status codes.

Frames and streams

HTTP/2 is a binary protocol, as opposed to the textual HTTP/1.1. The messages in HTTP/2 are composed of frames, with ten types of frames serving different purposes. Frames are always part of a stream. A single stream is usually used to fetch a single resource from the server (html, script, image, etc.). Frames from multiple streams can be sent and received simultaneously, and thus multiplexing is achieved. A typical HTTP/2 connection would usually look like the following:

In this illustration the following frames are exchanged:

SETTINGS - This frame is the first frame sent by the client and contains HTTP/2-specific settings. It is part of stream 0, which is the default root stream. No resource is retrieved on stream 0.
WINDOW_UPDATE - Increases the window size of the receiver. More on this later.
HEADERS - Contains the actual request from the client to the server. It contains the URI, the HTTP method and the client’s HTTP headers.
DATA - Contains the response from the server with the requested resource’s data.

Client fingerprinting with HTTP/2

Let’s take a deeper look at some of the frames. Each of the frames contains information that allows clients to be easily fingerprinted by the server.

The `SETTINGS` frame

With the SETTINGS frame, the client informs the server about its HTTP/2 preferenecs. There are six different settings³ with which the client can control parameters such as the maximum number of concurrent streams, the maximum number of HTTP headers, the default window size and whether it supports the server push feature. Each HTTP/2 client uses a different set of settings. The same client will usually use the same set of settings regardless of what the actual HTTP request is.

To see what SETTINGS are sent by a client, I usually use nghttpd, a small HTTP/2 server that can log these parameters. Here are Chrome’s settings taken from the log:

recv SETTINGS frame 
    [SETTINGS_HEADER_TABLE_SIZE(0x01):65536]
    [SETTINGS_MAX_CONCURRENT_STREAMS(0x03):1000]
    [SETTINGS_INITIAL_WINDOW_SIZE(0x04):6291456]
    [SETTINGS_MAX_HEADER_LIST_SIZE(0x06):262144]

Seen here are 4 different settings set by Chrome to some fixed values. Here are Firefox’s settings in comparison:

recv SETTINGS frame 
    [SETTINGS_HEADER_TABLE_SIZE(0x01):65536]
    [SETTINGS_INITIAL_WINDOW_SIZE(0x04):131072]
    [SETTINGS_MAX_FRAME_SIZE(0x05):16384]

Both the kind of settings and their values are different, making the browsers easily distinguishable. As another example, curl sets the SETTINGS_ENABLE_PUSH setting to 0 to disable the server push feature, which makes it distinguishable from a browser. Because the settings aren’t easily controllable by the user, they become a reliable method for client fingerprinting.

The `WINDOW_UPDATE` frame

HTTP/2 implements a mechanism for flow-control. Flow-control gives the receiving side means to regulate the flow of traffic on a per-stream basis. This is implemented using a window size, which is a number specifying how many bytes the receiver can process. There is a window size for each stream and a window size for the connection as a whole. This mechanism is pretty similar to TCP flow-control, but since multiple streams are multiplexed on top of a single TCP connection, HTTP/2 implements its own stream-level flow-control. For a full explanation you may refer to the RFC or to this article.

The stream-level default window size is controlled by the SETTINGS_INITIAL_WINDOW_SIZE in the SETTINGS frame, visible in the settings tables above. You can observe above that Chrome uses 6MB (6291456) and Firefox uses 128KB (131072).

As the client receives data, it can adjust the window size using a WINDOW_UPDATE frame, which increases its window size.

The connection-level window size is 65535 bytes by default and can only be increased by sending a WINDOW_UPDATE frame on the special stream id 0. Most clients will send a WINDOW_UPDATE frame for stream 0 right at the beginning of the connection, immediately after sending the SETTINGS frame. This is how it looks like for Chrome:

recv WINDOW_UPDATE frame 
          (window_size_increment=15663105)

Chrome is in effect increasing the connection-level window size to 15MB (15663105+65535=15MB). Firefox, on the other hand, will increase it to 12MB. curl uses 32MB⁴. Hence this parameter can be used for fingerprinting as well.

The `HEADERS` frame

The HEADERS frame contains, broadly speaking, all the functionality of HTTP/1.1 in a single frame. It contains the server’s host, the resource URI, the method (GET/POST/etc.) and the client’s headers. An important difference, however, is that everything is now considered a “header”. Here’s how it looks like for Chrome:

recv (stream_id=3) :method: GET
recv (stream_id=3) :authority: localhost:8443
recv (stream_id=3) :scheme: https
recv (stream_id=3) :path: /favicon.ico
recv (stream_id=3) sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="101", "Google Chrome";v="101"
recv (stream_id=3) sec-ch-ua-mobile: ?0
recv (stream_id=3) user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36
recv (stream_id=3) sec-ch-ua-platform: "Linux"
recv (stream_id=3) accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
recv (stream_id=3) sec-fetch-site: same-origin
recv (stream_id=3) sec-fetch-mode: no-cors
recv (stream_id=3) sec-fetch-dest: image
recv (stream_id=3) accept-encoding: gzip, deflate, br
recv (stream_id=3) accept-language: en-GB,en;q=0.9
recv HEADERS frame 

The method is encoded in the special :method header, the host in :authority, the scheme in :scheme and the URI in :path. The interesting thing here is that the order of these pseudo-headers is fixed but different for each client. From the protocol’s standpoint all orders are valid, but each client had decided to order them differently. The header order for some common clients (using the first letter of each pseudo-header to denote it):

Browser	Order
Chrome	`masp`
Firefox	`mpas`
Safari	`mspa`
curl	`mpsa`

This seemingly small difference is again making it easy to fingerprint the clients.

The `PRIORITY` frame

In HTTP/2 the client can define stream priorities. For example, the client may want to prioritize receiving JS scripts over images. This article being long enough, I will not describe this mechanism in full details. However, it is important to know two things:

The client can define a tree of streams, by specifying for each stream a parent stream. This tree defines dependencies for prioritization purposes.
The client can define for each stream a weight, which sets its priority relative to its siblings in the tree.

Both the parent of each stream and its weight are communicated via the PRIORITY frame. Firefox, for example, builds a rather complex tree of streams that looks like the following:

To create this tree Firefox by default will send a PRIORITY frame for streams 3,5,7,9,11,13 defining their parents and weights. Inspecting the nghttpd logs we observe this as follows:

recv PRIORITY frame 
	(dep_stream_id=0, weight=201, exclusive=0)
recv PRIORITY frame 
	(dep_stream_id=0, weight=101, exclusive=0)
recv PRIORITY frame 
	(dep_stream_id=0, weight=1, exclusive=0)
recv PRIORITY frame 
	(dep_stream_id=7, weight=1, exclusive=0)
recv PRIORITY frame 
	(dep_stream_id=3, weight=1, exclusive=0)
recv PRIORITY frame 
	(dep_stream_id=0, weight=241, exclusive=0)

The use of this specific tree structure and these specific weights is thus very indicative of Firefox.

Where is HTTP/2 fingerprinting being used?

HTTP/2 fingerprinting lets the server identify the client reliably before responding with data. Therefore it is used for similar purposes as TLS fingerprinting: Usually by commercial anti-DDOS and anti-bot solutions attempting to block automatic tools while allowing real browsers.

I’ve personally witnessed this method being used in the wild, such that real browsers were handled the real site’s content, but curl-impersonate, for example, got blocked. This was before HTTP/2 impersonation was fully implemented in curl-impersonate.

Controlling your HTTP/2 signature

As seen above, the HTTP/2 protocol contains a lot of details, and the parameters involved are not always configurable by the user. Tools and libraries will usually try to abstract the HTTP/2 details away, and as a result each of these tools created its own unique HTTP/2 signature which cannot be easily altered.

To control your HTTP/2 signatures there are three methods that I’m aware of:

Use a headless browser through a framework such as Puppeteer or Playwright. By using a real browser, you get that browser’s HTTP/2 signature.
curl-impersonate, my own fork of the popular curl tool, that supports impersonating real browsers. In its latest version it has a much better HTTP/2 impersonation support. It can impersonate the HTTP/2 signatures of Firefox and Chrome pretty well, including all the parameters mentioned in this article. Its main advantage is that it combines the correct TLS signature as well.
Write your own HTTP/2 client code through a low-level library such as nghttp2, which gives you full control over all parameters.

Checking a client’s HTTP/2 signature

You may wonder how to check a clien’ts HTTP/2 signature. Unlike TLS fingerprinting which relies on an unencrypted TLS Client Hello packet, the HTTP/2 frames will almost always be encrypted. This makes it a bit harder to inspect. There are two options which I like to use.

Capture the encrypted session in Wireshark while defining the SSLKEYLOGFILE environment variable. Most clients will then write a keylog file which Wireshark can use to decrypt the session. Full instructions are available here. The decrypted frames will look like the following (note the presence of the frames discussed above):
Use nghttpd, a small HTTP/2 server. It is already packaged for most Linux distributions and macOS. To use it, first create a self-signed SSL key and certificate, then run it as follows:
```
nghttpd -v 8443 server.key server.crt
```
Connect a client to https://localhost:8443 and nghttpd will log all the frames it receives with all the parameters.

The TS1 method and library

TS1 is a method and a Python package I developed for the purpose of checking and comparing clients’ signatures. It is available at https://github.com/lwthiker/ts1 or on PyPI.

TS1 takes all the HTTP/2 frames the client sends until, and including, the HEADERS frame, and encodes them into a JSON format that looks like the following (shown is a truncated version):

{
    "frames": [
        {
            "frame_type": "SETTINGS",
            "stream_id": 0,
            "settings": [
                {
                    "id": 1,
                    "value": 65536
                },
                {
                    "id": 4,
                    "value": 131072
                },
                {
                    "id": 5,
                    "value": 16384
                }
            ]
        },
        {
            "frame_type": "WINDOW_UPDATE",
            "stream_id": 0,
            "window_size_increment": 12517377
        },
        {
            "frame_type": "PRIORITY",
            "stream_id": 3,
            "priority": {
                "dep_stream_id": 0,
                "weight": 201,
                "exclusive": false
            }
        },
        {
            "frame_type": "HEADERS",
            "stream_id": 15,
            "pseudo_headers": [
                ":method",
                ":path",
                ":authority",
                ":scheme"
            ]
        }
    ]
}

The JSON is then turned into a canonical form, a compactified form according to certain rules:

{"frames": [{"frame_type": "SETTINGS", "settings": [{"id": 1, "value": 65536}, {"id": 4, "value": 131072}, {"id": 5, "value": 16384}], "stream_id": 0}, {"frame_type": "WINDOW_UPDATE", "stream_id": 0, "window_size_increment": 12517377}, {"frame_type": "PRIORITY", "priority": {"dep_stream_id": 0, "exclusive": false, "weight": 201}, "stream_id": 3}, {"frame_type": "HEADERS", "pseudo_headers": [":method", ":path", ":authority", ":scheme"], "stream_id": 15}]}

then a SHA1 hash of the string is calculated to produce the TS1 signature hash:

c9bb208868a10863867841a2e5bcb3b903719784

Different clients will have different hashes, and the hashes can be easily saved in a database for easy comparison of clients’ signatures.

More details about using the TS1 library can be found in the GitHub page.

Concluding

I will conclude with the same words from the previous post: Fingerprinting has become extremely common throughout the web, and while it is used for legitimate purposes such as blocking DDOS attacks, it is also making the web less open, less private and much more restrictive towards specific web clients. I have witnessed before how websites mark certain browsers as suspicious while letting in others (not intentionally probably), with TLS and HTTP fingerprinting being the main methods to achieve that.

With the added awareness about the prevelance of such techniques, I hope that browsers, web clients and future protocol designers will be more attentive towards these kinds of issues.

This method, though relatively-unknown, is not new. After doing my own research about the subject for curl-impersonate, I found this BlackHat presentation detailing a research with similar conclusions. ↩
https://w3techs.com/technologies/details/ce-http2 ↩
https://httpwg.org/specs/rfc7540.html#SettingValues ↩
Source code reference for curl’s window size ↩

TLS fingerprinting: How it works, where it is used and how to control your signature

2022-06-17T12:00:00+00:00

In this two-part series of posts I would like to expand about server-side browser fingerprinting. Server-side fingerprinting is a collection of techniques used by web servers to identify which web client is making a request based on network parameters sent by the client. By web client I mean the type of client, as in which browser or CLI tool, and not a specific user like what a cookie identifies.

A different technique from server-side fingerprinting is client-side fingerprinting, which is when Javascript is injected to test the client. This may be the subject of a future post, and I’ll focus on server-side fingerprinting for now.

TLS fingerprinting is a widely-deployed server-side technique. It allows web servers to identify the client to a high degree of accuracy based on the first packet of the connection alone. I will give examples below to demonstrate just how easy it is to tell the client from the its TLS parameters.

This is the first part of a two-part series about web fingerprinting. Read the second post about HTTP/2 fingerprinting here.

How does TLS fingerprinting work
Methods for signature calculation
- JA3
- TS1
Where is TLS fingerprinting being used?
Controlling your TLS signature
What’s next for TLS fingerprinting?

How does TLS fingerprinting work

TLS is the evolution of SSL, the protocol previously responsible for handling encrypted connections between web clients and servers. SSL is no longer in common use, but its name is still mistakenly used to refer to TLS as well.

Whenever a web client - a browser, script or a command line tool - accesses a TLS-encrypted site (https://...), it first performs a TLS handshake with the server. Here is a schematic diagram, courtesy of Wikipedia:

The first message is the TLS client hello, sent by the client to server. In this message the client declares to the server what parts of the TLS protocol it supports. The following are examples of parameters sent by the client:

The versions of the TLS protocol the client supports (from TLS 1.0 up to TLS 1.3).
The cryptographic algorithms the client supports for data encryption, known as cipher suites.
The cryptographic algorithms the client supports for digital signatures.

As it happens, each client uses a different TLS library: Firefox uses NSS, Chrome uses BoringSSL, Safari uses Secure Transport, and Python uses OpenSSL. The result is that the above parameters differ significantly between clients. Here is an example of the cipher suites list declared by Chrome in the TLS client hello, as captured by Wireshark:

This list - its contents and the order of ciphers - is different depending on the TLS client in use. In addition to that, TLS is such a complex protocol that it has many extensions, each with its own set of additional parameters ¹. To give some examples:

Some clients support compressing the exchanged certificates through a dedicated TLS extension.
Some clients support negotiating parameters for the underlying protocol (e.g. HTTP/2) through a dedicated TLS extension called ALPS.
Some clients add a fake TLS extension called GREASE.

Here is how Chrome’s list of TLS extensions looks like in Wireshark:

For each browser the above list of extensions is different, and the order of extensions may differ as well.

The following is a comparison table demonstrating notable differences in TLS signatures of common clients²:

	Chrome	Safari	Firefox	Python
No. of cipher suites	16	27	17	43
No. of signature algorithms	8	11	11	20
ALPS extension	Yes	No	No	No
Certificate compression method	Brotli	Zlib	None	None
GREASE extension	Yes	Yes	No	No

With this in mind it is obvious that web clients can be easily distinguished based on their TLS signature. The remarkable thing is that this information is all available upon the very first packet of the session to the server. The server can thus infer which client is connected even before responding back with any kind of data. Moreover, until encrypted client hello becomes the standard, any third-party listener on the network can infer this as well.

Methods for signature calculation

JA3

JA3 is a popular method used to formalize the notion of a TLS fingerprint. It takes a Client Hello packet and produces a hash identifying the client.

JA3 works by concatenating multiple fields of the Client Hello and then hashing them. The fields are:

SSLVersion,Cipher,SSLExtension,EllipticCurve,EllipticCurvePointFormat

For example, for a Chrome browser this would be:

771,39578-4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,23130-0-23-65281-10-11-35-16-5-13-18-51-45-43-27-17513-39578-21,39578-29-23-24,0

This is then hashed with MD5 to produce the JA3 signature:

e3501e1725c83830dd40f12930cc6eaa

JA3 is de-facto standard in this regard and has been integrated, for example, into Wireshark.

It is important to note that JA3 does not take into account all different parameteres in the Client Hello. This means that it is possible to have two different Client Hellos with the same JA3 signature³.

TS1

TS1 is my take on creating a unique hash per TLS signature. It was inspired by JA3 but is more comprehensive in that it encodes all the parameters of the TLS Client Hello message. I’ve created and used it myself while working on curl-impersonate.

TS1 encodes the parameters of the Client Hello message in JSON format according to certain rules:

{"client_hello": {"ciphersuites": [4865, 4867, 4866, 49195, 49199, 52393, 52392, 49196, 49200, 49162, 49161, 49171, 49172, 156, 157, 47, 53], "comp_methods": [0], "extensions": [{"type": "server_name"}, {"length": 0, "type": "extended_master_secret"}, {"length": 1, "type": "renegotiation_info"}, {"length": 14, "supported_groups": [29, 23, 24, 25, 256, 257], "type": "supported_groups"}, {"ec_point_formats": [0], "length": 2, "type": "ec_point_formats"}, {"length": 0, "type": "session_ticket"}, {"alpn_list": ["h2", "http/1.1"], "length": 14, "type": "application_layer_protocol_negotiation"}, {"length": 5, "status_request_type": 1, "type": "status_request"}, {"length": 10, "sig_hash_algs": [1027, 1283, 1539, 515], "type": "delegated_credentials"}, {"key_shares": [{"group": 29, "length": 32}, {"group": 23, "length": 65}], "length": 107, "type": "keyshare"}, {"length": 5, "supported_versions": ["TLS_VERSION_1_3", "TLS_VERSION_1_2"], "type": "supported_versions"}, {"length": 24, "sig_hash_algs": [1027, 1283, 1539, 2052, 2053, 2054, 1025, 1281, 1537, 515, 513], "type": "signature_algorithms"}, {"length": 2, "psk_ke_mode": 1, "type": "psk_key_exchange_modes"}, {"length": 2, "record_size_limit": 16385, "type": "record_size_limit"}, {"type": "padding"}], "handshake_version": "TLS_VERSION_1_2", "record_version": "TLS_VERSION_1_0", "session_id_length": 32}}

and then calculates its SHA1 hash to produce the TS1 signature:

889b4383dcfee0d3dc4c472d3d40568028842b3e

Different clients will have different hashes, and the hashes can be easily saved in a database for easy comparison of clients’ signatures.

TS1 signatures encode more parameters than JA3, therefore they represent a more accurate picture of the client. Another advantage is that due to the use of JSON, it is future-proof to additional TLS extensions that are not yet defined, and which may hold crucial client-identifying information in the future. The disadvantage of TS1 is that its JSON format is much more verbose than JA3’s simple format.

Where is TLS fingerprinting being used?

TLS fingerprinting is naturally used by anti-bot and anti-DDOS solutions to protect web pages against massive crawling or DDOS attacks. By checking if the client is a browser or a script (i.e. a bot), they can decide whether to allow the request, block it, or introduce an additional Javascript-based challenge to further test the client.

Another interesting use-case which got my attention, though I haven’t seen this by myself, is that of phishing campaigns. A phishing website will use TLS fingerprint to detect if the client is a browser or not. It will serve the phishy content to unsuspecting victims with a browser, but will block automatic crawling by security products attempting to identify phishing websites.

Controlling your TLS signature

Most of the parameters in the TLS client hello message are not controllable by scripts or command line tools. In Python, for example, you can control the cipher suites list, but it pretty much ends there. Even with that in place, the underlying TLS library may not send the exact list you specified, as is the case with Python and OpenSSL.

The best currently-available methods that I’m aware of to control the full TLS signature, are:

Puppeteer, which allows you to run a headless Chrome browser and control it with a script. By using a real browser, you get the TLS signature of that browser.
curl-impersonate, my own fork of the popular curl tool with support for faking TLS signatures to impersonate a few popular browsers. It also comes with a fork of libcurl, called libcurl-impersonate, so you can programatically use it in your code. Another option is to inject libcurl-impersonate into an already running application using the regular libcurl. You can read about the technical aspects of curl-impersonate in my previous posts (part 1, part 2), and find more documentation in the GitHub repository. An advantage of curl-impersonate is that the correct HTTP/2 fingerprint will be used as well. More on this in the next post.
JA3Transport is a Go library that intends to fake JA3 signatures. I didn’t test it myself.

What’s next for TLS fingerprinting?

TLS fingerprinting has become extremely common throughout the web, and while it is used for legitimate purposes such as blocking DDOS attacks, it is also making the web less open, less private and much more restrictive towards specific web clients.

It is my impression that current tools for faking a client’s TLS signature are still immature. Using curl-impersonate for example requires you to write your own C code or inject it into existing applications using libcurl.

The best solution would be for one of the TLS libraries to provide more fine-grained control for users. The kind of functionality that might be needed:

Allowing users to control the order TLS extensions.
Allowing users to control the exact list of ciphers.
Supporting the latest TLS extensions that some browsers use.

When this happens, packages for popular programming language can emerge to take advantage of the functionality and to control their TLS signatures.

The large number of available TLS extensions can be seen at https://www.iana.org/assignments/tls-extensiontype-values/tls-extensiontype-values.xhtml. ↩
Chrome 101, Firefox 100, Safari 15.4, Python 3.8.10 with OpenSSL 1.1.1f and the requests library. ↩
For example, the parameters inside the TLS compressed-certificate extension are not taken into account. ↩

Firefox appears to be flagged as suspicious by Cloudflare

2022-05-21T15:30:00+00:00

Update: Cloudflare’s response indicates that this is a customer-specific rule and not a global policy. They did not mention what kind of rule is triggering this behavior though.

It appears that Firefox is now flagged as “suspicious” by Cloudflare’s anti-bot protection. When you browse to certain websites hosted on Cloudflare’s CDN and using this service, Firefox is served back a Javascript challenge. This is how it looks like:

You can test it yourself: Browse to https://www.g2.com, which is a software reviews website. If you use Chrome or Edge, you will get the site’s content. However, use Firefox and you’ll most likely be served the challenge instead (make sure to clear cookies before). This basically means you must have JS enabled to access the site and you will incur a 2-3 seconds delay before the content is served.

This is not a good prospect for the open-source browser. If this behavior gets adapted on more sites, we can expect even more users leaving Firefox, as every web access will take a few more seconds.

From a technical standpoint it doesn’t make sense either. I don’t see any reason to “suspect” Firefox is a bot. If anything, Chrome is probably being used for web scraping at a much higher rate through projects like Puppeteer.

To be clear, I don’t believe this behavior is intentional on Cloudflare’s side. The way they identify which browser you are using is through a combination of TLS fingerprinting and HTTP fingerprinting (on which I might write an extended explanation later on). What I believe to be happening is that Cloudflare whitelists the signatures of browsers with large-enough market share, and Firefox happens to fall below that threshold. Even if that is the case, I do expect Cloudflare to actively whitelist Firefox. Open-source browsers are an important part of the web and should not be treated differently than their closed-source counterparts.

Impersonating Chrome, too

2022-02-20T10:00:00+00:00

This is a continuation of the previous post. If you didn’t read it, please go ahead and read at least until the TL;DR section. In summary, various web services perform TLS fingerprinting to identify whether you run a real browser like Chrome or Firefox or whether it is a tool like curl or a Python script. I created curl-impersonate, a modified version of curl that performs TLS handshakes which are identical to Firefox’s, thereby tricking said services to believe it is a real browser.

After uploading the repository I posted it to Hacker News. On the thread someone suggested that

They should really be impersonating Chrome. If this takes off, Firefox has such a small user share that I could see sites just banning Firefox altogether, like they do with Tor

Challenge accepted!

TL;DR

I re-compiled curl with BoringSSL, Chrome’s TLS library.
I tweaked curl’s TLS code to perform a similar TLS handshake to Chrome, enabling some Google-specific TLS extensions on the way.
This still being detected by TLS fingerprinters, I had to dive deeper into the encrypted session.
Two small but crucial differences in the HTTP/2 frames revealed further how those fingerprinters work.
I then patched the HTTP/2 code as well to impersonate Chrome.
You can find the updated curl-impersonate, with full Chrome 98 impersonation, in the GitHub repository.

Let’s look at the details.

Using BoringSSL

The first part of impersonating a browser is using the same TLS library. Otherwise you are going to hit a wall of missing features and varying implementations as we shall see below. For Firefox I used NSS as mentioned in the previous post. Chrome uses BoringSSL, described as “a fork of OpenSSL that is designed to meet Google’s needs.”. At first, looking at Curl’s list of SSL libraries, I didn’t find BoringSSL and concluded that it was not supported. But it really is supported. You just replace OpenSSL with BoringSSL at build time and it works:

./configure --with-openssl=/path/to/boringssl

The full build procedure is in the Dockerfile.

The Client Hello message

The first message sent by TLS clients is called Client Hello. It contains a list of parameters and extensions, all of which can be used to fingerprint the client. For example, the ja3 method calculates a hash of some of them to create a unique fingerprint for each client. Our goal here is to match curl’s Client Hello and make it completely identical to Chrome’s. Here’s the important part of Chrome’s Client Hello message (Chrome 98, Windows 10, non-incognito):

Handshake Protocol: Client Hello
    Handshake Type: Client Hello (1)
    Length: 508
    Version: TLS 1.2 (0x0303)
    Random: b46aad...
    Session ID Length: 32
    Session ID: 74c03b...
    Cipher Suites Length: 32
    Cipher Suites (16 suites)
    Compression Methods Length: 1
    Compression Methods (1 method)
    Extensions Length: 403
    Extension: Reserved (GREASE) (len=0)
    Extension: server_name (len=17)
    Extension: extended_master_secret (len=0)
    Extension: renegotiation_info (len=1)
    Extension: supported_groups (len=10)
    Extension: ec_point_formats (len=2)
    Extension: session_ticket (len=0)
    Extension: application_layer_protocol_negotiation (len=14)
    Extension: status_request (len=5)
    Extension: signature_algorithms (len=18)
    Extension: signed_certificate_timestamp (len=0)
    Extension: key_share (len=43)
    Extension: psk_key_exchange_modes (len=2)
    Extension: supported_versions (len=7)
    Extension: compress_certificate (len=3)
    Extension: application_settings (len=5)
    Extension: Reserved (GREASE) (len=1)
    Extension: padding (len=203)

The process of matching curl’s Client Hello consists of:

Matching the Ciphers Suites list, by using curl’s built-in --ciphers option.
Enabling, disabling and modifying various extensions by modifying curl’s TLS code.

I detailed some of the process in the previous post, the main difference now being the use of BoringSSL instead of NSS. There were, however, some interesting Google-specific extensions to be dealt with.

GREASE

As can be seen above, Chrome adds two extensions called GREASE before and after the main extension list. Firefox doesn’t do that, and in fact I don’t think NSS even supports it. The purpose of GREASE is to ensure TLS servers are future-proof by mixing in non-existent extensions, expecting the servers to ignore them until they become supported. There is a good explanation in this Cloudflare blog post. To enable GREASE in curl, all that was needed was to call a single function:

SSL_CTX_set_grease_enabled(backend->ctx, 1);

Because we are using the same BoringSSL implementation as Chrome, this adds the GREASE extensions at exactly the same place.

Compressed Certificates

Chrome adds the compress_certificate extension. This is how it looks like:

Extension: compress_certificate (len=3)
    Type: compress_certificate (27)
    Length: 3
    Algorithms Length: 2
    Algorithm: brotli (2)

Chrome is telling the server here that it supports receiving certificates compressed using the Brotli compression algorithm. Brotli was developed at Google and is the br in the Accept-Encoding: gzip, deflate, br HTTP header that most browsers send out today. Going through the Chromium source code we find that this TLS extension is enabled in cert_compression.cc. Again, it is a matter of a single line:

SSL_CTX_add_cert_compression_alg(ctx, TLSEXT_cert_compression_brotli,
                                 nullptr /* compression not supported */,
                                 DecompressBrotliCert);

Here DecompressBrotliCert is a simple proxy function between BoringSSL and the Brotli library. Copying the one-liner and the function over to curl enables the compress_certificate extension.

ALPS

In the previous post I mentioned the ALPN extension which allows the client and server to decide whether to use HTTP/1.1 or HTTP/2 during the TLS handshake. It’s being used by both Firefox and Chrome. Google had taken this one step forward and suggested the ALPS extension, which allows the client to send its HTTP/2 SETTINGS during the TLS handshake (more about SETTINGS later). This is the application_settings extension in the Client Hello. As of this writing, it is a non-standard TLS extension, but Google being Google, they love experimenting with our browsers and Chrome already adds it to its extension list. Here is the commit enabling ALPS in Chrome about a year ago.

In the end, it was again a matter of adding a one-liner to curl, and now curl supports ALPS as well¹:

SSL_add_application_settings(backend->handle, "h2", 2, NULL, 0);

Comparing the TLS fingerprint

By the end of this process, the Client Hello is identical. Here is Chrome’s TLS fingerprint from ja3er.com:

And here is ours:

$ curl-impersonate
    --ciphers TLS_AES_128_GCM_SHA256,TLS_AES_256_GCM_SHA384,TLS_CHACHA20_POLY1305_SHA256,ECDHE-ECDSA-AES128-GCM-SHA256,ECDHE-RSA-AES128-GCM-SHA256,ECDHE-ECDSA-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-ECDSA-CHACHA20-POLY1305,ECDHE-RSA-CHACHA20-POLY1305,ECDHE-RSA-AES128-SHA,ECDHE-RSA-AES256-SHA,AES128-GCM-SHA256,AES256-GCM-SHA384,AES128-SHA,AES256-SHA
    -X GET 'https://ja3er.com/json' | jq .
{
  "ja3_hash": "b32309a26951912be7dba376398abc3b",
  "ja3": "771,4865-4866-4867-49195-49199-49196-49200-52393-52392-49171-49172-156-157-47-53,0-23-65281-10-11-35-16-5-13-18-51-45-43-27-21,29-23-24,0",
}

It’s identical.

Diving deeper

Remarkably, even with an identical TLS fingerprint, Protectify was still able to identify and block our dear curl-impersonate (Protectify is the fake name of the company from the previous post). To understand how, we must dive deeper into the encrypted TLS session.

Decrypting the TLS session

To inspect what’s inside the TLS session we first need to capture it in Wireshark and decrypt it. This is easily done by defining the SSLKEYLOGFILE environment variable. Both Chrome and Firefox would then write a keylog file to the specified location. You can then feed this file to Wireshark and it would decrypt the session for you. Handy!

Here’s how a decrypted Chrome session to wikipedia.org looks like:

The session begins as follows:

Chrome sends the Client Hello message.
The server responds with the Server Hello message.
The server sends its certificate and the TLS handshake is done.
The client and server immediately begin an HTTP/2 session (Remember ALPN?).
Chrome sends a SETTINGS frame.
Chrome sends a HEADERS frame with the GET / request.

The SETTINGS frame

The SETTINGS frame is used to notify the server about a few HTTP/2 specific settings. Here’s how it looks like in Chrome:

Stream: SETTINGS, Stream ID: 0, Length 30
    ...
    Settings - Header table size : 65536
    Settings - Max concurrent streams : 1000
    Settings - Initial Windows size : 6291456
    Settings - Max header list size : 262144
    Settings - Unknown (10858) : 1359919199

Therein lies our first problem. Curl’s SETTINGS look completely different:

Stream: SETTINGS, Stream ID: 0, Length 18
    ...
    Settings - Max concurrent streams : 100
    Settings - Initial Windows size : 33554432
    Settings - Enable PUSH : 0

There are four notable differences:

Curl is sending different values for its settings.
Curl is missing Header table size and Max header list size.
Curl disables HTTP/2 server push because the command line curl doesn’t support it. This sticks out like a sore thumb in the SETTINGS frame.
Chrome throws in a random setting in the end (Shown as Unknown). My guess is that this is another Google invention with similar purpose to TLS GREASE explained above.

Patching curl’s relevant function solves these issues and makes the SETTINGS frame look identical. Here’s the full patch.

The HEADERS frame

In HTTP/2, the HEADERS frame combines the method (e.g. GET), the URI and the HTTP headers all into a unified format. Here’s Chrome’s HEADERS frame:

Stream: HEADERS, Stream ID: 1, Length 438, GET /
    ...
    Header: :method: GET
    Header: :authority: wikipedia.org
    Header: :scheme: https
    Header: :path: /
    ...
    (Regular HTTP headers follow)

It always begins with the pseudo-headers :method, :authority, :scheme and :path whose meaning is clear. But here’s the funny thing. curl sends them out in a different order! Look:

Stream: HEADERS, Stream ID: 1, Length 434, GET /
    ...
    Header: :method: GET
    Header: :path: /
    Header: :scheme: https
    Header: :authority: wikipedia.org
    ...

This is completely fine from an HTTP standpoint, but is being leveraged to fingerprint our client. curl, Firefox, Chrome - each sends them out in a different order.

You can’t control the order of the pseudo-headers from the curl command line. It’s hard-coded into curl’s code, and it’s always the same. Luckily, the fix is simple and involves re-ordering them into the desired order.

Concluding

After matching the TLS signature and the HTTP/2 signature, curl-impersonate now behaves similarly enough to Chrome to trick TLS fingerprinters. In the repository you may find curl_chrome98, a wrapper script that launches curl-impersonate with all the correct headers and flags to make it impersonate Chrome 98 on a Windows 10 machine.

Impersonating browsers is an endless cat-and-mouse game. The rapid release of new browser versions means TLS signatures change by the month. Tomorrow Chrome may come up with another Google-specific extension, or start using Encrypted Client Hello, or even turn on HTTP3 by default. Each such change will require a different set of modifications for curl-impersonate to work.

curl adds the ALPS extension to the Client Hello. For ALPS to fully work the server needs to respond with an encrypted ALPS extension, and the client to send its application settings back (e.g. the HTTP2 SETTINGS frame). I couldn’t test how curl behaves in this situation as no server seems to support it right now, not even google.com. ↩

Making curl impersonate Firefox

2022-02-17T16:00:00+00:00

Update: The second part about impersonating Chrome is up.

In the last post I analyzed an API used by a website to fetch data and display it to the user. I did that in order to automate fetching that same data once a day. The API required customized HTTP headers which I guess were some sort of bot protection. This time I faced a much more sophisticated mechanism: a commercial bot protection solution.

Bot protections are designed to protect websites against web scraping. There are a lot of commercial solutions available by known companies. Here I was getting blocked by one of them, let’s call the company by the fake name Protectify.

My motivation was similar to the last post. I wanted to perform a single GET request to a webpage automatically once a day. When using the browser, the website immediately returns the correct content. However, when using curl or a Python script to perform the exact same GET request, we get back:

HTTP/1.1 503 Service Temporarily Unavailable
...
Server: protectify
...

Checking your browser before accessing www.secured-by-protectify.com
This process is automatic. Your browser will redirect to your requested content shortly.

The returned HTML also contains some obfuscated Javascript code. Basically what’s happening is that the website is served by Protectify’s servers. They somehow detected the use of an automated tool to perform the HTTP request, and served us a Javascript-based challenge that only a real browser would be able to solve.

The data I was trying to fetch was publicly available information which could be taken from other sources. However, this piqued my interest. A real browser does not get the JS challenge, but is immediately served the real content. How could Protectify know that I was using curl to access the website?

TL;DR

Protectify’s servers fingerprint the HTTP client used (e.g. browser, curl) before serving back content.
They use a variety of parameters, most notably the TLS handshake and the HTTP headers.
In case your fingerprint does not match that of a known browser, the Javascript challenge is served instead of the real content.

To bypass it,

I compiled a special version of curl that behaves, network-wise, identically to Firefox. I called it curl-impersonate.
curl-impersonate is able to trick Protectify and gets served the real content.
You can find a Docker image that compiles it in this repository.

This was done in a very hacky way, but I hope the findings below could be turned into real project. Imagine that you could run:

curl --impersonate ff95

and it would behave exactly like Firefox 95. It can then be wrappped with a nice Python library.

Anyway, here are the technical details.

The technical details

Let’s try to understand how Protectify identifies that we are a bot. At first I tried to send the exact same HTTP headers that Firefox sends. I used Firefox 95 on a Windows virtual machine to see what headers are sent. I then ran curl with the exact same headers:

$ curl 'https://secured-by-protectify.com'
    -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' \
    -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8' \
    -H 'Accept-Language: en-US,en;q=0.5' \
    -H 'Accept-Encoding: gzip, deflate, br' \
    -H 'Connection: keep-alive' \
    -H 'Upgrade-Insecure-Requests: 1' \
    -H 'Sec-Fetch-Dest: document' \
    -H 'Sec-Fetch-Mode: navigate' \
    -H 'Sec-Fetch-Site: none' \
    -H 'Sec-Fetch-User: ?1'

This doesn’t work. We get back HTTP/1.1 503 Service Temporarily Unavailable.

There is also an open-source Python package which claims to “bypass Protectify’s anti-bot page”. It didn’t work with this site as well.

The TLS handshake

When an HTTP client opens a connection to a website with SSL/TLS enabled (i.e. https://…) it first performs a TLS handshake. The handshake’s purpose is to verify the other side’s authenticity and establish the encrypted connection. The first message sent by the client is called “Client Hello” and it contains quite a lot of TLS parameters. Here is a Wireshark capture from a regular curl invocation:

I’m far from a TLS expert, but it is clear that in this message alone there is a myriad of parameters, extensions and configurations which are sent by our client. Each TLS client will send a different “Client Hello” message, and it has been known for a long time that it can be used to identify which browser or tool initiated the connection. See, for example, the ja3 project.

The “Cipher Suites” list

Part of the “Client Hello” message is the Cipher Suites list, visible above. It indicates to the server what encryption methods the client supports. This is how curl’s cipher suite looks like by default:

Cipher Suites (31 suites)
    Cipher Suite: TLS_AES_256_GCM_SHA384 (0x1302)
    Cipher Suite: TLS_CHACHA20_POLY1305_SHA256 (0x1303)
    Cipher Suite: TLS_AES_128_GCM_SHA256 (0x1301)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 (0xc02c)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (0xc030)
    Cipher Suite: TLS_DHE_RSA_WITH_AES_256_GCM_SHA384 (0x009f)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 (0xcca9)
    ...

Notably, curl sends 31 different possible ciphers. Compare it to Firefox’s 17, which are also ordered differently:

Cipher Suites (17 suites)
    Cipher Suite: TLS_AES_128_GCM_SHA256 (0x1301)
    Cipher Suite: TLS_CHACHA20_POLY1305_SHA256 (0x1303)
    Cipher Suite: TLS_AES_256_GCM_SHA384 (0x1302)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 (0xc02b)
    Cipher Suite: TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (0xc02f)
    Cipher Suite: TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 (0xcca9)
    Cipher Suite: TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (0xcca8)
    ...

It is highly likely that Protectify uses this list to detect known browsers. Hence my first attempt was to cause curl to use the same cipher suite as Firefox. I converted the list to OpenSSL’s format using this reference and tried my luck with the --ciphers option:

$ curl 'https://secured-by-protectify.com'
    --ciphers TLS_AES_128_GCM_SHA256,TLS_CHACHA20_POLY1305_SHA256,TLS_AES_256_GCM_SHA384,ECDHE-ECDSA-AES128-GCM-SHA256,ECDHE-RSA-AES128-GCM-SHA256,ECDHE-ECDSA-CHACHA20-POLY1305,ECDHE-RSA-CHACHA20-POLY1305,ECDHE-ECDSA-AES256-GCM-SHA384,ECDHE-RSA-AES256-GCM-SHA384,ECDHE-ECDSA-AES256-SHA,ECDHE-ECDSA-AES128-SHA,ECDHE-RSA-AES128-SHA,ECDHE-RSA-AES256-SHA,AES128-GCM-SHA256,AES256-GCM-SHA384,AES128-SHA,AES256-SHA
    -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' \
    ...

well, it fails. 503 Service Temporarily Unavailable again. Looking at Wireshark, the cipher suite contains 18 ciphers, even though we requested only 17. OpenSSL, the library curl uses by default for TLS, had automatically added the following cipher:

Cipher Suite: TLS_EMPTY_RENEGOTIATION_INFO_SCSV (0x00ff)

This behavior is documented by OpenSSL but I could not find a way to disable it. This makes it extremely easy to detect OpenSSL clients. curl and Python use OpenSSL, but no major browser does. We’ll have to choose a different route.

Using NSS

Firefox does not use OpenSSL. It uses NSS, another library for TLS communications. Luckily, curl can be compiled against a large range of TLS libraries, NSS included. So I compiled curl against NSS instead of OpenSSL. This was pretty techinical and took a while to figure out. You can find the full build procedure at the repository. The resulting binary I named curl-impersonate.

With this in hand, I converted once more the cipher list into the right format, which can be found in this curl source file. Running our new curl-impersonate:

$ curl-impersonate 'https://secured-by-protectify.com'
    --ciphers aes_128_gcm_sha_256,chacha20_poly1305_sha_256,aes_256_gcm_sha_384,ecdhe_ecdsa_aes_128_gcm_sha_256,ecdhe_rsa_aes_128_gcm_sha_256,ecdhe_ecdsa_chacha20_poly1305_sha_256,ecdhe_rsa_chacha20_poly1305_sha_256,ecdhe_ecdsa_aes_256_gcm_sha_384,ecdhe_rsa_aes_256_gcm_sha_384,ecdhe_ecdsa_aes_256_sha,ecdhe_ecdsa_aes_128_sha,ecdhe_rsa_aes_128_sha,ecdhe_rsa_aes_256_sha,rsa_aes_128_gcm_sha_256,rsa_aes_256_gcm_sha_384,rsa_aes_128_sha,rsa_aes_256_sha
    -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' \
    -H ...

and… it fails again. However, looking at Wireshark, the Cipher Suite option now matches exactly the one Firefox sends. Left is curl-impersonate, right is Firefox:

So we are in the right direction.

The rest of the Client Hello message

The Cipher Suites is just one part of the Client Hello message. Most importantly, the Client Hello contains a list of TLS extensions. Each client produces a different set of extensions by default. Anti-bot mechanisms use this to identify which HTTP client was used. The goal here was to make curl-impersonate produce the exact same extension list as Firefox. I will detail some of the process. The bottom line is that by playing with curl’s source code, and putting in the right modifications, I managed to make its Client Hello message look exactly like Firefox’s.

Here is the Client Hello message that Firefox sends by default (Firefox 95, Windows, non-incognito):

Handshake Protocol: Client Hello
    Handshake Type: Client Hello (1)
    Length: 508
    Version: TLS 1.2 (0x0303)
    ...
    Session ID Length: 32
    Session ID: 22de422dd343bb2bccead1e060098037ae5793bae952b20c…
    ...
    Extensions Length: 401
    Extension: server_name (len=17)
    Extension: extended_master_secret (len=0)
    Extension: renegotiation_info (len=1)
    Extension: supported_groups (len=14)
    Extension: ec_point_formats (len=2)
    Extension: session_ticket (len=0)
    Extension: application_layer_protocol_negotiation (len=14)
    Extension: status_request (len=5)
    Extension: delegated_credentials (len=10)
    Extension: key_share (len=107)
    Extension: supported_versions (len=5)
    Extension: signature_algorithms (len=24)
    Extension: psk_key_exchange_modes (len=2)
    Extension: record_size_limit (len=2)
    Extension: padding (len=138)

Here are some of the notable changes I made to curl so that it sends the exact same message.

ALPN and HTTP2

The presence of the application_layer_protocol_negotiation extension can be seen above. This is known as ALPN. This extension is used by browsers to negotiate whether to use HTTP/1.1 or HTTP/2. By doing it as part of the TLS handshake, the browser saves a few round-trips which would otherwise happen only after the TLS session has been established. The extension’s contents look like the following:

Extension: application_layer_protocol_negotiation (len=14)
    Type: application_layer_protocol_negotiation (16)
    Length: 14
    ALPN Extension Length: 12
    ALPN Protocol
        ALPN string length: 2
        ALPN Next Protocol: h2
        ALPN string length: 8
        ALPN Next Protocol: http/1.1

Here Firefox tells the server that it supports both HTTP/2 (h2) and HTTP/1.1 (http/1.1).

To reproduce this behavior, I:

Compiled curl with nghttp2, the low-level library that provides the HTTP/2 implementation.
Made a small modification to Curl’s code, since it was sending h2 and http/1.1 in reverse order.
Launched curl with the --http2 flag.

A few other extensions

Firefox adds the status_request and delegated_credentials extensions as can be seen above. I don’t know what they do, but curl wasn’t sending them. Here the solution was to look at the Firefox source code. Mozilla provides searchfox, a whole site dedicated to searching the Firefox source code. It’s great! The two important files are nsNSSIOLayer.cpp and nsNSSComponent.cpp. Searching around I found the following two snippets:

// CommonInit() @ nsNSSComponent.cpp
  SSL_OptionSetDefault(
      SSL_ENABLE_DELEGATED_CREDENTIALS,
      Preferences::GetBool("security.tls.enable_delegated_credentials",
                           DELEGATED_CREDENTIALS_ENABLED_DEFAULT));

// nsSSLIOLayerSetOptions() @ nsNSSIOLayer.cpp
  if (SECSuccess != SSL_OptionSet(fd, SSL_ENABLE_OCSP_STAPLING, enabled)) {
    return NS_ERROR_FAILURE;
  }

So Firefox turns on some specific SSL options called SSL_ENABLE_DELEGATED_CREDENTIALS and SSL_ENABLE_OCSP_STAPLING . Without really understanding what’s their purpose, I added similar snippets to curl, and now it sends the desired extensions in the Client Hello. I continued this process for 7 or 8 extensions in total. Some were missing, some were configured differently, and it took some tinkering to figure everything out. The full patch can be found at the repo.

Session ID

TLS Session IDs are another optimization mechanism that saves the browser from re-doing a full TLS handshake. Quoting from this book:

… the client can include the session ID in the ClientHello message to indicate to the server that it still remembers the negotiated cipher suite and keys from previous handshake and is able to reuse them. In turn, if the server is able to find the session parameters associated with the advertised ID in its cache, then an abbreviated handshake (Figure 4-3) can take place.

But here is the curious thing: Firefox always includes a session ID, even when connecting to a never-visited-before site. This is how it looks:

    Session ID Length: 32
    Session ID: 22de422dd343bb2bccead1e060098037ae5793bae952b20c…

while curl’s is just empty:

    Session ID Length: 0

This took quite a deep look in the NSS/Firefox source code to figure out. The relevant function is ssl3_CreateClientHelloPreamble which builds the Client Hello message. Under certain circumstances, it adds a fake session ID:

...
else if (ss->opt.enableTls13CompatMode && !IS_DTLS(ss)) {
    /* We're faking session resumption, so rather than create new
     * randomness, just mix up the client random a little. */
    PRUint8 buf[SSL3_SESSIONID_BYTES];
    ssl_MakeFakeSid(ss, buf);
    rv = sslBuffer_AppendVariable(&constructed, buf, SSL3_SESSIONID_BYTES, 1);
}

I don’t really understand why. If anyone does, please let me know¹. To enable similar behavior in curl-impersonate I had to turn on “TLS1.3 compat mode” (which can be seen in the if condition above). Firefox does this as well. This is from the Firefox code:

// nsSSLIOLayerSetOptions() @ nsNSSIOLayer.cpp

  // Set TLS 1.3 compat mode.
  if (SECSuccess != SSL_OptionSet(fd, SSL_ENABLE_TLS13_COMPAT_MODE, PR_TRUE)) {
      ...

Putting a similar call in curl-impersonate makes it send fake sesssion IDs a well.

The result

The resulting curl binary, after all source-code modifications and using the right flags, sends a TLS Client Hello message that looks exactly like the one Firefox sends. Here is a side-by-side comparison:

I can’t tell the difference, and Protectify can’t either. It bypasses the bot protection entirely.

This repository contains a Dockerfile that will build it for you. The resulting image includes:

curl-impersonate, a modified curl binary with all the required TLS tweaks.
curl_ff95, a wrapper bash script that will launch curl-impersonate with the correct parameters to make it look like Firefox 95 on Windows.

Concluding

The modified curl behaves like a real browser, at least from the TLS viewpoint. It bypasses this specific company’s bot protection mechanism.

Honestly, that company did a pretty great job there. If your TLS handshake and HTTP headers don’t exactly match that of a real browser, you get blocked. If you use a real browser, you don’t notice anything. I would use their solution if I needed one.

Remember that this was just one bot protection mechanism. There are others which are more aggressive. I don’t expect the above to work for you if you do massive web scraping. For fetching a single page once a day it works well, at least until they figure it out and update their bot protection to use other tricks.

Update: I now understand that this was implemented as a bridge for adoption of TLS 1.3. More information in this Cloudflare blog post. ↩

Analyzing a stock exchange’s API

2022-02-12T10:30:00+00:00

This was a fun afternoon reverse engineering project so I figured I’d write a bit about it.

I’m developing a web app, Pumbaa Backtester, which is a small tool to simulate the historical performance of index-based investments. As part of the development I wanted to fetch long-term historical data for an ETF traded at a medium-size stock exchange. I won’t write exactly which one, but if you are curious you’ll figure it out.

Each day a closing price for the ETF is determined, which is pretty much like the price of a stock at the end of the trading day. What I needed are closing prices since the ETF was created 22 years ago. Browsing a bit at the stock exchange’s site I got to the following form:

Great! This gives the data I want. The goal is to automate fetching these prices - I want it to be done automatically once a day. So let’s fire up Firefox network monitor (Ctrl+Shift+E) and see what happens when we press “Search”:

Looks simple enough - an API with the parameters isin (unique id of the ETF), minDate and maxDate.

First attempts

If we attempt to access the API with curl:

$ curl -X GET -G   \
    'https://api.stock-exchange.com/v1/data/price_history'  \
        -d "limit=50"                                       \
        -d "offset=0"                                       \
        -d "isin=$ISIN"                                     \
        -d "minDate=2021-02-10"                             \
        -d "maxDate=2022-02-10"                             \
        -d "cleanSplit=false"                               \
        -d "cleanPayout=false"                              \
        -d "cleanSubscriptionRights=false"
{}

we get back an empty JSON response. At this point the most likely possibility is that we are missing one of the HTTP headers, it can be a Cookie header or something else. Looking at the original request’s headers, everything is quite standard except for the trio Client-Date, X-Client-TraceId and X-Security:

Client-Date: 2022-02-12T08:58:52.208Z
X-Client-TraceId: bbbfec1ad15ca1e16cd72fba9e8a7241
X-Security: 185111eb1d17ea0bf0928f2655d05254

These are not documented on MDN so they must be something unique to this API. You could wonder if we could just send the exact same headers again, and yes it works for a few minutes, but then stops working. We’ll have to find out the logic behind them.

Client-Date is simple enough, it’s just the current time. The other two are 16-byte hex encoded strings, so maybe they are just random UUIDs? Let’s try:

$ curl -X GET -G   \
    'https://api.stock-exchange.com/v1/data/price_history'  \
        -H "Client-Date: 2022-02-12T08:58:52.208Z"          \
        -H "X-Client-TraceId: $(uuidgen -r | tr -d '-')"    \
        -H "X-Security: $(uuidgen -r | tr -d '-')"          \
        -d 'limit=50'                                       \
         ...
{}

Nope, another empty JSON. There must be some logic then that generates these headers in Javascript.

Finding the origin

Searching for the string X-Client-TraceId through the JS scripts that the page uses, we find the culprit:

The script main-es2015.3f13e42ead3dc41c6dc3.js is a one-line, minified script, probably generated by webpack. Why would a page with a single form need 3MB of Javascript is really beyond me. Anyway, after beautifying it we can look at the snippet that generates the three headers:

class o {
    static generateHeaders(t) {
        const e = i().toISOString();
        let n = e + t + r.N.tracing.salt;
        return n = s.V.hashStr(n).toString(), {
            "Client-Date": e,
            "X-Client-TraceId": n,
            "X-Security": s.V.hashStr(i().format("YYYYMMDDHHmm")).toString()
        }
    }
}

At first I tried to approach this like a programmer, understanding where each variable comes from. But in a 90k-line script where everything is called t, i, and r it’s quite impossible. It doesn’t help that the surrounding code looks like some form of alien code:

63205: function(t, e, n) {
    "use strict";
    n.d(e, {
        N: function() {
            return o
        }
    });
    var i = n(16738),
        r = n(92340),
        s = n(9346);
    class o {
        static generateHeaders(t) {
            ...
        }
    }
}

So let’s just use some common sense and go header-by-header:

Client-Date

This is the current time, converted to a string with Javascript’s toISOString() function.

X-Security

Here is the snippet again for convenience:

"X-Security": s.V.hashStr(i().format("YYYYMMDDHHmm")).toString()

We can guess that it’s a hash of the current time, after being converted to the format YYYYMMDDHHmm. Which hash? The result is 16-byte long so the most probable candidate is md5. Let’s check:

$ echo -n '202202120858' | md5sum
f627c44850a16146d60590eb9584bac3

Doesn’t match… maybe we need to use the local time instead?

$ echo -n '202202121058' | md5sum
185111eb1d17ea0bf0928f2655d05254

It matches! So we got this header as well.

X-Client-TraceId

Here’s the relevant part again:

const e = i().toISOString();
let n = e + t + r.N.tracing.salt;
return n = s.V.hashStr(n).toString(), {
    "X-Client-TraceId": n,
    ...
}

Leveraging what we found out already, this header is generated as follows:

The current time, e, is concatenated to two unknown strings, t and salt.
X-Client-TraceId is the md5 hash of the result.

Now the fastest thing to do is to use a Javascript debugger to find out what t and salt are. The Firefox debugger (Ctrl+Shift+Z) lets us beautify the script and put a breakpoint on this line. Hitting “Search” again the breakpoint is triggered, and we can see the variables’ values:

So apparently:

t is the requested URL, including the query string.
salt is a fixed string, in this case w4icATTGtnjAZMbkL3kJwxMfEAKDa3MN. Apparently it appears in the source code as-is so it must be constant.
X-Client-TraceId is the md5 of time + url + salt.

Now we have all the information needed to generate valid requests to the API:

Take the current time and hash it to generate X-Security.
Construct the URL with the parameters, add it to the time and salt and hash everything together to generate X-Client-TraceId.

And it works! Here is a Python snippet to generate the headers for a given URL:

import datetime
import hashlib

def generate_headers(url):
    salt = "w4icATTGtnjAZMbkL3kJwxMfEAKDa3MN"
    current_time = datetime.datetime.now(tz=datetime.timezone.utc)
    client_date = (current_time
        .isoformat(timespec="milliseconds")
        .replace("+00:00", "Z")
    )
    client_traceid = hashlib.md5(
        (client_date + url + salt).encode("utf-8")
    )
    security = hashlib.md5(
        current_time.strftime("%Y%m%d%H%M").encode("utf-8")
    )

    return {
        "Client-Date": client_date,
        "X-Client-TraceId": client_traceid.hexdigest(),
        "X-Security": security.hexdigest()
    }

Concluding

What was the purpose of these headers? I’m really not sure. It could be protection against bots or maybe a user-tracking mechanism. Anyway, it didn’t take much work to understand it. I guess if you are exposing your API on the internet, expect someone to figure it out and use it.

lwt hiker

HTTP/2 fingerprinting: A relatively-unknown method for web fingerprinting

Table of contents

Back to HTTP/1.1

A short introduction to HTTP/2

Frames and streams

Client fingerprinting with HTTP/2

The SETTINGS frame

The WINDOW_UPDATE frame

The HEADERS frame

The PRIORITY frame

Where is HTTP/2 fingerprinting being used?

Controlling your HTTP/2 signature

Checking a client’s HTTP/2 signature

The TS1 method and library

Concluding

TLS fingerprinting: How it works, where it is used and how to control your signature

Table of contents

How does TLS fingerprinting work

Methods for signature calculation

JA3

TS1

Where is TLS fingerprinting being used?

Controlling your TLS signature

What’s next for TLS fingerprinting?

Firefox appears to be flagged as suspicious by Cloudflare

Impersonating Chrome, too

TL;DR

Using BoringSSL

The Client Hello message

GREASE

Compressed Certificates

ALPS

Comparing the TLS fingerprint

Diving deeper

Decrypting the TLS session

The SETTINGS frame

The HEADERS frame

Concluding

Making curl impersonate Firefox

TL;DR

The technical details

The TLS handshake

The “Cipher Suites” list

Using NSS

The rest of the Client Hello message

ALPN and HTTP2

A few other extensions

Session ID

The result

Concluding

Analyzing a stock exchange’s API

First attempts

Finding the origin

Client-Date

X-Security

X-Client-TraceId

Concluding

The `SETTINGS` frame

The `WINDOW_UPDATE` frame

The `HEADERS` frame

The `PRIORITY` frame