browser-privacy

Overview

The HAR Analysis Tool is designed to automate the collection, processing, and analysis of HTTP Archive (HAR) files. These HAR files are generated by executing browser automation tests using Docker. The primary goal is to analyze domain data, identify patterns such as repetitive and exclusive domains across multiple iterations.
This tool is designed to automate the process of generating and analyzing HTTP Archive (HAR) files using Docker and the Browsertime tool. The generated HAR files are then processed to extract domains. The script allows for configuration via a JSON file and can also prompt the user for manual input.

Prerequisites

Python 3.x
Python3-venv
Docker

Installation

Clone or download this repository.
Ensure Docker is installed and running on your system.
Setup a python venv: python3 -m venv venv
Activate the venv: source venv/bin/activate
Install the python dependencies pip install -r collection_scripts/requirements.txt
Pull the docker container docker pull rutvora/browsertime (This is a modified browsertime container to include brave browser. No changes are made to the existing browsers in the container)

Configuration Management

The tool requires four files for configuration:

orchestration_config.py (name of the file can't be changed without code changes)
config.json (name can be changed to anything, as long as the format is JSON)
orchestration_script/ec2_params.json (name of the file can't be changed without code changes)
orchestration_scripts/.env (name of the file can't be changed without code changes)

orchestration_config.py

This file consists of the following configurations, as python variables:

config_file: The name/path of the config.json file
res_dir: The path to the folder where the results will be stored (the results are stored in a dated folder inside this directory)
override_urls: Whether to download the latest set of tranco and cloudflare URLs or use the one provided in the config.json file
locations: An array consisting of locations where the tool has access to servers (or should spawn AWS instances) The locations should exist in either orchestration_scripts/ec2_params.json or as a key in the non_aws_instances variable.
non_aws_instances: A dictionary with the key being a location name of the following structure ("local" is the name of the location):

non_aws_instances = {
    "local": {
        "instance_id": None,
        "ip_addr": "localhost",
        "user": "USERNAME",
        "remote_path": "~"
    }
}

Note: We expect all access to be via SSH keys, and not passwords. It is upto you to configure AWS regions or custom instances/servers with the requisite public keys before-hand

config.json:

{
  "urls": [
      "https://wikipedia.org",
      "https://youtube.com",
      "https://github.com/",
      "https://sitespeed.io"
  ],
  "browsers": [
      "chrome",
      "firefox",
      "edge",
      "brave"
  ],
  "iterations": 5,
  "pretty_print": true,
  "video": false,
  "maxLoadTime": 60000,
  "cpus_per_browser": 4
}

urls
Description: An array of URLs that will be tested.
Usage: "urls": ["https://wikipedia.org", "https://youtube.com"]
Note: should be in format "https://example.com".
browsers
Description: An array of browsers to use for testing.
Usage: "browsers": ["chrome", "brave"]
Note: supported Values: chrome, firefox, edge, brave
iterations
Description: Number of iterations to run each test. If not specified, a default value (like 5) is assumed.
Usage: "iterations": 5
Note: should be an integer value.
pretty_print
Description: Boolean value (true or false) to enable pretty printing of results.
Usage: "pretty_print": true
video
Description: Boolean value (true or false) to enable or disable video recording.
Usage: "video": false
maxLoadTime
Description: Maximum page load time (in milliseconds) before a test times out.
Usage: "maxLoadTime": 60000
cpus_per_browser
Description: Number of CPUs allocated per browser instance.
Usage: "cpus_per_browser": 4
Note: should be an integer value.

orchestration_script/ec2_params.json

Here's an example JSON

{
    "USA-California": {
        "region": "us-west-1",
        "image_id": "AWS_IMAGE_ID",
        "security_group_ids": ["AWS_SECURITY_GROUP_ID"],
        "subnet": "AWS_SUBNET"
    }
}

The key is the location name (can be anything, this is used only locally). The values are as follows:

region: The AWS Region name.
image_id: The ID of the OS image that you want to run on the AWS instance. You can find it using these steps
security_group_ids: An array representing the security groups to be applied to this AWS instance. Create a group that allows at least SSH (port 22) inbound from the IP address of the server you will be executing this tool from. If you don't have a fixed public IP, you may open port 22 to 0.0.0.0/0 (everyone) at your own risk. Read more on Security Groups
subnet: The subnet this instance should belong to (the subnet can/should be tied to the security group(s)). Read more

orchestration_script/.env

This environment file contains some variables related to spawning the AWS instances in various locations.
You should copy the orchestration_script/env file to orchestration_script/.env and fill in the variables.
AWS IAM Access ID and Key

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

AWS Instance Setup

INSTANCE_TYPE="m6a.32xlarge" # Can change the type (but data collection speed will change)
KEY_PAIR_NAME="YOUR_KEY_PAIR_NAME"
DISK_SIZE=320 # 320 GB (required for 1000 websites, and 20 total iterations per site)

Usage

Create a config.json file and modify the orchestration_config.py
Ensure Docker is running before executing the script.
Run the script:

python3 main.py

Output

After running the tests with the HAR Analysis Tool, results are organized in a timestamped folder structure.

Results Folder Structure

The results folder (e.g., results/2025-10-03T183312) contains data collected from multiple vantage points (regions), with separate collections for no-script and with-script modes.

Directory Structure:

results/2025-10-03T183312/
├── Canada/                          # Vantage point: Canada
├── France/                          # Vantage point: France
├── Germany/                         # Vantage point: Germany
├── India/                           # Vantage point: India
├── Ireland/                         # Vantage point: Ireland
├── Singapore/                       # Vantage point: Singapore
├── USA-California/                  # Vantage point: California, USA
├── USA-Ohio/                        # Vantage point: Ohio, USA
│   ├── no-script/
│   └── with-script/
└── logs/                            # Collection logs

Per-Region Structure (e.g., Canada/):

Canada/
├── data_collection.log              # Collection log for this region
├── no-script/                       # Data collected without script execution
│   ├── HARs/                        # Raw HAR files organized by iteration
│   │   ├── 0/                       # Iteration 0
│   │   │   ├── brave/               # Browser: Brave
│   │   │   │   ├── 163.com/        # Website domain
│   │   │   │   │   ├── browsertime_brave.har     # HAR file
│   │   │   │   │   ├── loadedPage.jpg            # Screenshot
│   │   │   │   │   └── true.json                 # Metadata
│   │   │   │   ├── 166.com/
│   │   │   │   └── ...
│   │   │   ├── chrome/              # Browser: Chrome
│   │   │   ├── edge/                # Browser: Edge
│   │   │   └── firefox/             # Browser: Firefox
│   │   ├── 1/                       # Iteration 1
│   │   ├── 2/                       # Iteration 2
│   │   └── ...                      # Up to iteration 9 (10 total)
│   └── info/                        # Processed browser data
│       ├── brave_data_0.0.json      # Brave data (threshold 0.0)
│       ├── brave_data_0.5.json      # Brave data (threshold 0.5)
│       ├── brave_data_0.8.json      # Brave data (threshold 0.8)
│       ├── brave_data_1.0.json      # Brave data (threshold 1.0)
│       ├── chrome_data_0.0.json     # Chrome data (threshold 0.0)
│       ├── chrome_data_0.5.json
│       ├── chrome_data_0.8.json
│       ├── chrome_data_1.0.json
│       ├── edge_data_0.0.json       # Edge data
│       ├── edge_data_0.5.json
│       ├── edge_data_0.8.json
│       ├── edge_data_1.0.json
│       ├── firefox_data_0.0.json    # Firefox data
│       ├── firefox_data_0.5.json
│       ├── firefox_data_0.8.json
│       ├── firefox_data_1.0.json
│       ├── hars_dict.pkl            # Pickled HAR dictionary
│       ├── repetitive_netlocs_0.0.pkl  # Repetitive domains (threshold 0.0)
│       ├── repetitive_netlocs_0.5.pkl  # Repetitive domains (threshold 0.5)
│       ├── repetitive_netlocs_0.8.pkl  # Repetitive domains (threshold 0.8)
│       └── repetitive_netlocs_1.0.pkl  # Repetitive domains (threshold 1.0)
└── with-script/                     # Data collected with script execution
    ├── HARs/                        # Same structure as no-script
    └── info/                        # Same structure as no-script

Key Components:

HARs Directory: Contains raw HTTP Archive (HAR) files organized by:
- Iteration (0-9): 10 iterations per website for statistical significance
- Browser (brave, chrome, edge, firefox): Separate data for each browser
- Website Domain: One directory per website tested
- Files per website:
  - browsertime_{browser}.har: HTTP Archive with network traffic
  - loadedPage.jpg: Screenshot of loaded page
  - true.json: Metadata about the page load
Info Directory: Contains processed browser data files:
- browser_data_{threshold}.json: Processed data for each browser at different privacy filter thresholds
- Thresholds (0.0, 0.5, 0.8, 1.0) represent the minimum fraction of iterations a domain must appear in to be included
- repetitive_netlocs_{threshold}.pkl: Pickled data of domains that appear repeatedly across iterations
- hars_dict.pkl: Consolidated HAR data in pickled format
Script Modes:
- no-script: Data collected without injecting the cookie consent acceptance script (baseline tracking)
- with-script: Data collected with the cookie consent acceptance script injected (accepts all cookies via accept_cookies.js)

Browser Data JSON Structure (browser_data_0.0.json):

Each browser data file contains:

Website URLs as keys
Per-website metrics:
- ad_and_tracking: List of ad/tracking domains
- first_party_domains: First-party domains
- third_party_domains: Third-party domains
- triggered_domain_server_locations: Server IP locations for domains (by category)
- all_request_locations: Server IP locations for all HTTP requests (Chrome only, includes volume)
- referer_graph: Referer-referee relationships across iterations
- cookie_banner_triggered_count: Number of times cookie banner was detected
- Various statistics and metadata

Usage in Analysis:

The DuckDB analysis suite (analysis_scripts/duckdb_analysis) consumes this results folder structure to generate comprehensive privacy analysis reports. Point main.py to the timestamped results folder:

python3 src/main.py /path/to/results/2025-10-03T183312 --script-mode no-script

Analysis

Note: Analysis will try to auto-execute after data is collected from all AWS instances

All the scripts required for analysis of HAR files data is in /analysis-scripts and can be triggered using analyse.py.

 python3 analyse.py <PATH-TO-RESULTS> <PARAMS>

The PATH-TO-RESULTS should be a dir which contains dirs with location names (e.g. results/2025-10-03T200100)

The following optional params are supported:

-c --check-failed: Use Ollama (or another model you set up) to check for failed page loads using the images
-p --preprocess: Preprocess the HARs to extract the relevant information out into the browser_data_<NUM>.json
-b --compare-browsers: Run comparison across browsers, for each location
-l --compare-locations: Run comparison across locations
-a --all: Run all of the above (default, if no optional options are specified)

Server IP analysis

Advanced DuckDB Analysis Suite

The DuckDB analysis suite (/analysis_scripts/duckdb_analysis) provides comprehensive privacy analysis tools using DuckDB and DuckPGQ for graph analysis.

Installation

First, navigate to the duckdb_analysis directory:

cd analysis_scripts/duckdb_analysis

Install dependencies:

pip install -r requirements.txt

Note: All commands in this section assume you are in the analysis_scripts/duckdb_analysis directory.

Quick Start - Integrated Analysis (main.py)

The main.py script runs the complete analysis pipeline with comprehensive command-line options.

Basic Usage:

# From analysis_scripts/duckdb_analysis directory
python3 src/main.py /path/to/results/2025-10-03T183312

This generates:

Regional privacy visualizations
Server location analysis and IP destination tables
Cookie banner impact analysis
Ad traffic concentration analysis
Referer-referee tracking chain analysis

Complete Command-line Options:

python3 src/main.py RESULTS_FOLDER [OPTIONS]

Required Arguments:

RESULTS_FOLDER - Path to results folder containing regional browser data (e.g., results/2025-10-03T183312)

Optional Arguments:

Flag	Description	Default
`-o, --output DIR`	Output directory for visualizations and reports	`./visualization_demo_output`
`-t, --threshold FLOAT`	Privacy filter threshold for referer analysis	`0.0`
`-r, --regions REGION [REGION ...]`	Specific regions to analyze	All available regions
`-s, --script-mode {no-script,with-script}`	Script mode to analyze (only for new format results)	`no-script`

Analysis Control Flags (Skip Components):

Flag	Description
`--skip-regional-comparison`	Skip regional comparison visualizations
`--skip-server-ip`	Skip server IP destination analysis
`--skip-cookie-banners`	Skip cookie banner impact analysis
`--skip-concentration`	Skip ad traffic concentration analysis
`--skip-apex-charts`	Skip apex domain charts for top websites
`--skip-dashboards`	Skip comprehensive dashboard showcase
`--skip-tracking-chains`	Skip referer-referee tracking chain analysis

Usage Examples:

# Basic analysis with default settings
python3 src/main.py /path/to/results/2025-10-03T183312

# Analyze with custom output directory
python3 src/main.py /path/to/results --output ./custom_output

# Analyze specific regions only
python3 src/main.py /path/to/results --regions USA-California Canada France

# Analyze with-script mode data
python3 src/main.py /path/to/results --script-mode with-script

# Set privacy filter threshold for referer analysis
python3 src/main.py /path/to/results --threshold 0.8

# Skip cookie banner and concentration analysis for faster execution
python3 src/main.py /path/to/results --skip-cookie-banners --skip-concentration

# Complete example with multiple options
python3 src/main.py /path/to/results/2025-10-03T183312 \
  --output ./privacy_analysis \
  --threshold 0.0 \
  --script-mode no-script \
  --regions USA-Ohio France Germany \
  --skip-dashboards

Get Help:

python3 src/main.py --help

Individual Analysis Scripts

1. Combine Website Location Data

Combine server location data from all regions into a single JSON file:

python3 src/combine_website_locations.py /path/to/results/2025-10-03T183312 \
  -o combined_website_locations.json \
  --script-mode no-script

2. Sankey Flow Diagrams

Generate Sankey diagrams showing first-party to ad/tracking data flows:

# Using browser JSON for first-party locations (default)
python3 src/create_sankey_from_combined.py combined_website_locations.json -o sankey_plots

# Using website_locs.json for first-party locations
python3 src/create_sankey_from_combined.py combined_website_locations.json \
  -o sankey_plots \
  --use-website-locs

# Volume-weighted flows (weight by ad/tracking domain count)
python3 src/create_sankey_from_combined.py combined_website_locations.json \
  -o sankey_plots_volume \
  --use-website-locs \
  --weight-by-volume

3. Server IP Destination Analysis

Analyze where ad/tracking requests are sent from each vantage point:

python3 src/analysis/server_ip_analysis.py

Generates tables showing:

Number of unique destination countries
Percentage of requests to same region (based on HTTP request volume)
EU adequacy compliance analysis

4. Category Prevalence Analysis

Analyze which website categories contribute most to ad/tracking activity:

python3 src/analyze_category_prevalence.py combined_website_locations.json USA-Ohio

Output includes:

Top categories by ad/tracking contribution
Percentage of total ad/tracking by category
Number of websites per category

5. Referer Graph Analysis

Analyze referer-referee relationships using DuckPGQ for graph queries:

# Per-website subgraph analysis (recommended)
python3 src/referer_analysis/batch_subgraph_analysis.py \
  /path/to/results/2025-10-03T183312 \
  --threshold 0.0 \
  --script-mode no-script \
  --k 300

# Combined graph analysis (all websites in one graph)
python3 src/referer_analysis/batch_referer_analysis.py \
  /path/to/results/2025-10-03T183312 \
  --threshold 0.0 \
  --script-mode no-script

Parameters:

--threshold: threshold (0.0, 0.5, 0.8, etc.) - this means the number of iterations the domains appear in. 0.8 means the domain must appear in 80% of iterations. Results in the paper are in 0.0 mode.
--script-mode: Use no-script or with-script data
--k: Number of top nodes to analyze (for subgraph analysis)

Output Structure

Example output structure from analysis_scripts/duckdb_analysis/with-script-nov2:

output_directory/
├── # Top-level Analysis Files
├── comprehensive_statistics.json              # Overall statistics across all regions
├── server_destination_analysis.csv            # Server IP destination summary
├── eu_adequacy_compliance_analysis.csv        # EU GDPR adequacy compliance
├── ad_traffic_concentration.csv               # Ad traffic concentration metrics
├── cookie_banner_impact_analysis.csv          # Cookie banner impact data
├── cookie_banner_summary_by_region_browser.csv
├── top_10pct_website_similarity.csv           # Similarity metrics
├── top_20pct_website_similarity.csv
├── top_50pct_website_similarity.csv
├── traffic_from_top_websites.csv
├── traffic_concentration_distribution_chrome.pdf
├── traffic_per_percentile_chrome.pdf
│
├── regional_plots/                            # Regional privacy visualizations
│   ├── brave/
│   ├── chrome/
│   ├── edge/
│   └── firefox/
│       ├── apex_domain_histogram_<region>_threshold_<t>.pdf
│       ├── category_distribution_stacked_<region>_threshold_<t>.pdf
│       ├── category_heatmap_threshold_<t>.pdf
│       ├── cdf_comparison_threshold_<t>.pdf
│       ├── regional_privacy_rankings_threshold_<t>.pdf
│       └── ...
│
├── server_plots/                              # Server location analysis by region
│   ├── Canada/
│   ├── France/
│   ├── Germany/
│   ├── India/
│   ├── Ireland/
│   ├── Singapore/
│   ├── USA-California/
│   └── USA-Ohio/
│       ├── data_flow_sankey_brave_threshold_<t>.pdf
│       ├── data_flow_sankey_chrome_threshold_<t>.pdf
│       ├── data_flow_sankey_edge_threshold_<t>.pdf
│       └── data_flow_sankey_firefox_threshold_<t>.pdf
│
├── referer_subgraph_analysis/                 # Referer graph analysis per region
│   ├── Canada/
│   ├── France/
│   ├── Germany/
│   ├── India/
│   ├── Ireland/
│   ├── Singapore/
│   ├── USA-California/
│   ├── USA-Ohio/
│   │   ├── overall_statistics.json
│   │   ├── subgraph_statistics.csv
│   │   ├── node_url_multiplicity.csv
│   │   ├── cross_subgraph_patterns.csv
│   │   ├── max_outdegree_nodes_per_subgraph.csv
│   │   ├── distance_from_origin_analysis.csv
│   │   ├── tracking_chain_length_summary.csv
│   │   ├── all_hubs_by_tracking_outdegree.csv
│   │   ├── top_50_hubs_by_tracking_outdegree.csv
│   │   ├── top_100_hubs_by_tracking_outdegree.csv
│   │   ├── top_subgraphs_by_edge_count.csv
│   │   ├── top_subgraphs_by_node_count.csv
│   │   ├── top_subgraphs_by_tracker_count.csv
│   │   ├── outdegree_distribution.pdf
│   │   ├── outdegree_by_node_type.pdf
│   │   ├── hub_node_type_distribution.pdf
│   │   ├── hub_tracking_distribution.pdf
│   │   ├── all_hubs_analysis/
│   │   ├── top_100_hubs_analysis/
│   │   └── exported_subgraphs/           # For Gephi/Cytoscape import
│   └── cross_region_hub_analysis/        # Cross-region hub patterns
│
├── referer_subgraph_analysis_chrome/         # Chrome-specific referer analysis
│   ├── [Same structure as above for Chrome only]
│   └── cross_region_hub_analysis/
│
├── referer_subgraph_analysis_brave/          # Brave-specific referer analysis
│   ├── [Same structure as above for Brave only]
│   └── cross_region_hub_analysis/
│
├── hub_analysis/                              # Hub node analysis by region
│   ├── Canada/
│   ├── France/
│   ├── Germany/
│   ├── India/
│   ├── Ireland/
│   ├── Singapore/
│   ├── USA-California/
│   └── USA-Ohio/
│       ├── hub_node_type_distribution.pdf
│       ├── hub_tracking_distribution.pdf
│       ├── outdegree_by_node_type.pdf
│       └── outdegree_distribution.pdf
│
├── comprehensive_analysis/                    # Comprehensive statistical reports
│   └── visualizations/
│       ├── browser_comparison_charts.pdf
│       ├── regional_heatmaps.pdf
│       └── ...
│
├── cookie_banner_plots/                       # Cookie banner analysis
├── cookie_banner_common_websites_plots/
├── apex_domains_top_websites/                 # Top website apex domain analysis
├── chain_analysis_all/                        # Tracking chain analysis (all browsers)
├── chain_analysis_all_chrome/                 # Chrome-specific chain analysis
└── chain_analysis_all_brave/                  # Brave-specific chain analysis

Module Organization

analysis_scripts/duckdb_analysis/src/
├── main.py                          # Main integrated analysis entry point
├── combine_website_locations.py     # Combine location data across regions
├── create_sankey_from_combined.py   # Generate Sankey flow diagrams
├── analyze_category_prevalence.py   # Category analysis
├── core/                            # Core data processing
│   ├── browser_data_processor.py
│   ├── regional_comparator.py
│   └── enhanced_compare_regions.py
├── analysis/                        # Specialized analysis modules
│   ├── server_ip_analysis.py        # Server IP destination analysis
│   ├── cdf_analyzer.py
│   ├── data_transfer_analyzer.py
│   └── statistical_analyzer.py
├── visualization/                   # Plotting and charts
│   ├── plotting_utilities.py
│   └── region_visualizations.py
└── referer_analysis/                # Referer-referee graph analysis
    ├── referer_graph_analysis.py
    ├── referer_subgraph_analysis.py
    ├── batch_referer_analysis.py
    └── batch_subgraph_analysis.py

Data Availability

To support reproducibility and future research, we publicly release the full dataset and code associated with this project. The dataset is available via Globus under the folder RegTrack-MADWeb26:

Globus Endpoint: RegTrack-MADWeb26

The dataset includes:

Raw HAR files from all crawls (~1.5 TB compressed)
Processed browser_data_*.json files per region, browser, and threshold
Screenshots and metadata from each visit
Aggregated statistics and measurement results

The dataset is licensed under CC BY 4.0. All code in this repository is licensed under the MIT License.

Citation

If you use this tool, dataset, or analysis in your research, please cite the following paper:

@inproceedings{prasad2026regtrack,
  title     = {{RegTrack}: Uncovering Global Disparities in Third-party Advertising and Tracking},
  author    = {Prasad, Tanya and Vora, Rut and Lim, Soo Yee and Hoang, Nguyen Phong and Pasquier, Thomas},
  booktitle = {Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb)},
  year      = {2026},
  address   = {San Diego, CA, USA},
  publisher = {Internet Society},
  doi       = {10.14722/madweb.2026.23010},
  isbn      = {978-1-970672-06-0}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
analysis_scripts		analysis_scripts
archive		archive
blocklists		blocklists
collection_scripts		collection_scripts
configs		configs
docker		docker
orchestration_scripts		orchestration_scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyse.py		analyse.py
categorisation.json		categorisation.json
get_domain_lists.py		get_domain_lists.py
main.py		main.py
orchestration_config.py		orchestration_config.py
requirements.txt		requirements.txt
run.sh		run.sh
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

browser-privacy

Overview

Prerequisites

Installation

Configuration Management

orchestration_config.py

config.json:

orchestration_script/ec2_params.json

orchestration_script/.env

Usage

Output

Results Folder Structure

Analysis

Server IP analysis

Advanced DuckDB Analysis Suite

Installation

Quick Start - Integrated Analysis (main.py)

Individual Analysis Scripts

Output Structure

Module Organization

Data Availability

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

browser-privacy

Overview

Prerequisites

Installation

Configuration Management

orchestration_config.py

config.json:

orchestration_script/ec2_params.json

orchestration_script/.env

Usage

Output

Results Folder Structure

Analysis

Server IP analysis

Advanced DuckDB Analysis Suite

Installation

Quick Start - Integrated Analysis (main.py)

Individual Analysis Scripts

Output Structure

Module Organization

Data Availability

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages