Skip to content

ubc-spg/RegTrack

Repository files navigation

browser-privacy

Overview

The HAR Analysis Tool is designed to automate the collection, processing, and analysis of HTTP Archive (HAR) files. These HAR files are generated by executing browser automation tests using Docker. The primary goal is to analyze domain data, identify patterns such as repetitive and exclusive domains across multiple iterations.
This tool is designed to automate the process of generating and analyzing HTTP Archive (HAR) files using Docker and the Browsertime tool. The generated HAR files are then processed to extract domains. The script allows for configuration via a JSON file and can also prompt the user for manual input.


Prerequisites

  • Python 3.x
  • Python3-venv
  • Docker

Installation

  1. Clone or download this repository.
  2. Ensure Docker is installed and running on your system.
  3. Setup a python venv: python3 -m venv venv
  4. Activate the venv: source venv/bin/activate
  5. Install the python dependencies pip install -r collection_scripts/requirements.txt
  6. Pull the docker container docker pull rutvora/browsertime (This is a modified browsertime container to include brave browser. No changes are made to the existing browsers in the container)

Configuration Management

The tool requires four files for configuration:

  1. orchestration_config.py (name of the file can't be changed without code changes)
  2. config.json (name can be changed to anything, as long as the format is JSON)
  3. orchestration_script/ec2_params.json (name of the file can't be changed without code changes)
  4. orchestration_scripts/.env (name of the file can't be changed without code changes)

orchestration_config.py

This file consists of the following configurations, as python variables:

  • config_file: The name/path of the config.json file
  • res_dir: The path to the folder where the results will be stored (the results are stored in a dated folder inside this directory)
  • override_urls: Whether to download the latest set of tranco and cloudflare URLs or use the one provided in the config.json file
  • locations: An array consisting of locations where the tool has access to servers (or should spawn AWS instances) The locations should exist in either orchestration_scripts/ec2_params.json or as a key in the non_aws_instances variable.
  • non_aws_instances: A dictionary with the key being a location name of the following structure ("local" is the name of the location):
non_aws_instances = {
    "local": {
        "instance_id": None,
        "ip_addr": "localhost",
        "user": "USERNAME",
        "remote_path": "~"
    }
}

Note: We expect all access to be via SSH keys, and not passwords. It is upto you to configure AWS regions or custom instances/servers with the requisite public keys before-hand

config.json:

{
  "urls": [
      "https://wikipedia.org",
      "https://youtube.com",
      "https://github.com/",
      "https://sitespeed.io"
  ],
  "browsers": [
      "chrome",
      "firefox",
      "edge",
      "brave"
  ],
  "iterations": 5,
  "pretty_print": true,
  "video": false,
  "maxLoadTime": 60000,
  "cpus_per_browser": 4
}
  • urls
    Description: An array of URLs that will be tested.
    Usage: "urls": ["https://wikipedia.org", "https://youtube.com"]
    Note: should be in format "https://example.com".
  • browsers
    Description: An array of browsers to use for testing.
    Usage: "browsers": ["chrome", "brave"]
    Note: supported Values: chrome, firefox, edge, brave
  • iterations
    Description: Number of iterations to run each test. If not specified, a default value (like 5) is assumed.
    Usage: "iterations": 5
    Note: should be an integer value.
  • pretty_print
    Description: Boolean value (true or false) to enable pretty printing of results.
    Usage: "pretty_print": true
  • video
    Description: Boolean value (true or false) to enable or disable video recording.
    Usage: "video": false
  • maxLoadTime
    Description: Maximum page load time (in milliseconds) before a test times out.
    Usage: "maxLoadTime": 60000
  • cpus_per_browser
    Description: Number of CPUs allocated per browser instance.
    Usage: "cpus_per_browser": 4
    Note: should be an integer value.

orchestration_script/ec2_params.json

Here's an example JSON

{
    "USA-California": {
        "region": "us-west-1",
        "image_id": "AWS_IMAGE_ID",
        "security_group_ids": ["AWS_SECURITY_GROUP_ID"],
        "subnet": "AWS_SUBNET"
    }
}

The key is the location name (can be anything, this is used only locally). The values are as follows:

  • region: The AWS Region name.
  • image_id: The ID of the OS image that you want to run on the AWS instance. You can find it using these steps
  • security_group_ids: An array representing the security groups to be applied to this AWS instance. Create a group that allows at least SSH (port 22) inbound from the IP address of the server you will be executing this tool from. If you don't have a fixed public IP, you may open port 22 to 0.0.0.0/0 (everyone) at your own risk. Read more on Security Groups
  • subnet: The subnet this instance should belong to (the subnet can/should be tied to the security group(s)). Read more

orchestration_script/.env

This environment file contains some variables related to spawning the AWS instances in various locations.
You should copy the orchestration_script/env file to orchestration_script/.env and fill in the variables.
AWS IAM Access ID and Key

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY

AWS Instance Setup

  • INSTANCE_TYPE="m6a.32xlarge" # Can change the type (but data collection speed will change)
  • KEY_PAIR_NAME="YOUR_KEY_PAIR_NAME"
  • DISK_SIZE=320 # 320 GB (required for 1000 websites, and 20 total iterations per site)

Usage

  1. Create a config.json file and modify the orchestration_config.py
  2. Ensure Docker is running before executing the script.
  3. Run the script:
python3 main.py

Output

After running the tests with the HAR Analysis Tool, results are organized in a timestamped folder structure.

Results Folder Structure

The results folder (e.g., results/2025-10-03T183312) contains data collected from multiple vantage points (regions), with separate collections for no-script and with-script modes.

Directory Structure:

results/2025-10-03T183312/
├── Canada/                          # Vantage point: Canada
├── France/                          # Vantage point: France
├── Germany/                         # Vantage point: Germany
├── India/                           # Vantage point: India
├── Ireland/                         # Vantage point: Ireland
├── Singapore/                       # Vantage point: Singapore
├── USA-California/                  # Vantage point: California, USA
├── USA-Ohio/                        # Vantage point: Ohio, USA
│   ├── no-script/
│   └── with-script/
└── logs/                            # Collection logs

Per-Region Structure (e.g., Canada/):

Canada/
├── data_collection.log              # Collection log for this region
├── no-script/                       # Data collected without script execution
│   ├── HARs/                        # Raw HAR files organized by iteration
│   │   ├── 0/                       # Iteration 0
│   │   │   ├── brave/               # Browser: Brave
│   │   │   │   ├── 163.com/        # Website domain
│   │   │   │   │   ├── browsertime_brave.har     # HAR file
│   │   │   │   │   ├── loadedPage.jpg            # Screenshot
│   │   │   │   │   └── true.json                 # Metadata
│   │   │   │   ├── 166.com/
│   │   │   │   └── ...
│   │   │   ├── chrome/              # Browser: Chrome
│   │   │   ├── edge/                # Browser: Edge
│   │   │   └── firefox/             # Browser: Firefox
│   │   ├── 1/                       # Iteration 1
│   │   ├── 2/                       # Iteration 2
│   │   └── ...                      # Up to iteration 9 (10 total)
│   └── info/                        # Processed browser data
│       ├── brave_data_0.0.json      # Brave data (threshold 0.0)
│       ├── brave_data_0.5.json      # Brave data (threshold 0.5)
│       ├── brave_data_0.8.json      # Brave data (threshold 0.8)
│       ├── brave_data_1.0.json      # Brave data (threshold 1.0)
│       ├── chrome_data_0.0.json     # Chrome data (threshold 0.0)
│       ├── chrome_data_0.5.json
│       ├── chrome_data_0.8.json
│       ├── chrome_data_1.0.json
│       ├── edge_data_0.0.json       # Edge data
│       ├── edge_data_0.5.json
│       ├── edge_data_0.8.json
│       ├── edge_data_1.0.json
│       ├── firefox_data_0.0.json    # Firefox data
│       ├── firefox_data_0.5.json
│       ├── firefox_data_0.8.json
│       ├── firefox_data_1.0.json
│       ├── hars_dict.pkl            # Pickled HAR dictionary
│       ├── repetitive_netlocs_0.0.pkl  # Repetitive domains (threshold 0.0)
│       ├── repetitive_netlocs_0.5.pkl  # Repetitive domains (threshold 0.5)
│       ├── repetitive_netlocs_0.8.pkl  # Repetitive domains (threshold 0.8)
│       └── repetitive_netlocs_1.0.pkl  # Repetitive domains (threshold 1.0)
└── with-script/                     # Data collected with script execution
    ├── HARs/                        # Same structure as no-script
    └── info/                        # Same structure as no-script

Key Components:

  1. HARs Directory: Contains raw HTTP Archive (HAR) files organized by:

    • Iteration (0-9): 10 iterations per website for statistical significance
    • Browser (brave, chrome, edge, firefox): Separate data for each browser
    • Website Domain: One directory per website tested
    • Files per website:
      • browsertime_{browser}.har: HTTP Archive with network traffic
      • loadedPage.jpg: Screenshot of loaded page
      • true.json: Metadata about the page load
  2. Info Directory: Contains processed browser data files:

    • browser_data_{threshold}.json: Processed data for each browser at different privacy filter thresholds
    • Thresholds (0.0, 0.5, 0.8, 1.0) represent the minimum fraction of iterations a domain must appear in to be included
    • repetitive_netlocs_{threshold}.pkl: Pickled data of domains that appear repeatedly across iterations
    • hars_dict.pkl: Consolidated HAR data in pickled format
  3. Script Modes:

    • no-script: Data collected without injecting the cookie consent acceptance script (baseline tracking)
    • with-script: Data collected with the cookie consent acceptance script injected (accepts all cookies via accept_cookies.js)

Browser Data JSON Structure (browser_data_0.0.json):

Each browser data file contains:

  • Website URLs as keys
  • Per-website metrics:
    • ad_and_tracking: List of ad/tracking domains
    • first_party_domains: First-party domains
    • third_party_domains: Third-party domains
    • triggered_domain_server_locations: Server IP locations for domains (by category)
    • all_request_locations: Server IP locations for all HTTP requests (Chrome only, includes volume)
    • referer_graph: Referer-referee relationships across iterations
    • cookie_banner_triggered_count: Number of times cookie banner was detected
    • Various statistics and metadata

Usage in Analysis:

The DuckDB analysis suite (analysis_scripts/duckdb_analysis) consumes this results folder structure to generate comprehensive privacy analysis reports. Point main.py to the timestamped results folder:

python3 src/main.py /path/to/results/2025-10-03T183312 --script-mode no-script

Analysis

Note: Analysis will try to auto-execute after data is collected from all AWS instances

All the scripts required for analysis of HAR files data is in /analysis-scripts and can be triggered using analyse.py.

 python3 analyse.py <PATH-TO-RESULTS> <PARAMS>

The PATH-TO-RESULTS should be a dir which contains dirs with location names (e.g. results/2025-10-03T200100)

The following optional params are supported:

  • -c --check-failed: Use Ollama (or another model you set up) to check for failed page loads using the images
  • -p --preprocess: Preprocess the HARs to extract the relevant information out into the browser_data_<NUM>.json
  • -b --compare-browsers: Run comparison across browsers, for each location
  • -l --compare-locations: Run comparison across locations
  • -a --all: Run all of the above (default, if no optional options are specified)

Server IP analysis

Advanced DuckDB Analysis Suite

The DuckDB analysis suite (/analysis_scripts/duckdb_analysis) provides comprehensive privacy analysis tools using DuckDB and DuckPGQ for graph analysis.

Installation

First, navigate to the duckdb_analysis directory:

cd analysis_scripts/duckdb_analysis

Install dependencies:

pip install -r requirements.txt

Note: All commands in this section assume you are in the analysis_scripts/duckdb_analysis directory.

Quick Start - Integrated Analysis (main.py)

The main.py script runs the complete analysis pipeline with comprehensive command-line options.

Basic Usage:

# From analysis_scripts/duckdb_analysis directory
python3 src/main.py /path/to/results/2025-10-03T183312

This generates:

  • Regional privacy visualizations
  • Server location analysis and IP destination tables
  • Cookie banner impact analysis
  • Ad traffic concentration analysis
  • Referer-referee tracking chain analysis

Complete Command-line Options:

python3 src/main.py RESULTS_FOLDER [OPTIONS]

Required Arguments:

  • RESULTS_FOLDER - Path to results folder containing regional browser data (e.g., results/2025-10-03T183312)

Optional Arguments:

Flag Description Default
-o, --output DIR Output directory for visualizations and reports ./visualization_demo_output
-t, --threshold FLOAT Privacy filter threshold for referer analysis 0.0
-r, --regions REGION [REGION ...] Specific regions to analyze All available regions
-s, --script-mode {no-script,with-script} Script mode to analyze (only for new format results) no-script

Analysis Control Flags (Skip Components):

Flag Description
--skip-regional-comparison Skip regional comparison visualizations
--skip-server-ip Skip server IP destination analysis
--skip-cookie-banners Skip cookie banner impact analysis
--skip-concentration Skip ad traffic concentration analysis
--skip-apex-charts Skip apex domain charts for top websites
--skip-dashboards Skip comprehensive dashboard showcase
--skip-tracking-chains Skip referer-referee tracking chain analysis

Usage Examples:

# Basic analysis with default settings
python3 src/main.py /path/to/results/2025-10-03T183312

# Analyze with custom output directory
python3 src/main.py /path/to/results --output ./custom_output

# Analyze specific regions only
python3 src/main.py /path/to/results --regions USA-California Canada France

# Analyze with-script mode data
python3 src/main.py /path/to/results --script-mode with-script

# Set privacy filter threshold for referer analysis
python3 src/main.py /path/to/results --threshold 0.8

# Skip cookie banner and concentration analysis for faster execution
python3 src/main.py /path/to/results --skip-cookie-banners --skip-concentration

# Complete example with multiple options
python3 src/main.py /path/to/results/2025-10-03T183312 \
  --output ./privacy_analysis \
  --threshold 0.0 \
  --script-mode no-script \
  --regions USA-Ohio France Germany \
  --skip-dashboards

Get Help:

python3 src/main.py --help

Individual Analysis Scripts

1. Combine Website Location Data

Combine server location data from all regions into a single JSON file:

python3 src/combine_website_locations.py /path/to/results/2025-10-03T183312 \
  -o combined_website_locations.json \
  --script-mode no-script

2. Sankey Flow Diagrams

Generate Sankey diagrams showing first-party to ad/tracking data flows:

# Using browser JSON for first-party locations (default)
python3 src/create_sankey_from_combined.py combined_website_locations.json -o sankey_plots

# Using website_locs.json for first-party locations
python3 src/create_sankey_from_combined.py combined_website_locations.json \
  -o sankey_plots \
  --use-website-locs

# Volume-weighted flows (weight by ad/tracking domain count)
python3 src/create_sankey_from_combined.py combined_website_locations.json \
  -o sankey_plots_volume \
  --use-website-locs \
  --weight-by-volume

3. Server IP Destination Analysis

Analyze where ad/tracking requests are sent from each vantage point:

python3 src/analysis/server_ip_analysis.py

Generates tables showing:

  • Number of unique destination countries
  • Percentage of requests to same region (based on HTTP request volume)
  • EU adequacy compliance analysis

4. Category Prevalence Analysis

Analyze which website categories contribute most to ad/tracking activity:

python3 src/analyze_category_prevalence.py combined_website_locations.json USA-Ohio

Output includes:

  • Top categories by ad/tracking contribution
  • Percentage of total ad/tracking by category
  • Number of websites per category

5. Referer Graph Analysis

Analyze referer-referee relationships using DuckPGQ for graph queries:

# Per-website subgraph analysis (recommended)
python3 src/referer_analysis/batch_subgraph_analysis.py \
  /path/to/results/2025-10-03T183312 \
  --threshold 0.0 \
  --script-mode no-script \
  --k 300

# Combined graph analysis (all websites in one graph)
python3 src/referer_analysis/batch_referer_analysis.py \
  /path/to/results/2025-10-03T183312 \
  --threshold 0.0 \
  --script-mode no-script

Parameters:

  • --threshold: threshold (0.0, 0.5, 0.8, etc.) - this means the number of iterations the domains appear in. 0.8 means the domain must appear in 80% of iterations. Results in the paper are in 0.0 mode.
  • --script-mode: Use no-script or with-script data
  • --k: Number of top nodes to analyze (for subgraph analysis)

Output Structure

Example output structure from analysis_scripts/duckdb_analysis/with-script-nov2:

output_directory/
├── # Top-level Analysis Files
├── comprehensive_statistics.json              # Overall statistics across all regions
├── server_destination_analysis.csv            # Server IP destination summary
├── eu_adequacy_compliance_analysis.csv        # EU GDPR adequacy compliance
├── ad_traffic_concentration.csv               # Ad traffic concentration metrics
├── cookie_banner_impact_analysis.csv          # Cookie banner impact data
├── cookie_banner_summary_by_region_browser.csv
├── top_10pct_website_similarity.csv           # Similarity metrics
├── top_20pct_website_similarity.csv
├── top_50pct_website_similarity.csv
├── traffic_from_top_websites.csv
├── traffic_concentration_distribution_chrome.pdf
├── traffic_per_percentile_chrome.pdf
│
├── regional_plots/                            # Regional privacy visualizations
│   ├── brave/
│   ├── chrome/
│   ├── edge/
│   └── firefox/
│       ├── apex_domain_histogram_<region>_threshold_<t>.pdf
│       ├── category_distribution_stacked_<region>_threshold_<t>.pdf
│       ├── category_heatmap_threshold_<t>.pdf
│       ├── cdf_comparison_threshold_<t>.pdf
│       ├── regional_privacy_rankings_threshold_<t>.pdf
│       └── ...
│
├── server_plots/                              # Server location analysis by region
│   ├── Canada/
│   ├── France/
│   ├── Germany/
│   ├── India/
│   ├── Ireland/
│   ├── Singapore/
│   ├── USA-California/
│   └── USA-Ohio/
│       ├── data_flow_sankey_brave_threshold_<t>.pdf
│       ├── data_flow_sankey_chrome_threshold_<t>.pdf
│       ├── data_flow_sankey_edge_threshold_<t>.pdf
│       └── data_flow_sankey_firefox_threshold_<t>.pdf
│
├── referer_subgraph_analysis/                 # Referer graph analysis per region
│   ├── Canada/
│   ├── France/
│   ├── Germany/
│   ├── India/
│   ├── Ireland/
│   ├── Singapore/
│   ├── USA-California/
│   ├── USA-Ohio/
│   │   ├── overall_statistics.json
│   │   ├── subgraph_statistics.csv
│   │   ├── node_url_multiplicity.csv
│   │   ├── cross_subgraph_patterns.csv
│   │   ├── max_outdegree_nodes_per_subgraph.csv
│   │   ├── distance_from_origin_analysis.csv
│   │   ├── tracking_chain_length_summary.csv
│   │   ├── all_hubs_by_tracking_outdegree.csv
│   │   ├── top_50_hubs_by_tracking_outdegree.csv
│   │   ├── top_100_hubs_by_tracking_outdegree.csv
│   │   ├── top_subgraphs_by_edge_count.csv
│   │   ├── top_subgraphs_by_node_count.csv
│   │   ├── top_subgraphs_by_tracker_count.csv
│   │   ├── outdegree_distribution.pdf
│   │   ├── outdegree_by_node_type.pdf
│   │   ├── hub_node_type_distribution.pdf
│   │   ├── hub_tracking_distribution.pdf
│   │   ├── all_hubs_analysis/
│   │   ├── top_100_hubs_analysis/
│   │   └── exported_subgraphs/           # For Gephi/Cytoscape import
│   └── cross_region_hub_analysis/        # Cross-region hub patterns
│
├── referer_subgraph_analysis_chrome/         # Chrome-specific referer analysis
│   ├── [Same structure as above for Chrome only]
│   └── cross_region_hub_analysis/
│
├── referer_subgraph_analysis_brave/          # Brave-specific referer analysis
│   ├── [Same structure as above for Brave only]
│   └── cross_region_hub_analysis/
│
├── hub_analysis/                              # Hub node analysis by region
│   ├── Canada/
│   ├── France/
│   ├── Germany/
│   ├── India/
│   ├── Ireland/
│   ├── Singapore/
│   ├── USA-California/
│   └── USA-Ohio/
│       ├── hub_node_type_distribution.pdf
│       ├── hub_tracking_distribution.pdf
│       ├── outdegree_by_node_type.pdf
│       └── outdegree_distribution.pdf
│
├── comprehensive_analysis/                    # Comprehensive statistical reports
│   └── visualizations/
│       ├── browser_comparison_charts.pdf
│       ├── regional_heatmaps.pdf
│       └── ...
│
├── cookie_banner_plots/                       # Cookie banner analysis
├── cookie_banner_common_websites_plots/
├── apex_domains_top_websites/                 # Top website apex domain analysis
├── chain_analysis_all/                        # Tracking chain analysis (all browsers)
├── chain_analysis_all_chrome/                 # Chrome-specific chain analysis
└── chain_analysis_all_brave/                  # Brave-specific chain analysis

Module Organization

analysis_scripts/duckdb_analysis/src/
├── main.py                          # Main integrated analysis entry point
├── combine_website_locations.py     # Combine location data across regions
├── create_sankey_from_combined.py   # Generate Sankey flow diagrams
├── analyze_category_prevalence.py   # Category analysis
├── core/                            # Core data processing
│   ├── browser_data_processor.py
│   ├── regional_comparator.py
│   └── enhanced_compare_regions.py
├── analysis/                        # Specialized analysis modules
│   ├── server_ip_analysis.py        # Server IP destination analysis
│   ├── cdf_analyzer.py
│   ├── data_transfer_analyzer.py
│   └── statistical_analyzer.py
├── visualization/                   # Plotting and charts
│   ├── plotting_utilities.py
│   └── region_visualizations.py
└── referer_analysis/                # Referer-referee graph analysis
    ├── referer_graph_analysis.py
    ├── referer_subgraph_analysis.py
    ├── batch_referer_analysis.py
    └── batch_subgraph_analysis.py

Data Availability

To support reproducibility and future research, we publicly release the full dataset and code associated with this project. The dataset is available via Globus under the folder RegTrack-MADWeb26:

Globus Endpoint: RegTrack-MADWeb26

The dataset includes:

  • Raw HAR files from all crawls (~1.5 TB compressed)
  • Processed browser_data_*.json files per region, browser, and threshold
  • Screenshots and metadata from each visit
  • Aggregated statistics and measurement results

The dataset is licensed under CC BY 4.0. All code in this repository is licensed under the MIT License.


Citation

If you use this tool, dataset, or analysis in your research, please cite the following paper:

@inproceedings{prasad2026regtrack,
  title     = {{RegTrack}: Uncovering Global Disparities in Third-party Advertising and Tracking},
  author    = {Prasad, Tanya and Vora, Rut and Lim, Soo Yee and Hoang, Nguyen Phong and Pasquier, Thomas},
  booktitle = {Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb)},
  year      = {2026},
  address   = {San Diego, CA, USA},
  publisher = {Internet Society},
  doi       = {10.14722/madweb.2026.23010},
  isbn      = {978-1-970672-06-0}
}

About

RegTrack: Uncovering Global Disparities in Third-party Advertising and Tracking

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors