The HAR Analysis Tool is designed to automate the collection, processing, and analysis of HTTP Archive (HAR) files.
These HAR files are generated by executing browser automation tests using Docker. The primary goal is to analyze domain
data, identify patterns such as repetitive and exclusive domains across multiple iterations.
This tool is designed to automate the process of generating and analyzing HTTP Archive (HAR) files using Docker and the
Browsertime tool. The generated HAR files are then processed to extract domains. The script allows for configuration via
a JSON file and can also prompt the user for manual input.
- Python 3.x
- Python3-venv
- Docker
- Clone or download this repository.
- Ensure Docker is installed and running on your system.
- Setup a python venv:
python3 -m venv venv - Activate the venv:
source venv/bin/activate - Install the python dependencies
pip install -r collection_scripts/requirements.txt - Pull the docker container
docker pull rutvora/browsertime(This is a modified browsertime container to includebravebrowser. No changes are made to the existing browsers in the container)
The tool requires four files for configuration:
orchestration_config.py(name of the file can't be changed without code changes)config.json(name can be changed to anything, as long as the format is JSON)orchestration_script/ec2_params.json(name of the file can't be changed without code changes)orchestration_scripts/.env(name of the file can't be changed without code changes)
This file consists of the following configurations, as python variables:
config_file: The name/path of theconfig.jsonfileres_dir: The path to the folder where the results will be stored (the results are stored in a dated folder inside this directory)override_urls: Whether to download the latest set of tranco and cloudflare URLs or use the one provided in theconfig.jsonfilelocations: An array consisting of locations where the tool has access to servers (or should spawn AWS instances) The locations should exist in eitherorchestration_scripts/ec2_params.jsonor as a key in thenon_aws_instancesvariable.non_aws_instances: A dictionary with the key being a location name of the following structure ("local" is the name of the location):
non_aws_instances = {
"local": {
"instance_id": None,
"ip_addr": "localhost",
"user": "USERNAME",
"remote_path": "~"
}
}Note: We expect all access to be via SSH keys, and not passwords. It is upto you to configure AWS regions or custom instances/servers with the requisite public keys before-hand
{
"urls": [
"https://wikipedia.org",
"https://youtube.com",
"https://github.com/",
"https://sitespeed.io"
],
"browsers": [
"chrome",
"firefox",
"edge",
"brave"
],
"iterations": 5,
"pretty_print": true,
"video": false,
"maxLoadTime": 60000,
"cpus_per_browser": 4
}
urls
Description: An array of URLs that will be tested.
Usage:"urls": ["https://wikipedia.org", "https://youtube.com"]
Note: should be in format"https://example.com".browsers
Description: An array of browsers to use for testing.
Usage:"browsers": ["chrome", "brave"]
Note: supported Values:chrome,firefox,edge,braveiterations
Description: Number of iterations to run each test. If not specified, a default value (like 5) is assumed.
Usage:"iterations": 5
Note: should be an integer value.pretty_print
Description: Boolean value (true or false) to enable pretty printing of results.
Usage:"pretty_print": truevideo
Description: Boolean value (true or false) to enable or disable video recording.
Usage:"video": falsemaxLoadTime
Description: Maximum page load time (in milliseconds) before a test times out.
Usage:"maxLoadTime": 60000cpus_per_browser
Description: Number of CPUs allocated per browser instance.
Usage:"cpus_per_browser": 4
Note: should be an integer value.
Here's an example JSON
{
"USA-California": {
"region": "us-west-1",
"image_id": "AWS_IMAGE_ID",
"security_group_ids": ["AWS_SECURITY_GROUP_ID"],
"subnet": "AWS_SUBNET"
}
}The key is the location name (can be anything, this is used only locally). The values are as follows:
region: The AWS Region name.image_id: The ID of the OS image that you want to run on the AWS instance. You can find it using these stepssecurity_group_ids: An array representing the security groups to be applied to this AWS instance. Create a group that allows at least SSH (port 22) inbound from the IP address of the server you will be executing this tool from. If you don't have a fixed public IP, you may open port 22 to0.0.0.0/0(everyone) at your own risk. Read more on Security Groupssubnet: The subnet this instance should belong to (the subnet can/should be tied to the security group(s)). Read more
This environment file contains some variables related to spawning the AWS instances in various locations.
You should copy the orchestration_script/env file to orchestration_script/.env and fill in the variables.
AWS IAM Access ID and Key
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEY
AWS Instance Setup
INSTANCE_TYPE="m6a.32xlarge" # Can change the type (but data collection speed will change)KEY_PAIR_NAME="YOUR_KEY_PAIR_NAME"DISK_SIZE=320 # 320 GB (required for 1000 websites, and 20 total iterations per site)
- Create a
config.jsonfile and modify theorchestration_config.py - Ensure Docker is running before executing the script.
- Run the script:
python3 main.py
After running the tests with the HAR Analysis Tool, results are organized in a timestamped folder structure.
The results folder (e.g., results/2025-10-03T183312) contains data collected from multiple vantage points (regions), with separate collections for no-script and with-script modes.
Directory Structure:
results/2025-10-03T183312/
├── Canada/ # Vantage point: Canada
├── France/ # Vantage point: France
├── Germany/ # Vantage point: Germany
├── India/ # Vantage point: India
├── Ireland/ # Vantage point: Ireland
├── Singapore/ # Vantage point: Singapore
├── USA-California/ # Vantage point: California, USA
├── USA-Ohio/ # Vantage point: Ohio, USA
│ ├── no-script/
│ └── with-script/
└── logs/ # Collection logs
Per-Region Structure (e.g., Canada/):
Canada/
├── data_collection.log # Collection log for this region
├── no-script/ # Data collected without script execution
│ ├── HARs/ # Raw HAR files organized by iteration
│ │ ├── 0/ # Iteration 0
│ │ │ ├── brave/ # Browser: Brave
│ │ │ │ ├── 163.com/ # Website domain
│ │ │ │ │ ├── browsertime_brave.har # HAR file
│ │ │ │ │ ├── loadedPage.jpg # Screenshot
│ │ │ │ │ └── true.json # Metadata
│ │ │ │ ├── 166.com/
│ │ │ │ └── ...
│ │ │ ├── chrome/ # Browser: Chrome
│ │ │ ├── edge/ # Browser: Edge
│ │ │ └── firefox/ # Browser: Firefox
│ │ ├── 1/ # Iteration 1
│ │ ├── 2/ # Iteration 2
│ │ └── ... # Up to iteration 9 (10 total)
│ └── info/ # Processed browser data
│ ├── brave_data_0.0.json # Brave data (threshold 0.0)
│ ├── brave_data_0.5.json # Brave data (threshold 0.5)
│ ├── brave_data_0.8.json # Brave data (threshold 0.8)
│ ├── brave_data_1.0.json # Brave data (threshold 1.0)
│ ├── chrome_data_0.0.json # Chrome data (threshold 0.0)
│ ├── chrome_data_0.5.json
│ ├── chrome_data_0.8.json
│ ├── chrome_data_1.0.json
│ ├── edge_data_0.0.json # Edge data
│ ├── edge_data_0.5.json
│ ├── edge_data_0.8.json
│ ├── edge_data_1.0.json
│ ├── firefox_data_0.0.json # Firefox data
│ ├── firefox_data_0.5.json
│ ├── firefox_data_0.8.json
│ ├── firefox_data_1.0.json
│ ├── hars_dict.pkl # Pickled HAR dictionary
│ ├── repetitive_netlocs_0.0.pkl # Repetitive domains (threshold 0.0)
│ ├── repetitive_netlocs_0.5.pkl # Repetitive domains (threshold 0.5)
│ ├── repetitive_netlocs_0.8.pkl # Repetitive domains (threshold 0.8)
│ └── repetitive_netlocs_1.0.pkl # Repetitive domains (threshold 1.0)
└── with-script/ # Data collected with script execution
├── HARs/ # Same structure as no-script
└── info/ # Same structure as no-script
Key Components:
-
HARs Directory: Contains raw HTTP Archive (HAR) files organized by:
- Iteration (0-9): 10 iterations per website for statistical significance
- Browser (brave, chrome, edge, firefox): Separate data for each browser
- Website Domain: One directory per website tested
- Files per website:
browsertime_{browser}.har: HTTP Archive with network trafficloadedPage.jpg: Screenshot of loaded pagetrue.json: Metadata about the page load
-
Info Directory: Contains processed browser data files:
- browser_data_{threshold}.json: Processed data for each browser at different privacy filter thresholds
- Thresholds (0.0, 0.5, 0.8, 1.0) represent the minimum fraction of iterations a domain must appear in to be included
- repetitive_netlocs_{threshold}.pkl: Pickled data of domains that appear repeatedly across iterations
- hars_dict.pkl: Consolidated HAR data in pickled format
-
Script Modes:
- no-script: Data collected without injecting the cookie consent acceptance script (baseline tracking)
- with-script: Data collected with the cookie consent acceptance script injected (accepts all cookies via
accept_cookies.js)
Browser Data JSON Structure (browser_data_0.0.json):
Each browser data file contains:
- Website URLs as keys
- Per-website metrics:
ad_and_tracking: List of ad/tracking domainsfirst_party_domains: First-party domainsthird_party_domains: Third-party domainstriggered_domain_server_locations: Server IP locations for domains (by category)all_request_locations: Server IP locations for all HTTP requests (Chrome only, includes volume)referer_graph: Referer-referee relationships across iterationscookie_banner_triggered_count: Number of times cookie banner was detected- Various statistics and metadata
Usage in Analysis:
The DuckDB analysis suite (analysis_scripts/duckdb_analysis) consumes this results folder structure to generate comprehensive privacy analysis reports. Point main.py to the timestamped results folder:
python3 src/main.py /path/to/results/2025-10-03T183312 --script-mode no-scriptNote: Analysis will try to auto-execute after data is collected from all AWS instances
All the scripts required for analysis of HAR files data is in /analysis-scripts and can be triggered using analyse.py.
python3 analyse.py <PATH-TO-RESULTS> <PARAMS>
The PATH-TO-RESULTS should be a dir which contains dirs with location names (e.g. results/2025-10-03T200100)
The following optional params are supported:
-c --check-failed: Use Ollama (or another model you set up) to check for failed page loads using the images-p --preprocess: Preprocess the HARs to extract the relevant information out into thebrowser_data_<NUM>.json-b --compare-browsers: Run comparison across browsers, for each location-l --compare-locations: Run comparison across locations-a --all: Run all of the above (default, if no optional options are specified)
The DuckDB analysis suite (/analysis_scripts/duckdb_analysis) provides comprehensive privacy analysis tools using DuckDB and DuckPGQ for graph analysis.
First, navigate to the duckdb_analysis directory:
cd analysis_scripts/duckdb_analysisInstall dependencies:
pip install -r requirements.txtNote: All commands in this section assume you are in the analysis_scripts/duckdb_analysis directory.
The main.py script runs the complete analysis pipeline with comprehensive command-line options.
Basic Usage:
# From analysis_scripts/duckdb_analysis directory
python3 src/main.py /path/to/results/2025-10-03T183312This generates:
- Regional privacy visualizations
- Server location analysis and IP destination tables
- Cookie banner impact analysis
- Ad traffic concentration analysis
- Referer-referee tracking chain analysis
Complete Command-line Options:
python3 src/main.py RESULTS_FOLDER [OPTIONS]Required Arguments:
RESULTS_FOLDER- Path to results folder containing regional browser data (e.g.,results/2025-10-03T183312)
Optional Arguments:
| Flag | Description | Default |
|---|---|---|
-o, --output DIR |
Output directory for visualizations and reports | ./visualization_demo_output |
-t, --threshold FLOAT |
Privacy filter threshold for referer analysis | 0.0 |
-r, --regions REGION [REGION ...] |
Specific regions to analyze | All available regions |
-s, --script-mode {no-script,with-script} |
Script mode to analyze (only for new format results) | no-script |
Analysis Control Flags (Skip Components):
| Flag | Description |
|---|---|
--skip-regional-comparison |
Skip regional comparison visualizations |
--skip-server-ip |
Skip server IP destination analysis |
--skip-cookie-banners |
Skip cookie banner impact analysis |
--skip-concentration |
Skip ad traffic concentration analysis |
--skip-apex-charts |
Skip apex domain charts for top websites |
--skip-dashboards |
Skip comprehensive dashboard showcase |
--skip-tracking-chains |
Skip referer-referee tracking chain analysis |
Usage Examples:
# Basic analysis with default settings
python3 src/main.py /path/to/results/2025-10-03T183312
# Analyze with custom output directory
python3 src/main.py /path/to/results --output ./custom_output
# Analyze specific regions only
python3 src/main.py /path/to/results --regions USA-California Canada France
# Analyze with-script mode data
python3 src/main.py /path/to/results --script-mode with-script
# Set privacy filter threshold for referer analysis
python3 src/main.py /path/to/results --threshold 0.8
# Skip cookie banner and concentration analysis for faster execution
python3 src/main.py /path/to/results --skip-cookie-banners --skip-concentration
# Complete example with multiple options
python3 src/main.py /path/to/results/2025-10-03T183312 \
--output ./privacy_analysis \
--threshold 0.0 \
--script-mode no-script \
--regions USA-Ohio France Germany \
--skip-dashboardsGet Help:
python3 src/main.py --help1. Combine Website Location Data
Combine server location data from all regions into a single JSON file:
python3 src/combine_website_locations.py /path/to/results/2025-10-03T183312 \
-o combined_website_locations.json \
--script-mode no-script2. Sankey Flow Diagrams
Generate Sankey diagrams showing first-party to ad/tracking data flows:
# Using browser JSON for first-party locations (default)
python3 src/create_sankey_from_combined.py combined_website_locations.json -o sankey_plots
# Using website_locs.json for first-party locations
python3 src/create_sankey_from_combined.py combined_website_locations.json \
-o sankey_plots \
--use-website-locs
# Volume-weighted flows (weight by ad/tracking domain count)
python3 src/create_sankey_from_combined.py combined_website_locations.json \
-o sankey_plots_volume \
--use-website-locs \
--weight-by-volume3. Server IP Destination Analysis
Analyze where ad/tracking requests are sent from each vantage point:
python3 src/analysis/server_ip_analysis.pyGenerates tables showing:
- Number of unique destination countries
- Percentage of requests to same region (based on HTTP request volume)
- EU adequacy compliance analysis
4. Category Prevalence Analysis
Analyze which website categories contribute most to ad/tracking activity:
python3 src/analyze_category_prevalence.py combined_website_locations.json USA-OhioOutput includes:
- Top categories by ad/tracking contribution
- Percentage of total ad/tracking by category
- Number of websites per category
5. Referer Graph Analysis
Analyze referer-referee relationships using DuckPGQ for graph queries:
# Per-website subgraph analysis (recommended)
python3 src/referer_analysis/batch_subgraph_analysis.py \
/path/to/results/2025-10-03T183312 \
--threshold 0.0 \
--script-mode no-script \
--k 300
# Combined graph analysis (all websites in one graph)
python3 src/referer_analysis/batch_referer_analysis.py \
/path/to/results/2025-10-03T183312 \
--threshold 0.0 \
--script-mode no-scriptParameters:
--threshold: threshold (0.0, 0.5, 0.8, etc.) - this means the number of iterations the domains appear in. 0.8 means the domain must appear in 80% of iterations. Results in the paper are in 0.0 mode.--script-mode: Useno-scriptorwith-scriptdata--k: Number of top nodes to analyze (for subgraph analysis)
Example output structure from analysis_scripts/duckdb_analysis/with-script-nov2:
output_directory/
├── # Top-level Analysis Files
├── comprehensive_statistics.json # Overall statistics across all regions
├── server_destination_analysis.csv # Server IP destination summary
├── eu_adequacy_compliance_analysis.csv # EU GDPR adequacy compliance
├── ad_traffic_concentration.csv # Ad traffic concentration metrics
├── cookie_banner_impact_analysis.csv # Cookie banner impact data
├── cookie_banner_summary_by_region_browser.csv
├── top_10pct_website_similarity.csv # Similarity metrics
├── top_20pct_website_similarity.csv
├── top_50pct_website_similarity.csv
├── traffic_from_top_websites.csv
├── traffic_concentration_distribution_chrome.pdf
├── traffic_per_percentile_chrome.pdf
│
├── regional_plots/ # Regional privacy visualizations
│ ├── brave/
│ ├── chrome/
│ ├── edge/
│ └── firefox/
│ ├── apex_domain_histogram_<region>_threshold_<t>.pdf
│ ├── category_distribution_stacked_<region>_threshold_<t>.pdf
│ ├── category_heatmap_threshold_<t>.pdf
│ ├── cdf_comparison_threshold_<t>.pdf
│ ├── regional_privacy_rankings_threshold_<t>.pdf
│ └── ...
│
├── server_plots/ # Server location analysis by region
│ ├── Canada/
│ ├── France/
│ ├── Germany/
│ ├── India/
│ ├── Ireland/
│ ├── Singapore/
│ ├── USA-California/
│ └── USA-Ohio/
│ ├── data_flow_sankey_brave_threshold_<t>.pdf
│ ├── data_flow_sankey_chrome_threshold_<t>.pdf
│ ├── data_flow_sankey_edge_threshold_<t>.pdf
│ └── data_flow_sankey_firefox_threshold_<t>.pdf
│
├── referer_subgraph_analysis/ # Referer graph analysis per region
│ ├── Canada/
│ ├── France/
│ ├── Germany/
│ ├── India/
│ ├── Ireland/
│ ├── Singapore/
│ ├── USA-California/
│ ├── USA-Ohio/
│ │ ├── overall_statistics.json
│ │ ├── subgraph_statistics.csv
│ │ ├── node_url_multiplicity.csv
│ │ ├── cross_subgraph_patterns.csv
│ │ ├── max_outdegree_nodes_per_subgraph.csv
│ │ ├── distance_from_origin_analysis.csv
│ │ ├── tracking_chain_length_summary.csv
│ │ ├── all_hubs_by_tracking_outdegree.csv
│ │ ├── top_50_hubs_by_tracking_outdegree.csv
│ │ ├── top_100_hubs_by_tracking_outdegree.csv
│ │ ├── top_subgraphs_by_edge_count.csv
│ │ ├── top_subgraphs_by_node_count.csv
│ │ ├── top_subgraphs_by_tracker_count.csv
│ │ ├── outdegree_distribution.pdf
│ │ ├── outdegree_by_node_type.pdf
│ │ ├── hub_node_type_distribution.pdf
│ │ ├── hub_tracking_distribution.pdf
│ │ ├── all_hubs_analysis/
│ │ ├── top_100_hubs_analysis/
│ │ └── exported_subgraphs/ # For Gephi/Cytoscape import
│ └── cross_region_hub_analysis/ # Cross-region hub patterns
│
├── referer_subgraph_analysis_chrome/ # Chrome-specific referer analysis
│ ├── [Same structure as above for Chrome only]
│ └── cross_region_hub_analysis/
│
├── referer_subgraph_analysis_brave/ # Brave-specific referer analysis
│ ├── [Same structure as above for Brave only]
│ └── cross_region_hub_analysis/
│
├── hub_analysis/ # Hub node analysis by region
│ ├── Canada/
│ ├── France/
│ ├── Germany/
│ ├── India/
│ ├── Ireland/
│ ├── Singapore/
│ ├── USA-California/
│ └── USA-Ohio/
│ ├── hub_node_type_distribution.pdf
│ ├── hub_tracking_distribution.pdf
│ ├── outdegree_by_node_type.pdf
│ └── outdegree_distribution.pdf
│
├── comprehensive_analysis/ # Comprehensive statistical reports
│ └── visualizations/
│ ├── browser_comparison_charts.pdf
│ ├── regional_heatmaps.pdf
│ └── ...
│
├── cookie_banner_plots/ # Cookie banner analysis
├── cookie_banner_common_websites_plots/
├── apex_domains_top_websites/ # Top website apex domain analysis
├── chain_analysis_all/ # Tracking chain analysis (all browsers)
├── chain_analysis_all_chrome/ # Chrome-specific chain analysis
└── chain_analysis_all_brave/ # Brave-specific chain analysis
analysis_scripts/duckdb_analysis/src/
├── main.py # Main integrated analysis entry point
├── combine_website_locations.py # Combine location data across regions
├── create_sankey_from_combined.py # Generate Sankey flow diagrams
├── analyze_category_prevalence.py # Category analysis
├── core/ # Core data processing
│ ├── browser_data_processor.py
│ ├── regional_comparator.py
│ └── enhanced_compare_regions.py
├── analysis/ # Specialized analysis modules
│ ├── server_ip_analysis.py # Server IP destination analysis
│ ├── cdf_analyzer.py
│ ├── data_transfer_analyzer.py
│ └── statistical_analyzer.py
├── visualization/ # Plotting and charts
│ ├── plotting_utilities.py
│ └── region_visualizations.py
└── referer_analysis/ # Referer-referee graph analysis
├── referer_graph_analysis.py
├── referer_subgraph_analysis.py
├── batch_referer_analysis.py
└── batch_subgraph_analysis.py
To support reproducibility and future research, we publicly release the full dataset and code associated with this project. The dataset is available via Globus under the folder RegTrack-MADWeb26:
Globus Endpoint: RegTrack-MADWeb26
The dataset includes:
- Raw HAR files from all crawls (~1.5 TB compressed)
- Processed
browser_data_*.jsonfiles per region, browser, and threshold - Screenshots and metadata from each visit
- Aggregated statistics and measurement results
The dataset is licensed under CC BY 4.0. All code in this repository is licensed under the MIT License.
If you use this tool, dataset, or analysis in your research, please cite the following paper:
@inproceedings{prasad2026regtrack,
title = {{RegTrack}: Uncovering Global Disparities in Third-party Advertising and Tracking},
author = {Prasad, Tanya and Vora, Rut and Lim, Soo Yee and Hoang, Nguyen Phong and Pasquier, Thomas},
booktitle = {Workshop on Measurements, Attacks, and Defenses for the Web (MADWeb)},
year = {2026},
address = {San Diego, CA, USA},
publisher = {Internet Society},
doi = {10.14722/madweb.2026.23010},
isbn = {978-1-970672-06-0}
}