Agora is a collaborative event powered by Protocol Learning, our decentralized approach to AI development.
Anyone with a 24 GB consumer GPU (e.g. an NVIDIA RTX 4090) can participate. Each participant hosts one pipeline stage of the model; participants can join and leave at any time, and adding more peers per stage increases data-parallel throughput within that stage.
By pooling compute from many sources, Agora makes it feasible to train larger models than any single participant's hardware could handle alone.
Full documentation: pluralis.ai/docs.
Each participant's progress is logged - check their contribution in the dashboard.
- Requirements
- Getting Started
- Startup & Training
- Important
- Troubleshooting
- Citations
- License
- Acknowledgements
PC/Server with Nvidia GPU:
- Minimum 24GB GPU memory (RTX 4090 or equivalent)
- Minimum 80GB RAM per GPU (e.g. a 2-GPU machine needs 160GB total)
- Minimum 80GB disk space
- Linux
- Windows + WSL2 (enable CUDA support in WSL)
- Stable internet connection with a minimum of 200 Mbps bandwidth. Note that some cloud providers advertise shared bandwidth split across multiple tenants — the effective throughput may be significantly lower than advertised.
- The port you are exposing (default is 49200) must be accessible to external connections.
Follow (cloud services) for how to set up cloud instances that meet the requirements.
Create HuggingFace access token (instruction). The token doesn't need any read/write permissions as it will only be used for authorization.
Before joining, check the live wait time on the Dashboard. If joining is paused, check with the team for updates.
You will need a HuggingFace token — create one here (no special permissions needed). Make sure port 49200 is open for inbound connections.
Then run:
git clone https://github.com/PluralisResearch/agora
cd agora
python3 agora_cli.pyThe CLI will guide you through setup and start training. Your settings are saved automatically — next time, just run python3 agora_cli.py again.
Using an AI coding agent like Claude Code? Run it from inside the cloned repo and invoke /agora-join — the bundled skill (.claude/skills/agora-join/SKILL.md) walks you through the same setup conversationally, launches the server, and watches the startup logs. Other agents (Cursor, Codex CLI) can read the same SKILL.md directly.
We recommend running with Docker. If Docker is not available, Python 3.11 (exact version) is required, and we recommend running inside tmux or screen so the server keeps running when you disconnect.
To stop:
- Native: press
Ctrl + C - Docker:
docker stop <container> && docker rm <container>
You can rejoin anytime — your contribution is saved.
To use more than one GPU on the same machine, run one instance per GPU:
python3 agora_cli.py --gpu_id 0
python3 agora_cli.py --gpu_id 1When running natively, use a separate terminal for each GPU. When using Docker, each instance runs in its own container automatically.
Each GPU automatically gets its own port (49200, 49201, ...). Make sure each port is open.
The CLI prompts for everything interactively. You can also pass values directly:
| Flag | Description |
|---|---|
--gpu_id <ID> |
GPU to use (default: 0) |
--token <TOKEN> |
HuggingFace token |
--email <EMAIL> |
Email address (optional) |
--host_port <PORT> |
Listening port (default: 49200 + gpu_id) |
--announce_port <PORT> |
External port, if different from host_port (e.g. on RunPod) |
--use_docker |
Run inside Docker |
--log_file <PATH> |
Log file path (default: logs/server_gpu<ID>.log) — leave as default unless you have a specific reason |
--identity_path <PATH> |
Identity key path (default: private_gpu<ID>.key) — leave as default unless you have a specific reason |
--batch_size_override <N> |
Override the batch size defined in the config. Only set this if you hit CUDA out-of-memory errors with the default. |
--skip_input |
Non-interactive mode (all values must come from args or saved config) |
--reconfigure |
Re-prompt all settings from scratch |
This section is optional.
agora_cli.pyis the recommended way to install and run — it handles everything automatically. The instructions below are for advanced users.
Step 1 — Install dependencies
Docker:
docker build . -t pluralis_agora --label image_version=1Native (no Docker):
Requirements: Python 3.11, pip >= 25.3, NVIDIA GPU drivers with CUDA 12.8 support, conda.
# Create and activate a Python 3.11 environment
conda create -y -n agora python=3.11
conda activate agora
# Upgrade pip
pip install --upgrade "pip>=25.3"
# Install PyTorch with CUDA 12.8 support
pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128
# Install the two required packages from source
pip install --constraint constraints.txt --build-constraint constraints.txt -e ./agora_server
pip install --constraint constraints.txt --build-constraint constraints.txt -e ./agoraStep 2 — Read runtime parameters
run.json at the repo root contains values required by run_server.py:
run.json field |
run_server.py argument |
|---|---|
run_config |
--config |
auth_server |
--auth_server |
prom_gateway |
--prom_gateway |
seeds (array) |
--initial_peers |
RUN_CONFIG=$(jq -r '.run_config' run.json)
AUTH_SERVER=$(jq -r '.auth_server' run.json)
PROM_GATEWAY=$(jq -r '.prom_gateway' run.json)
SEEDS=$(jq -r '.seeds[]' run.json)Step 3 — Get your public IP
PUBLIC_IP=$(curl -s https://api.ipify.org)Step 4 — Launch
Default port convention: 49200 + gpu_id. Each GPU on the same machine needs its own port.
Native:
CUDA_VISIBLE_DEVICES=<gpu_id> python3.11 agora/src/agora/run_server.py \
--gpu_id <gpu_id> \
--config "$RUN_CONFIG" \
--token <hf_token> \
--auth_server "$AUTH_SERVER" \
--prom_gateway "$PROM_GATEWAY" \
--host_maddrs /ip4/0.0.0.0/tcp/<port> \
--announce_maddrs /ip4/$PUBLIC_IP/tcp/<port> \
--initial_peers $SEEDS \
--email <email> \
--log_file logs/server_gpu<gpu_id>.log \
--identity_path private_gpu<gpu_id>.keyDocker:
docker run -d --name agora_gpu<gpu_id> --ipc=host --network=host \
--gpus device=<gpu_id> \
-v $(pwd):/home -w /home \
pluralis_agora \
bash -c "CUDA_VISIBLE_DEVICES=0 python3.11 agora/src/agora/run_server.py \
--gpu_id <gpu_id> \
--config $RUN_CONFIG \
--token <hf_token> \
--auth_server $AUTH_SERVER \
--prom_gateway $PROM_GATEWAY \
--host_maddrs /ip4/0.0.0.0/tcp/<port> \
--announce_maddrs /ip4/$PUBLIC_IP/tcp/<port> \
--initial_peers $SEEDS \
--email <email> \
--log_file logs/server_gpu<gpu_id>.log \
--identity_path private_gpu<gpu_id>.key"When your node starts, it first runs network checks, downloads the model weights, and waits for authorization before training begins:
[NETWORK] Running internet speed test... [DOWNLOAD] Downloading model weights... [DOWNLOAD] Model weights downloaded. Waiting for authorization... [AUTH] Authorization queue: position 2, estimated wait: 1m [AUTH] Access granted for your_user
Once authorized, training begins and you will see regular status updates:
[SERVER] Training started [TRAINING] Training step 1 [PROGRESS] Processed 51 batches in the last 60s [PROGRESS] Forward pass: 28 batches [PROGRESS] Backward pass: 23 batches
If you see Processed [N] batches updates regularly, your node is working correctly.
When joining an active run, your node may first enter a warmup period to catch up with other nodes. During this time you will see:
[WARMUP] Node warmup started, catching up with other nodes
This is normal. Once the node has caught up, it will begin processing batches and you will see:
[WARMUP] Node warmup complete
Check logs/server_gpu<ID>.log for detailed logs.
The code generates a private.key file during initial setup. This file:
- Contains your node's cryptographic identity
- Is required for secure communication within the network
- Should be kept confidential and never shared publicly
All files created within the Docker container have a different level of ownership. To modify/delete them outside of the container, you need to reclaim ownership.
Linux:
sudo chown -R <linux_user> <path/to/project>-
Wrong pytorch version. Please install torch 2.7A PyTorch version other than 2.7.x is installed. If running via
python3 agora_cli.py, the CLI installs the correct version automatically. For manual installs, run:pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128. -
CUDA is not available. Exiting run.No GPU was detected. Verify that NVIDIA drivers are installed (
nvidia-smi), the CUDA toolkit is accessible, and (if using Docker) that the container was launched with--gpus. -
GPU is not eligible (GPU VRAM: X GB).The GPU does not meet the minimum VRAM requirement. A GPU with at least 24 GB VRAM is required.
-
Verification failed.The verification process detected local modifications or a stale installation. Revert any local changes. Make sure you are using the latest version of the code.
-
Port test failed. Make sure your port forwarding is correctThe authorization server could not reach your node on the announced port. Verify that port 49200 (or your configured port) is open for inbound connections. Check firewall rules and cloud security groups.
-
Authorization timeout after <N>sThe node waited too long for authorization. This can happen during high demand. Restart the node to re-enter the authorization queue.
-
Invalid HuggingFace tokenThe token provided via
--tokenis invalid or expired. Create a new token (no special permissions required) and retry. -
This peer_id is already used by another userYou are joining with a different HuggingFace account than the one that originally registered this node's
private.key. Either use the original account, or deleteprivate.keyand restart to generate a new identity. -
This node is already in the authorization queue. Please remove private key from the node and try again.A node with this
private.keyis already queued. Deleteprivate.keyand restart. -
Download is too slow, please ensure sufficient bandwidth and try again.Your node took too long to download model weights, so it was dropped from the queue. Verify that your connection meets the 200 Mbps minimum and restart.
-
An unknown connection error occurred. Exiting run.A fatal connection failure was detected. Verify your internet connection, check that no firewall is blocking outbound traffic, and restart the node.
-
Too many failed all-reduce attempts. Exiting run.Multiple consecutive parameter averaging rounds timed out, indicating the node's internet connection is too slow or unstable. Check your connection speed and restart.
-
Failed to connect to CUDA. Exiting run.A CUDA error was detected during training. Verify GPU driver health with
nvidia-smi. If the GPU is shared with other workloads, free resources and restart. -
CUDA out of memoryThe batch size selected for your GPU's VRAM is too high (this can happen if other processes are using the same GPU, or on edge-case GPU models). Restart with a smaller batch size using
--batch_size_override <N>(e.g.python3 agora_cli.py --batch_size_override 1). Lower the value until the error stops. -
RAM usage exceeded threshold. Exiting run.System RAM usage exceeded the allowed threshold. Close other processes consuming memory and restart. If running inside Docker, increase the container's memory limit.
-
Server exited with -9The server process was killed by the OS, usually because the instance ran out of system resources. Common causes include exceeding available RAM or hitting a per-process limit on open threads/file descriptors. Check system resource usage (
free -h,ulimit -u,ulimit -n), ensure the machine meets the RAM requirements (80GB per GPU), and raise thread/file-descriptor limits if your instance enforces low defaults.
If you use this project in your research, please cite:
Gil Avraham, Violetta Shevchenko, Karol Pajak, James Snewin, Harry Xi, Hadi Mohaghegh Dolatabadi, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Chamin Hewa Koneputugodage, Shamane Siriwardhana, Alexander Long. Pluralis 8b: Asynchronous Large-Scale Distributed Training over the Internet. 2026.
@misc{avraham2026agora,
title={Pluralis 8b: Asynchronous Large-Scale Distributed Training over the Internet},
author={Gil Avraham and Violetta Shevchenko and Karol Pajak and James Snewin and Harry Xi and Hadi Mohaghegh Dolatabadi and Thalaiyasingam Ajanthan and Sameera Ramasinghe and Chamin Hewa Koneputugodage and Shamane Siriwardhana and Alexander Long},
year={2026},
url={https://github.com/PluralisResearch/agora},
}
Sameera Ramasinghe, Thalaiyasingam Ajanthan, Gil Avraham, Yan Zuo, and Alexander Long. Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism. 2025.
@misc{ramasinghe2025protocolmodelsscalingdecentralized,
title={Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism},
author={Sameera Ramasinghe and Thalaiyasingam Ajanthan and Gil Avraham and Yan Zuo and Alexander Long},
year={2025},
eprint={2506.01260},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.01260},
}
Pluralis Research. AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism. 2026. The SPARTA optimizer used in the live run.
@misc{pluralis2026asyncmesh,
title={AsyncMesh: Fully Asynchronous Optimization for Data and Pipeline Parallelism},
author={Pluralis Research},
year={2026},
eprint={2601.22442},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.22442},
}
Thalaiyasingam Ajanthan, Sameera Ramasinghe, Yan Zuo, Gil Avraham, and Alexander Long. Nesterov Method for Asynchronous Pipeline Parallel Optimization. ICML. 2025.
@article{ajanthan2025asyncpp,
title={Nesterov Method for Asynchronous Pipeline Parallel Optimization},
author={Ajanthan, Thalaiyasingam and Ramasinghe, Sameera and Zuo, Yan and Avraham, Gil and Long, Alexander},
journal={ICML},
year={2025}
}Distributed under the Apache-2.0 License. See LICENSE for more information.
Third-party dependencies and their licenses are listed in THIRD_PARTY_LICENSES.md.
This project is built upon the Hivemind library for decentralized deep learning, distributed under the MIT License.
This project uses the FineWeb-Edu dataset by HuggingFace, made available under the Open Data Commons Attribution License (ODC-BY) v1.0.