A federated learning framework built on Ray and Hydra. OmniFed scales from local experiments to HPC clusters and cross-institutional scenarios with 10+ built-in algorithms and extensible architecture.
- 🧩 Modular: Mix and match 10+ single-file algorithm implementations, topologies, and communication protocols
- 📊 Flexible: Local, HPC, and cross-network deployments with multiple communication backends
- ⚙️ Extensible: Custom algorithms, communicators, and topologies with minimal code requirements
- 🔬 Research-Friendly: Easy experimentation with lifecycle hooks and PyTorch compatibility
- 🚀 Scalable: Ray-powered distributed coordination from laptops to HPC clusters
# Clone and install
git clone <repository-url>
cd OmniFed
pip install -r requirements.txt
# Run basic federated learning experiment
./main.sh --config-name test_fedavg_centralized_torchdistThis configures a federated learning experiment with multiple nodes using the CIFAR-10 dataset. You'll see:
- Setup phase: Ray cluster initialization, node actor creation, and model broadcasting
- Training progress: Loss, accuracy, and other metrics logged per batch/epoch
- Communication logs: Model aggregation and synchronization between nodes
- Results: Final metrics saved to timestamped output directory
Different deployment types:
# Local/HPC clusters (PyTorch distributed communication)
./main.sh --config-name test_fedavg_centralized_torchdist
# Cross-network deployment (gRPC communication)
./main.sh --config-name test_fedavg_centralized_grpc
# Multi-tier hierarchical setup
./main.sh --config-name test_fedavg_hierarchicalCustomize parameters:
# Override any parameter
./main.sh --config-name test_fedavg_centralized_torchdist topology.num_clients=10 global_rounds=10 algorithm.max_epochs_per_round=8Explore available configurations:
# See all available experiment configs
python main.py --helpInspect configurations:
# Preview full configuration before running
python main.py --config-name test_fedavg_centralized_torchdist --cfg job
# Show resolved configuration (with interpolations)
python main.py --config-name test_fedavg_centralized_torchdist --cfg job --resolve
# Focus on specific config sections
python main.py --config-name test_fedavg_centralized_torchdist --cfg job --package algorithmTroubleshooting:
# Get detailed Hydra system information
python main.py --infoSee Hydra's command line flags for more configuration options.
OmniFed orchestrates federated learning experiments through a modular architecture:
Core Components:
- Algorithm: Defines the federated learning strategy (e.g., FedAvg for simple averaging, SCAFFOLD for drift correction, MOON for contrastive learning)
- Topology: Specifies the network structure and client-server relationships (centralized for star topology, hierarchical for multi-tier deployments)
- Communicator: Handles message passing between nodes (PyTorch distributed for HPC clusters, gRPC for cross-network scenarios)
- DataModule: Manages data loading, partitioning, and distribution across clients
- Model: Any PyTorch nn.Module with seamless integration
Execution Flow:
- Initialization: Ray spawns distributed actors based on topology configuration
- Local Training: Each client trains on private data for specified epochs/batches
- Model Exchange: Clients send updates to aggregators via the communicator
- Aggregation: Server combines updates using the algorithm's aggregation strategy
- Model Distribution: Updated global model is broadcast back to clients
- Evaluation: Periodic validation on local and/or global test sets
Additional Capabilities:
- Flexible Scheduling: Control when aggregation and evaluation occur
- Metric Tracking: Built-in logging system for training loss and custom metrics
- Stateful Algorithms: Support for momentum, control variates, and personalized models
OmniFed/
├── src/flora/ # Main framework code
│ ├── algorithm/ # Federated learning algorithms
│ │ ├── base.py # Base algorithm class
│ │ ├── fedavg.py # FedAvg
│ │ └── ... # 10 more algorithms (SCAFFOLD, MOON, FedProx, etc.)
│ ├── communicator/ # Communication protocols
│ │ ├── base.py # Base communicator class
│ │ ├── torchdist.py # PyTorch distributed backend
│ │ ├── grpc.py # gRPC backend
│ │ └── ... # gRPC server/client implementations
│ ├── topology/ # Network structures
│ │ ├── base.py # Base topology class
│ │ ├── centralized.py # Centralized topology
│ │ ├── hierarchical.py # Hierarchical topology
│ │ └── ...
│ ├── model/ # Built-in model examples and reusable components
│ │ └── ...
│ ├── data/ # Data loading and partitioning
│ │ ├── datamodule.py # DataModule class
│ │ └── ...
│ ├── utils/ # Utilities and helpers
│ │ ├── metric_logger.py # Metrics tracking and logging
│ │ ├── results_display.py # Results visualization
│ │ └── ...
│ ├── engine.py # Ray orchestration and coordination
│ └── node.py # Federated learning participant actors
├── conf/ # Hydra configuration files
│ ├── algorithm/ # Algorithm-specific configs
│ ├── datamodule/ # Dataset configurations
│ ├── model/ # Model architecture configs
│ ├── topology/ # Network topology configs
│ └── test_*.yaml # Example experiment configurations
├── main.py # Python entry point
├── main.sh # Development script with setup handling
└── requirements.txt # Dependencies
- Fork the repository
- Create feature branch:
git checkout -b feature-name - Test your changes
- Submit pull request
@inproceedings{omnifed2025,
title={OmniFed: A Modular Federated Learning Framework},
author={Authors},
year={2025}
}