Code for (WWW'23) To Store or Not? Online Data Selection for Federated Learning with Limited Storage. Note that the network traffic classification dataset is possessed by Huawei company and thus we only provide the codes for three public datasets.
Code/Baselines: Implementation of 4 categories of baselines, including (1) Random Sampling (FIFO and RS), (2) Importance Sampling (HighLoss, Gradient Norm), (3) Data Selection for FL (FedBalancer, Li) and (4) FullData Setting (FullData)main.py: main function of federated learning processmodel.py: base models for clients and serverclient.py: definition of client classserver.py: definition of server classbaseline_constants.py: some arguments and constantsutils: other functionslanguage_utils.py: functions for text processingmodel_utils.py: functions of batch data creation, client setup, noisy data generation, data loadingtorch_utils.py: transformation between numpy.array and torch.tensor
fashionmnist: model of fashionmnist datasetLeNet.pyCNN.py
HARBOX: model of HARBOX datasetlog_reg.py
synthetic: model of synthetic datasetlog_reg.py
Code/ODE: Implementation of our proposed ODE framework, including 4 versions: (1) individual client-side data selection with accurate gradient of global data (Dream), (2) individual client-side data selection with estimated gradient of global data (Estimation), (3) server-side coordinated data storage and client-side data selection with accurate global data gradient (Coor+Dream), (4) server-side coordinated data storage and client-side data selection with estimated global data gradient (Coor+Estimation)main.py: main function of federated learningmodel.py: base models for clients and serverclient.py: client classserver.py: server classbaseline_constants.py: some arguments and constantsutils: other functionsmodel_utils.pytorch_utils.py
fashionmnist: model of fashionmnist datasetLeNet.pyCNN.py
HARBOX: model of HARBOX datasetlog_reg.py
synthetic: model of synthetic datasetlog_reg.py
Data: the directory for data storage. If you need the data, please refer to link or send email.syntheticdatatrain: json filetest: json file
fashionmistdatatrain: json filetest: json file
HARBOXdatatrain: json filetest: json file
torch >= 1.10.1 tqdm PIL pandas 1.3.4 numpy 1.20.3 json matplotlib pickle args gc
- Baselines:
-
Commands:
cd Baselinespython main.py -dataset <dataset_name> -model <model_name> --num-rounds 1000 --eval-every 20 --clients-per-round <clients_per_round> --seed 0 --num-epochs 5 -lr <learning_rate>, --choosing-method <data_selection_method> --buffer-size <buffer_size>
-
Default parameters:
dataset_name model_name clients_per_round learning_rate data_selection_method buffer_size synthetic log_reg 10 (10/200=5%) 0.0001 FIFO, HighLoss, GradientNorm, FedBalancer, Li, FullData 10 fashionmnist LeNet 5 (5/50=10%) 0.001 FIFO, HighLoss, GradientNorm, FedBalancer, Li, FullData 10 HARBOX log_reg 12 (12/120=10%) 0.001 FIFO, HighLoss, GradientNorm, FedBalancer, Li, FullData 10 -
Examples:
python main.py -dataset synthetic -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 10 --seed 0 --num-epochs 5 -lr 0.0001 --choosing-method FIFO --buffer-size 10python main.py -dataset fashionmnist -model LeNet --num-rounds 1000 --eval-every 20 --clients-per-round 5 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method FIFO --buffer-size 10python main.py -dataset HARBOX -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 12 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method FIFO --buffer-size 10
-
- ODE: Same as commands for baselines except for data selection method
data_selection_methods:- Dream: individual client-side data selection with accurate global data gradient
- Estimation1: individual client-side data selection with global data gradient estimated by aggegating the local gradient estimators of only participating clients
- Estimation2: individual client-side data selection with global data gradient estimated by updating the global gradient estimator using the local gradient estimators of participating clients
- Coor+Dream: cross-client coordinated data storage + Dream
- Coor+Estimation1: cross-client coordinated data storage + Estimation1
- Coor+Estimation2: cross-client coordinated data storage + Estimation2
- Coor+Dream and Coor+Estimation2 map to ODE-exact and ODE-Est in paper. As Coor+Estimation1 and Coor+Estimation2 exhibit different performances on various datasets (e.g. Coor+Estimation2 has better performance on FashionMNIST), we provide both of them to facilitate users in testing their own datasets.
- Commands
python main.py -dataset synthetic -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 10 --seed 0 --num-epochs 5 -lr 0.0001 --choosing-method Coor+Dream --buffer-size 10python main.py -dataset fashionmnist -model LeNet --num-rounds 1000 --eval-every 20 --clients-per-round 5 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method Coor+Dream --buffer-size 10python main.py -dataset HARBOX -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 12 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method Coor+Dream --buffer-size 10
- As mentioned in Section 4.1 of our paper in WWW'23, we measure the training speedup by the ratio of the training rounds required for FIFO (or RS) and other methods to reach the same target accuracy (i.e., the final accuracy of FIFO or RS). This is because: despite our proposed ODE only adds the data evaluation process for non-participating clients and has little impact on the local training processes of participating clients, we implement all their actions in a serial manner rather than in parallel in our simulation experiments due to limited server memory. That means that the data evaluation time of non-participating clients is expected to overlap with the local training time of paritipating clients and using number of training rounds is more reasonable.
- The values in our paper are averaged across around 5 experiments with different random seeds.