4. Notice

Code for (WWW'23) To Store or Not? Online Data Selection for Federated Learning with Limited Storage. Note that the network traffic classification dataset is possessed by Huawei company and thus we only provide the codes for three public datasets.

1. Structure

Code/Baselines: Implementation of 4 categories of baselines, including (1) Random Sampling (FIFO and RS), (2) Importance Sampling (HighLoss, Gradient Norm), (3) Data Selection for FL (FedBalancer, Li) and (4) FullData Setting (FullData)
- main.py: main function of federated learning process
- model.py: base models for clients and server
- client.py: definition of client class
- server.py: definition of server class
- baseline_constants.py: some arguments and constants
- utils: other functions
  - language_utils.py: functions for text processing
  - model_utils.py: functions of batch data creation, client setup, noisy data generation, data loading
  - torch_utils.py: transformation between numpy.array and torch.tensor
- fashionmnist: model of fashionmnist dataset
  - LeNet.py
  - CNN.py
- HARBOX: model of HARBOX dataset
  - log_reg.py
- synthetic: model of synthetic dataset
  - log_reg.py
Code/ODE: Implementation of our proposed ODE framework, including 4 versions: (1) individual client-side data selection with accurate gradient of global data (Dream), (2) individual client-side data selection with estimated gradient of global data (Estimation), (3) server-side coordinated data storage and client-side data selection with accurate global data gradient (Coor+Dream), (4) server-side coordinated data storage and client-side data selection with estimated global data gradient (Coor+Estimation)
- main.py: main function of federated learning
- model.py: base models for clients and server
- client.py: client class
- server.py: server class
- baseline_constants.py: some arguments and constants
- utils: other functions
  - model_utils.py
  - torch_utils.py
- fashionmnist: model of fashionmnist dataset
  - LeNet.py
  - CNN.py
- HARBOX: model of HARBOX dataset
  - log_reg.py
- synthetic: model of synthetic dataset
  - log_reg.py
Data: the directory for data storage. If you need the data, please refer to link or send email.
- synthetic
  - data
    - train: json file
    - test: json file
- fashionmist
  - data
    - train: json file
    - test: json file
- HARBOX
  - data
    - train: json file
    - test: json file

2. Libraries

torch >= 1.10.1 tqdm PIL pandas 1.3.4 numpy 1.20.3 json matplotlib pickle args gc

3. Command

Baselines:

Commands:
- cd Baselines
- python main.py -dataset <dataset_name> -model <model_name> --num-rounds 1000 --eval-every 20 --clients-per-round <clients_per_round> --seed 0 --num-epochs 5 -lr <learning_rate>, --choosing-method <data_selection_method> --buffer-size <buffer_size>

Default parameters:

dataset_name	model_name	clients_per_round	learning_rate	data_selection_method	buffer_size
synthetic	log_reg	10 (10/200=5%)	0.0001	FIFO, HighLoss, GradientNorm, FedBalancer, Li, FullData	10
fashionmnist	LeNet	5 (5/50=10%)	0.001	FIFO, HighLoss, GradientNorm, FedBalancer, Li, FullData	10
HARBOX	log_reg	12 (12/120=10%)	0.001	FIFO, HighLoss, GradientNorm, FedBalancer, Li, FullData	10

Examples:
- python main.py -dataset synthetic -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 10 --seed 0 --num-epochs 5 -lr 0.0001 --choosing-method FIFO --buffer-size 10
- python main.py -dataset fashionmnist -model LeNet --num-rounds 1000 --eval-every 20 --clients-per-round 5 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method FIFO --buffer-size 10
- python main.py -dataset HARBOX -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 12 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method FIFO --buffer-size 10

ODE: Same as commands for baselines except for data selection method
- data_selection_methods:
  - Dream: individual client-side data selection with accurate global data gradient
  - Estimation1: individual client-side data selection with global data gradient estimated by aggegating the local gradient estimators of only participating clients
  - Estimation2: individual client-side data selection with global data gradient estimated by updating the global gradient estimator using the local gradient estimators of participating clients
  - Coor+Dream: cross-client coordinated data storage + Dream
  - Coor+Estimation1: cross-client coordinated data storage + Estimation1
  - Coor+Estimation2: cross-client coordinated data storage + Estimation2
  - Coor+Dream and Coor+Estimation2 map to ODE-exact and ODE-Est in paper. As Coor+Estimation1 and Coor+Estimation2 exhibit different performances on various datasets (e.g. Coor+Estimation2 has better performance on FashionMNIST), we provide both of them to facilitate users in testing their own datasets.
- Commands
  - python main.py -dataset synthetic -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 10 --seed 0 --num-epochs 5 -lr 0.0001 --choosing-method Coor+Dream --buffer-size 10
  - python main.py -dataset fashionmnist -model LeNet --num-rounds 1000 --eval-every 20 --clients-per-round 5 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method Coor+Dream --buffer-size 10
  - python main.py -dataset HARBOX -model log_reg --num-rounds 1000 --eval-every 20 --clients-per-round 12 --seed 0 --num-epochs 5 -lr 0.001 --choosing-method Coor+Dream --buffer-size 10

4. Notice

As mentioned in Section 4.1 of our paper in WWW'23, we measure the training speedup by the ratio of the training rounds required for FIFO (or RS) and other methods to reach the same target accuracy (i.e., the final accuracy of FIFO or RS). This is because: despite our proposed ODE only adds the data evaluation process for non-participating clients and has little impact on the local training processes of participating clients, we implement all their actions in a serial manner rather than in parallel in our simulation experiments due to limited server memory. That means that the data evaluation time of non-participating clients is expected to overlap with the local training time of paritipating clients and using number of training rounds is more reasonable.
The values in our paper are averaged across around 5 experiments with different random seeds.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Code		Code
Data		Data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Structure

2. Libraries

3. Command

4. Notice

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1. Structure

2. Libraries

3. Command

4. Notice

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages