Artifact Evaluation (AE) for Research Paper -- "Who Watches the Watchers? On the Reliability of Softwarizing Cloud Application Management" (NSDI'26 Spring AE #14)
This artifact reproduces (1) all the quantitative findings and tables in the research paper and (2) the Oat tool's evaluation results.
This artifact includes:
- The dataset of 412 failure cases of 13 popular Kubernetes operators with instructions to reproduce the findings (Findings 1--4) and tables (Tables 3--5, Figure 3),
- Reproduction instructions for the 86 bugs (Table 7) found by Oat.
The entire AE can take 12 hours if run with a concurrency of 4 workers (using the CloudLab machine we suggest); it will take about 34 hours if running sequentially (with no concurrent worker).
If you have any questions, please contact us via email or HotCRP.
Setting up CloudLab machines
If you are a first timer of CloudLab, we encourage you to read the CloudLab docs on how AE is generally conducted on CloudLab.
CloudLab For Artifact Evaluation
If you do not have a CloudLab account, please apply for one following this link, and ask the NSDI AEC chair to add you to the NSDI AEC project. Please let us know if you have trouble accessing CloudLab, we can help set up the experiment and give you access.
We recommend you to use the machine type, c6420 (CloudLab profile), which we used in our own evaluation. Note that the machine may not be available all the time. You would need to submit a resource reservation to guarantee the availability of the resource. You can also use the alternative machine type via our profile, c8220. The c8220 machine is available most of the time, but has less memory than c6420.
Note that our results in the evaluation are all produced using the c6420 profile.
Below, we provide three ways to set up the environment:
- Set up environment on CloudLab c6420 using the profile (recommended)
- Set up environment on CloudLab c8220 using the profile
- Set up environment on a local machine
To reserve machines, click the "Reserve Nodes" tab from the dropdown menu from the "Experiments" tab at top left corner. Select "CloudLab Clemson" for the cluster, "c6420" as the hardware, and "1" for the number of nodes. Specify the desired time frame for the reservation, and click "Check". The website will check if your reservation can be satisfied and then you can submit the request. The request will be reviewed by CloudLab staff and approved typically on the next business day.
Note: Reservation does not automatically start the experiment.
We provide the CloudLab profile to automatically select the c6420 as the machine type and set up all the environment.
To use the profile, follow the link
and keep hitting next to create the experiment.
You should see that CloudLab starts to provision the machine and our profile will run a StartUp
script to set the environment up.
The start up would take around 15 minutes.
Please patiently wait for "Status" to become Ready and "Startup" to become Finished.
After that, Oat is installed at the workdir/acto directory under your $HOME directory.
Access the machine using ssh or through the shell provided by the CloudLab Web UI.
Please proceed to the Kick-the-tire Instructions to validate.
Click to show details
There could sometimes be transient network issues within the CloudLab cluster, which prevent e.g. pip install in the startup scripts from functioning as expected.
To circumvent the problem, either
-
Recreate the experiment using the same profile, or
-
SSH into the machine and manually rerun the startup:
sudo su - geniuser bash /local/repository/scripts/cloudlab_startup_run_by_geniuser.sh exit
Seeing an Error Message from CloudLab No available physical nodes of type c6420 found (1 requested)?
Click to show details
This means that currently there is no c6420 machines available for experiments. Please check the Reserve nodes with preferred hardware section or check back later.
We provide CloudLab profile to automatically select the c8220 as the machine type and set up all the environment, in case the c6420 machine is not available at the time of starting experiment, or reviewers do not have enough time to make a resource reservation.
To use the profile, follow the link
and keep hitting next to create the experiment.
You should see that CloudLab starts to provision the machine and our profile will run a StartUp
script to set the environment up.
The startup would take around 20 minutes.
Please patiently wait for "Status" to become Ready and "Startup" to become Finished.
After that, Acto is installed at the workdir/acto directory under your $HOME directory.
Access the machine using ssh or through the shell provided by the CloudLab Web UI.
Please proceed to the Kick-the-tire Instructions to validate.
Click to show details
- A Linux system with Docker support
- Python 3.10 or newer
- Install
pip3by runningsudo apt install python3-pip - Install Golang
- Clone the repo recursively by running
git clone --recursive --branch nsdi26-ae https://github.com/xlab-uiuc/acto.git - Install Python dependencies by running
pip3 install -r requirements-dev.txtin the project - Install
Kindby runninggo install sigs.k8s.io/kind@v0.21.0 - Install
Kubectlby runningcurl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"andsudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl - Configure inotify limits (need to rerun after reboot)
sudo sysctl fs.inotify.max_user_instances=1024sudo sysctl fs.inotify.max_user_watches=1048576
We prepared a simple example – reproducing a bug found by Oat – to help check obvious setup problems.
First, build the dependent modules:
cd ~/workdir/acto/
makeThen, reproduce the MariaDBOp-863 bug by running:
python3 reproduce_bugs.py --bug-id mariadbop-863Expected results:
Reproducing bug mariadbop-863 in MariaDBOp!
Preparing required images...
Deleting cluster "acto-0-cluster-0" ...
Deleted nodes: ["acto-0-cluster-0-worker2" "acto-0-cluster-0-worker4" "acto-0-cluster-0-worker5" "acto-0-cluster-0-worker3" "acto-0-cluster-0-worker" "acto-0-cluster-0-control-plane"]
Creating a Kind cluster...
Creating cluster "acto-0-cluster-0" ...
...
Deploying operator...
Operator deployed
pod/mariadb-writer created
Bug mariadbop-863 reproduced!
Bug category: Operation Semantics
You can view the tables and findings reproduced using Jupyter notebooks here: https://github.com/xlab-uiuc/acto/blob/nsdi26-ae/study.ipynb
Operator Failure Dataset: https://github.com/xlab-uiuc/acto/blob/nsdi26-ae/nsdi26ae.csv
To reproduce the 86 bugs in Table 7, please execute the tests by running:
cd ~/workdir/acto/
make
python3 reproduce_bugs.py -n <NUM_WORKERS>Using the c6420 or the c8220 profile we recommend, run the tests with 4 workers -n 4 and it will take about 12 hours to finish.
We suggest starting this long-running experiment in a tmux or screen session.
Caution: Running too many workers at the same time may overload your machine, and Kind would fail to bootstrap Kubernetes clusters. If you are not running the experiment using our recommended CloudLab profile, please default the number of workers to 1. Running this step sequentially takes approximately 34 hours.
What does the reproduce script do?
For each bug, the reproduction code runs Oat with tests needed to reproduce the bug. It checks if every bug is reproducible and outputs Table 7.The table7.txt should look like below:
Operator Operation Semantics State Observability Version Compatibility Error Handling By Product Total
-------------- --------------------- --------------------- ----------------------- ---------------- ------------ -------
CassOp 7 0 1 0 2 10
KafkaOp 2 1 0 0 0 3
MariaDBOp 9 1 0 1 16 27
MinIOOp 1 0 0 0 1 2
MongoOp 18 2 1 3 2 26
TiDBOp 9 2 1 0 6 18
Total 46 6 3 4 27 86
Some bugs could fail to be reproduced due to the machine being overloaded, you can retry reproducing a specific bug by its bug ID:
python3 reproduce_bugs.py --bug-id <BUG_ID>