-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Description
Bug report criteria
- This bug report is not security related, security issues should be disclosed privately via security@etcd.io.
- This is not a support request, support requests should be raised in the etcd discussion forums.
- You have read the etcd bug reporting guidelines.
- Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
After experiencing a power failure while an etcd server is bootstrapping, the server is no longer able to recover and restart again.
This issue occurs in both single-node and three-node clusters. The root cause of the problem is that some writes to the member/snap/db file exceed the common size of a page at the page cache. This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not, since the pages of the page cache can be flushed out of order. There are several references about this problem:
- https://www.usenix.org/conference/osdi14/technical-sessions/presentation/pillai
- https://dl.acm.org/doi/pdf/10.1145/2872362.2872406
- https://mariadb.com/kb/en/atomic-write-support/
- https://pages.cs.wisc.edu/~remzi/OSTEP/file-journaling.pdf (page 9)
What did you expect to happen?
That the server where the power failure happened restarted correctly.
How can we reproduce it (as minimally and precisely as possible)?
This issue can be replicated using LazyFS, which is now capable of simulating out of order persistence of file system pages, at the disk. The main problem is a write to the file member/snap/db which is 16384 bytes long. LazyFS will persist portions (in sizes of 8192 bytes) of this write out of order and will crash, simulating a power failure.
To reproduce this problem, one can follow these steps:
- Mount LazyFS on a directory where etcd data will be saved, with a specified root directory. Assuming the data path for etcd is
/home/data/data.etcdand the root directory is/home/data-root/data.etcd, add the following lines to the default configuration file (located in theconfig/default.tomldirectory):
[[injection]]
type="split_write"
file="/home/data-r/data.etcd/member/snap/db"
persist=[2]
parts=2
occurrence=1
These lines define a fault to be injected. A power failure will be simulated after writing to the /home/data-r/data.etcd/member/snap/db file. Since this write is large (16384 bytes), it is split into 2 parts (each with 8192 bytes), and only the second part is persisted. Specify that it's the first write issued to this file (with the parameter occurrence).
-
Start LazyFS with the following command:
./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/data.etcd -r /home/data-r/data.etcd -f -
Start etcd with the command
./etcd --data-dir '/home/data/data.etcd'.
Immediately after this step, etcd will shut down because LazyFS was unmounted, simulating the power failure. At this point, you can analyze the logs produced by LazyFS to see the system calls issued until the moment of the fault. Here is a simplified version of the log:
{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/.touch', 'mode': 'O_TRUNC'}
{'syscall': 'release', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/.touch'}
{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/.touch', 'mode': 'O_TRUNC'}
{'syscall': 'release', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/.touch'}
{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'mode': 'O_RDWR'}
{'syscall': 'write', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'size': '16384', 'off': '0'}
{'syscall': 'fault'}
- Remove the fault from the configuration file, unmount the filesystem with
fusermount -uz /home/data/data.etcd - Mount LazyFS again with the previously provided command.
- Attemp to start etcd (it fails).
By following these steps, you can replicate the issue and analyze the effects of the power failure on etcd's restart process.
The same problem (but with a different error) happens when we persist the first 8192 bytes of the write (for this change the parameter persist to [1]).
Note that no problem happens when persist is changed to [1,2]. The whole write will be persisted and etcd will succeed to restart.
Anything else we need to know?
Here is the output produced by etcd on restarting. The first file corresponds to the error reported after only persisting the first 8192 bytes of the member/snap/db file and the second file to the error reported after only persisting the second 8192 bytes of the member/snap/db file.
persist_first_part.txt
persist_second_part.txt
Etcd version (please run commands below)
Details
$ etcd --version
etcd Version: 3.4.25
Git SHA: 94593e63d
Go Version: go1.19.8
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.4.25
API version: 3.4Etcd configuration (command line flags or environment variables)
Details
--data-dir 'data/data.etcd'
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Details
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output hereRelevant log output
No response