Skip to content

etcd fails to start after power failure #16596

@mj-ramos

Description

@mj-ramos

Bug report criteria

What happened?

After experiencing a power failure while an etcd server is bootstrapping, the server is no longer able to recover and restart again.

This issue occurs in both single-node and three-node clusters. The root cause of the problem is that some writes to the member/snap/db file exceed the common size of a page at the page cache. This can result in a "torn write" scenario where only part of the write's payload is persisted while the rest is not, since the pages of the page cache can be flushed out of order. There are several references about this problem:

What did you expect to happen?

That the server where the power failure happened restarted correctly.

How can we reproduce it (as minimally and precisely as possible)?

This issue can be replicated using LazyFS, which is now capable of simulating out of order persistence of file system pages, at the disk. The main problem is a write to the file member/snap/db which is 16384 bytes long. LazyFS will persist portions (in sizes of 8192 bytes) of this write out of order and will crash, simulating a power failure.
To reproduce this problem, one can follow these steps:

  1. Mount LazyFS on a directory where etcd data will be saved, with a specified root directory. Assuming the data path for etcd is /home/data/data.etcdand the root directory is /home/data-root/data.etcd, add the following lines to the default configuration file (located in the config/default.toml directory):
[[injection]]
type="split_write"
file="/home/data-r/data.etcd/member/snap/db"
persist=[2]
parts=2
occurrence=1

These lines define a fault to be injected. A power failure will be simulated after writing to the /home/data-r/data.etcd/member/snap/db file. Since this write is large (16384 bytes), it is split into 2 parts (each with 8192 bytes), and only the second part is persisted. Specify that it's the first write issued to this file (with the parameter occurrence).

  1. Start LazyFS with the following command:
    ./scripts/mount-lazyfs.sh -c config/default.toml -m /home/data/data.etcd -r /home/data-r/data.etcd -f

  2. Start etcd with the command ./etcd --data-dir '/home/data/data.etcd'.

Immediately after this step, etcd will shut down because LazyFS was unmounted, simulating the power failure. At this point, you can analyze the logs produced by LazyFS to see the system calls issued until the moment of the fault. Here is a simplified version of the log:

{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/.touch', 'mode': 'O_TRUNC'}
{'syscall': 'release', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/.touch'}
{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/.touch', 'mode': 'O_TRUNC'}
{'syscall': 'release', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/.touch'}
{'syscall': 'create', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'mode': 'O_RDWR'}
{'syscall': 'write', 'path': '/home/gsd/etcd-v3.4.25-linux-amd64/data-r/data.etcd/member/snap/db', 'size': '16384', 'off': '0'}
{'syscall': 'fault'}
  1. Remove the fault from the configuration file, unmount the filesystem with fusermount -uz /home/data/data.etcd
  2. Mount LazyFS again with the previously provided command.
  3. Attemp to start etcd (it fails).

By following these steps, you can replicate the issue and analyze the effects of the power failure on etcd's restart process.

The same problem (but with a different error) happens when we persist the first 8192 bytes of the write (for this change the parameter persist to [1]).

Note that no problem happens when persist is changed to [1,2]. The whole write will be persisted and etcd will succeed to restart.

Anything else we need to know?

Here is the output produced by etcd on restarting. The first file corresponds to the error reported after only persisting the first 8192 bytes of the member/snap/db file and the second file to the error reported after only persisting the second 8192 bytes of the member/snap/db file.
persist_first_part.txt
persist_second_part.txt

Etcd version (please run commands below)

Details
$ etcd --version
etcd Version: 3.4.25
Git SHA: 94593e63d
Go Version: go1.19.8
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.4.25
API version: 3.4

Etcd configuration (command line flags or environment variables)

Details

--data-dir 'data/data.etcd'

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Details
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions