Set replicas to panic on disk errors, and optionally panic on replication errors#10504
Conversation
|
i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid), didn't look at the failed tests, maybe they show something similar to that. p.s. we already have a metric for these: |
it doesn't matter after 5.0, since we rewrite lua script as BTW, I don't think #10419 is a bug, in particular we have to do a lot of trivial but insignificant work to handle replication, the last arg win is just redis protocol style I think. |
Removed extra integration test Co-authored-by: Binbin <binloveplay1314@qq.com>
|
As per @soloestoy's comment, I didn't think there were any "valid" cases of sending incompatible commands along the replication stream since we now only do effect replication. If there are others, I would still like to get something like this in (maybe with a default off). I'm just slightly worried about silent corruption.
|
|
I think the tests are me forgetting to tag debug stuff. |
|
well, i still fear that feature, i have a feeling there are several other cases that we're missing. but in anyway, at the very least, we should add 2 conditions to prevent it.
|
|
Ok, I'm okay with a config. I'm not really sure we need the check for Redis 7, even if a user is on a lower version, they have to be sending traffic that could be unintentionally divergent. I would argue it's better to resync on a potential issue that silently ignore a real one. |
|
a re-sync is not necessarily a solution. did you try to run the tests with |
|
I agree, it's possible it's not better, but it's also possibly much worse. I think a config that defaults to disabled for now at least provides a good middle ground we can change later. I would argue that disconnects are surfaced better than log messages at least. I will do the grep test when I update the PR a little bit later today. |
|
Today it's possible for a replica to not know about the correct topology of its master (by design, only masters are authoritative about slots), so we can't really do any validation besides checking for "cross-slot" operations. Although I originally advocated for it, I now think we should just skip that extra validation and blindly accept whatever the master sends us and apply it. |
A weird dangly space.
oranagra
left a comment
There was a problem hiding this comment.
we don't have test coverage for repl_ignore_disk_write_error, the code seems pretty safe to merge without a test though..
oranagra
left a comment
There was a problem hiding this comment.
approved by core-team meeting.
…efore a real write arrives from the master.
Co-authored-by: Binbin <binloveplay1314@qq.com>
Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in redis#10504
Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in #10504
…tion errors (redis#10504) * Till now, replicas that were unable to persist, would still execute the commands they got from the master, now they'll panic by default, and we add a new `replica-ignore-disk-errors` config to change that. * Till now, when a command failed on a replica or AOF-loading, it only logged a warning and a stat, we add a new `propagation-error-behavior` config to allow panicking in that state (may become the default one day) Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.
Missing a typeof, we will get errors like this: - multiple definition of `replicationErrorBehavior' - ld: error: duplicate symbol: replicationErrorBehavior Introduced in redis#10504
replica-ignore-disk-write-errorsconfig to change that.propagation-error-behaviorconfig to allow panicking in that state (may become the default one day)Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.
Background
Based off our conversations about data corruption (#10419), it seemed like maybe we should add some defense here. If a replica tries to apply an invalid command, this flag will crash the replica. This has been enabled in tests.
to do:
replica-ignore-disk-write-errors)