Skip to content

Set replicas to panic on disk errors, and optionally panic on replication errors#10504

Merged
oranagra merged 25 commits into
redis:unstablefrom
madolson:resync-on-corruption
Apr 26, 2022
Merged

Set replicas to panic on disk errors, and optionally panic on replication errors#10504
oranagra merged 25 commits into
redis:unstablefrom
madolson:resync-on-corruption

Conversation

@madolson

@madolson madolson commented Apr 1, 2022

Copy link
Copy Markdown
Contributor
  • Till now, replicas that were unable to persist, would still execute the commands they got from the master, now they'll panic by default, and we add a new replica-ignore-disk-write-errors config to change that.
  • Till now, when a command failed on a replica or AOF-loading, it only logged a warning and a stat, we add a new propagation-error-behavior config to allow panicking in that state (may become the default one day)

Note that commands that fail on the replica can either indicate a bug that could cause data inconsistency between the replica and the master, or they could be in some cases (specifically in previous versions), a result of a command (e.g. EVAL) that failed on the master, but still had to be propagated to fail on the replica as well.

Background

Based off our conversations about data corruption (#10419), it seemed like maybe we should add some defense here. If a replica tries to apply an invalid command, this flag will crash the replica. This has been enabled in tests.

to do:

  • consider removing the check added here d3b4662 and panicking instead (either depending on the new config, or always). (Originally we thought this was an issue, but it doesn't seem like this makes sense since primaries/replicas can disagree on slot ownership)
  • consider panicking on disk errors in replica (possibly using another new config: replica-ignore-disk-write-errors)

Comment thread tests/test_helper.tcl Outdated
@oranagra

oranagra commented Apr 1, 2022

Copy link
Copy Markdown
Member

i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid),
this was when a script was executed on the master and failed half way (after making some modifications), that script must have been propagated to the replica to fail there as well)
@guybe7 @soloestoy you may have other examples.

didn't look at the failed tests, maybe they show something similar to that.

p.s. we already have a metric for these: stat_unexpected_error_replies

@soloestoy

soloestoy commented Apr 1, 2022

Copy link
Copy Markdown
Contributor

i'm aware of at least one valid case for a command to fail on the replica (may no longer be valid),
this was when a script was executed on the master and failed half way (after making some modifications), that script must have been propagated to the replica to fail there as well)

it doesn't matter after 5.0, since we rewrite lua script as MULTI/EXEC by default, but indeed that's a problem when upgrading from old version redis.

BTW, I don't think #10419 is a bug, in particular we have to do a lot of trivial but insignificant work to handle replication, the last arg win is just redis protocol style I think.

Removed extra integration test

Co-authored-by: Binbin <binloveplay1314@qq.com>
@madolson

madolson commented Apr 1, 2022

Copy link
Copy Markdown
Contributor Author

As per @soloestoy's comment, I didn't think there were any "valid" cases of sending incompatible commands along the replication stream since we now only do effect replication. If there are others, I would still like to get something like this in (maybe with a default off). I'm just slightly worried about silent corruption.

stat_unexpected_error_replies should cover it, I'm pretty sure I saw that and then promptly forgot while publishing the PR.

@madolson

madolson commented Apr 1, 2022

Copy link
Copy Markdown
Contributor Author

I think the tests are me forgetting to tag debug stuff.

@oranagra

oranagra commented Apr 3, 2022

Copy link
Copy Markdown
Member

well, i still fear that feature, i have a feeling there are several other cases that we're missing.
we need to conduct some deeper research to find it.

but in anyway, at the very least, we should add 2 conditions to prevent it.

  1. detect the version of the master, and avoid this if the master is a lower version (or at least lower than 7)
  2. maybe a config (possibly disabled by default)

@madolson

madolson commented Apr 4, 2022

Copy link
Copy Markdown
Contributor Author

Ok, I'm okay with a config. I'm not really sure we need the check for Redis 7, even if a user is on a lower version, they have to be sending traffic that could be unintentionally divergent. I would argue it's better to resync on a potential issue that silently ignore a real one.

@oranagra

oranagra commented Apr 4, 2022

Copy link
Copy Markdown
Member

a re-sync is not necessarily a solution.
in a certain use case (for instance one with an EVAL that fails half way), could be facing repeated re-connection, one every second.
you can argue that once we have a config for that the user has a way out, and also that script propagation was disabled by default already anyway, but maybe there are other cases....

did you try to run the tests with --dont-clean and grep for these warnings to see if we have other issues like that?

@madolson

madolson commented Apr 4, 2022

Copy link
Copy Markdown
Contributor Author

I agree, it's possible it's not better, but it's also possibly much worse. I think a config that defaults to disabled for now at least provides a good middle ground we can change later. I would argue that disconnects are surfaced better than log messages at least. I will do the grep test when I update the PR a little bit later today.

Comment thread src/networking.c Outdated
Comment thread src/networking.c Outdated
Comment thread src/config.c Outdated
@madolson madolson changed the title Force resync from master when data corruption is detected on replica Add config to allow replicas to panic on replication errors Apr 8, 2022
Comment thread src/networking.c Outdated
Comment thread tests/integration/replication-4.tcl
Comment thread src/networking.c Outdated
Comment thread redis.conf Outdated
Comment thread redis.conf Outdated
Comment thread src/config.c
Comment thread src/server.h Outdated
Comment thread src/server.h
@madolson

Copy link
Copy Markdown
Contributor Author

Today it's possible for a replica to not know about the correct topology of its master (by design, only masters are authoritative about slots), so we can't really do any validation besides checking for "cross-slot" operations. Although I originally advocated for it, I now think we should just skip that extra validation and blindly accept whatever the master sends us and apply it.

Comment thread redis.conf
Comment thread src/server.c
Comment thread src/config.c Outdated
A weird dangly space.
Comment thread redis.conf Outdated
Comment thread redis.conf
Comment thread redis.conf Outdated
Comment thread src/server.c
@oranagra oranagra changed the title Add config to allow replicas to panic on replication errors Set replicas to panic on disk errors, and optionally panic on replication errors Apr 26, 2022

@oranagra oranagra left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have test coverage for repl_ignore_disk_write_error, the code seems pretty safe to merge without a test though..

Comment thread src/server.c Outdated
Co-authored-by: Binbin <binloveplay1314@qq.com>

@oranagra oranagra left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved by core-team meeting.

@oranagra oranagra added the breaking-change This change can potentially break existing application label Apr 26, 2022
Comment thread redis.conf
oranagra and others added 2 commits April 26, 2022 12:33
Co-authored-by: Binbin <binloveplay1314@qq.com>
@oranagra oranagra merged commit 6fa8e4f into redis:unstable Apr 26, 2022
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Apr 26, 2022
Missing a typeof, we will get errors like this:
- multiple definition of `replicationErrorBehavior'
- ld: error: duplicate symbol: replicationErrorBehavior

Introduced in redis#10504
oranagra pushed a commit that referenced this pull request Apr 26, 2022
Missing a typeof, we will get errors like this:
- multiple definition of `replicationErrorBehavior'
- ld: error: duplicate symbol: replicationErrorBehavior

Introduced in #10504
@oranagra oranagra mentioned this pull request Apr 27, 2022
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…tion errors (redis#10504)

* Till now, replicas that were unable to persist, would still execute the commands
  they got from the master, now they'll panic by default, and we add a new
  `replica-ignore-disk-errors` config to change that.
* Till now, when a command failed on a replica or AOF-loading, it only logged a
  warning and a stat, we add a new `propagation-error-behavior` config to allow
  panicking in that state (may become the default one day)

Note that commands that fail on the replica can either indicate a bug that could
cause data inconsistency between the replica and the master, or they could be
in some cases (specifically in previous versions), a result of a command (e.g. EVAL)
that failed on the master, but still had to be propagated to fail on the replica as well.
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
Missing a typeof, we will get errors like this:
- multiple definition of `replicationErrorBehavior'
- ld: error: duplicate symbol: replicationErrorBehavior

Introduced in redis#10504
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approval-needed Waiting for core team approval to be merged breaking-change This change can potentially break existing application release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants