Skip to content

Conversation

@MeirShpilraien
Copy link

@MeirShpilraien MeirShpilraien commented Jul 13, 2022

NOTICE! The main part of this change has been later reverted in #11178, and isn't included in 7.2 RC1, despite what the release notes state
see #10969 (comment) as to why


Fix replication inconsistency on modules that uses key space notifications.

The Problem

In general, key space notifications are invoked after the command logic was executed (this is not always the case, we will discuss later about specific command that do not follow this rules). For example, the set x 1 will trigger a set notification that will be invoked after the set logic was performed, so if the notification logic will try to fetch x, it will see the new data that was written.
Consider the scenario on which the notification logic performs some write commands. for example, the notification logic increase some counter, incr x{counter}, indicating how many times x was changed. The logical order by which the logic was executed is has follow:

set x 1
incr x{counter}

The issue is that the set x 1 command is added to the replication buffer at the end of the command invocation (specifically after the key space notification logic was invoked and performed the incr command). The replication/aof sees the commands in the wrong order:

incr x{counter}
set x 1

In this specific example the order is less important. But if, for example, the notification would have deleted x then we would end up with primary-replica inconsistency.

The Solution

Put the command that cause the notification in its rightful place. In the above example, the set x 1 command logic was executed before the notification logic, so it should be added to the replication buffer before the commands that is invoked by the notification logic. To achieve this, without a major code refactoring, we save a placeholder in the replication buffer, when finishing invoking the command logic we check if the command need to be replicated, and if it does, we use the placeholder to add it to the replication buffer instead of appending it to the end.

To be efficient and not allocating memory on each command to save the placeholder, the replication buffer array was modified to reuse memory (instead of allocating it each time we want to replicate commands). Also, to avoid saving a placeholder when not needed, we do it only for WRITE or MAY_REPLICATE commands.

Additional Fixes

  • Expire and Eviction notifications:
    • Expire/Eviction logical order was to first perform the Expire/Eviction and then the notification logic. The replication buffer got this in the other way around (first notification effect and then the del command). The PR fixes this issue.
    • The notification effect and the del command was not wrap with multi-exec (if needed). The PR also fix this issue.
  • SPOP command:
    • On spop, the spop notification was fired before the command logic was executed. The change in this PR would have cause the replication order to be change (first spop command and then notification logic) although the logical order is first the notification logic and then the spop logic. The right fix would have been to move the notification to be fired after the command was executed (like all the other commands), but this can be considered a breaking change. To overcome this, the PR keeps the current behaviour and changes the spop code to keep the right logical order when pushing commands to the replication buffer. Another PR will follow to fix the SPOP properly and match it to the other command (we split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if we chose to).

Unhanded Known Limitations

  • key miss event:
    • On key miss event, if a module performed some write command on the event (using RM_Call), the dirty counter would increase and the read command that cause the key miss event would be replicated to the replication and aof. This problem can also happened on a write command that open some keys but eventually decides not to perform any action. We decided not to handle this problem on this PR because the solution is complex and will cause additional risks in case we will want to cherry-pick this PR. We should decide if we want to handle it in future PR's (@oranagra @yossigo @madolson FYI). For now, modules writers is advice not to perform any write commands on key miss event.

Testing

  • We already have tests to cover cases where a notification is invoking write commands that are also added to the replication buffer, the tests was modified to verify that the replica gets the command in the correct logical order.
  • Test was added to verify that spop behaviour was kept unchanged.
  • Test was added to verify key miss event behave as expected.
  • Test was added to verify the changes do not break lazy expiration.

Additional Changes

  • Follow @guybe7 suggestion, propagateNow function can accept a special dbid, -1, indicating not to replicate select. We use this to replicate multi/exec on propagatePendingCommands function. The side effect of this change is that now the select command will appear inside the multi/exec block on the replication stream (instead of outside of the multi/exec block). Tests was modified to match this new behaviour.

…tions.

### The Problem

In general, key space notifcations are invoked after the command logic was executed (this is not always the case, we will discuss later about spacific command that do not follow this rules). For exampel, the `set x 1` will trigger a `set` notification that will be invoked after the `set` logic was perfromed, so if the notification logic will try to fetch `x`, it will see the new data that was writen.
Consider the scenario on which the notification logic performs some write commands. for example, the notification logic increase some counter, `incr x{counter}`, indicating how many times `x` was changed. The logical order by which the logic was executed is has followi:

```
set x 1
incr x{counter}
```

The issue is that the `set x 1` command is added to the replication buffer at the end of the command invocation (spacifically after the key space notification logic was invoked and performed the `incr` command). The replication/aof sees the commands in the wrong order:

```
incr x{counter}
set x 1
```

In this spacifica example the order is less important. But if, for example, the notifcation would have deleted `x` then we would end up with primary-replica inconsistency.

### The Solution

Put the command that cause the notification in its rightfull place. In the above example, the `set x 1` command logic was executed before the notification logic, so it should be added to the replication buffer before the commands that is invoked by the notificaiton logic. To achieve this, without a major code refactoring, we save a placeholder in the replication buffer, when finishing invoking the command logic we check if the command need to be replicated, and if it does, we use the placeholder to add it to the replication buffer instead of appending it to the end.

To be efficient and not allocating memory on each command to save the placeholder, the replication buffer array was modified to reuse memory (instead of allocating it each time we want to replicate commands).

#### Exceptions

* Expire and Eviction notifications:
  * Expire/Eviction logical order was to first perform the Expire/Eviction and then the notification logic. The replication buffer got this in the other way around (first notification effect and then the `del` command). The PR fixes this issue.
  * The notification effect and the `del` command was not wrap with `multi-exec` (if needed). The PR also fix this issue.
* SPOP command:
  * On spop, the `spop` notification was fired before the command logic was executed. The change in this PR would have cause the replication order to be change (first `spop` command and then notification `logic`) altouhg the logical order is first the notification logic and then the `spop` logic. The right would have been to move the notification to be fire after the command was executed (like all the other commands), but this can be considered a breaking change. To overcome this, the PR keeps the current behaviure and changes the `spop` code to keep the right logical order when pushing commands to the replication buffer.

#### Testing

* We already have tests to cover cases where a notification is invoking write commands that are also added to the replication buffer, the tests was modified to verify that the replica gets the command in the correct logical order.
* Test was added to verify that `spop` behaviure was kept unchanged.
@oranagra
Copy link
Member

@MeirShpilraien didn't we agree to fix spop in a breaking manner?
or did we say that we'll first fix it in a non-breaking manner, and then break it in a later version?

@oranagra oranagra added the state:major-decision Requires core team consensus label Jul 13, 2022
@MeirShpilraien
Copy link
Author

@oranagra we said that he will first fix it without breaking so we can cherrypick this to 7.0 if we want to. We will fix it proparly after, in anothet RP.

@oranagra
Copy link
Member

Ohh, I forgot. Better outline that plan in the top comment.
@yossigo do you think we can put this in 7.0?

@MeirShpilraien MeirShpilraien requested a review from guybe7 July 21, 2022 13:39
Meir Shpilraien (Spielrein) and others added 3 commits July 28, 2022 09:23
Co-authored-by: Oran Agra <oran@redislabs.com>
@oranagra
Copy link
Member

@yossigo please take another look.
we attempted to include only the bugfix in this PR without fixing other things that can be considered a breaking change.
along the way a few other improvements were made and some mechanism was introduced.
we need to decide if it's safe to backport to 7.0, maybe after a few weeks in unstable.
maybe try to make a review sweep that ignores all the comments and tests, and just consider the actual changes that have a risk of breaking things...

Meir Shpilraien (Spielrein) and others added 2 commits July 28, 2022 16:45
Co-authored-by: Oran Agra <oran@redislabs.com>
@oranagra
Copy link
Member

@redis/core-team not really a major-decision since there's no interface change, but it's a sensitive area so maybe one of you wants to review it too.
besides merging it to unstable, i'd also like to consider backporting it to 7.0, so that's probably more of a major-decision, so if you have any opinion, please comment.
either way, we better merge it to unstable to let it age a bit there, so i'd like to merge it soon.

Copy link
Collaborator

@yossigo yossigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oranagra I agree this isn't a major decision, but it is bug fix that includes a breaking change and potential for undesired side effects so I vote for holding back on a backport at the moment.

@oranagra
Copy link
Member

the PR doesn't include a breaking change (just a bug fix).
unless someone wrote code that actually relies on the bug (but that can be true for any bug fix).
it is however sensitive area, so maybe it introduces a new bug.

@oranagra oranagra merged commit 508a138 into redis:unstable Aug 18, 2022
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Aug 18, 2022
@MeirShpilraien
Copy link
Author

Notice that we have an occasional test failure because of the changes on this PR: https://github.com/redis/redis/runs/7927670765?check_suite_focus=true

The issue is that if there is a lazy expire during the command invocation, the del command is added to the replication stream after the command placeholder. So the logical order on the primary is:

  • Delete the key (lazy expiration)
  • Command invocation

But the replication stream gets it the other way around:

  • Command invocation (because the command is written into the placeholder)
  • Delete the key (lazy expiration)

So if the command write to the key that was just lazy expired we will get inconsistency between primary and replica.

The following test can constantly reproduce the issue:

start_server {tags {"repl external:skip"}} {
    set master [srv 0 client]
    set master_host [srv 0 host]
    set master_port [srv 0 port]
    $master debug SET-ACTIVE-EXPIRE 0
    start_server {} {
        set slave [srv 0 client]
        $slave debug SET-ACTIVE-EXPIRE 0
        $slave slaveof $master_host $master_port

        test "Tes replication with lazy expire" {
            # wait for replication to be in sync
            wait_for_condition 50 100 {
                [lindex [$slave role] 0] eq {slave} &&
                [string match {*master_link_status:up*} [$slave info replication]]
            } else {
                fail "Can't turn the instance into a replica"
            }

            $master sadd s foo
            $master expire s 1
            after 1000
            $master sadd s foo
            assert_equal 1 [$master wait 1 1]

            assert_equal "set" [$master type s]
            assert_equal "set" [$slave type s]
        }
    }
}

One solution we considered is to add another lazy expire replication stream and write all the lazy expire there. Then when replicating, we will replicate the lazy expire replication stream first. This will solve this specific test failure but we realise that the issues does not ends here and the more we dig the more problems we find. One of the example we thought about (that can actually crashes Redis) is as follow:

  1. User perform SINTERSTORE
  2. When Redis tries to fetch the second input key it triggers lazy expire
  3. The lazy expire trigger a module logic that deletes the first input key
  4. Now Redis hold the robj of the first input key that was actually freed

We believe we took the wrong approach here and we would like to suggested a new approach:

  1. Modules should not perform any writes during key space notifications, we should document it and make it clear to any module writer.
  2. Module might still want to perform writes as a result of a key space notification, and the requirement is that it will be atomic along side the command that triggered the notification, for this we will introduce a new post command hook that will allow the module to run whatever logic it wants (including writes) after the command was finished. The logical order and the replication order will be the same because any writes logic was invoke after the command was finished and there is not risk in changing the key space while the command is running. A module will be able to register to key space notifications, collect them, and only react on them on the post command hook.

We are planing to revert this PR and implements this post command hook idea on another PR.

MeirShpilraien pushed a commit to MeirShpilraien/redis that referenced this pull request Aug 23, 2022
The PR reverts the changes made on redis#10969. The reason for revert was trigger
because of occasional test failure that started after the PR was merged.

The issue is that if there is a lazy expire during the command invocation,
the del command is added to the replication stream after the command placeholder.
So the logical order on the primary is:

* Delete the key (lazy expiration)
* Command invocation

But the replication stream gets it the other way around:

* Command invocation (because the command is written into the placeholder)
* Delete the key (lazy expiration)

So if the command write to the key that was just lazy expired we will get inconsistency between primary and replica.
One solution we considered is to add another lazy expire replication stream and write all the lazy expire there.
Then when replicating, we will replicate the lazy expire replication stream first.
This will solve this specific test failure but we realise that the issues does not ends here and the more we dig the more problems we find.
One of the example we thought about (that can actually crashes Redis) is as follow:

* User perform SINTERSTORE
* When Redis tries to fetch the second input key it triggers lazy expire
* The lazy expire trigger a module logic that deletes the first input key
* Now Redis hold the robj of the first input key that was actually freed

We believe we took the wrong approach and we will come up with another PR that
solve the problem differently, for now we revert the changes so we will not
have the tests failure.

Notice that not the entire code was revert, some parts of the PR are changes that we
would like to keep. The changes that was reverted are:

* Saving a placeholder for replication at the begining of the command (`call` function)
* Order of the replication stream on active expire and eviction (we will decide how to handle it correctly on follow up PR)
* Spop changes are no longer needed (because we reverted the placeholder code)

Changes that was not reverted:

* On expire/eviction, wrap the `del` and the notification effect in a multi exec.
* PropagateNow function can still accept a special dbid, -1, indicating not to replicate select.
* Keep optimization for reusing the alsoPropagate array instead of allocating it each time.

Tests:

* All tests was kept and only few tests was modify to work correctly with the changes
* Test was added to verify that the revert fixes the issues.
oranagra pushed a commit that referenced this pull request Aug 24, 2022
The PR reverts the changes made on #10969.
The reason for revert was trigger because of occasional test failure
that started after the PR was merged.

The issue is that if there is a lazy expire during the command invocation,
the `del` command is added to the replication stream after the command
placeholder. So the logical order on the primary is:

* Delete the key (lazy expiration)
* Command invocation

But the replication stream gets it the other way around:

* Command invocation (because the command is written into the placeholder)
* Delete the key (lazy expiration)

So if the command write to the key that was just lazy expired we will get
inconsistency between primary and replica.

One solution we considered is to add another lazy expire replication stream
and write all the lazy expire there. Then when replicating, we will replicate the
lazy expire replication stream first. This will solve this specific test failure but
we realize that the issues does not ends here and the more we dig the more
problems we find.One of the example we thought about (that can actually
crashes Redis) is as follow:

* User perform SINTERSTORE
* When Redis tries to fetch the second input key it triggers lazy expire
* The lazy expire trigger a module logic that deletes the first input key
* Now Redis hold the robj of the first input key that was actually freed

We believe we took the wrong approach and we will come up with another
PR that solve the problem differently, for now we revert the changes so we
will not have the tests failure.

Notice that not the entire code was revert, some parts of the PR are changes
that we would like to keep. The changes that **was** reverted are:

* Saving a placeholder for replication at the beginning of the command (`call` function)
* Order of the replication stream on active expire and eviction (we will decide how
  to handle it correctly on follow up PR)
* `Spop` changes are no longer needed (because we reverted the placeholder code)

Changes that **was not** reverted:

* On expire/eviction, wrap the `del` and the notification effect in a multi exec.
* `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select.
* Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time.

Tests:

* All tests was kept and only few tests was modify to work correctly with the changes
* Test was added to verify that the revert fixes the issues.
MeirShpilraien pushed a commit to MeirShpilraien/redis that referenced this pull request Aug 28, 2022
Related PRs: redis#10969, redis#11178

The following PR is a proposal of handling write operations inside key space notifications. The PR is not yet finished (see todo list at the end) but I would like to start getting opinions whether or not the approach make sense and whether there are any issues that we might have missed.

After a lot of discussions we came to a conclussion that module
should not perform any write operations on key space notifcation.

Some examples of issues that such write operation can cause are describe
on the following links:

* Bad replication oreder - redis#10969
* Used after free - redis#10969 (comment)
* Used after free - redis#9406 (comment)

There are probably more issues that are yet to be discovered. The underline problem
with writing inside key space notification is that the notification runs synchronously,
this means that the notification code will be executed in the middle on Redis logic
(commands logic, eviction, expire). Redis **do not assume** that the data might change
while running the logic and such changes can crash Redis or cause unexpected behaviure.

One solution is to go over all those places (that assumes no data changes) and fix them
such that they will be tollerable to data changes. This apporach will probably require a
lot of changes all around the code and it will be very hard to know whether or not we got
all those issues. Moreover, it will be very hard to maintain such code and be sure that
we are not adding more places that assume no data changes.

Another approach (which presented in this PR) is to state that modules **should not** perform
any write command inside key space notification (we can chose whether or not we want to force it).
To still cover the use-case where module wants to perform a write operation as a reaction to
key space notifications, we introduce a new API that, `RedisModule_AddPostNotificationJob`, allows to register a callback that will be
called by Redis when the following conditions hold:

* It is safe to perform any write operation.
* The job will be called atomically along side the notification (the notification effect and the
  job effect will be replicated to the replication and the aof wrapped with multi exec).

Module can use this new API to safely perform any write operation and still achieve atomicity
betweent the notification and the write.

### Technical Details

Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list of callbacks (called `modulePostKeyspaceJobs`) that need to be invoke after the current logic ends (whether its a command, eviction, or active expire). In order to trigger those callback atomically with the notification effect, we call those callbacks on `propagatePendingCommands` just before we actually propagating the effects.

If the callback perform more operations that triggers more key space notifications. Those keys space notifications might register more callbacks. Those callbacks will be added to the end of `modulePostKeyspaceJobs` list and will be invoke atomically after the current callback ends. In order to avoid infinite loop we limit the number of jobs that can be registered in a single logic unit (command, eviction, active expire). This limitation is configurable using `maxpostnotificationsjobs` configuration value.
oranagra added a commit that referenced this pull request Nov 24, 2022
### Summary of API additions

* `RedisModule_AddPostNotificationJob` - new API to call inside a key space
  notification (and on more locations in the future) and allow to add a post job as describe above.
* New module option, `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`,
  allows to disable Redis protection of nested key-space notifications.
* `RedisModule_GetModuleOptionsAll` - gets the mask of all supported module options so a module
  will be able to check if a given option is supported by the current running Redis instance.

### Background

The following PR is a proposal of handling write operations inside module key space notifications.
After a lot of discussions we came to a conclusion that module should not perform any write
operations on key space notification.

Some examples of issues that such write operation can cause are describe on the following links:

* Bad replication oreder - #10969
* Used after free - #10969 (comment)
* Used after free - #9406 (comment)

There are probably more issues that are yet to be discovered. The underline problem with writing
inside key space notification is that the notification runs synchronously, this means that the notification
code will be executed in the middle on Redis logic (commands logic, eviction, expire).
Redis **do not assume** that the data might change while running the logic and such changes
can crash Redis or cause unexpected behaviour.

The solution is to state that modules **should not** perform any write command inside key space
notification (we can chose whether or not we want to force it). To still cover the use-case where
module wants to perform a write operation as a reaction to key space notifications, we introduce
a new API , `RedisModule_AddPostNotificationJob`, that allows to register a callback that will be
called by Redis when the following conditions hold:

* It is safe to perform any write operation.
* The job will be called atomically along side the operation that triggers it (in our case, key
  space notification).

Module can use this new API to safely perform any write operation and still achieve atomicity
between the notification and the write.

Although currently the API is supported on key space notifications, the API is written in a generic
way so that in the future we will be able to use it on other places (server events for example).

### Technical Details

Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list
of callbacks (called `modulePostExecUnitJobs`) that need to be invoke after the current execution
unit ends (whether its a command, eviction, or active expire). In order to trigger those callback
atomically with the notification effect, we call those callbacks on `postExecutionUnitOperations`
(which was `propagatePendingCommands` before this PR). The new function fires the post jobs
and then calls `propagatePendingCommands`.

If the callback perform more operations that triggers more key space notifications. Those keys
space notifications might register more callbacks. Those callbacks will be added to the end
of `modulePostExecUnitJobs` list and will be invoke atomically after the current callback ends.
This raises a concerns of entering an infinite loops, we consider infinite loops as a logical bug
that need to be fixed in the module, an attempt to protect against infinite loops by halting the
execution could result in violation of the feature correctness and so **Redis will make no attempt
to protect the module from infinite loops**

In addition, currently key space notifications are not nested. Some modules might want to allow
nesting key-space notifications. To allow that and keep backward compatibility, we introduce a
new module option called `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`.
Setting this option will disable the Redis key-space notifications nesting protection and will
pass this responsibility to the module.

### Redis infrastructure

This PR promotes the existing `propagatePendingCommands` to an "Execution Unit" concept,
which is called after each atomic unit of execution,

Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
madolson pushed a commit to madolson/redis that referenced this pull request Apr 19, 2023
The PR reverts the changes made on redis#10969.
The reason for revert was trigger because of occasional test failure
that started after the PR was merged.

The issue is that if there is a lazy expire during the command invocation,
the `del` command is added to the replication stream after the command
placeholder. So the logical order on the primary is:

* Delete the key (lazy expiration)
* Command invocation

But the replication stream gets it the other way around:

* Command invocation (because the command is written into the placeholder)
* Delete the key (lazy expiration)

So if the command write to the key that was just lazy expired we will get
inconsistency between primary and replica.

One solution we considered is to add another lazy expire replication stream
and write all the lazy expire there. Then when replicating, we will replicate the
lazy expire replication stream first. This will solve this specific test failure but
we realize that the issues does not ends here and the more we dig the more
problems we find.One of the example we thought about (that can actually
crashes Redis) is as follow:

* User perform SINTERSTORE
* When Redis tries to fetch the second input key it triggers lazy expire
* The lazy expire trigger a module logic that deletes the first input key
* Now Redis hold the robj of the first input key that was actually freed

We believe we took the wrong approach and we will come up with another
PR that solve the problem differently, for now we revert the changes so we
will not have the tests failure.

Notice that not the entire code was revert, some parts of the PR are changes
that we would like to keep. The changes that **was** reverted are:

* Saving a placeholder for replication at the beginning of the command (`call` function)
* Order of the replication stream on active expire and eviction (we will decide how
  to handle it correctly on follow up PR)
* `Spop` changes are no longer needed (because we reverted the placeholder code)

Changes that **was not** reverted:

* On expire/eviction, wrap the `del` and the notification effect in a multi exec.
* `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select.
* Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time.

Tests:

* All tests was kept and only few tests was modify to work correctly with the changes
* Test was added to verify that the revert fixes the issues.
madolson added a commit to madolson/redis that referenced this pull request Apr 19, 2023
…11199)

### Summary of API additions

* `RedisModule_AddPostNotificationJob` - new API to call inside a key space
  notification (and on more locations in the future) and allow to add a post job as describe above.
* New module option, `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`,
  allows to disable Redis protection of nested key-space notifications.
* `RedisModule_GetModuleOptionsAll` - gets the mask of all supported module options so a module
  will be able to check if a given option is supported by the current running Redis instance.

### Background

The following PR is a proposal of handling write operations inside module key space notifications.
After a lot of discussions we came to a conclusion that module should not perform any write
operations on key space notification.

Some examples of issues that such write operation can cause are describe on the following links:

* Bad replication oreder - redis#10969
* Used after free - redis#10969 (comment)
* Used after free - redis#9406 (comment)

There are probably more issues that are yet to be discovered. The underline problem with writing
inside key space notification is that the notification runs synchronously, this means that the notification
code will be executed in the middle on Redis logic (commands logic, eviction, expire).
Redis **do not assume** that the data might change while running the logic and such changes
can crash Redis or cause unexpected behaviour.

The solution is to state that modules **should not** perform any write command inside key space
notification (we can chose whether or not we want to force it). To still cover the use-case where
module wants to perform a write operation as a reaction to key space notifications, we introduce
a new API , `RedisModule_AddPostNotificationJob`, that allows to register a callback that will be
called by Redis when the following conditions hold:

* It is safe to perform any write operation.
* The job will be called atomically along side the operation that triggers it (in our case, key
  space notification).

Module can use this new API to safely perform any write operation and still achieve atomicity
between the notification and the write.

Although currently the API is supported on key space notifications, the API is written in a generic
way so that in the future we will be able to use it on other places (server events for example).

### Technical Details

Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list
of callbacks (called `modulePostExecUnitJobs`) that need to be invoke after the current execution
unit ends (whether its a command, eviction, or active expire). In order to trigger those callback
atomically with the notification effect, we call those callbacks on `postExecutionUnitOperations`
(which was `propagatePendingCommands` before this PR). The new function fires the post jobs
and then calls `propagatePendingCommands`.

If the callback perform more operations that triggers more key space notifications. Those keys
space notifications might register more callbacks. Those callbacks will be added to the end
of `modulePostExecUnitJobs` list and will be invoke atomically after the current callback ends.
This raises a concerns of entering an infinite loops, we consider infinite loops as a logical bug
that need to be fixed in the module, an attempt to protect against infinite loops by halting the
execution could result in violation of the feature correctness and so **Redis will make no attempt
to protect the module from infinite loops**

In addition, currently key space notifications are not nested. Some modules might want to allow
nesting key-space notifications. To allow that and keep backward compatibility, we introduce a
new module option called `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`.
Setting this option will disable the Redis key-space notifications nesting protection and will
pass this responsibility to the module.

### Redis infrastructure

This PR promotes the existing `propagatePendingCommands` to an "Execution Unit" concept,
which is called after each atomic unit of execution,

Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…tions (redis#10969)

Fix replication inconsistency on modules that uses key space notifications.

### The Problem

In general, key space notifications are invoked after the command logic was
executed (this is not always the case, we will discuss later about specific
command that do not follow this rules). For example, the `set x 1` will trigger
a `set` notification that will be invoked after the `set` logic was performed, so
if the notification logic will try to fetch `x`, it will see the new data that was written.
Consider the scenario on which the notification logic performs some write
commands. for example, the notification logic increase some counter,
`incr x{counter}`, indicating how many times `x` was changed.
The logical order by which the logic was executed is has follow:

```
set x 1
incr x{counter}
```

The issue is that the `set x 1` command is added to the replication buffer
at the end of the command invocation (specifically after the key space
notification logic was invoked and performed the `incr` command).
The replication/aof sees the commands in the wrong order:

```
incr x{counter}
set x 1
```

In this specific example the order is less important.
But if, for example, the notification would have deleted `x` then we would
end up with primary-replica inconsistency.

### The Solution

Put the command that cause the notification in its rightful place. In the
above example, the `set x 1` command logic was executed before the
notification logic, so it should be added to the replication buffer before
the commands that is invoked by the notification logic. To achieve this,
without a major code refactoring, we save a placeholder in the replication
buffer, when finishing invoking the command logic we check if the command
need to be replicated, and if it does, we use the placeholder to add it to the
replication buffer instead of appending it to the end.

To be efficient and not allocating memory on each command to save the
placeholder, the replication buffer array was modified to reuse memory
(instead of allocating it each time we want to replicate commands).
Also, to avoid saving a placeholder when not needed, we do it only for
WRITE or MAY_REPLICATE commands.

#### Additional Fixes

* Expire and Eviction notifications:
  * Expire/Eviction logical order was to first perform the Expire/Eviction
    and then the notification logic. The replication buffer got this in the
    other way around (first notification effect and then the `del` command).
    The PR fixes this issue.
  * The notification effect and the `del` command was not wrap with
    `multi-exec` (if needed). The PR also fix this issue.
* SPOP command:
  * On spop, the `spop` notification was fired before the command logic
    was executed. The change in this PR would have cause the replication
    order to be change (first `spop` command and then notification `logic`)
    although the logical order is first the notification logic and then the
    `spop` logic. The right fix would have been to move the notification to
    be fired after the command was executed (like all the other commands),
    but this can be considered a breaking change. To overcome this, the PR
    keeps the current behavior and changes the `spop` code to keep the right
    logical order when pushing commands to the replication buffer. Another PR
    will follow to fix the SPOP properly and match it to the other command (we
    split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if
    we chose to).

#### Unhanded Known Limitations

* key miss event:
  * On key miss event, if a module performed some write command on the
    event (using `RM_Call`), the `dirty` counter would increase and the read
    command that cause the key miss event would be replicated to the replication
    and aof. This problem can also happened on a write command that open
    some keys but eventually decides not to perform any action. We decided
    not to handle this problem on this PR because the solution is complex
    and will cause additional risks in case we will want to cherry-pick this PR.
    We should decide if we want to handle it in future PR's. For now, modules
    writers is advice not to perform any write commands on key miss event.

#### Testing

* We already have tests to cover cases where a notification is invoking write
  commands that are also added to the replication buffer, the tests was modified
  to verify that the replica gets the command in the correct logical order.
* Test was added to verify that `spop` behavior was kept unchanged.
* Test was added to verify key miss event behave as expected.
* Test was added to verify the changes do not break lazy expiration.

#### Additional Changes

* `propagateNow` function can accept a special dbid, -1, indicating not
  to replicate `select`. We use this to replicate `multi/exec` on `propagatePendingCommands`
  function. The side effect of this change is that now the `select` command
  will appear inside the `multi/exec` block on the replication stream (instead of
  outside of the `multi/exec` block). Tests was modified to match this new behavior.
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
The PR reverts the changes made on redis#10969.
The reason for revert was trigger because of occasional test failure
that started after the PR was merged.

The issue is that if there is a lazy expire during the command invocation,
the `del` command is added to the replication stream after the command
placeholder. So the logical order on the primary is:

* Delete the key (lazy expiration)
* Command invocation

But the replication stream gets it the other way around:

* Command invocation (because the command is written into the placeholder)
* Delete the key (lazy expiration)

So if the command write to the key that was just lazy expired we will get
inconsistency between primary and replica.

One solution we considered is to add another lazy expire replication stream
and write all the lazy expire there. Then when replicating, we will replicate the
lazy expire replication stream first. This will solve this specific test failure but
we realize that the issues does not ends here and the more we dig the more
problems we find.One of the example we thought about (that can actually
crashes Redis) is as follow:

* User perform SINTERSTORE
* When Redis tries to fetch the second input key it triggers lazy expire
* The lazy expire trigger a module logic that deletes the first input key
* Now Redis hold the robj of the first input key that was actually freed

We believe we took the wrong approach and we will come up with another
PR that solve the problem differently, for now we revert the changes so we
will not have the tests failure.

Notice that not the entire code was revert, some parts of the PR are changes
that we would like to keep. The changes that **was** reverted are:

* Saving a placeholder for replication at the beginning of the command (`call` function)
* Order of the replication stream on active expire and eviction (we will decide how
  to handle it correctly on follow up PR)
* `Spop` changes are no longer needed (because we reverted the placeholder code)

Changes that **was not** reverted:

* On expire/eviction, wrap the `del` and the notification effect in a multi exec.
* `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select.
* Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time.

Tests:

* All tests was kept and only few tests was modify to work correctly with the changes
* Test was added to verify that the revert fixes the issues.
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…11199)

### Summary of API additions

* `RedisModule_AddPostNotificationJob` - new API to call inside a key space
  notification (and on more locations in the future) and allow to add a post job as describe above.
* New module option, `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`,
  allows to disable Redis protection of nested key-space notifications.
* `RedisModule_GetModuleOptionsAll` - gets the mask of all supported module options so a module
  will be able to check if a given option is supported by the current running Redis instance.

### Background

The following PR is a proposal of handling write operations inside module key space notifications.
After a lot of discussions we came to a conclusion that module should not perform any write
operations on key space notification.

Some examples of issues that such write operation can cause are describe on the following links:

* Bad replication oreder - redis#10969
* Used after free - redis#10969 (comment)
* Used after free - redis#9406 (comment)

There are probably more issues that are yet to be discovered. The underline problem with writing
inside key space notification is that the notification runs synchronously, this means that the notification
code will be executed in the middle on Redis logic (commands logic, eviction, expire).
Redis **do not assume** that the data might change while running the logic and such changes
can crash Redis or cause unexpected behaviour.

The solution is to state that modules **should not** perform any write command inside key space
notification (we can chose whether or not we want to force it). To still cover the use-case where
module wants to perform a write operation as a reaction to key space notifications, we introduce
a new API , `RedisModule_AddPostNotificationJob`, that allows to register a callback that will be
called by Redis when the following conditions hold:

* It is safe to perform any write operation.
* The job will be called atomically along side the operation that triggers it (in our case, key
  space notification).

Module can use this new API to safely perform any write operation and still achieve atomicity
between the notification and the write.

Although currently the API is supported on key space notifications, the API is written in a generic
way so that in the future we will be able to use it on other places (server events for example).

### Technical Details

Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list
of callbacks (called `modulePostExecUnitJobs`) that need to be invoke after the current execution
unit ends (whether its a command, eviction, or active expire). In order to trigger those callback
atomically with the notification effect, we call those callbacks on `postExecutionUnitOperations`
(which was `propagatePendingCommands` before this PR). The new function fires the post jobs
and then calls `propagatePendingCommands`.

If the callback perform more operations that triggers more key space notifications. Those keys
space notifications might register more callbacks. Those callbacks will be added to the end
of `modulePostExecUnitJobs` list and will be invoke atomically after the current callback ends.
This raises a concerns of entering an infinite loops, we consider infinite loops as a logical bug
that need to be fixed in the module, an attempt to protect against infinite loops by halting the
execution could result in violation of the feature correctness and so **Redis will make no attempt
to protect the module from infinite loops**

In addition, currently key space notifications are not nested. Some modules might want to allow
nesting key-space notifications. To allow that and keep backward compatibility, we introduce a
new module option called `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`.
Setting this option will disable the Redis key-space notifications nesting protection and will
pass this responsibility to the module.

### Redis infrastructure

This PR promotes the existing `propagatePendingCommands` to an "Execution Unit" concept,
which is called after each atomic unit of execution,

Co-authored-by: Oran Agra <oran@redislabs.com>
Co-authored-by: Yossi Gottlieb <yossigo@gmail.com>
Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants