Fix replication inconsistency on modules that uses key space notifications #10969

MeirShpilraien · 2022-07-13T07:50:56Z

NOTICE! The main part of this change has been later reverted in #11178, and isn't included in 7.2 RC1, despite what the release notes state
see #10969 (comment) as to why

Fix replication inconsistency on modules that uses key space notifications.

The Problem

In general, key space notifications are invoked after the command logic was executed (this is not always the case, we will discuss later about specific command that do not follow this rules). For example, the set x 1 will trigger a set notification that will be invoked after the set logic was performed, so if the notification logic will try to fetch x, it will see the new data that was written.
Consider the scenario on which the notification logic performs some write commands. for example, the notification logic increase some counter, incr x{counter}, indicating how many times x was changed. The logical order by which the logic was executed is has follow:

set x 1
incr x{counter}

The issue is that the set x 1 command is added to the replication buffer at the end of the command invocation (specifically after the key space notification logic was invoked and performed the incr command). The replication/aof sees the commands in the wrong order:

incr x{counter}
set x 1

In this specific example the order is less important. But if, for example, the notification would have deleted x then we would end up with primary-replica inconsistency.

The Solution

Put the command that cause the notification in its rightful place. In the above example, the set x 1 command logic was executed before the notification logic, so it should be added to the replication buffer before the commands that is invoked by the notification logic. To achieve this, without a major code refactoring, we save a placeholder in the replication buffer, when finishing invoking the command logic we check if the command need to be replicated, and if it does, we use the placeholder to add it to the replication buffer instead of appending it to the end.

To be efficient and not allocating memory on each command to save the placeholder, the replication buffer array was modified to reuse memory (instead of allocating it each time we want to replicate commands). Also, to avoid saving a placeholder when not needed, we do it only for WRITE or MAY_REPLICATE commands.

Additional Fixes

Expire and Eviction notifications:
- Expire/Eviction logical order was to first perform the Expire/Eviction and then the notification logic. The replication buffer got this in the other way around (first notification effect and then the del command). The PR fixes this issue.
- The notification effect and the del command was not wrap with multi-exec (if needed). The PR also fix this issue.
SPOP command:
- On spop, the spop notification was fired before the command logic was executed. The change in this PR would have cause the replication order to be change (first spop command and then notification logic) although the logical order is first the notification logic and then the spop logic. The right fix would have been to move the notification to be fired after the command was executed (like all the other commands), but this can be considered a breaking change. To overcome this, the PR keeps the current behaviour and changes the spop code to keep the right logical order when pushing commands to the replication buffer. Another PR will follow to fix the SPOP properly and match it to the other command (we split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if we chose to).

Unhanded Known Limitations

key miss event:
- On key miss event, if a module performed some write command on the event (using RM_Call), the dirty counter would increase and the read command that cause the key miss event would be replicated to the replication and aof. This problem can also happened on a write command that open some keys but eventually decides not to perform any action. We decided not to handle this problem on this PR because the solution is complex and will cause additional risks in case we will want to cherry-pick this PR. We should decide if we want to handle it in future PR's (@oranagra @yossigo @madolson FYI). For now, modules writers is advice not to perform any write commands on key miss event.

Testing

We already have tests to cover cases where a notification is invoking write commands that are also added to the replication buffer, the tests was modified to verify that the replica gets the command in the correct logical order.
Test was added to verify that spop behaviour was kept unchanged.
Test was added to verify key miss event behave as expected.
Test was added to verify the changes do not break lazy expiration.

Additional Changes

Follow @guybe7 suggestion, propagateNow function can accept a special dbid, -1, indicating not to replicate select. We use this to replicate multi/exec on propagatePendingCommands function. The side effect of this change is that now the select command will appear inside the multi/exec block on the replication stream (instead of outside of the multi/exec block). Tests was modified to match this new behaviour.

…tions. ### The Problem In general, key space notifcations are invoked after the command logic was executed (this is not always the case, we will discuss later about spacific command that do not follow this rules). For exampel, the `set x 1` will trigger a `set` notification that will be invoked after the `set` logic was perfromed, so if the notification logic will try to fetch `x`, it will see the new data that was writen. Consider the scenario on which the notification logic performs some write commands. for example, the notification logic increase some counter, `incr x{counter}`, indicating how many times `x` was changed. The logical order by which the logic was executed is has followi: ``` set x 1 incr x{counter} ``` The issue is that the `set x 1` command is added to the replication buffer at the end of the command invocation (spacifically after the key space notification logic was invoked and performed the `incr` command). The replication/aof sees the commands in the wrong order: ``` incr x{counter} set x 1 ``` In this spacifica example the order is less important. But if, for example, the notifcation would have deleted `x` then we would end up with primary-replica inconsistency. ### The Solution Put the command that cause the notification in its rightfull place. In the above example, the `set x 1` command logic was executed before the notification logic, so it should be added to the replication buffer before the commands that is invoked by the notificaiton logic. To achieve this, without a major code refactoring, we save a placeholder in the replication buffer, when finishing invoking the command logic we check if the command need to be replicated, and if it does, we use the placeholder to add it to the replication buffer instead of appending it to the end. To be efficient and not allocating memory on each command to save the placeholder, the replication buffer array was modified to reuse memory (instead of allocating it each time we want to replicate commands). #### Exceptions * Expire and Eviction notifications: * Expire/Eviction logical order was to first perform the Expire/Eviction and then the notification logic. The replication buffer got this in the other way around (first notification effect and then the `del` command). The PR fixes this issue. * The notification effect and the `del` command was not wrap with `multi-exec` (if needed). The PR also fix this issue. * SPOP command: * On spop, the `spop` notification was fired before the command logic was executed. The change in this PR would have cause the replication order to be change (first `spop` command and then notification `logic`) altouhg the logical order is first the notification logic and then the `spop` logic. The right would have been to move the notification to be fire after the command was executed (like all the other commands), but this can be considered a breaking change. To overcome this, the PR keeps the current behaviure and changes the `spop` code to keep the right logical order when pushing commands to the replication buffer. #### Testing * We already have tests to cover cases where a notification is invoking write commands that are also added to the replication buffer, the tests was modified to verify that the replica gets the command in the correct logical order. * Test was added to verify that `spop` behaviure was kept unchanged.

oranagra · 2022-07-13T20:08:20Z

@MeirShpilraien didn't we agree to fix spop in a breaking manner?
or did we say that we'll first fix it in a non-breaking manner, and then break it in a later version?

MeirShpilraien · 2022-07-13T20:18:45Z

@oranagra we said that he will first fix it without breaking so we can cherrypick this to 7.0 if we want to. We will fix it proparly after, in anothet RP.

oranagra · 2022-07-14T16:25:08Z

Ohh, I forgot. Better outline that plan in the top comment.
@yossigo do you think we can put this in 7.0?

src/evict.c

src/server.c

src/server.h

src/server.c

src/evict.c

src/server.c

src/aof.c

src/replication.c

src/expire.c

src/server.c

Co-authored-by: Oran Agra <oran@redislabs.com>

src/module.c

src/server.c

tests/modules/keyspace_events.c

oranagra · 2022-07-28T13:39:02Z

@yossigo please take another look.
we attempted to include only the bugfix in this PR without fixing other things that can be considered a breaking change.
along the way a few other improvements were made and some mechanism was introduced.
we need to decide if it's safe to backport to 7.0, maybe after a few weeks in unstable.
maybe try to make a review sweep that ignores all the comments and tests, and just consider the actual changes that have a risk of breaking things...

Co-authored-by: Oran Agra <oran@redislabs.com>

oranagra · 2022-08-11T10:26:53Z

@redis/core-team not really a major-decision since there's no interface change, but it's a sensitive area so maybe one of you wants to review it too.
besides merging it to unstable, i'd also like to consider backporting it to 7.0, so that's probably more of a major-decision, so if you have any opinion, please comment.
either way, we better merge it to unstable to let it age a bit there, so i'd like to merge it soon.

yossigo

@oranagra I agree this isn't a major decision, but it is bug fix that includes a breaking change and potential for undesired side effects so I vote for holding back on a backport at the moment.

oranagra · 2022-08-14T14:37:55Z

the PR doesn't include a breaking change (just a bug fix).
unless someone wrote code that actually relies on the bug (but that can be true for any bug fix).
it is however sensitive area, so maybe it introduces a new bug.

MeirShpilraien · 2022-08-23T08:59:33Z

Notice that we have an occasional test failure because of the changes on this PR: https://github.com/redis/redis/runs/7927670765?check_suite_focus=true

The issue is that if there is a lazy expire during the command invocation, the del command is added to the replication stream after the command placeholder. So the logical order on the primary is:

Delete the key (lazy expiration)
Command invocation

But the replication stream gets it the other way around:

Command invocation (because the command is written into the placeholder)
Delete the key (lazy expiration)

So if the command write to the key that was just lazy expired we will get inconsistency between primary and replica.

The following test can constantly reproduce the issue:

start_server {tags {"repl external:skip"}} {
    set master [srv 0 client]
    set master_host [srv 0 host]
    set master_port [srv 0 port]
    $master debug SET-ACTIVE-EXPIRE 0
    start_server {} {
        set slave [srv 0 client]
        $slave debug SET-ACTIVE-EXPIRE 0
        $slave slaveof $master_host $master_port

        test "Tes replication with lazy expire" {
            # wait for replication to be in sync
            wait_for_condition 50 100 {
                [lindex [$slave role] 0] eq {slave} &&
                [string match {*master_link_status:up*} [$slave info replication]]
            } else {
                fail "Can't turn the instance into a replica"
            }

            $master sadd s foo
            $master expire s 1
            after 1000
            $master sadd s foo
            assert_equal 1 [$master wait 1 1]

            assert_equal "set" [$master type s]
            assert_equal "set" [$slave type s]
        }
    }
}

One solution we considered is to add another lazy expire replication stream and write all the lazy expire there. Then when replicating, we will replicate the lazy expire replication stream first. This will solve this specific test failure but we realise that the issues does not ends here and the more we dig the more problems we find. One of the example we thought about (that can actually crashes Redis) is as follow:

User perform SINTERSTORE
When Redis tries to fetch the second input key it triggers lazy expire
The lazy expire trigger a module logic that deletes the first input key
Now Redis hold the robj of the first input key that was actually freed

We believe we took the wrong approach here and we would like to suggested a new approach:

Modules should not perform any writes during key space notifications, we should document it and make it clear to any module writer.
Module might still want to perform writes as a result of a key space notification, and the requirement is that it will be atomic along side the command that triggered the notification, for this we will introduce a new post command hook that will allow the module to run whatever logic it wants (including writes) after the command was finished. The logical order and the replication order will be the same because any writes logic was invoke after the command was finished and there is not risk in changing the key space while the command is running. A module will be able to register to key space notifications, collect them, and only react on them on the post command hook.

We are planing to revert this PR and implements this post command hook idea on another PR.

The PR reverts the changes made on redis#10969. The reason for revert was trigger because of occasional test failure that started after the PR was merged. The issue is that if there is a lazy expire during the command invocation, the del command is added to the replication stream after the command placeholder. So the logical order on the primary is: * Delete the key (lazy expiration) * Command invocation But the replication stream gets it the other way around: * Command invocation (because the command is written into the placeholder) * Delete the key (lazy expiration) So if the command write to the key that was just lazy expired we will get inconsistency between primary and replica. One solution we considered is to add another lazy expire replication stream and write all the lazy expire there. Then when replicating, we will replicate the lazy expire replication stream first. This will solve this specific test failure but we realise that the issues does not ends here and the more we dig the more problems we find. One of the example we thought about (that can actually crashes Redis) is as follow: * User perform SINTERSTORE * When Redis tries to fetch the second input key it triggers lazy expire * The lazy expire trigger a module logic that deletes the first input key * Now Redis hold the robj of the first input key that was actually freed We believe we took the wrong approach and we will come up with another PR that solve the problem differently, for now we revert the changes so we will not have the tests failure. Notice that not the entire code was revert, some parts of the PR are changes that we would like to keep. The changes that was reverted are: * Saving a placeholder for replication at the begining of the command (`call` function) * Order of the replication stream on active expire and eviction (we will decide how to handle it correctly on follow up PR) * Spop changes are no longer needed (because we reverted the placeholder code) Changes that was not reverted: * On expire/eviction, wrap the `del` and the notification effect in a multi exec. * PropagateNow function can still accept a special dbid, -1, indicating not to replicate select. * Keep optimization for reusing the alsoPropagate array instead of allocating it each time. Tests: * All tests was kept and only few tests was modify to work correctly with the changes * Test was added to verify that the revert fixes the issues.

The PR reverts the changes made on #10969. The reason for revert was trigger because of occasional test failure that started after the PR was merged. The issue is that if there is a lazy expire during the command invocation, the `del` command is added to the replication stream after the command placeholder. So the logical order on the primary is: * Delete the key (lazy expiration) * Command invocation But the replication stream gets it the other way around: * Command invocation (because the command is written into the placeholder) * Delete the key (lazy expiration) So if the command write to the key that was just lazy expired we will get inconsistency between primary and replica. One solution we considered is to add another lazy expire replication stream and write all the lazy expire there. Then when replicating, we will replicate the lazy expire replication stream first. This will solve this specific test failure but we realize that the issues does not ends here and the more we dig the more problems we find.One of the example we thought about (that can actually crashes Redis) is as follow: * User perform SINTERSTORE * When Redis tries to fetch the second input key it triggers lazy expire * The lazy expire trigger a module logic that deletes the first input key * Now Redis hold the robj of the first input key that was actually freed We believe we took the wrong approach and we will come up with another PR that solve the problem differently, for now we revert the changes so we will not have the tests failure. Notice that not the entire code was revert, some parts of the PR are changes that we would like to keep. The changes that **was** reverted are: * Saving a placeholder for replication at the beginning of the command (`call` function) * Order of the replication stream on active expire and eviction (we will decide how to handle it correctly on follow up PR) * `Spop` changes are no longer needed (because we reverted the placeholder code) Changes that **was not** reverted: * On expire/eviction, wrap the `del` and the notification effect in a multi exec. * `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select. * Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time. Tests: * All tests was kept and only few tests was modify to work correctly with the changes * Test was added to verify that the revert fixes the issues.

Related PRs: redis#10969, redis#11178 The following PR is a proposal of handling write operations inside key space notifications. The PR is not yet finished (see todo list at the end) but I would like to start getting opinions whether or not the approach make sense and whether there are any issues that we might have missed. After a lot of discussions we came to a conclussion that module should not perform any write operations on key space notifcation. Some examples of issues that such write operation can cause are describe on the following links: * Bad replication oreder - redis#10969 * Used after free - redis#10969 (comment) * Used after free - redis#9406 (comment) There are probably more issues that are yet to be discovered. The underline problem with writing inside key space notification is that the notification runs synchronously, this means that the notification code will be executed in the middle on Redis logic (commands logic, eviction, expire). Redis **do not assume** that the data might change while running the logic and such changes can crash Redis or cause unexpected behaviure. One solution is to go over all those places (that assumes no data changes) and fix them such that they will be tollerable to data changes. This apporach will probably require a lot of changes all around the code and it will be very hard to know whether or not we got all those issues. Moreover, it will be very hard to maintain such code and be sure that we are not adding more places that assume no data changes. Another approach (which presented in this PR) is to state that modules **should not** perform any write command inside key space notification (we can chose whether or not we want to force it). To still cover the use-case where module wants to perform a write operation as a reaction to key space notifications, we introduce a new API that, `RedisModule_AddPostNotificationJob`, allows to register a callback that will be called by Redis when the following conditions hold: * It is safe to perform any write operation. * The job will be called atomically along side the notification (the notification effect and the job effect will be replicated to the replication and the aof wrapped with multi exec). Module can use this new API to safely perform any write operation and still achieve atomicity betweent the notification and the write. ### Technical Details Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list of callbacks (called `modulePostKeyspaceJobs`) that need to be invoke after the current logic ends (whether its a command, eviction, or active expire). In order to trigger those callback atomically with the notification effect, we call those callbacks on `propagatePendingCommands` just before we actually propagating the effects. If the callback perform more operations that triggers more key space notifications. Those keys space notifications might register more callbacks. Those callbacks will be added to the end of `modulePostKeyspaceJobs` list and will be invoke atomically after the current callback ends. In order to avoid infinite loop we limit the number of jobs that can be registered in a single logic unit (command, eviction, active expire). This limitation is configurable using `maxpostnotificationsjobs` configuration value.

### Summary of API additions * `RedisModule_AddPostNotificationJob` - new API to call inside a key space notification (and on more locations in the future) and allow to add a post job as describe above. * New module option, `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`, allows to disable Redis protection of nested key-space notifications. * `RedisModule_GetModuleOptionsAll` - gets the mask of all supported module options so a module will be able to check if a given option is supported by the current running Redis instance. ### Background The following PR is a proposal of handling write operations inside module key space notifications. After a lot of discussions we came to a conclusion that module should not perform any write operations on key space notification. Some examples of issues that such write operation can cause are describe on the following links: * Bad replication oreder - #10969 * Used after free - #10969 (comment) * Used after free - #9406 (comment) There are probably more issues that are yet to be discovered. The underline problem with writing inside key space notification is that the notification runs synchronously, this means that the notification code will be executed in the middle on Redis logic (commands logic, eviction, expire). Redis **do not assume** that the data might change while running the logic and such changes can crash Redis or cause unexpected behaviour. The solution is to state that modules **should not** perform any write command inside key space notification (we can chose whether or not we want to force it). To still cover the use-case where module wants to perform a write operation as a reaction to key space notifications, we introduce a new API , `RedisModule_AddPostNotificationJob`, that allows to register a callback that will be called by Redis when the following conditions hold: * It is safe to perform any write operation. * The job will be called atomically along side the operation that triggers it (in our case, key space notification). Module can use this new API to safely perform any write operation and still achieve atomicity between the notification and the write. Although currently the API is supported on key space notifications, the API is written in a generic way so that in the future we will be able to use it on other places (server events for example). ### Technical Details Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list of callbacks (called `modulePostExecUnitJobs`) that need to be invoke after the current execution unit ends (whether its a command, eviction, or active expire). In order to trigger those callback atomically with the notification effect, we call those callbacks on `postExecutionUnitOperations` (which was `propagatePendingCommands` before this PR). The new function fires the post jobs and then calls `propagatePendingCommands`. If the callback perform more operations that triggers more key space notifications. Those keys space notifications might register more callbacks. Those callbacks will be added to the end of `modulePostExecUnitJobs` list and will be invoke atomically after the current callback ends. This raises a concerns of entering an infinite loops, we consider infinite loops as a logical bug that need to be fixed in the module, an attempt to protect against infinite loops by halting the execution could result in violation of the feature correctness and so **Redis will make no attempt to protect the module from infinite loops** In addition, currently key space notifications are not nested. Some modules might want to allow nesting key-space notifications. To allow that and keep backward compatibility, we introduce a new module option called `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`. Setting this option will disable the Redis key-space notifications nesting protection and will pass this responsibility to the module. ### Redis infrastructure This PR promotes the existing `propagatePendingCommands` to an "Execution Unit" concept, which is called after each atomic unit of execution, Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: Yossi Gottlieb <yossigo@gmail.com> Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>

The PR reverts the changes made on redis#10969. The reason for revert was trigger because of occasional test failure that started after the PR was merged. The issue is that if there is a lazy expire during the command invocation, the `del` command is added to the replication stream after the command placeholder. So the logical order on the primary is: * Delete the key (lazy expiration) * Command invocation But the replication stream gets it the other way around: * Command invocation (because the command is written into the placeholder) * Delete the key (lazy expiration) So if the command write to the key that was just lazy expired we will get inconsistency between primary and replica. One solution we considered is to add another lazy expire replication stream and write all the lazy expire there. Then when replicating, we will replicate the lazy expire replication stream first. This will solve this specific test failure but we realize that the issues does not ends here and the more we dig the more problems we find.One of the example we thought about (that can actually crashes Redis) is as follow: * User perform SINTERSTORE * When Redis tries to fetch the second input key it triggers lazy expire * The lazy expire trigger a module logic that deletes the first input key * Now Redis hold the robj of the first input key that was actually freed We believe we took the wrong approach and we will come up with another PR that solve the problem differently, for now we revert the changes so we will not have the tests failure. Notice that not the entire code was revert, some parts of the PR are changes that we would like to keep. The changes that **was** reverted are: * Saving a placeholder for replication at the beginning of the command (`call` function) * Order of the replication stream on active expire and eviction (we will decide how to handle it correctly on follow up PR) * `Spop` changes are no longer needed (because we reverted the placeholder code) Changes that **was not** reverted: * On expire/eviction, wrap the `del` and the notification effect in a multi exec. * `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select. * Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time. Tests: * All tests was kept and only few tests was modify to work correctly with the changes * Test was added to verify that the revert fixes the issues.

…11199) ### Summary of API additions * `RedisModule_AddPostNotificationJob` - new API to call inside a key space notification (and on more locations in the future) and allow to add a post job as describe above. * New module option, `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`, allows to disable Redis protection of nested key-space notifications. * `RedisModule_GetModuleOptionsAll` - gets the mask of all supported module options so a module will be able to check if a given option is supported by the current running Redis instance. ### Background The following PR is a proposal of handling write operations inside module key space notifications. After a lot of discussions we came to a conclusion that module should not perform any write operations on key space notification. Some examples of issues that such write operation can cause are describe on the following links: * Bad replication oreder - redis#10969 * Used after free - redis#10969 (comment) * Used after free - redis#9406 (comment) There are probably more issues that are yet to be discovered. The underline problem with writing inside key space notification is that the notification runs synchronously, this means that the notification code will be executed in the middle on Redis logic (commands logic, eviction, expire). Redis **do not assume** that the data might change while running the logic and such changes can crash Redis or cause unexpected behaviour. The solution is to state that modules **should not** perform any write command inside key space notification (we can chose whether or not we want to force it). To still cover the use-case where module wants to perform a write operation as a reaction to key space notifications, we introduce a new API , `RedisModule_AddPostNotificationJob`, that allows to register a callback that will be called by Redis when the following conditions hold: * It is safe to perform any write operation. * The job will be called atomically along side the operation that triggers it (in our case, key space notification). Module can use this new API to safely perform any write operation and still achieve atomicity between the notification and the write. Although currently the API is supported on key space notifications, the API is written in a generic way so that in the future we will be able to use it on other places (server events for example). ### Technical Details Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list of callbacks (called `modulePostExecUnitJobs`) that need to be invoke after the current execution unit ends (whether its a command, eviction, or active expire). In order to trigger those callback atomically with the notification effect, we call those callbacks on `postExecutionUnitOperations` (which was `propagatePendingCommands` before this PR). The new function fires the post jobs and then calls `propagatePendingCommands`. If the callback perform more operations that triggers more key space notifications. Those keys space notifications might register more callbacks. Those callbacks will be added to the end of `modulePostExecUnitJobs` list and will be invoke atomically after the current callback ends. This raises a concerns of entering an infinite loops, we consider infinite loops as a logical bug that need to be fixed in the module, an attempt to protect against infinite loops by halting the execution could result in violation of the feature correctness and so **Redis will make no attempt to protect the module from infinite loops** In addition, currently key space notifications are not nested. Some modules might want to allow nesting key-space notifications. To allow that and keep backward compatibility, we introduce a new module option called `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`. Setting this option will disable the Redis key-space notifications nesting protection and will pass this responsibility to the module. ### Redis infrastructure This PR promotes the existing `propagatePendingCommands` to an "Execution Unit" concept, which is called after each atomic unit of execution, Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: Yossi Gottlieb <yossigo@gmail.com> Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>

…tions (redis#10969) Fix replication inconsistency on modules that uses key space notifications. ### The Problem In general, key space notifications are invoked after the command logic was executed (this is not always the case, we will discuss later about specific command that do not follow this rules). For example, the `set x 1` will trigger a `set` notification that will be invoked after the `set` logic was performed, so if the notification logic will try to fetch `x`, it will see the new data that was written. Consider the scenario on which the notification logic performs some write commands. for example, the notification logic increase some counter, `incr x{counter}`, indicating how many times `x` was changed. The logical order by which the logic was executed is has follow: ``` set x 1 incr x{counter} ``` The issue is that the `set x 1` command is added to the replication buffer at the end of the command invocation (specifically after the key space notification logic was invoked and performed the `incr` command). The replication/aof sees the commands in the wrong order: ``` incr x{counter} set x 1 ``` In this specific example the order is less important. But if, for example, the notification would have deleted `x` then we would end up with primary-replica inconsistency. ### The Solution Put the command that cause the notification in its rightful place. In the above example, the `set x 1` command logic was executed before the notification logic, so it should be added to the replication buffer before the commands that is invoked by the notification logic. To achieve this, without a major code refactoring, we save a placeholder in the replication buffer, when finishing invoking the command logic we check if the command need to be replicated, and if it does, we use the placeholder to add it to the replication buffer instead of appending it to the end. To be efficient and not allocating memory on each command to save the placeholder, the replication buffer array was modified to reuse memory (instead of allocating it each time we want to replicate commands). Also, to avoid saving a placeholder when not needed, we do it only for WRITE or MAY_REPLICATE commands. #### Additional Fixes * Expire and Eviction notifications: * Expire/Eviction logical order was to first perform the Expire/Eviction and then the notification logic. The replication buffer got this in the other way around (first notification effect and then the `del` command). The PR fixes this issue. * The notification effect and the `del` command was not wrap with `multi-exec` (if needed). The PR also fix this issue. * SPOP command: * On spop, the `spop` notification was fired before the command logic was executed. The change in this PR would have cause the replication order to be change (first `spop` command and then notification `logic`) although the logical order is first the notification logic and then the `spop` logic. The right fix would have been to move the notification to be fired after the command was executed (like all the other commands), but this can be considered a breaking change. To overcome this, the PR keeps the current behavior and changes the `spop` code to keep the right logical order when pushing commands to the replication buffer. Another PR will follow to fix the SPOP properly and match it to the other command (we split it to 2 separate PR's so it will be easy to cherry-pick this PR to 7.0 if we chose to). #### Unhanded Known Limitations * key miss event: * On key miss event, if a module performed some write command on the event (using `RM_Call`), the `dirty` counter would increase and the read command that cause the key miss event would be replicated to the replication and aof. This problem can also happened on a write command that open some keys but eventually decides not to perform any action. We decided not to handle this problem on this PR because the solution is complex and will cause additional risks in case we will want to cherry-pick this PR. We should decide if we want to handle it in future PR's. For now, modules writers is advice not to perform any write commands on key miss event. #### Testing * We already have tests to cover cases where a notification is invoking write commands that are also added to the replication buffer, the tests was modified to verify that the replica gets the command in the correct logical order. * Test was added to verify that `spop` behavior was kept unchanged. * Test was added to verify key miss event behave as expected. * Test was added to verify the changes do not break lazy expiration. #### Additional Changes * `propagateNow` function can accept a special dbid, -1, indicating not to replicate `select`. We use this to replicate `multi/exec` on `propagatePendingCommands` function. The side effect of this change is that now the `select` command will appear inside the `multi/exec` block on the replication stream (instead of outside of the `multi/exec` block). Tests was modified to match this new behavior.

The PR reverts the changes made on redis#10969. The reason for revert was trigger because of occasional test failure that started after the PR was merged. The issue is that if there is a lazy expire during the command invocation, the `del` command is added to the replication stream after the command placeholder. So the logical order on the primary is: * Delete the key (lazy expiration) * Command invocation But the replication stream gets it the other way around: * Command invocation (because the command is written into the placeholder) * Delete the key (lazy expiration) So if the command write to the key that was just lazy expired we will get inconsistency between primary and replica. One solution we considered is to add another lazy expire replication stream and write all the lazy expire there. Then when replicating, we will replicate the lazy expire replication stream first. This will solve this specific test failure but we realize that the issues does not ends here and the more we dig the more problems we find.One of the example we thought about (that can actually crashes Redis) is as follow: * User perform SINTERSTORE * When Redis tries to fetch the second input key it triggers lazy expire * The lazy expire trigger a module logic that deletes the first input key * Now Redis hold the robj of the first input key that was actually freed We believe we took the wrong approach and we will come up with another PR that solve the problem differently, for now we revert the changes so we will not have the tests failure. Notice that not the entire code was revert, some parts of the PR are changes that we would like to keep. The changes that **was** reverted are: * Saving a placeholder for replication at the beginning of the command (`call` function) * Order of the replication stream on active expire and eviction (we will decide how to handle it correctly on follow up PR) * `Spop` changes are no longer needed (because we reverted the placeholder code) Changes that **was not** reverted: * On expire/eviction, wrap the `del` and the notification effect in a multi exec. * `PropagateNow` function can still accept a special dbid, -1, indicating not to replicate select. * Keep optimisation for reusing the `alsoPropagate` array instead of allocating it each time. Tests: * All tests was kept and only few tests was modify to work correctly with the changes * Test was added to verify that the revert fixes the issues.

…11199) ### Summary of API additions * `RedisModule_AddPostNotificationJob` - new API to call inside a key space notification (and on more locations in the future) and allow to add a post job as describe above. * New module option, `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`, allows to disable Redis protection of nested key-space notifications. * `RedisModule_GetModuleOptionsAll` - gets the mask of all supported module options so a module will be able to check if a given option is supported by the current running Redis instance. ### Background The following PR is a proposal of handling write operations inside module key space notifications. After a lot of discussions we came to a conclusion that module should not perform any write operations on key space notification. Some examples of issues that such write operation can cause are describe on the following links: * Bad replication oreder - redis#10969 * Used after free - redis#10969 (comment) * Used after free - redis#9406 (comment) There are probably more issues that are yet to be discovered. The underline problem with writing inside key space notification is that the notification runs synchronously, this means that the notification code will be executed in the middle on Redis logic (commands logic, eviction, expire). Redis **do not assume** that the data might change while running the logic and such changes can crash Redis or cause unexpected behaviour. The solution is to state that modules **should not** perform any write command inside key space notification (we can chose whether or not we want to force it). To still cover the use-case where module wants to perform a write operation as a reaction to key space notifications, we introduce a new API , `RedisModule_AddPostNotificationJob`, that allows to register a callback that will be called by Redis when the following conditions hold: * It is safe to perform any write operation. * The job will be called atomically along side the operation that triggers it (in our case, key space notification). Module can use this new API to safely perform any write operation and still achieve atomicity between the notification and the write. Although currently the API is supported on key space notifications, the API is written in a generic way so that in the future we will be able to use it on other places (server events for example). ### Technical Details Whenever a module uses `RedisModule_AddPostNotificationJob` the callback is added to a list of callbacks (called `modulePostExecUnitJobs`) that need to be invoke after the current execution unit ends (whether its a command, eviction, or active expire). In order to trigger those callback atomically with the notification effect, we call those callbacks on `postExecutionUnitOperations` (which was `propagatePendingCommands` before this PR). The new function fires the post jobs and then calls `propagatePendingCommands`. If the callback perform more operations that triggers more key space notifications. Those keys space notifications might register more callbacks. Those callbacks will be added to the end of `modulePostExecUnitJobs` list and will be invoke atomically after the current callback ends. This raises a concerns of entering an infinite loops, we consider infinite loops as a logical bug that need to be fixed in the module, an attempt to protect against infinite loops by halting the execution could result in violation of the feature correctness and so **Redis will make no attempt to protect the module from infinite loops** In addition, currently key space notifications are not nested. Some modules might want to allow nesting key-space notifications. To allow that and keep backward compatibility, we introduce a new module option called `REDISMODULE_OPTIONS_ALLOW_NESTED_KEYSPACE_NOTIFICATIONS`. Setting this option will disable the Redis key-space notifications nesting protection and will pass this responsibility to the module. ### Redis infrastructure This PR promotes the existing `propagatePendingCommands` to an "Execution Unit" concept, which is called after each atomic unit of execution, Co-authored-by: Oran Agra <oran@redislabs.com> Co-authored-by: Yossi Gottlieb <yossigo@gmail.com> Co-authored-by: Madelyn Olson <34459052+madolson@users.noreply.github.com>

MeirShpilraien requested review from guybe7 and oranagra July 13, 2022 07:51

Fix spellcheck

5ac9055

oranagra added the state:major-decision Requires core team consensus label Jul 13, 2022