-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Sanitize dump payload: ziplist, listpack, zipmap, intset, stream #7807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Notes:
@redis/core-team please have a look and comment. |
madolson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started looking through it, I got through some of the data structures.
src/debug.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed sometimes we add testing stuff here, and other times you add them as configs. Is there a reason we don't always add them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK the configs that we want some people to actually use in production are put in CONFIG, and the ones we only add for the purpose of the test suite are put in DEBUG.
The two important differences IMHO are:
- config options should later be supported forever, we can't remove them. (e.g. no one can argue that we need to keep supporting
SET-ACTIVE-EXPIREforever) - config options also work right on startup (even before redis starts loading an AOF). this is the reason we sometime put test-suite tunables as config rather than DEBUG (e.g. key-load-delay)
other considerations may be:
- this specific config is dangerous, we don't want it to be enabled with config (we assume DEBUG is blocked in production clusters anyway).
src/listpack.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever called with deep = 1? Are you just keeping the code around in case someone else uses listpack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in the future, when listpacks are used by other data types, they can use this method.
initially i used it to validate the listpacks for streams in rdb.c, but i later realized that i need to validate that the actual records in the listpack make sense in the context of a stream to avoid runtime crashes (added streamValidateListpackIntegrity), and i didn't want to run over all the records in the listpack twice, so i made streamValidateListpackIntegrity do both the validation of the listpack structure and the stream in one pass.
|
rebased, and added 8 new commits (by topic) |
|
@redis/core-team i don't expect you all to review all the code changes, i'll do that with Yossi. but please post a message that approves the feature and it's interfaces. i.e. have a look at the top comment here and at redis.conf. |
|
Some measurements i did that show degradation (positive is improvement). Note that initially LPOS and HGET showed severe (-25%) degradation, and after some optimizations effort (last commit) i was able to re-gain the performance loss and even improve. Obviously the biggest impact is on the (now optional) RESTORE that does full (deep) sanitization
Details |
|
pushed one final commit with code review fixes. |
|
@yossigo i'm done re-ordering and squashing my commits. |
src/acl.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a trivial comment, but the only importance of the order of these flags is in ACLDescribeUser, which would make this come first. I'd rather have it towards the end :)
When loading an encoded payload we will at least do a shallow validation to check that the size that's encoded in the payload matches the size of the allocation. This let's us later use this encoded size to make sure the various offsets inside encoded payload don't reach outside the allocation, if they do, we'll assert/panic, but at least we won't segfault or smear memory. We can also do 'deep' validation which runs on all the records of the encoded payload and validates that they don't contain invalid offsets. This lets us detect corruptions early and reject a RESTORE command rather than accepting it and asserting (crashing) later when accessing that payload via some command. configuration: - adding ACL flag skip-sanitize-payload - adding config sanitize-dump-payload [yes/no/clients] For now, we don't have a good way to ensure MIGRATE in cluster resharding isn't being slowed down by these sanitation, so i'm setting the default value to `no`, but later on it should be set to `clients` by default. changes: - changing rdbReportError not to `exit` in RESTORE command - adding a new stat to be able to later check if cluster MIGRATE isn't being slowed down by sanitation.
- improve stream rdb encoding test to include more types of stream metadata - add test to cover various ziplist encoding entries (although it does look like the stress test above it is able to find some too - add another test for ziplist encoding for hash with full sanitization - add similar ziplist encoding tests for list
…it exposed The test creates keys with various encodings, DUMP them, corrupt the payload and RESTORES it. It utilizes the recently added use-exit-on-panic config to distinguish between asserts and segfaults. If the restore succeeds, it runs random commands on the key to attempt to trigger a crash. It runs in two modes, one with deep sanitation enabled and one without. In the first one we don't expect any assertions or segfaults, in the second one we expect assertions, but no segfaults. We also check for leaks and invalid reads using valgrind, and if we find them we print the commands that lead to that issue. Changes in the code (other than the test): - Replace a few NPD (null pointer deference) flows and division by zero with an assertion, so that it doesn't fail the test. (since we set the server to use `exit` rather than `abort` on assertion). - Fix quite a lot of flows in rdb.c that could have lead to memory leaks in RESTORE command (since it now responds with an error rather than panic) - Add a DEBUG flag for SET-SKIP-CHECKSUM-VALIDATION so that the test don't need to bother with faking a valid checksum - Remove a pile of code in serverLogObjectDebugInfo which is actually unsafe to run in the crash report (see comments in the code) - fix a missing boundary check in lzf_decompress test suite infra improvements: - be able to run valgrind checks before the process terminates - rotate log files when restarting servers
when using --baseport to run two tests suite in parallel (different folders), we need to also make sure the port used by the testsuite to communicate with it's workers is unique. otherwise the attept to find a free port connects to the other test suite and messes it. maybe one day we need to attempt to bind, instead of connect when tring to find a free port.
If RESTORE passes successfully with full sanitization, we can't affort to crash later on assertion due to duplicate records in a hash when converting it form ziplist to dict. This means that when doing full sanitization, we must make sure there are no duplicate records in any of the collections.
When RDB input attempts to make a huge memory allocation that fails, RESTORE should fail gracefully rather than die with panic
First, if the ziplist header is surely inside the ziplist, do fast path decoding rather than the careful one. In that case, streamline the encoding if-else chain to be executed only once, and the encoding validity tested at the end. encourage inlining likely / unlikely hints for speculative execution Assertion used _exit(1) to tell the compiler that the code after them is not reachable and get rid of warnings. But in some cases assertions are placed inside tight loops, and any piece of code in them can slow down execution (code cache and other reasons), instead using either abort() or better yet, unreachable builtin.
Starting from 6.2, after ACL SETUSER user reset, the user will carry the sanitize-payload flag. I guess it was added by mistake in redis#7807, reset means reverting to default. Added tests to make sure it won't be broken in the future, and updated the comment in ACLSetUser. Fixes redis#11278
…11279) Starting from 6.2, after ACL SETUSER user reset, the user will carry the sanitize-payload flag. It was added in #7807, and then ACL SETUSER reset is inconsistent with default newly created user which missing sanitize-payload flag. Same as `off` and `on` these two bits are mutually exclusive, the default created user needs to have sanitize-payload flag. Adds USER_FLAG_SANITIZE_PAYLOAD flag to ACLCreateUser. Note that the bug don't have any real implications, since the code in rdb.c (rdbLoadObject) checks for `USER_FLAG_SANITIZE_PAYLOAD_SKIP`, so the fact that `USER_FLAG_SANITIZE_PAYLOAD` is missing doesn't really matters. Added tests to make sure it won't be broken in the future, and updated the comment in ACLSetUser and redis.conf
…edis#11279) Starting from 6.2, after ACL SETUSER user reset, the user will carry the sanitize-payload flag. It was added in redis#7807, and then ACL SETUSER reset is inconsistent with default newly created user which missing sanitize-payload flag. Same as `off` and `on` these two bits are mutually exclusive, the default created user needs to have sanitize-payload flag. Adds USER_FLAG_SANITIZE_PAYLOAD flag to ACLCreateUser. Note that the bug don't have any real implications, since the code in rdb.c (rdbLoadObject) checks for `USER_FLAG_SANITIZE_PAYLOAD_SKIP`, so the fact that `USER_FLAG_SANITIZE_PAYLOAD` is missing doesn't really matters. Added tests to make sure it won't be broken in the future, and updated the comment in ACLSetUser and redis.conf
…edis#11279) Starting from 6.2, after ACL SETUSER user reset, the user will carry the sanitize-payload flag. It was added in redis#7807, and then ACL SETUSER reset is inconsistent with default newly created user which missing sanitize-payload flag. Same as `off` and `on` these two bits are mutually exclusive, the default created user needs to have sanitize-payload flag. Adds USER_FLAG_SANITIZE_PAYLOAD flag to ACLCreateUser. Note that the bug don't have any real implications, since the code in rdb.c (rdbLoadObject) checks for `USER_FLAG_SANITIZE_PAYLOAD_SKIP`, so the fact that `USER_FLAG_SANITIZE_PAYLOAD` is missing doesn't really matters. Added tests to make sure it won't be broken in the future, and updated the comment in ACLSetUser and redis.conf
Sanitize dump payload: ziplist, listpack, zipmap, intset, stream
When loading an encoded payload we will at least do a shallow validation to
check that the size that's encoded in the payload matches the size of the
allocation.
This let's us later use this encoded size to make sure the various offsets
inside encoded payload don't reach outside the allocation, if they do, we'll
assert/panic, but at least we won't segfault or smear memory.
We can also do 'deep' validation which runs on all the records of the encoded
payload and validates that they don't contain invalid offsets. This lets us
detect corruptions early and reject a RESTORE command rather than accepting
it and asserting (crashing) later when accessing that payload via some command.
configuration:
For now, we don't have a good way to ensure MIGRATE in cluster resharding isn't
being slowed down by these sanitation, so i'm setting the default value to
no,but later on it should be set to
clientsby default.changes:
exitin RESTORE commandslowed down by sanitation.
Fuzz tester and fixes for segfaults and leaks it exposed
The test creates keys with various encodings, DUMP them, corrupt the payload and RESTORES it.
It utilizes the recently added use-exit-on-panic config to distinguish between asserts and segfaults.
If the restore succeeds, it runs random commands on the key to attempt to trigger a crash.
It runs in two modes, one with deep sanitation enabled and one without.
In the first one we don't expect any assertions or segfaults, in the second one we expect assertions, but no segfaults.
We also check for leaks and invalid reads using valgrind, and if we find them we print the commands that lead to that issue.
Changes in the code (other than the test):
exitrather thanaborton assertion).test suite infra improvements: