Skip to content

Additional assertions for listpack#2027

Merged
madolson merged 1 commit into
valkey-io:unstablefrom
muelstefamzn:validation-1
May 12, 2025
Merged

Additional assertions for listpack#2027
madolson merged 1 commit into
valkey-io:unstablefrom
muelstefamzn:validation-1

Conversation

@muelstefamzn

Copy link
Copy Markdown
Contributor

Adds a few asserts to increase the chance of detecting corrupted listpack data.

It replaces a questionable code path that purposefully returns invalid data with a panic() and adds assert() statements to verify that LP_EOF is only detected at the end of the listpack.

Listpack data is expected to be correctly encoded. These asserts should only trigger if the listpack data was corrupted in some way, in which case it is better to assert than to return invalid data to the client.

@codecov

codecov Bot commented Apr 29, 2025

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 93.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.08%. Comparing base (70f2057) to head (eca224f).
Report is 27 commits behind head on unstable.

Files with missing lines Patch % Lines
src/listpack.c 93.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2027      +/-   ##
============================================
+ Coverage     70.97%   71.08%   +0.10%     
============================================
  Files           123      122       -1     
  Lines         66135    66175      +40     
============================================
+ Hits          46941    47039      +98     
+ Misses        19194    19136      -58     
Files with missing lines Coverage Δ
src/listpack.c 91.60% <93.33%> (+1.81%) ⬆️

... and 29 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/listpack.c Outdated
Comment thread src/listpack.c Outdated
@xbasel

xbasel commented Apr 30, 2025

Copy link
Copy Markdown
Member

Few questions:

  1. Is this addressing an issue that has occurred in practice (e.g. corruption caught in real-world usage)?
  2. The code adds assertions in a hot path. Do we have any benchmarks or profiling to understand the performance penalty of adding these assertions?
  3. How has this change been tested?

@muelstefamzn

Copy link
Copy Markdown
Contributor Author

Thank you for reviewing it!

I see a few reasons why listpack data could be invalid, all have low probability:

  • Memory corruption due to a hardware defect.
  • Snapshot corruption during storage/transfer that is not caught by the checksum.
  • A bug that creates invalid data. I'm not aware of such a bug in Valkey.

If such a situation occurs, I prefer to panic/assert over returning invalid data or undefined behaviour. Of course, this change does not guarantee that it will be caught, just makes it more likely.

I've done a simple valkey-benchmark test:

  • Pre-filled the node with 100k hash keys with four fields each, containing integers.
  • Ran valkey-benchmark -r 100000 -n 1000000 HINCRBY __rand_int__ field 1.
    This exercises lpFirst(), lpFind(), lpNext(), lpGetWithSize(), and others.
  • Repeated the test 5x with and without the change.

I did not observe a significant change in performance.

  • Without the change: 83157.334 RPS (standard deviation +-381.4)
  • With the change: 83417.814 RPS (standard deviation +-332.5)

This is hardly a comprehensive performance test, of course. At least for this setup, the difference
is less than the standard deviation (+-0.4%). By chance, averaged a little higher with the change.

Other than running the existing tests, I've manually spot checked some of the asserts by attaching
a debugger and modifying the listpack data in-situ to contain the LP_EOF byte at invalid locations.
The asserts trigger as expected, creating a crash report.

Comment thread src/listpack.c Outdated
Comment thread src/listpack.c Outdated
Comment thread src/listpack.c Outdated
Comment thread src/listpack.c
Adds a few asserts to increase the chance of detecting corrupted
listpack data.

Signed-off-by: Stefan Mueller <muelstef@amazon.com>

@madolson madolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@madolson madolson merged commit 9f7aa04 into valkey-io:unstable May 12, 2025
Comment thread src/listpack.c
uval = 12345678900000000ULL + p[0];
negstart = UINT64_MAX;
negmax = 0;
panic("Invalid listpack encoding. Byte %02hhx is not a valid encoding.", p[0]);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I suppose we can't do this. While validating listpack entries, we need to fetch some of the data out to validate the integrity. However, if the listpack itself is itself malformed, we might try to extract the data out. The alternative would be to check every single point in streamValidateListpackIntegrity to verify we still have more data to read.

muelstefamzn added a commit to muelstefamzn/valkey that referenced this pull request May 13, 2025
There is existing code that depends on the previous behaviour.
See valkey-io#2027.

Signed-off-by: Stefan Mueller <muelstef@amazon.com>
madolson pushed a commit that referenced this pull request May 13, 2025
Removing the `panic()` statement introduced previously.

There is existing code that depends on the previous behaviour. 

See #2027.

Signed-off-by: Stefan Mueller <muelstef@amazon.com>
@zuiderkwast

zuiderkwast commented May 14, 2025

Copy link
Copy Markdown
Contributor

The fact that we allow loading corrupted listpacks from RDB is the source of this problem and complexity. We have already decided that we want to validate the dump before we load it, so we will not need to allow correpted listpacks to exist in the database anymore. The decision is here: #399 (comment) but the PR was never finished.

shanwan1 pushed a commit to shanwan1/valkey that referenced this pull request Jun 13, 2025
Removing the `panic()` statement introduced previously.

There is existing code that depends on the previous behaviour.

See valkey-io#2027.

Signed-off-by: Stefan Mueller <muelstef@amazon.com>
Signed-off-by: shanwan1 <shanwan1@intel.com>
jjuleslasarte added a commit to jjuleslasarte/valkey that referenced this pull request May 14, 2026
Remove lpAssertValidEntry calls from lpNext, lpPrev, lpFirst, lpFind,
and lpDeleteRangeWithEntry to eliminate redundant per-operation listpack
validation. Safety is maintained by making deep_integrity_validation
unconditional in rdbLoadObject, ensuring all data structures (listpack,
intset, stream) are fully validated on every load path (RDB and RESTORE).
This effectively makes sanitize-dump-payload a no-op.

Benchmark results (Graviton3, 200 clients, P=32, 8 threads):

  256-field hash HGET: +4.7% throughput (907K -> 950K rps)
  256-field hash HSET: +1.2% throughput (688K -> 696K rps)
  100-field hash HSET: +2.4% throughput (688K -> 704K rps)
  p99 latency (256-field HSET): -16% (11.15ms -> 9.38ms)
  RDB load (1M keys): +1% overhead (acceptable)

Lightweight EOF position assertions from PR valkey-io#2027 remain intact.

Takes over: valkey-io#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
jjuleslasarte added a commit to jjuleslasarte/valkey that referenced this pull request May 26, 2026
Remove lpAssertValidEntry calls from lpNext, lpPrev, lpFirst, lpFind,
and lpDeleteRangeWithEntry to eliminate redundant per-operation listpack
validation. Safety is maintained by making deep_integrity_validation
unconditional in rdbLoadObject, ensuring all data structures (listpack,
intset, stream) are fully validated on every load path (RDB and RESTORE).
This effectively makes sanitize-dump-payload a no-op.

Benchmark results (Graviton3, 200 clients, P=32, 8 threads):

  256-field hash HGET: +4.7% throughput (907K -> 950K rps)
  256-field hash HSET: +1.2% throughput (688K -> 696K rps)
  100-field hash HSET: +2.4% throughput (688K -> 704K rps)
  p99 latency (256-field HSET): -16% (11.15ms -> 9.38ms)
  RDB load (1M keys): +1% overhead (acceptable)

Lightweight EOF position assertions from PR valkey-io#2027 remain intact.

Takes over: valkey-io#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
jjuleslasarte added a commit to jjuleslasarte/valkey that referenced this pull request May 27, 2026
Remove lpAssertValidEntry calls from lpNext, lpPrev, lpFirst, lpFind,
and lpDeleteRangeWithEntry to eliminate redundant per-operation listpack
validation. Safety is maintained by making deep_integrity_validation
unconditional in rdbLoadObject, ensuring all data structures (listpack,
intset, stream) are fully validated on every load path (RDB and RESTORE).
This effectively makes sanitize-dump-payload a no-op.

Benchmark results (Graviton3, 200 clients, P=32, 8 threads):

  256-field hash HGET: +4.7% throughput (907K -> 950K rps)
  256-field hash HSET: +1.2% throughput (688K -> 696K rps)
  100-field hash HSET: +2.4% throughput (688K -> 704K rps)
  p99 latency (256-field HSET): -16% (11.15ms -> 9.38ms)
  RDB load (1M keys): +1% overhead (acceptable)

Lightweight EOF position assertions from PR valkey-io#2027 remain intact.

Takes over: valkey-io#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants