Skip to content

Remove lpAssertValidEntry from listpack functions to achieve up to 10% improvement on HSET command.#399

Closed
ashtul wants to merge 6 commits into
valkey-io:unstablefrom
ashtul:remove-lpAssertValidEntry-from-listpack-functions
Closed

Remove lpAssertValidEntry from listpack functions to achieve up to 10% improvement on HSET command.#399
ashtul wants to merge 6 commits into
valkey-io:unstablefrom
ashtul:remove-lpAssertValidEntry-from-listpack-functions

Conversation

@ashtul

@ashtul ashtul commented Apr 28, 2024

Copy link
Copy Markdown
Contributor

This PR come after a similar PR was declined on Redis repo redis/redis#11273 with additional info at redis/redis#11293.

The issue it comes to solve is excess checks for corrupted data whenever a listpack is traversing. There are flamecharts in the original PR show it can reach 10% improvement of common commands such a HSET

The reasoning against the change is that the data can be corrupted if the RESTORE command was used and with flag SANITIZE_DUMP_NO.

IMO this isn't justified. To make all users pay a significant patently because a user can potentially load data without sanitizing it first, seems excessive. Anyone who does it must consider the server may crush.

This simple change can give ValKey a nice speed boost.

image

NOTE: the current tests fail b/c there are tests which load corrupted listpacks without sanitizing them first.

@madolson

Copy link
Copy Markdown
Member

@zuiderkwast I see you commented on the previous thread, are you aligned with this? This seems like a good idea.

@zuiderkwast

Copy link
Copy Markdown
Contributor

@madolson Yes, the original PR first just removed the validation and Oran complained that there may be corrupt data from unvalidated RDB or RESTORE. I think it's better to always validate on insert rather than on lookup, even though it may affect RDB loading time.

@ashtul do you want to make a flamegraph of loading a dump with and without sanitizing? It would be good to have a view on this difference as well, just for reference.

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I looked at the diff. It still just removes validation.

If we remove validation on lookup, we must do validation on insert instead.

@zuiderkwast

Copy link
Copy Markdown
Contributor

Here are some stats from when deep sanitization were introduced: redis/redis#7807 (comment). RESTORE deep is 80% slower than RESTORE shallow.

@zuiderkwast

Copy link
Copy Markdown
Contributor

If we always sanitize listpacks on load and no longer on lookup, then we shall also do the same for intset and stream.

For zipmap and ziplist (no longer used) I think we always convert them when we load them from an old dump. Let's check that santitization is done in this case too.

Then we should remove the config and ACL rules added in redis/redis#7807. (We can't remove them immediately but we can make them have no effect.)

@madolson WDYT?

@ashtul

ashtul commented Apr 29, 2024

Copy link
Copy Markdown
Contributor Author

@madolson @zuiderkwast

The stats Oran showed might be skewed due to other improvements he has done as he writes Note that initially LPOS and HGET showed severe (-25%) degradation, and after some optimizations effort (last commit) i was able to re-gain the performance loss and even improve..

What do you suggest validating an insert? Maybe validating prev, next and the following next should be EOF.

As for stats, I will try to create them after the holiday.

BTW, how hard would it be to add mechanism which will load data without sanitation but won't add the key until it is sanitized, possibly on a thread?

@zuiderkwast

Copy link
Copy Markdown
Contributor

What do you suggest validating an insert? Maybe validating prev, next and the following next should be EOF.

We only need to validate when we're loading a dump that can contain corrupt data (RESTORE or RDB). A normal insert can't add invalid data.

As for stats, I will try to create them after the holiday.

Sounds good. Anyway, I can accept a little slower RDB loading. It's more important to avoid validate on lookup. Also, the system is simpler if we never load corrupt data.

BTW, how hard would it be to add mechanism which will load data without sanitation but won't add the key until it is sanitized, possibly on a thread?

I think that would be too complex. Are you suggesting using a background thread for that? Like #356? We can do that later maybe, if you have a good idea about it, but not in the same PR.

@daniel-house

Copy link
Copy Markdown
Contributor

It seems like all of this is the classic trade-off between security and speed. Since we can't have both, and there are situations that justify either choice, it appears to me that we should be giving the solution architect the ability to choose via a new configuration property.

@zuiderkwast zuiderkwast added the major-decision-pending Major decision pending by TSC team label Apr 29, 2024
@zuiderkwast

Copy link
Copy Markdown
Contributor

@daniel-house There is already a config for that. Just the possibility of having invalid data is what prevents us from removing these asserts.

More important than load speed vs lookup speed IMO is the complexity aspect. The possibility of having invalid data in memory is a tech dept with a high maintenance cost IMO.

@zuiderkwast

Copy link
Copy Markdown
Contributor

@valkey-io/core-team Shall we remove the possibility to load potentially corrupt data? Yes 👍 or no 👎 (This is a core team vote, so non-core-team members, please don't vote on this comment. Feel free to comment in the thread though.)

@daniel-house

Copy link
Copy Markdown
Contributor

There is already a config for that. Just the possibility of having invalid data is what prevents us from removing these asserts.

It seems I was unclear. I meant to put a config around removing these asserts. Allow the asserts to be disabled via a test performed inside the in-lined functions.

@madolson

madolson commented May 1, 2024

Copy link
Copy Markdown
Member

Before I vote, to make sure I understand, the plan is to change the code so we always sanitize the load on RDB Restore and RDB Load so that we don't need to do these checks during runtime. I'm OK with that decision as long as we have performance numbers that we aren't dramatically (>25%) increased execution time on the load.

@madolson

Copy link
Copy Markdown
Member

@ashtul Sorry for the delay. Do you have any interest in updating this PR to perform the validation on the restore path. This would improve the normal runtime performance, and only slightly degrade the restore case which should happen less often.

@madolson madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Aug 26, 2024
@madolson

Copy link
Copy Markdown
Member

The decision was approved if we add the validation on the restore path.

…functions

Signed-off-by: Ariel Shtul <ashtul@gmail.com>
@codecov

codecov Bot commented Aug 28, 2024

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.07%. Comparing base (04d76d8) to head (1d43af8).
Report is 101 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #399      +/-   ##
============================================
- Coverage     70.43%   70.07%   -0.37%     
============================================
  Files           113      114       +1     
  Lines         61728    61705      -23     
============================================
- Hits          43479    43237     -242     
- Misses        18249    18468     +219     
Files with missing lines Coverage Δ
src/listpack.c 83.88% <100.00%> (-6.35%) ⬇️

... and 54 files with indirect coverage changes

@ashtul ashtul force-pushed the remove-lpAssertValidEntry-from-listpack-functions branch from ff448df to a7341b3 Compare August 28, 2024 09:41
@ashtul

ashtul commented Aug 31, 2024

Copy link
Copy Markdown
Contributor Author

@madolson @zuiderkwast
This is great. Removing this unnecessary assert will speed up valkey for many use cases.

I need help with the tests. I am unfamiliar with tcl and my solution of removing whole tests is not likely to be the right solution. Is there is anyone I can get help from? I plan to come to Vienna for the conference, would it be better to discuss it then?

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's missing in this PR is to always sanitize when loading dumps and on RESTORE.

Comment thread tests/integration/corrupt-dump.tcl
Comment thread tests/integration/corrupt-dump.tcl
@ashtul ashtul force-pushed the remove-lpAssertValidEntry-from-listpack-functions branch from a7341b3 to fb8844a Compare September 1, 2024 08:50
Signed-off-by: Ariel Shtul <ashtul@gmail.com>
@ashtul ashtul force-pushed the remove-lpAssertValidEntry-from-listpack-functions branch from c5206b5 to 4299eca Compare September 1, 2024 13:43
@ashtul

ashtul commented Sep 3, 2024

Copy link
Copy Markdown
Contributor Author

Hi @zuiderkwast,
I have rolled back some changes I have made since I don't have the knowledge in tcl and the attention span to fix the tests correctly.
The change in this PR is important and I do not want to drag committing it. I'm pretty sure some users will notice an increase in ops per second and lower latency as soon as we push it into the unstable branch. It actually gives valkey an advantage over redis.
Can someone takes the PR from here?

@zuiderkwast

Copy link
Copy Markdown
Contributor

The change in this PR is important and I do not want to drag committing it.

We are releasing 8.0 very soon. We are busy fixing the stability of this release, so this PR will have to wait until after 8.0 is released.

What is missing in this PR is to make the config sanitize-dump-payload no behave just like sanitize-dump-payload yes, so we always sanitize the payload when we load it.

* remove sanitize-dump-payload w/o mayfail

* remove mayfail tests

* remove 767
@madolson

madolson commented Oct 7, 2024

Copy link
Copy Markdown
Member

CI tests do not run and are awaiting your approval. Is that a bug?

If you haven't merged anything yet, it won't run automatically. I did run it and there is a format error, can you fix it?

Comment thread src/listpack.c Outdated
Signed-off-by: Madelyn Olson <madelyneolson@gmail.com>
@ashtul

ashtul commented Oct 8, 2024

Copy link
Copy Markdown
Contributor Author

Now I looked at the diff. It still just removes validation.

If we remove validation on lookup, we must do validation on insert instead.

Valkey's listpack is well tested and we do not fear a mistake during insertion.
Loading external data was the root issue for adding the assert.
Can you please remove the 'required changes' flag?

@zuiderkwast

Copy link
Copy Markdown
Contributor

Now I looked at the diff. It still just removes validation.
If we remove validation on lookup, we must do validation on insert instead.

Valkey's listpack is well tested and we do not fear a mistake during insertion. Loading external data was the root issue for adding the assert. Can you please remove the 'required changes' flag?

Yes, I meant when loading external data. "On insert" was unclear, sorry. You still didn't ensure we always do validation when loading external data. I'm waiting for that change.

Currently, it is still possible to load corrupt data.

@ashtul

ashtul commented Oct 13, 2024

Copy link
Copy Markdown
Contributor Author

@zuiderkwast I see your point.

@madolson suggested to ignore the SANITIZE_DUMP_NO and always perform a check on insert. As a second stage, to sanitize incoming listpacks on a second thread as to not block the main thread. We how to agree how to treat listpacks that are found to be corrupted.

@zuiderkwast

Copy link
Copy Markdown
Contributor

@madolson suggested to ignore the SANITIZE_DUMP_NO and always perform a check on insert.

Yes, that sounds right.

As a second stage, to sanitize incoming listpacks on a second thread as to not block the main thread. We how to agree how to treat listpacks that are found to be corrupted.

We'll do this while/before loading it? We offload to a thread and then wait for all such jobs to complete before we actually load them? If we do it this way, then we can also return an error just like we do for SANITIZE_DUMP_YES, right?

I don't see how we can do it after loading it, because then we will already have corrupt data in the system and maybe already accessing it.

@madolson madolson moved this to Todo in Valkey 9.0 May 19, 2025
@madolson madolson removed this from Valkey 9.0 Jul 21, 2025
@madolson madolson moved this to Optional for next release candidate in Valkey 9.1 Jul 21, 2025
@madolson madolson moved this from Optional for next release candidate to Todo in Valkey 9.1 Aug 6, 2025
@zuiderkwast

Copy link
Copy Markdown
Contributor

Reminder:

The decision was approved if we add the validation on the restore path.

This is the decision we made long ago, as noted when the approved label was added: #399 (comment)

This PR is not relevant without the missing validation.

@madolson madolson moved this from Todo to Optional for next release candidate in Valkey 9.1 Feb 2, 2026
@madolson madolson moved this to Needs Review in Valkey 10 May 10, 2026
@madolson madolson removed this from Valkey 9.1 May 10, 2026
jjuleslasarte added a commit to jjuleslasarte/valkey that referenced this pull request May 14, 2026
Remove lpAssertValidEntry calls from lpNext, lpPrev, lpFirst, lpFind,
and lpDeleteRangeWithEntry to eliminate redundant per-operation listpack
validation. Safety is maintained by making deep_integrity_validation
unconditional in rdbLoadObject, ensuring all data structures (listpack,
intset, stream) are fully validated on every load path (RDB and RESTORE).
This effectively makes sanitize-dump-payload a no-op.

Benchmark results (Graviton3, 200 clients, P=32, 8 threads):

  256-field hash HGET: +4.7% throughput (907K -> 950K rps)
  256-field hash HSET: +1.2% throughput (688K -> 696K rps)
  100-field hash HSET: +2.4% throughput (688K -> 704K rps)
  p99 latency (256-field HSET): -16% (11.15ms -> 9.38ms)
  RDB load (1M keys): +1% overhead (acceptable)

Lightweight EOF position assertions from PR valkey-io#2027 remain intact.

Takes over: valkey-io#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
jjuleslasarte added a commit to jjuleslasarte/valkey that referenced this pull request May 26, 2026
Remove lpAssertValidEntry calls from lpNext, lpPrev, lpFirst, lpFind,
and lpDeleteRangeWithEntry to eliminate redundant per-operation listpack
validation. Safety is maintained by making deep_integrity_validation
unconditional in rdbLoadObject, ensuring all data structures (listpack,
intset, stream) are fully validated on every load path (RDB and RESTORE).
This effectively makes sanitize-dump-payload a no-op.

Benchmark results (Graviton3, 200 clients, P=32, 8 threads):

  256-field hash HGET: +4.7% throughput (907K -> 950K rps)
  256-field hash HSET: +1.2% throughput (688K -> 696K rps)
  100-field hash HSET: +2.4% throughput (688K -> 704K rps)
  p99 latency (256-field HSET): -16% (11.15ms -> 9.38ms)
  RDB load (1M keys): +1% overhead (acceptable)

Lightweight EOF position assertions from PR valkey-io#2027 remain intact.

Takes over: valkey-io#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
jjuleslasarte added a commit to jjuleslasarte/valkey that referenced this pull request May 27, 2026
Remove lpAssertValidEntry calls from lpNext, lpPrev, lpFirst, lpFind,
and lpDeleteRangeWithEntry to eliminate redundant per-operation listpack
validation. Safety is maintained by making deep_integrity_validation
unconditional in rdbLoadObject, ensuring all data structures (listpack,
intset, stream) are fully validated on every load path (RDB and RESTORE).
This effectively makes sanitize-dump-payload a no-op.

Benchmark results (Graviton3, 200 clients, P=32, 8 threads):

  256-field hash HGET: +4.7% throughput (907K -> 950K rps)
  256-field hash HSET: +1.2% throughput (688K -> 696K rps)
  100-field hash HSET: +2.4% throughput (688K -> 704K rps)
  p99 latency (256-field HSET): -16% (11.15ms -> 9.38ms)
  RDB load (1M keys): +1% overhead (acceptable)

Lightweight EOF position assertions from PR valkey-io#2027 remain intact.

Takes over: valkey-io#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
zuiderkwast pushed a commit that referenced this pull request May 27, 2026
Until now, deep santitization of listpack on load (RDB load, RESTORE)
was configurable. That meant that we had to do validation on access and
traversal of the listpack. This change:

- Force sanitization of listpacks on load and import.
- Deprecates the config `sanitize-dump-payload` and makes it a no-op.
- Deprecates the ACL flags `skip-sanitize-payload` and
`sanitize-payload`, ignoring them when loading old ACL files.
- Removes validation asserts on access.

Benefits:

- Prevents deferred assertions, which would happen if unsanitized
  listpacks were loaded and later accessed, leading to server
  shutdown.
- Improves read performance.

Replaces: #399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
@zuiderkwast

Copy link
Copy Markdown
Contributor

Replaced by #3721.

@zuiderkwast zuiderkwast removed this from Valkey 10 May 28, 2026
2030XiaoGe added a commit to 2030XiaoGe/valkey that referenced this pull request May 28, 2026
…key-io#3721)

Until now, deep santitization of listpack on load (RDB load, RESTORE)
was configurable. That meant that we had to do validation on access and
traversal of the listpack. This change:

- Force sanitization of listpacks on load and import.
- Deprecates the config `sanitize-dump-payload` and makes it a no-op.
- Deprecates the ACL flags `skip-sanitize-payload` and
`sanitize-payload`, ignoring them when loading old ACL files.
- Removes validation asserts on access.

Benefits:

- Prevents deferred assertions, which would happen if unsanitized
  listpacks were loaded and later accessed, leading to server
  shutdown.
- Improves read performance.

Replaces: valkey-io#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
hanzo-dev pushed a commit to hanzoai/kv that referenced this pull request Jun 10, 2026
Until now, deep santitization of listpack on load (RDB load, RESTORE)
was configurable. That meant that we had to do validation on access and
traversal of the listpack. This change:

- Force sanitization of listpacks on load and import.
- Deprecates the config `sanitize-dump-payload` and makes it a no-op.
- Deprecates the ACL flags `skip-sanitize-payload` and
`sanitize-payload`, ignoring them when loading old ACL files.
- Removes validation asserts on access.

Benefits:

- Prevents deferred assertions, which would happen if unsanitized
  listpacks were loaded and later accessed, leading to server
  shutdown.
- Improves read performance.

Replaces: valkey-io/valkey#399

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-approved Major decision approved by TSC team performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants