Skip to content

fix: continue pruning if version is not found (backport #1063)#1065

Merged
aljo242 merged 1 commit into
release/v1.2.xfrom
mergify/bp/release/v1.2.x/pr-1063
Mar 26, 2025
Merged

fix: continue pruning if version is not found (backport #1063)#1065
aljo242 merged 1 commit into
release/v1.2.xfrom
mergify/bp/release/v1.2.x/pr-1063

Conversation

@mergify

@mergify mergify Bot commented Mar 26, 2025

Copy link
Copy Markdown
Contributor

Description

We found a case in Osmosis node where there is a root key that is points to a node that doesn't exists and it hangs the pruning process because fails at get root key (returns ErrVersionDoesNotExist).

There is already code to clean the dangling ref node up, but it just never get there because it early returns ErrVersionDoesNotExist before getting there.

This means when pruning we cannot prune a version of the store because it gets stuck. This PR moves onto the next version in the store if pruning returns a not found error.

Notes about legacy nodes

  • there is an nasty edge case that I think will be hit with legacy nodes that we need to consider before merging this in
  1. If legacy pruning is broken
  2. legacy pruning will error and set the first to legacyLatestVersion+1
  3. this has the side effect of iterating to the latest non legacy version available
  4. this iteration could be potentially large for some validators that
    1. aren't aware or maintaining their state
    2. lots of log lines are this PR adds an error log to IAVL for pruning that's skipped
    3. chains that haven't fully upgraded to IAVLv1 or heavily depend on legacy nodes

see:

Downloading state

https://snapshots.testnet.osmosis.zone/

wget -q -O - https://osmosis.fra1.cdn.digitaloceanspaces.com/osmo-test-5/snapshots/v29/osmosis-snapshot-202503251415-27294691.tar.lz4 | lz4 -d | tar -C $HOME/.osmosisd -xvf -

or rn polkachu snapshots have and issue with bank and concentratedliquidity

https://polkachu.com/tendermint_snapshots/osmosis

wget -O osmosis_32290276.tar.lz4 https://snapshots.polkachu.com/snapshots/osmosis/osmosis_32290276.tar.lz4 --inet4-only
Analyzing store versions for outliers...

Stores with potentially excessive versions (may need pruning):
Store 'concentratedliquidity' has 161172 versions (average: 6746.03) - This store may need pruning
Store 'bank' has 80752 versions (average: 6746.03) - This store may need pruning

Stores with large version gaps (may indicate inconsistent pruning):
Store 'concentratedliquidity' has a large version gap: 161171 (from 32129108 to 32290279) - This may indicate inconsistent pruning

I ran this PR on this state on osmosis mainnet and it fixed the issue see => osmosis-labs/osmosis#9333

Checking broken stores

Use this PR and run:
osmosis-labs/cosmprund#2

go run main.go check-store-versions /home/ghost/osmosis-states/osmosis-testnet-state/data

Pruning broken stores

Use this PR and run:
osmosis-labs/cosmprund#2

go run main.go prune /home/ghost/osmosis-states/osmosis-testnet-state/data

State will then be fixed

Things we don't know

Why are there states deleted outside of pruning? Why does this become more apparent with async pruning?

Another version of the fix

#1048

This fix, works in the same way and just continues after the is a version not found error, this moves past both checks, version and version+1

Why this is needed

Currently if pruning breaks with this error the chain state will start to grow quickly.

What the fix will look like

Osmosis mainnet with broken state:

osmosis → λ git v28.0.5* → osmosisd start --home ~/osmosis-states/test 2>&1 | grep "version does not exist"
1:34PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=31883938 version missing=31883937
1:34PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=31883939 version missing=31883938

Before this would have and the state would bloat

This is osmosis testnet with broken state

1:55PM INF service stop impl="Peer{MConn{176.9.82.221:12556} ade4d8bc8cbe014af6ebdf3cb7b1e9ad36f412c0 out}" module=p2p msg="Stopping Peer service" peer=ade4d8bc8cbe014af6ebdf3cb7b1e9ad36f412c0@176.9.82.221:12556
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202688 version missing=7202687
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202689 version missing=7202688
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7202690 version missing=7202689
1:55PM INF commit is for a block we do not know about; set ProposalBlock=nil commit=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 commit_round=0 height=27208281 module=consensus proposal=
1:55PM INF received complete proposal block hash=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 height=27208281 module=consensus
1:55PM INF finalizing commit of block hash=D8FFDC467BB88FE33CD49519B063E564BFC219668D4734F8A3DC05EC441B06D8 height=27208281 module=consensus num_txs=0 root=08CB5DDE28307231EF8D5D5B9BEF99BB803058B08AF1FCE0095E5620E381E14F
1:55PM INF finalized block block_app_hash=BA816B7934B12755217F0F4BDA6264743CDC6E18B3E8858D86CE37618EBA57B7 height=27208281 module=state num_txs_res=0 num_val_updates=0
1:55PM INF executed block app_hash=BA816B7934B12755217F0F4BDA6264743CDC6E18B3E8858D86CE37618EBA57B7 height=27208281 module=state
1:55PM INF committed state block_app_hash=08CB5DDE28307231EF8D5D5B9BEF99BB803058B08AF1FCE0095E5620E381E14F height=27208281 module=state
1:55PM INF Timed out dur=443.168164 height=27208282 module=consensus round=0 step=RoundStepNewHeight
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=6667061 version missing=6667060
1:55PM INF Timed out dur=1600 height=27208282 module=consensus round=0 step=RoundStepPropose
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7203887 version missing=7203886
1:55PM ERR Error while pruning, moving on the the next version in the store err="version does not exist" module=server next version=7203888 version missing=7203887

This represents a large backlog as pruning is on 7203887 27208281

Summary by CodeRabbit

Summary by CodeRabbit

  • Bug Fixes
    • Improved error handling during version removal, now logging missing versions while allowing the process to continue smoothly.
    • Refined the handling of orphan node traversal, ensuring operations proceed only when valid data is present, thereby enhancing overall system stability.

This is an automatic backport of pull request #1063 done by [Mergify](https://mergify.com).

@mergify mergify Bot requested a review from a team March 26, 2025 17:42
@coderabbitai

coderabbitai Bot commented Mar 26, 2025

Copy link
Copy Markdown

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@aljo242 aljo242 enabled auto-merge (squash) March 26, 2025 17:51
@aljo242 aljo242 disabled auto-merge March 26, 2025 17:52
@aljo242 aljo242 enabled auto-merge (squash) March 26, 2025 17:52
@aljo242 aljo242 merged commit b31498d into release/v1.2.x Mar 26, 2025
@aljo242 aljo242 deleted the mergify/bp/release/v1.2.x/pr-1063 branch March 26, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants