Skip to content

Fix random export failure#90

Merged
TxCorpi0x merged 9 commits into
masterfrom
mehdi/fix-random-upgrade-test-failure
Mar 5, 2026
Merged

Fix random export failure#90
TxCorpi0x merged 9 commits into
masterfrom
mehdi/fix-random-upgrade-test-failure

Conversation

@TxCorpi0x

@TxCorpi0x TxCorpi0x commented Feb 23, 2026

Copy link
Copy Markdown
Contributor

Description

This pull request refines the logic for synchronizing app heights in the syncAppsHeights function within the integration-tests/export/export_test.go file. The update improves robustness when handling exported app versions and ensures both the exported and initiated apps are consistently advanced to the target height.

Improvements to app height synchronization:

  • Updated syncAppsHeights to attempt loading the exported app at initialHeight, and if unavailable, load the previous version and replay a block to reach the correct height, handling cases where the export was taken before the block.
  • Ensured both exported and initiated apps are committed at the same height, using consistent error handling with requireT.NoError.

Reviewers checklist:

  • Try to write more meaningful comments with clear actions to be taken.
  • Nit-picking should be unblocking. Focus on core issues.

Authors checklist

  • Provide a concise and meaningful description
  • Review the code yourself first, before making the PR.
  • Annotate your PR in places that require explanation.
  • Think and try to split the PR to smaller PR if it is big.

This change is Reviewable

@TxCorpi0x TxCorpi0x marked this pull request as ready for review February 27, 2026 08:21
@TxCorpi0x TxCorpi0x requested a review from a team as a code owner February 27, 2026 08:21
@TxCorpi0x TxCorpi0x requested review from a team, masihyeganeh, metalarm10, miladz68 and ysv and removed request for a team February 27, 2026 08:21

@miladz68 miladz68 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miladz68 reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, TxCorpi0x, and ysv).


a discussion (no related file):
I don't understand why a rollback would fix the issue. I think we should discuss what caused the problem.

@TxCorpi0x TxCorpi0x left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TxCorpi0x made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, miladz68, and ysv).


a discussion (no related file):

Previously, miladz68 (milad) wrote…

I don't understand why a rollback would fix the issue. I think we should discuss what caused the problem.

The error comes up when any of the node fails or has corrupted db, this error is the same error that we faced in the devnet, and two nodes failed because of lack of enough resource. In a normal chain, we need to use the rollback command of the cosmos-sdk appchain's binary to roleback one block and restart the node so it can catchup again with the recent network data.

So the same behaviour happened here, we rollback one block to make sure there is no failed storage read.

panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1]) [recovered]
	panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1])

@miladz68 miladz68 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miladz68 made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, TxCorpi0x, and ysv).


a discussion (no related file):

Previously, TxCorpi0x wrote…

The error comes up when any of the node fails or has corrupted db, this error is the same error that we faced in the devnet, and two nodes failed because of lack of enough resource. In a normal chain, we need to use the rollback command of the cosmos-sdk appchain's binary to roleback one block and restart the node so it can catchup again with the recent network data.

So the same behaviour happened here, we rollback one block to make sure there is no failed storage read.

panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1]) [recovered]
	panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1])

Are we sure that the problem is caused by corrupted db ?

@TxCorpi0x TxCorpi0x left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TxCorpi0x made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, miladz68, and ysv).


a discussion (no related file):

Previously, miladz68 (milad) wrote…

Are we sure that the problem is caused by corrupted db ?

This issue is random, in tests, this PR is to make sure this failure don't come up in the rest of the development and PRs.
But about the actual issue, I think we need to define an specific task to dig more into this issue, since I observed this issue after adding the PSE module, i am not sure which part of the code would cause this issue, but at least we should gain more info why this kind of error is apearing recently more often than before, these issues can be related:

  1. PSE logic and end blocker
  2. More repos (tx-xrpl-bridge and etc.) use the self-host runner recently, may be this happens when too many runs of different PRs are running.
  3. The recent upgrades of sdk and dependencies might be the reason as well.

@miladz68 miladz68 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miladz68 resolved 1 discussion.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on masihyeganeh, metalarm10, and ysv).

@TxCorpi0x TxCorpi0x merged commit ed16b4a into master Mar 5, 2026
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants