Fix random export failure#90
Conversation
miladz68
left a comment
There was a problem hiding this comment.
@miladz68 reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, TxCorpi0x, and ysv).
a discussion (no related file):
I don't understand why a rollback would fix the issue. I think we should discuss what caused the problem.
TxCorpi0x
left a comment
There was a problem hiding this comment.
@TxCorpi0x made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, miladz68, and ysv).
a discussion (no related file):
Previously, miladz68 (milad) wrote…
I don't understand why a rollback would fix the issue. I think we should discuss what caused the problem.
The error comes up when any of the node fails or has corrupted db, this error is the same error that we faced in the devnet, and two nodes failed because of lack of enough resource. In a normal chain, we need to use the rollback command of the cosmos-sdk appchain's binary to roleback one block and restart the node so it can catchup again with the recent network data.
So the same behaviour happened here, we rollback one block to make sure there is no failed storage read.
panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1]) [recovered]
panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1])
miladz68
left a comment
There was a problem hiding this comment.
@miladz68 made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, TxCorpi0x, and ysv).
a discussion (no related file):
Previously, TxCorpi0x wrote…
The error comes up when any of the node fails or has corrupted db, this error is the same error that we faced in the devnet, and two nodes failed because of lack of enough resource. In a normal chain, we need to use the
rollbackcommand of the cosmos-sdk appchain's binary to roleback one block and restart the node so it can catchup again with the recent network data.So the same behaviour happened here, we rollback one block to make sure there is no failed storage read.
panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1]) [recovered] panic: version 1810 was already saved to different hash from 3D8D3DB7F429156B6BCD81232657AF5C1691B3234E9273A91C63BDD56DF3D7B0 (existing nodeKey [0 0 0 0 0 0 7 18 0 0 0 1])
Are we sure that the problem is caused by corrupted db ?
TxCorpi0x
left a comment
There was a problem hiding this comment.
@TxCorpi0x made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on masihyeganeh, metalarm10, miladz68, and ysv).
a discussion (no related file):
Previously, miladz68 (milad) wrote…
Are we sure that the problem is caused by
corrupted db?
This issue is random, in tests, this PR is to make sure this failure don't come up in the rest of the development and PRs.
But about the actual issue, I think we need to define an specific task to dig more into this issue, since I observed this issue after adding the PSE module, i am not sure which part of the code would cause this issue, but at least we should gain more info why this kind of error is apearing recently more often than before, these issues can be related:
- PSE logic and end blocker
- More repos (tx-xrpl-bridge and etc.) use the self-host runner recently, may be this happens when too many runs of different PRs are running.
- The recent upgrades of sdk and dependencies might be the reason as well.
miladz68
left a comment
There was a problem hiding this comment.
@miladz68 resolved 1 discussion.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on masihyeganeh, metalarm10, and ysv).
Description
This pull request refines the logic for synchronizing app heights in the
syncAppsHeightsfunction within theintegration-tests/export/export_test.gofile. The update improves robustness when handling exported app versions and ensures both the exported and initiated apps are consistently advanced to the target height.Improvements to app height synchronization:
syncAppsHeightsto attempt loading the exported app atinitialHeight, and if unavailable, load the previous version and replay a block to reach the correct height, handling cases where the export was taken before the block.requireT.NoError.Reviewers checklist:
Authors checklist
This change is