Skip to content

fix(e2e): reproduction an fixing of missing evidence in e2e nightlies (backport #3234)#3240

Merged
sergio-mena merged 4 commits intov0.38.xfrom
mergify/bp/v0.38.x/pr-3234
Jun 11, 2024
Merged

fix(e2e): reproduction an fixing of missing evidence in e2e nightlies (backport #3234)#3240
sergio-mena merged 4 commits intov0.38.xfrom
mergify/bp/v0.38.x/pr-3234

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Jun 11, 2024

Closes #3233

This PR is is to be interpreted commit by commit.

  • Commit 1: Changes in `e2e runner to repro the problem (these changes will stay).
  • Commit 2. Making the problem repro appear in this PR (look at the ❌ next to commit 2 below, and check the e2e logs)
  • Commit 3: Fix. (look at the ✅ next to commit 3 below)
  • Commit 4: Revert commit 2
  • Commit 5: Adapt nightlies to start running all regression manifests in (newly introduced) test/e2e/networks_regressions directory. Also, adapt manifest generator to create testnets with more evidences.
  • Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some evidences can be delayed in their transit to the next validator, long enough to become invalid (too old). The delay is achieved by having a validator with skewed clock, in such a way that every time it proposes (untimely proposal) other peers shut down their p2p connection with that validator. The validator having the skewed clock is the one receiving all evidences.

The root cause is that, in e2e, the max age of evidence is too restrictive (500ms in time, and 7 heights in terms blocks). So, if the gossiping of transactions is delayed, some of them are too old when they arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in e2e: 1500ms, or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I haven't been able to repro it after multiple attempts.


PR checklist

  • Tests written/updated
  • [ ] Changelog entry added in .changelog (we use unclog to manage our changelog)
  • [ ] Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

This is an automatic backport of pull request #3234 done by [Mergify](https://mergify.com).

…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit bb6c702)

# Conflicts:
#	.github/workflows/e2e-nightly-1x.yml
#	.github/workflows/e2e-nightly-38x.yml
#	test/e2e/generator/generate.go
#	test/e2e/networks/ci.toml
@mergify mergify bot requested a review from a team as a code owner June 11, 2024 08:11
@mergify mergify bot added the conflicts label Jun 11, 2024
@mergify

This comment was marked as resolved.

…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
@sergio-mena sergio-mena merged commit ee556e3 into v0.38.x Jun 11, 2024
@sergio-mena sergio-mena deleted the mergify/bp/v0.38.x/pr-3234 branch June 11, 2024 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant