Skip to content

fix(e2e): reproduction an fixing of missing evidence in e2e nightlies (backport #3234)#3239

Merged
sergio-mena merged 4 commits intov1.xfrom
mergify/bp/v1.x/pr-3234
Jun 11, 2024
Merged

fix(e2e): reproduction an fixing of missing evidence in e2e nightlies (backport #3234)#3239
sergio-mena merged 4 commits intov1.xfrom
mergify/bp/v1.x/pr-3234

Conversation

@mergify
Copy link
Contributor

@mergify mergify bot commented Jun 11, 2024

Closes #3233

This PR is is to be interpreted commit by commit.

  • Commit 1: Changes in `e2e runner to repro the problem (these changes will stay).
  • Commit 2. Making the problem repro appear in this PR (look at the ❌ next to commit 2 below, and check the e2e logs)
  • Commit 3: Fix. (look at the ✅ next to commit 3 below)
  • Commit 4: Revert commit 2
  • Commit 5: Adapt nightlies to start running all regression manifests in (newly introduced) test/e2e/networks_regressions directory. Also, adapt manifest generator to create testnets with more evidences.
  • Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some evidences can be delayed in their transit to the next validator, long enough to become invalid (too old). The delay is achieved by having a validator with skewed clock, in such a way that every time it proposes (untimely proposal) other peers shut down their p2p connection with that validator. The validator having the skewed clock is the one receiving all evidences.

The root cause is that, in e2e, the max age of evidence is too restrictive (500ms in time, and 7 heights in terms blocks). So, if the gossiping of transactions is delayed, some of them are too old when they arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in e2e: 1500ms, or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I haven't been able to repro it after multiple attempts.


PR checklist

  • Tests written/updated
  • [ ] Changelog entry added in .changelog (we use unclog to manage our changelog)
  • [ ] Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

This is an automatic backport of pull request #3234 done by [Mergify](https://mergify.com).

…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit bb6c702)

# Conflicts:
#	.github/workflows/e2e-nightly-1x.yml
@mergify mergify bot requested a review from a team as a code owner June 11, 2024 08:11
@mergify mergify bot requested a review from a team June 11, 2024 08:11
@mergify mergify bot added the conflicts label Jun 11, 2024
@mergify
Copy link
Contributor Author

mergify bot commented Jun 11, 2024

Cherry-pick of bb6c702 has failed:

On branch mergify/bp/v1.x/pr-3234
Your branch is up to date with 'origin/v1.x'.

You are currently cherry-picking commit bb6c7027e.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Changes to be committed:
	modified:   .github/workflows/e2e-manual-multiversion.yml
	modified:   .github/workflows/e2e-manual.yml
	modified:   .github/workflows/e2e-nightly-38x.yml
	modified:   .github/workflows/e2e-nightly-main.yml
	modified:   test/e2e/generator/generate.go
	modified:   test/e2e/networks/ci.toml
	new file:   test/e2e/networks_regressions/evidence_fail.toml
	modified:   test/e2e/pkg/testnet.go
	modified:   test/e2e/runner/evidence.go

Unmerged paths:
  (use "git add/rm <file>..." as appropriate to mark resolution)
	deleted by us:   .github/workflows/e2e-nightly-1x.yml

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
Copy link
Collaborator

@andynog andynog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@sergio-mena sergio-mena merged commit 6ab83a8 into v1.x Jun 11, 2024
@sergio-mena sergio-mena deleted the mergify/bp/v1.x/pr-3234 branch June 11, 2024 15:59
hvanz pushed a commit that referenced this pull request Jun 20, 2024
…es (backport #3234) (#3239)

Closes #3233 

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3234 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Sergio Mena <sergio@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants