Skip to content

fix(e2e): reproduction an fixing of missing evidence in e2e nightlies #3234

Merged
sergio-mena merged 7 commits intomainfrom
sergio/3233-evidence-missing
Jun 11, 2024
Merged

fix(e2e): reproduction an fixing of missing evidence in e2e nightlies #3234
sergio-mena merged 7 commits intomainfrom
sergio/3233-evidence-missing

Conversation

@sergio-mena
Copy link
Collaborator

@sergio-mena sergio-mena commented Jun 10, 2024

Closes #3233

This PR is is to be interpreted commit by commit.

  • Commit 1: Changes in `e2e runner to repro the problem (these changes will stay).
  • Commit 2. Making the problem repro appear in this PR (look at the ❌ next to commit 2 below, and check the e2e logs)
  • Commit 3: Fix. (look at the ✅ next to commit 3 below)
  • Commit 4: Revert commit 2
  • Commit 5: Adapt nightlies to start running all regression manifests in (newly introduced) test/e2e/networks_regressions directory. Also, adapt manifest generator to create testnets with more evidences.
  • Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some evidences can be delayed in their transit to the next validator, long enough to become invalid (too old). The delay is achieved by having a validator with skewed clock, in such a way that every time it proposes (untimely proposal) other peers shut down their p2p connection with that validator. The validator having the skewed clock is the one receiving all evidences.

The root cause is that, in e2e, the max age of evidence is too restrictive (500ms in time, and 7 heights in terms blocks). So, if the gossiping of transactions is delayed, some of them are too old when they arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in e2e: 1500ms, or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I haven't been able to repro it after multiple attempts.


PR checklist

  • Tests written/updated
  • [ ] Changelog entry added in .changelog (we use unclog to manage our changelog)
  • [ ] Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

@sergio-mena sergio-mena self-assigned this Jun 10, 2024
@sergio-mena sergio-mena requested a review from a team as a code owner June 10, 2024 12:53
@sergio-mena sergio-mena requested a review from a team June 10, 2024 12:53
@sergio-mena sergio-mena changed the title bug(e2e): reproduction an fixing of missing evidence in e2enightlies fix(e2e): reproduction an fixing of missing evidence in e2enightlies Jun 10, 2024
@cometbft cometbft deleted a comment from github-actions bot Jun 10, 2024
@sergio-mena sergio-mena changed the title fix(e2e): reproduction an fixing of missing evidence in e2enightlies fix(e2e): reproduction an fixing of missing evidence in e2e nightlies Jun 10, 2024
@sergio-mena sergio-mena added bug Something isn't working backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x backport-to-v1.x e2e Related to our end-to-end tests labels Jun 10, 2024
Copy link
Collaborator

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sergio-mena sergio-mena added this pull request to the merge queue Jun 11, 2024
Merged via the queue into main with commit bb6c702 Jun 11, 2024
@sergio-mena sergio-mena deleted the sergio/3233-evidence-missing branch June 11, 2024 08:10
mergify bot pushed a commit that referenced this pull request Jun 11, 2024
…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit bb6c702)

# Conflicts:
#	.github/workflows/e2e-nightly-1x.yml
mergify bot pushed a commit that referenced this pull request Jun 11, 2024
…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit bb6c702)

# Conflicts:
#	.github/workflows/e2e-nightly-1x.yml
#	.github/workflows/e2e-nightly-38x.yml
#	test/e2e/generator/generate.go
#	test/e2e/networks/ci.toml
sergio-mena added a commit that referenced this pull request Jun 11, 2024
sergio-mena added a commit that referenced this pull request Jun 11, 2024
…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
sergio-mena added a commit that referenced this pull request Jun 11, 2024
sergio-mena added a commit that referenced this pull request Jun 11, 2024
…es (#3234)

Closes #3233

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
sergio-mena added a commit that referenced this pull request Jun 11, 2024
…es (backport #3234) (#3239)

Closes #3233 

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3234 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Sergio Mena <sergio@informal.systems>
sergio-mena added a commit that referenced this pull request Jun 11, 2024
…es (backport #3234) (#3240)

Closes #3233 

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3234 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Sergio Mena <sergio@informal.systems>
hvanz pushed a commit that referenced this pull request Jun 20, 2024
…es (backport #3234) (#3239)

Closes #3233 

This PR is is to be interpreted commit by commit.

* Commit 1: Changes in `e2e runner to repro the problem (these changes
will stay).
* Commit 2. Making the problem repro appear in this PR (look at the ❌
next to commit 2 below, and check the `e2e` logs)
* Commit 3: Fix. (look at the ✅ next to commit 3 below)
* Commit 4: Revert commit 2
* Commit 5: Adapt nightlies to start running all regression manifests in
(newly introduced) `test/e2e/networks_regressions` directory. Also,
adapt manifest generator to create testnets with more evidences.
* Further commits: fix e2e infra

The reproduction consists in coming up with a testnet where some
evidences can be delayed in their transit to the next validator, long
enough to become invalid (too old). The delay is achieved by having a
validator with skewed clock, in such a way that every time it proposes
(untimely proposal) other peers shut down their p2p connection with that
validator. The validator having the skewed clock is the one receiving
all evidences.

The root cause is that, in `e2e`, the max age of evidence is too
restrictive (500ms in time, and 7 heights in terms blocks). So, if the
gossiping of transactions is delayed, some of them are too old when they
arrive to the first validator that can propose them.

The fix consists in relaxing the max age of evidences in `e2e`: 1500ms,
or 14 heights.

Without the fix, the e2e run fails almost every time. With the fix, I
haven't been able to repro it after multiple attempts.

---

#### PR checklist

- [x] Tests written/updated
- ~~[ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)~~
- ~~[ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments~~
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3234 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Sergio Mena <sergio@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x bug Something isn't working e2e Related to our end-to-end tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

e2e nightly fail sometimes because some injected evidence is lost

3 participants