feat(conductor): implement restart logic#1463
Conversation
SuperFluffy
left a comment
There was a problem hiding this comment.
Requesting changes because checking the error condition for a restart is flaky, and because the way a new conductor is spun up is prone to refactoring errors:
I would prefer if a conductor restart used the same Conductor::new into Conductor::run_until_stopped logic that is used to spin it up in the first place.
The easiest way to accomplish this might be to change Conductor into a wrapper around a conductor task like Conductor(JoinHandle<ConductorImpl>), and then ConductorImpl signals the reason for its exit, letting the public-facing Conductor wrapper decide to restart or not.
SuperFluffy
left a comment
There was a problem hiding this comment.
Changes are going in the right direction, thank you! I think we should split the public-facing conductor and the inner business logic by modules. There is quite a bit of duplicated naming for both objects, and going through the single module makes that confusing.
Also have a second look at the cancellation tokens. If you clone a token, then cancelling it will automatically cancel the other, which might not be what you intended. Instead, check out https://docs.rs/tokio-util/0.7.12/tokio_util/sync/struct.CancellationToken.html#method.child_token, which allows you to cancel tokens only at the level of the child token (or below).
| report_exit(&exit_reason, message); | ||
| let shutdown_res = self.shutdown().await; | ||
|
|
||
| if should_restart { | ||
| return Ok(RestartToken::Restart); | ||
| } | ||
|
|
||
| if let RestartToken::Restart = shutdown_res { | ||
| return Ok(RestartToken::Restart); | ||
| } | ||
| exit_reason?; | ||
| Ok(RestartToken::Shutdown) |
There was a problem hiding this comment.
Can you streameline these lines? Right now I see 4 different ways to exit on 8 lines, which is a bit much.
For example, the shutdown method could take an extra argument that isFailedTask { name: String, error: JoinError<eyre::Result<()>>, } and do all the checks in one function.
There was a problem hiding this comment.
Tried to streamline a fair bit in 4f605ea. Let me know if there's more to be done in this regard!
| } | ||
|
|
||
| impl Conductor { | ||
| impl InnerConductorTask { |
There was a problem hiding this comment.
This mixes nomenclature a bit since "task" for us refers to a task spawned on the tokio runtime, whereas this object implements the conductor business logic. ConductorInner or ConductorImpl would be common names to give to this:
| impl InnerConductorTask { | |
| impl ConductorImpl { |
There was a problem hiding this comment.
I chose ConductorInner since I thought it was slightly more appropriate, and also to my mind Impl kind of implies a trait. But, if using {MODULENAME}Impl is standard I'm happy to change it
There was a problem hiding this comment.
It's common to use Thing in the public API, and ThingImpl (or ThingInner) for the implementation deatils.
I would have been ok with ConductorInner as I said :D I just took offense to using InnerConductorTask
| } | ||
|
|
||
| impl Conductor { | ||
| impl InnerConductorTask { |
There was a problem hiding this comment.
It's common to use Thing in the public API, and ThingImpl (or ThingInner) for the implementation deatils.
I would have been ok with ConductorInner as I said :D I just took offense to using InnerConductorTask
SuperFluffy
left a comment
There was a problem hiding this comment.
This will be a nice feature to have. Thank you for implementing it!
* main: feat(conductor): implement restart logic (#1463) fix: ignore `RUSTSEC-2024-0370` (#1483) fix, refactor(sequencer): refactor ics20 logic (#1495) fix(ci): use commit SHA instead of PR number preview-env images (#1501) chore(bridge-withdrawer): pass GRPC and CometBFT clients to consumers directly (#1510) fix(sequencer): Fix incorrect error message from BridgeUnlock actions (#1505) fix(bridge-contracts): fix memo transaction hash encoding (#1428) fix: build docker when workflow explicitly includes component (#1498) chore(sequencer): migrate from `anyhow::Result` to `eyre::Result` (#1387) fix(ci): typo for required field in sequencer preview-env (#1500) feat(ci): provide demo/preview environments (#1406)
Summary
Added logic to restart conductor tasks if it receives a
PermissionDeniedcode from the execution layer.Background
Conductor previously needed to be restarted if a rollup was restarted. This change is meant to make it such that if a rollup goes down and comes back online, the conductor can restart internally and automatically as opposed to needing a manual restart.
Changes
InnerConductorTaskto shut down gracefully before restarting.RestartTokento be returned by theInnerConductorTaskon shutdown to indicate to the wrapper whether to restart.tonic::code::PermissionDeniedto initiate a restart.restart()to end the tasks and create new ones.Testing
Added blackbox test and helper functions to ensure conductor restarts after receiving a
PermissionDeniedstatus code.Breaking Changelist
Not breaking, but we should have documentation that reflects the fact that rollups need to return a
PermissionDeniederror if they receive anexecute_blockgRPC beforeget_genesis_infoorget_commitment_state. Similarly, if the type of code used for this changes, the conductor restart logic will need to be changed to reflect this. It's quite rigid right now.Related Issues
closes #906