Skip to content

feat(conductor): implement restart logic#1463

Merged
ethanoroshiba merged 13 commits intomainfrom
ENG-83/conductor_restart_logic
Sep 20, 2024
Merged

feat(conductor): implement restart logic#1463
ethanoroshiba merged 13 commits intomainfrom
ENG-83/conductor_restart_logic

Conversation

@ethanoroshiba
Copy link
Copy Markdown

@ethanoroshiba ethanoroshiba commented Sep 6, 2024

Summary

Added logic to restart conductor tasks if it receives a PermissionDenied code from the execution layer.

Background

Conductor previously needed to be restarted if a rollup was restarted. This change is meant to make it such that if a rollup goes down and comes back online, the conductor can restart internally and automatically as opposed to needing a manual restart.

Changes

  • Added wrapper task and corresponding handle around conductor to manage a restart, allowing the InnerConductorTask to shut down gracefully before restarting.
  • Added error propagation within the conductor to ensure tasks do not silently fail.
  • Added RestartToken to be returned by the InnerConductorTask on shutdown to indicate to the wrapper whether to restart.
  • Added checks for task errors with tonic::code::PermissionDenied to initiate a restart.
  • Implemented logic to check a returned error for a need to restart and function restart() to end the tasks and create new ones.

Testing

Added blackbox test and helper functions to ensure conductor restarts after receiving a PermissionDenied status code.

Breaking Changelist

Not breaking, but we should have documentation that reflects the fact that rollups need to return a PermissionDenied error if they receive an execute_block gRPC before get_genesis_info or get_commitment_state. Similarly, if the type of code used for this changes, the conductor restart logic will need to be changed to reflect this. It's quite rigid right now.

Related Issues

closes #906

@github-actions github-actions bot added the conductor pertaining to the astria-conductor crate label Sep 6, 2024
@ethanoroshiba ethanoroshiba marked this pull request as ready for review September 6, 2024 16:04
@ethanoroshiba ethanoroshiba requested a review from a team as a code owner September 6, 2024 16:04
@ethanoroshiba ethanoroshiba requested a review from noot September 6, 2024 16:04
Copy link
Copy Markdown
Contributor

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes because checking the error condition for a restart is flaky, and because the way a new conductor is spun up is prone to refactoring errors:

I would prefer if a conductor restart used the same Conductor::new into Conductor::run_until_stopped logic that is used to spin it up in the first place.

The easiest way to accomplish this might be to change Conductor into a wrapper around a conductor task like Conductor(JoinHandle<ConductorImpl>), and then ConductorImpl signals the reason for its exit, letting the public-facing Conductor wrapper decide to restart or not.

Copy link
Copy Markdown
Contributor

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are going in the right direction, thank you! I think we should split the public-facing conductor and the inner business logic by modules. There is quite a bit of duplicated naming for both objects, and going through the single module makes that confusing.

Also have a second look at the cancellation tokens. If you clone a token, then cancelling it will automatically cancel the other, which might not be what you intended. Instead, check out https://docs.rs/tokio-util/0.7.12/tokio_util/sync/struct.CancellationToken.html#method.child_token, which allows you to cancel tokens only at the level of the child token (or below).

Comment on lines +334 to +345
report_exit(&exit_reason, message);
let shutdown_res = self.shutdown().await;

if should_restart {
return Ok(RestartToken::Restart);
}

if let RestartToken::Restart = shutdown_res {
return Ok(RestartToken::Restart);
}
exit_reason?;
Ok(RestartToken::Shutdown)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you streameline these lines? Right now I see 4 different ways to exit on 8 lines, which is a bit much.

For example, the shutdown method could take an extra argument that isFailedTask { name: String, error: JoinError<eyre::Result<()>>, } and do all the checks in one function.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to streamline a fair bit in 4f605ea. Let me know if there's more to be done in this regard!

}

impl Conductor {
impl InnerConductorTask {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mixes nomenclature a bit since "task" for us refers to a task spawned on the tokio runtime, whereas this object implements the conductor business logic. ConductorInner or ConductorImpl would be common names to give to this:

Suggested change
impl InnerConductorTask {
impl ConductorImpl {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose ConductorInner since I thought it was slightly more appropriate, and also to my mind Impl kind of implies a trait. But, if using {MODULENAME}Impl is standard I'm happy to change it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's common to use Thing in the public API, and ThingImpl (or ThingInner) for the implementation deatils.

I would have been ok with ConductorInner as I said :D I just took offense to using InnerConductorTask

}

impl Conductor {
impl InnerConductorTask {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's common to use Thing in the public API, and ThingImpl (or ThingInner) for the implementation deatils.

I would have been ok with ConductorInner as I said :D I just took offense to using InnerConductorTask

Copy link
Copy Markdown
Contributor

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a nice feature to have. Thank you for implementing it!

@ethanoroshiba ethanoroshiba added this pull request to the merge queue Sep 20, 2024
@SuperFluffy SuperFluffy removed this pull request from the merge queue due to a manual request Sep 20, 2024
@ethanoroshiba ethanoroshiba added this pull request to the merge queue Sep 20, 2024
Merged via the queue into main with commit f11f900 Sep 20, 2024
@ethanoroshiba ethanoroshiba deleted the ENG-83/conductor_restart_logic branch September 20, 2024 14:46
steezeburger added a commit that referenced this pull request Sep 23, 2024
* main:
  feat(conductor): implement restart logic (#1463)
  fix: ignore `RUSTSEC-2024-0370` (#1483)
  fix, refactor(sequencer): refactor ics20 logic (#1495)
  fix(ci): use commit SHA instead of PR number preview-env images (#1501)
  chore(bridge-withdrawer): pass GRPC and CometBFT clients to consumers directly (#1510)
  fix(sequencer): Fix incorrect error message from BridgeUnlock actions (#1505)
  fix(bridge-contracts): fix memo transaction hash encoding (#1428)
  fix: build docker when workflow explicitly includes component (#1498)
  chore(sequencer): migrate from `anyhow::Result` to `eyre::Result` (#1387)
  fix(ci): typo for required field in sequencer preview-env (#1500)
  feat(ci): provide demo/preview environments (#1406)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

conductor pertaining to the astria-conductor crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: conductor restart logic

2 participants