Skip to content

test(sequencer-relayer): reinstate black box tests#1033

Merged
Fraser999 merged 10 commits intomainfrom
fraser/restore-relayer-black-box-tests
May 3, 2024
Merged

test(sequencer-relayer): reinstate black box tests#1033
Fraser999 merged 10 commits intomainfrom
fraser/restore-relayer-black-box-tests

Conversation

@Fraser999
Copy link
Copy Markdown
Contributor

Summary

The black box tests were removed in a recent PR. This PR reinstates them using the new mock gRPC framework astria-grpc-mock.

Background

The tests would have needed heavy refactoring in #963, and we wanted to avoid adding a lot more code to that already-large PR. We decided to temporarily delete them and reinstate them using astria-grpc-mock to mock responses from the Celestia app.

Changes

The tests removed in #963 and a single test briefly added in #1001 then removed again have been restored. I also added a new test to check the relayer shuts down in a timely manner.

Previously the tests leaned heavily on counts of blobs received by the mock Celestia app. The current tests retain these checks, but also query the /status endpoint of the sequencer-relayer to confirm its state. There was also a single test previously which checked the state of the postsubmit.json file. I didn't think this was necessary given that we're querying the state in a more "black box" way now (via the http server), but I didn't remove the function to perform this check (TestSequencerRelayer::assert_state_files_are_as_expected) pending a decision on whether to reinstate that check or not inside the later_height_in_state_leads_to_expected_relay test.

As @SuperFluffy predicted, all the new protobuf packages added in #963 had to be added to the pbjson_build builder to generate the serde impls required for using them in astria-grpc-mock. This accounts for the vast majority of the new LoC in this PR.

I made a few small changes to the mock framework:

  • added Mock::up_to_n_times to avoid failures due to a single-use mock response being sent multiple times
  • added DynamicResponse to support constructing mock responses which can be passed the relevant gRPC request
  • changed MockGuard::wait_until_satisfied to take self by ref rather than value, since if it consumes self and wait_until_satisfied times out, the MockGuard panics in its Drop impl, meaning we don't get a good indication from e.g. a tokio::time::timeout as to what went wrong

Most of the new functionality lives in MockCelestiaAppServer and TestSequencerRelayer. For the former, all gRPCs except for BroadcastTx and GetTx are set up to always return valid-enough responses - i.e. these don't need to be mounted individually in every test case. In the case of MockCelestiaAppServer, the new functionality is a combination of support for mounting mock responses and functions with timeouts to query the /status, /healthz and /readyz http endpoints of the sequencer-relayer.

Testing

These changes are tests.

Closes #1008.

@Fraser999 Fraser999 requested a review from a team as a code owner May 1, 2024 21:42
@Fraser999 Fraser999 requested a review from SuperFluffy May 1, 2024 21:42
@github-actions github-actions bot added proto pertaining to the Astria Protobuf spec sequencer-relayer pertaining to the astria-sequencer-relayer crate labels May 1, 2024
Comment on lines +236 to +238
let maybe_block = SequencerBlock::try_from_raw(block.into_inner())
.wrap_err("failed to parse raw proto block from grpc response");
state.set_sequencer_connected(maybe_block.is_ok());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only change to production code, so worth highlighting :)

I think this is the correct behaviour, since if we return an error here, we exit the select! loop in Relayer::run and are hung until externally killed. While in that state, I think it makes sense to have the sequencer noted as disconnected so that the /healthz endpoint reports degraded health.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we hanging until externally killed or do we just shut down gracefully?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging here. We could do self.shutdown_token.cancel(); just before that if we do want to actually shut down?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - yeah. Good catch. We should do that. It's outside the scope of this PR I'd say, but relayer knows that if the read task fails no progress will be made. So it should just shut everything down.

Copy link
Copy Markdown
Contributor

@SuperFluffy SuperFluffy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very nice to have these back. I also the additions like the custom timeout_ms that reports /status in case of failures (the contents of which we should probably expand on :-) ).

I believe some mocks that currently return guards should not do so (I am thinking of the ABCI info one and also the sequencer-block fetch; I might have missed one). In my mind, "intermittent" mocks should just be mounted without any extra requirements when and in what order they are picked up by relayer - only the final result (blocks posted to Celestia) should actually be awaited. This would leave the order and timing "inside" relayer opaque.

Also, I think the Celestia-App checks (e.g. how many blocks were received) could be replaced by mocks.

But I also realize that both requirements are not necessarily trivial and might be beyond the scope of this PR, which stays very close to the original hand-written mock. I think these would be good to cover in future PRs.

I think with this PR you can also remove astria-celestia-client and astria-celestia-mock?

// Mount a bad block next, so the relayer will fail to fetch the block.
let _guard = sequencer_relayer.mount_abci_response(2).await;
let block_to_mount = SequencerBlockToMount::BadAtHeight(2);
let block_guard = sequencer_relayer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not really necessary to have this guard here - if we confirm /healthz before (which we do), then mount the abci + block responses, and finally observe 500 on /healthz that's enough. What do you think?

That would allow removing the guard returned by mount_sequencer_block_response

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did the same for the abci one (i.e. not return a guard for it), I think that should work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I take that back :) Without the scoped guards being awaited, the test exits before the mocks have registered that they've been consumed, so they fail the test.

I think I can get away with making the abci one not scoped, and have two methods for the sequencer block - one scoped and the other not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in d64481e.

}

#[must_use = "a mock must be mounted on a server to be useful"]
pub fn up_to_n_times(mut self, n: u64) -> Self {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original wiremock functionality that we hadn't yet ported over right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, indeed.


impl MockGuard {
pub async fn wait_until_satisfied(self) {
pub async fn wait_until_satisfied(&self) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes it necessary to take this by reference? The timeout?

I actually thought of it as a feature that it gets dropped at the end of the scope, immediately returning the panic/report. But I suppose that wasn't useful and/or blocking some other functionality?

Copy link
Copy Markdown
Contributor Author

@Fraser999 Fraser999 May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we had e.g.

tokio::time::timeout(duration, guard.wait_until_satisfied()).expect("shouldn't timeout");

then what was happening if wait_until_satisfied exceeded the duration was:

  • tokio::time::timeout expired and cancelled the wait_until_satisfied future
  • wait_until_satisfied had consumed the MockGuard, so when it was cancelled, the guard was dropped
  • the Drop impl of the guard panicked due to unsatisfied condition
  • we never get as far as executing .expect("shouldn't timeout") since we've panicked

I found this unexpected initially - I was expecting to see "shouldn't timeout" in the panic message, and then the report about the unsatisfied mock. That's what happens now by taking &self in wait_until_satisfied.

}
}

pub fn dynamic_response<I, O, F>(responder: F) -> DynamicResponse<I, O, F>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can implement e.g.:

impl<I, O, F> Respond for F,
where
    F: Fn(&I) -> O,
    O: erased_serde::Serialize + prost::Name + Clone + 'static,

Maybe another indirection would be necessary like MakeDynamicResponse and then implemeent Respond for MakeDynamicResponse. Gonna think about how to do that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - that was my initial approach, but I and O are unconstrained type parameters so it doesn't fly :)

I played with this for a while but the explicit dynamic_response function seemed like the path of least resistance. Maybe there's a cleaner way though?

Comment on lines +236 to +238
let maybe_block = SequencerBlock::try_from_raw(block.into_inner())
.wrap_err("failed to parse raw proto block from grpc response");
state.set_sequencer_connected(maybe_block.is_ok());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we hanging until externally killed or do we just shut down gracefully?

@Fraser999 Fraser999 requested a review from a team as a code owner May 3, 2024 15:57
@Fraser999 Fraser999 requested a review from aajimal May 3, 2024 15:57
@Fraser999
Copy link
Copy Markdown
Contributor Author

@SuperFluffy

I think with this PR you can also remove astria-celestia-client and astria-celestia-mock?

Done in c9a5841.

@Fraser999 Fraser999 added this pull request to the merge queue May 3, 2024
Merged via the queue into main with commit 83d23ab May 3, 2024
@Fraser999 Fraser999 deleted the fraser/restore-relayer-black-box-tests branch May 3, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

proto pertaining to the Astria Protobuf spec sequencer-relayer pertaining to the astria-sequencer-relayer crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Restore black box tests removed from sequencer-relayer

2 participants