Skip to content

routing: test flake in TestRouterPaymentStateMachine #5259

@yyforyongyu

Description

@yyforyongyu

Issue

Running go test -v -run=TestRouterPaymentStateMachine/MP_terminal inside routing package produces a test flake.

--- FAIL: TestRouterPaymentStateMachine (5.99s)
    --- FAIL: TestRouterPaymentStateMachine/MP_terminal (5.09s)
        payment_lifecycle_test.go:1015: got no payment result

Cause

Inside the test MP terminal, we fire four shards. The first shard will fail with a terminal error, while the rest will fail with a temp error. In the end, we want to have channeldb.FailureReasonPaymentDetails to be returned, while it's not always the case.

Inside the function resumePayment(), each time we fire a shard, we call shardHandler.collectResultAsync(attempt) in the end, which, as the name suggests it's asynced. Inside the function, relevant to the context, it has two steps,

  • Step one, fail the HTLCAttempt.
  • Step two, fail the payment if decide not to retry. It will send the expected error to channel p.shardErrors, block until the msg is received.

Back to the function resumePayment(), we can only get the expected error in the following lines,

  • When we call checkShards(), which reads the error from channel p.shardErrors.
  • When we hit case state.terminate && state.numShardsInFlight == 0, where the payment is terminated and we have no shards in flight.
  • When we hit case state.terminate || state.remainingAmt == 0, where either we are in the terminate state, or there are no more shards to fire. We would then call waitForShard() to get the error from p.shardErrors.
  • Finally here after we call RequestRoute and it finds no path plus we have shards in flight. We would then call waitForShard() to get the error from p.shardErrors.

Now image the edge case. We fire shard#1, which should have failed the HTLCAttempt and the payment. However, inside the collectResultAsync, it only finished step one but never step two, so it didn't send an error to p.shardErrors yet, which means,

  • checkShards() will return nil.
  • the HTLCAttempt is failed, we won't hit condition state.remainingAmt == 0.
  • the payment is not failed, hence we won't hit case state.terminate.

What's left is the last choice, we will call RequestRoute again. And this is where the issue comes from, we can't do it now because the test requires a routeRelease step, otherwise the function will block and the test will fail.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions