Issue
Running go test -v -run=TestRouterPaymentStateMachine/MP_terminal inside routing package produces a test flake.
--- FAIL: TestRouterPaymentStateMachine (5.99s)
--- FAIL: TestRouterPaymentStateMachine/MP_terminal (5.09s)
payment_lifecycle_test.go:1015: got no payment result
Cause
Inside the test MP terminal, we fire four shards. The first shard will fail with a terminal error, while the rest will fail with a temp error. In the end, we want to have channeldb.FailureReasonPaymentDetails to be returned, while it's not always the case.
Inside the function resumePayment(), each time we fire a shard, we call shardHandler.collectResultAsync(attempt) in the end, which, as the name suggests it's asynced. Inside the function, relevant to the context, it has two steps,
- Step one, fail the
HTLCAttempt.
- Step two, fail the payment if decide not to retry. It will send the expected error to channel
p.shardErrors, block until the msg is received.
Back to the function resumePayment(), we can only get the expected error in the following lines,
- When we call
checkShards(), which reads the error from channel p.shardErrors.
- When we hit
case state.terminate && state.numShardsInFlight == 0, where the payment is terminated and we have no shards in flight.
- When we hit
case state.terminate || state.remainingAmt == 0, where either we are in the terminate state, or there are no more shards to fire. We would then call waitForShard() to get the error from p.shardErrors.
- Finally here after we call
RequestRoute and it finds no path plus we have shards in flight. We would then call waitForShard() to get the error from p.shardErrors.
Now image the edge case. We fire shard#1, which should have failed the HTLCAttempt and the payment. However, inside the collectResultAsync, it only finished step one but never step two, so it didn't send an error to p.shardErrors yet, which means,
checkShards() will return nil.
- the
HTLCAttempt is failed, we won't hit condition state.remainingAmt == 0.
- the payment is not failed, hence we won't hit
case state.terminate.
What's left is the last choice, we will call RequestRoute again. And this is where the issue comes from, we can't do it now because the test requires a routeRelease step, otherwise the function will block and the test will fail.
Issue
Running
go test -v -run=TestRouterPaymentStateMachine/MP_terminalinside routing package produces a test flake.Cause
Inside the test
MP terminal, we fire four shards. The first shard will fail with a terminal error, while the rest will fail with a temp error. In the end, we want to havechanneldb.FailureReasonPaymentDetailsto be returned, while it's not always the case.Inside the function
resumePayment(), each time we fire a shard, we callshardHandler.collectResultAsync(attempt)in the end, which, as the name suggests it's asynced. Inside the function, relevant to the context, it has two steps,HTLCAttempt.p.shardErrors, block until the msg is received.Back to the function
resumePayment(), we can only get the expected error in the following lines,checkShards(), which reads the error from channelp.shardErrors.case state.terminate && state.numShardsInFlight == 0, where the payment is terminated and we have no shards in flight.case state.terminate || state.remainingAmt == 0, where either we are in the terminate state, or there are no more shards to fire. We would then callwaitForShard()to get the error fromp.shardErrors.RequestRouteand it finds no path plus we have shards in flight. We would then callwaitForShard()to get the error fromp.shardErrors.Now image the edge case. We fire shard#1, which should have failed the HTLCAttempt and the payment. However, inside the
collectResultAsync, it only finished step one but never step two, so it didn't send an error top.shardErrorsyet, which means,checkShards()will return nil.HTLCAttemptis failed, we won't hit conditionstate.remainingAmt == 0.case state.terminate.What's left is the last choice, we will call
RequestRouteagain. And this is where the issue comes from, we can't do it now because the test requires arouteReleasestep, otherwise the function will block and the test will fail.