pay hands back failures to gossipd#638
Conversation
9b4bfda to
0805958
Compare
|
Rebase on latest, remove |
0805958 to
450da48
Compare
|
Added saving of route nodes and route channels to db, rebased on master to get #670 into this PR also. Next step (actually report on failure) should be easier. @cdecker @rustyrussell I would like to request to review especially the new functions Sorry for the massive delay, this saving of route, was a bigger adventure than I expected! |
|
@rustyrussell : sorry, I would like to ask more detailed help on BOLT#4 and gossipd routing... Below is my understanding: The Now, when a node at |
450da48 to
b5d16e6
Compare
cdecker
left a comment
There was a problem hiding this comment.
Very nice PR so far, only minor nits.
gossipd/gossip.c
Outdated
| return daemon_conn_read_next(conn, &daemon->master); | ||
| } | ||
|
|
||
| static struct io_plan *routing_failure_req(struct io_conn *conn, |
There was a problem hiding this comment.
Nit: _req suggests that there is a reply to this, just routing_failure should be fine.
| { | ||
| struct pubkey erring_node; | ||
| struct short_channel_id erring_channel; | ||
| u16 failcode; |
There was a problem hiding this comment.
Wondering whether this can be enum onion_type right away. Would save us the cast down below.
There was a problem hiding this comment.
onion_type itself seems generated, and does not seem to play nice when used in a csv file source. Or maybe some other way?
There was a problem hiding this comment.
I think it works, just need with #include <wire/gen_onion_wire.h> in the CSV file.
gossipd/routing.h
Outdated
| /* Is this connection active? */ | ||
| bool active; | ||
| /* Is this connection permanently inactive? */ | ||
| bool perminactive; |
There was a problem hiding this comment.
Permanent failure means that the channel will not be able to recover from this error, or the channel is unusable for our purposes, so we should rather remove them from our network view. That may result in us re-adding it when syncing gossip with a new peer, but we'll eventually not do the gossip dump.
| return true; | ||
| } | ||
| struct short_channel_id * | ||
| sqlite3_column_short_channel_id_array(const tal_t *ctx, |
There was a problem hiding this comment.
We could simplify this a lot by using the uint64 serialization format, so that each short_channel_id is exactly 8 bytes in size. No need for separator. That'd be a bit inconsistent with the way we serialize the scid in the channels table, but much simpler here.
There was a problem hiding this comment.
I think one concern is also byte order? That seems to be the main reason for why scid is serialized as string rather than directly as blob.
There was a problem hiding this comment.
The byte order is fixed in the protocol, otherwise we couldn't communicate them at all. I just used the string serialization because I had it handy at the time, and we were stringifying everything else as well. We have since migrated to the sqlite3_bind_* interfaces for binaries, which are much safer and more performant.
There was a problem hiding this comment.
I suppose I was confused by what was meant by "uint64 serialization format", are you referring to fromwire_short_channel_id and towire_short_channel_id? If so, probably I should indeed use that.
It seems I can chain multiple calls to towire_short_channel_id, so maybe:
for (i = 0; i < num; ++i) {
towire_short_channel_id(&allser, &id[i]);
}
b5d16e6 to
b3804a0
Compare
|
Fixed up as per review:
|
b3804a0 to
b6d0fce
Compare
|
Fixed up as per review:
Added commit:
|
|
Now report payment routing failures to gossip. I assume my thoughts in #638 (comment) are correct and that is what is implemented. Re
In specifics, I want to add a Boolean field, |
|
@rustyrussell one thing I want to bring up for |
|
@rustyrussell So I propose the below new result value for
|
|
Conflicts with #756 btw, nothing too bad |
Since I created the conflict, I'm also happy to fix it once one of the two gets merged ^^ |
|
I just crashed in routing_failure: some context: I was trying to pay an invoice and it was spamming: I killed the process and brought it up again, and then it crashed with the log above. |
rustyrussell
left a comment
There was a problem hiding this comment.
Generally this is really good; just some stylistic nits where you've written a little too much code :)
| { | ||
| struct pubkey erring_node; | ||
| struct short_channel_id erring_channel; | ||
| u16 failcode; |
There was a problem hiding this comment.
I think it works, just need with #include <wire/gen_onion_wire.h> in the CSV file.
| } | ||
|
|
||
| return NULL; | ||
| } |
There was a problem hiding this comment.
This is a very specialized function, which only makes sense because of the current caller.
How about a 'get_nc_from_node(rstate, node, scid)' which simply returns the pointer to the node_connection; then do the work below.
gossipd/routing.c
Outdated
|
|
||
| u64 num_ins = tal_count(node->in); | ||
| u64 num_outs = tal_count(node->out); | ||
| u64 num = num_ins + num_outs; |
There was a problem hiding this comment.
tal_count returns size_t, which is pretty natural for sizes; u64 looks a bit weird here.
gossipd/routing.c
Outdated
| * consideration. | ||
| * | ||
| */ | ||
| if ((failcode & NODE) != 0) |
There was a problem hiding this comment.
if (failcode & NODE) is a little more idiomatic.
gossipd/routing.c
Outdated
| get_all_node_connections_of(const tal_t *ctx, | ||
| struct routing_state *rstate, | ||
| const struct node *node) | ||
| { |
There was a problem hiding this comment.
I would open code this, see below.
gossipd/routing.c
Outdated
| if ((failcode & NODE) != 0) | ||
| ncs = get_all_node_connections_of(tmpctx, rstate, node); | ||
| else | ||
| ncs = get_out_node_connection_of(tmpctx, rstate, node, scid); |
There was a problem hiding this comment.
How about:
if (failcode & NODE) {
for (i = 0; i < tal_count(node->in); i++)
handle_route_fail(rstate, node->in[i], failcode);
for (i = 0; i < tal_count(node->out); i++)
handle_route_fail(rstate, node->out[i], failcode);
} else {
struct node_connection *nc = get_nc_from_node(node, scid);
/* Possible, if something else deleted nc since we sent payment */
if (!nc)
status_trace("Error %s on node %s scid %s: no such connection any more");
else
handle_route_fail(rstate, nc, failcode);
gossipd/routing.c
Outdated
| * this after deactivating, so that if the | ||
| * channel_update is newer it will be | ||
| * reactivated. */ | ||
| if ((failcode & UPDATE) != 0) { |
There was a problem hiding this comment.
Minor style nit: if (failcode & UPDATE) { is more idiomatic.
lightningd/pay.c
Outdated
| if (failure->origin_index >= tal_count(payment->route_nodes)) { | ||
| /* Is it permanent? */ | ||
| if ((failcode & PERM) != 0) | ||
| return false; |
There was a problem hiding this comment.
We should still hand to gossipd!
There was a problem hiding this comment.
But what will we hand to gossipd? In particular, what outgoing channel will we report via gossip_routing_failure message?
From my reading of BOLT 4, below is the list of possible errors a final payee node can return:
invalid_realm- permanent non-node error, but this "should" only happen if we send a realm that the recipient is ignorant of, and that indicates a bug on our side as presumably future realm types (that old nodes might not recognize) should be enabled in a global features bit. Arguably this should be treated a a node error, as apparently we think that node supports some new realm byte, but the node itself denies cognizance of that realm byte.temporary_node_failure- node error, so outgoing channel ignored, no problem.permanent_node_failure- node error, so outgoing channel ignored, no problem.required_node_feature_missing- similar toinvalid_realm(permanent non-node error, this time the receiver thinks we support some feature but we ourselves are ignorant of that feature and did not pay attention to the even feature bit).unknown_payment_hash- permanent non-node error, but may happen for example if payee is c-lightning and the invoice expired before the payment reached it.incorrect_payment_amount- permanent non-node error, but is arguably a bug on our side where some roundoff or some other issue made us go beyond the limits of the invoice being paid.final_expiry_too_soon- temporary non-node error, but we can actually reuse the exact same route, we just need to adjust the CLTVs.final_incorrect_cltv_expiry- temporary non-node error, but maybe this indicates a bug on our side?final_incorrect_htlc_amount- temporary non-node error, similar tofinal_incorrect_cltv_expiry.
So for now, I think, it is sensible to report node-level errors on the final node to gossip, but I am uncertain what channel to deactivate for the non-node errors; non-node errors at the final destination seem to suggest a bug or disagreement between us and the final node.
I will fix the other review items for now but I want to know your opinion for the final node case in more detail first.
lightningd/pay.c
Outdated
| failure->msg, NULL, | ||
| &channel_update)) | ||
| /* No channel update. */ | ||
| channel_update = NULL; |
There was a problem hiding this comment.
This is so ugly it needs to be in its own function, I think!
|
As to sendpay, no, we need to return an error when there's a routing problem. However, we should be jsonrpc 2.0 compliant, which allows a 'data' field in errors; we should put in the information about exactly what happened in this case. There remain two problems (which DO NOT need to be fixed in this series!):
|
094ad5d to
8ebe1be
Compare
|
@rustyrussell itl looks like we currently violate json rpc 2.0 as we return a string |
|
Okay, for |
d060bd8 to
3de856f
Compare
…e of immediate channeld error.
3de856f to
b10fa9a
Compare
|
#858 merged, so rebased on top of master. Should now be OK for merging if it passes CI. |
|
CI OK. @cdecker @rustyrussell please merge! |
|
Removing the WIP label, but it's too late for me to review now. Will do it first thing tomorrow. |
rustyrussell
left a comment
There was a problem hiding this comment.
Congratulations! This is an excellent patch set. I'm applying it now..
| (void) retry_plausible; | ||
|
|
||
| json_pay_failed(hout->cmd, NULL, failcode, "reply from remote"); | ||
| json_pay_failed(hout->cmd, failcode, failmsg); |
There was a problem hiding this comment.
Future work: we should provide more detailed failmsg: unlike logging, that's returned to the caller who is definitely interested!
|
good stuff @ZmnSCPxj ! |
Fixes: #550
Eventually. Remaining:
sendpayshould differ frompay. Presumablysendpayonly tries the specific route specified and should probably return the result of the attempt immediately without retrying, whilepaydoes a best-effort and keeps retrying until a permanent failure at the payee or no-route-found from gossipd.ImplementDo this in a later PR instead, it is big enough already!dataerror return onsendpayif a routing error occurred.