Better logic for sending `channel_updates` by pm47 · Pull Request #888 · ACINQ/eclair

pm47 · 2019-03-08T20:17:47Z

Previous logic was very simple but naive:

every time a channel_update changed we would send it out
we would always make a new channel_update with the disabled flag set
at startup.

In case our node was simply restarted, this resulted in us re-sending a
channel_update with the disabled flag set, then a second one with the
disabled flag unset a few seconds later, for each public channel.

On top of that, this opened way to a bug: if reconnection is very fast,
then the two successive channel_update will have the same timestamp,
causing the router to not send the second one, which means that the
channel would be considered disabled by the network, and excluded from
payments.

The new logic is as follows:

when we do NORMAL->NORMAL or NORMAL->OFFLINE or OFFLINE->NORMAL, we
send out the new channel_update if it has changed
in all other case (e.g. WAIT_FOR_INIT_INTERNAL->OFFLINE) we do nothing

As a side effect, if we were connected to a peer, then we shut down
eclair, then the peer goes down, then we restart eclair: we will make a
new channel_update with the disabled flag set but we won't broadcast it.
If someone tries to make a payment to that node, we will return the
new channel_update with disabled flag set (and maybe the payer will then
broadcast that channel_update). So even in that corner case we are good.

Previous logic was very simple but naive: - every time a channel_update changed we would send it out - we would always make a new channel_update with the disabled flag set at startup. In case our node was simply restarted, this resulted in us re-sending a channel_update with the disabled flag set, then a second one with the disabled flag unset a few seconds later, for each public channel. On top of that, this opened way to a bug: if reconnection is very fast, then the two successive channel_update will have the same timestamp, causing the router to not send the second one, which means that the channel would be considered disabled by the network, and excluded from payments. The new logic is as follows: - when we do NORMAL->NORMAL or NORMAL->OFFLINE or OFFLINE->NORMAL, we send out the new channel_update if it has changed - in all other case (e.g. WAIT_FOR_INIT_INTERNAL->OFFLINE) we do nothing As a side effect, if we were connected to a peer, then we shut down eclair, then the peer goes down, then we restart eclair: we will make a new channel_update with the disabled flag set but we won't broadcast it. If someone tries to make a payment to that node, we will return the new channel_update with disabled flag set (and maybe the payer will then broadcast that channel_update). So even in that corner case we are good.

In case of a disconnection-reconnection, we first generate a channel_update with disabled bit set, then after we reconnect we generate a second channel_update with disabled bit not set. If this happens very quickly, then both channel_updates will have the same timestamp, and the second one will get ignored by the network. A simple fix is to bump the second timestamp in this case.

We only care about this timer when connected anyway. We also cancel it when disconnecting. This has several advantages: - having a static task resulted in unnecessary refresh if the channel got disconnected/reconnected in between 2 weeks - better repartition of the channel_update refresh over time because at startup all channels were generated at the same time causing all refresh tasks to be synchronized - less overhead for the scheduler (because we cancel refresh task for offline channels (minor, but still)

…o restart-channel-updates

sstone

LGTM, with a nitpick: comment in Channel.scala line 1566 is not valid anymore

When switching from `SYNCING`->`NORMAL`, instead of emitting a new `channel_update` with flag=enabled right away, we wait a little bit and send it later. This way, if a connection to a peer is unstable and we keep getting disconnected/reconnected, we won't spam the network. This extra delay allows us to remove the change made in #888, which was a workaround in case we generated `channel_update` too quickly. Also, increased refresh interval from 7 days to 10 days. There was no need to be so conservative.

The goal is to prevent sending a lot of updates for flappy channels. Instead of sending a disabled `channel_update` after each disconnection, we now wait for a payment to try to route through the channel and only then reply with a disabled `channel_update` and broadcast it on the network. The reason is that in case of a disconnection, if noone cares about that channel then there is no reason to tell everyone about its current (disconnected) state. In addition to that, when switching from `SYNCING`->`NORMAL`, instead of emitting a new `channel_update` with flag=enabled right away, we wait a little bit and send it later. We also don't send a new `channel_update` if it is identical to the previous one (except if the previous one is outdated). This way, if a connection to a peer is unstable and we keep getting disconnected/reconnected, we won't spam the network. The extra delay allows us to remove the change made in #888, which was a workaround in case we generated `channel_update` too quickly. Also, increased refresh interval from 7 days to 10 days. There was no need to be so conservative. Note that on startup we still need to re-send `channel_update` for all channels in order to properly initialize the `Router` and the `Relayer`. Otherwise they won't know about those channels, and e.g. the `Relayer` will return `UnknownNextPeer` errors. But we don't need to create new `channel_update`s in most cases, so this should have little or no impact to gossip because our peers will already know the updates and will filter them out. On the other hand, if some global parameters (like relaying fees) are changed, it will cause the creation a new `channel_update` for all channels.

* Fix eclair-cli to work with equal sign in arguments (#926) * Fix eclair cli argument passing * Modify eclair-cli to work with equals in arguments * Eclair-cli: show usage when wrong params are received * Remove deprecated call from eclair-cli help message [ci skip] * Make Electrum tests pass on windows (#932) There was an obscure Docker error when trying to start an Electrum server in tests. [1] It appears that there is a conflict between Docker and Hyper-V on some range of ports. A workaround is to just change the port we were using. [1] docker/for-win#3171 * API: fix fee rate conversion (#936) Our `open` API calls expects an optional fee rate in satoshi/byte, which is the most widely used unit, but failed to convert to satoshi/kiloweight which is the standard in LN. We also check that the converted fee rate cannot go below 253 satoshi/kiloweight. * Expose the websocket over HTTP GET to work properly with basic auth (#934) * Expose the websocket over HTTP GET * Add test for basic auth over websocket endpoint * Set max payment attempts from configuration (#931) With a default to `5`. * Add a proper payments database (#885) There is no unique identifier for payments in LN protocol. Critically, we can't use `payment_hash` as a unique id because there is no way to ensure unicity at the protocol level. Also, the general case for a "payment" is to be associated to multiple `update_add_htlc`s, because of automated retries. We also routinely retry payments, which means that the same `payment_hash` will be conceptually linked to a list of lists of `update_add_htlc`s. In order to address this, we introduce a payment id, which uniquely identifies a payment, as in a set of sequential `update_add_htlc` managed by a single `PaymentLifecycle` that ends with a `PaymentSent` or `PaymentFailed` outcome. We can then query the api using either `payment_id` or `payment_hash`. The former will return a single payment status, the latter will return a set of payment statuses, each identified by their `payment_id`. * Add a payment identifier * Remove InvalidPaymentHash channel exception * Remove unused 'close' from paymentsDb * Introduce sent_payments in PaymentDB, bump db version * Return the UUID of the ongoing payment in /send API * Add api to query payments by ID * Add 'fallbackAddress' in /receive API * Expose /paymentinfo by paymentHash * Add id column to audit.sent table, add test for db migration * Add invoices to payment DB * Add license header to ExtraDirective.scala * Respond with HTTP 404 if the corresponding invoice/paymentHash was not found. * Left-pad numeric bolt11 tagged fields to have a number of bits multiple of five (bech32 encoding). * Add invoices API * Remove CheckPayment message * GUI: consume UUID reply from payment initiator * API: reply with JSON encoded response if the queried element wasn't found * Return a payment request object in /receive * Remove limit of pending payment requests! * Avoid printing "null" fields when serializing an invoice to json * Add index on paymentDb.sent_payments.payment_hash * Order results in descending order in listPaymentRequest * Electrum: do not persist transaction locks (#953) Locks held on utxos that are used in unpublished funding transactions should not be persisted. If the app is stopped before the funding transaction has been published the channel is forgotten and so should be locks on its funding tx utxos. * Added a timeout for channel open request (#928) Until now, if the peer is unresponsive (typically doesn't respond to `open_channel` or `funding_created`), we waited indefinitely, or until the connection closed. It translated to an API timeout for users, and uncertainty about the state of the channel. This PR: - adds an optional `--openTimeoutSeconds` timeout to the `open` endpoint, that will actively cancel the channel opening if it takes too long before reaching state `WAIT_FOR_FUNDING_CONFIRMED`. - makes the `ask` timeout configurable per request with a new `--timeoutSeconds` - makes the akka http timeout slightly greater than the `ask` timeout Ask timeout is set to 30s by default. * Set `MAX_BUFFERED` to 1,000,000 (#948) Note that this doesn't mean that we will buffer 1M objects in memory: those are just pointers to (mostly) network announcements that already exist in our routing table. Routing table has recently gone over 100K elements (nodes, announcements, updates) and this causes the connection to be closed when peer requests a full initial sync. * Fix Dockerfile maven binary checksum (#956) The Maven 3.6.0 SHA256 checksum was invalid and caused the docker build to fail. * Add channel errors in audit db (#955) We now keep track of all local/remote channel errors in the audit db. * Added simple plugin support (#927) Using org.clapper:classutil library and a very simple `Plugin` interface. * Live channel database backup (#951) * Backup running channel database when needed Every time our channel database needs to be persisted, we create a backup which is always safe to copy even when the system is busy. * Upgrade sqlite-jdbc to 3.27.2.1 * BackupHandler: use a specific bounded mailbox BackupHandler is now private, users have to call BackupHandler.props() which always specifies our custom bounded maibox. * BackupHandler: use a specific threadpool with a single thread * Add backup notification script Once a new backup has been created, call an optional user defined script. * Update readme with bitcoin 0.17 instructions (#958) This has somehow been missed by PR #826. * Backup: explicitely specify move options (#960) * Backup: explicitely specify move options We now specify that we want to atomically overwrite the existing backup file with the new one (fixes a potential issue on Windows). We also publish a specific notification when the backup process has been completed. * Print stack trace when crashing during boot sequence (#949) * Print stack trace when crashing during boot sequence * Use friendly message when db compatibility check fails * ElectrumWallet should not send ready if syncing (#963) This commit is already embedded in version `0.2-android-beta22`. * Channel: Log additional data (#943) * Channel: Log additional data Log local channel parameters, and our peer's open or accept message. This should be enough to recompute keys needed to recover funds in case of unilateral close. * Electrum: make debug logs shorter (#964) * Better handling of closed channels (#944) * Remove closed channels when application starts If the app is stopped just after a channel has transition from CLOSING to CLOSED, when the application starts again if will be restored as CLOSING. This commit checks channel data and remove closed channels instead of restoring them. * Channels Database: tag closed channels but don't delete them Instead we add a new `closed` column that we check when we restore channels. * Document how we check and remove closed channels on startup * Do not print the stacktrace on stderr when there is an error at boot (#966) * Do not print the stacktrace on stdout when there is an error at boot * Fix flaky test in PaymentLifecycleSpec (#967) * Use local random pamentHash for each test in paymentlifecyclespec, intercept the route request before the router. * Rename `eclair.bak` to `eclair.sqlite.bak` (#968) This removes any ambiguity about what the content of the file is about. * Fixed concurrency issue in `IndexedObservableList` (#961) Update map with new indexes after element is removed Fixes #915 * Various fix and improvements in time/timestamp handling (#971) This PR standardizes the way we compute the current time as unix timestamp - Scala's Platform is used and the conversion is done via scala's concurrent.duration facilities - Java's Instant has been replaced due to broken compatibility with android - AuditDB events use milliseconds (fixes #970) - PaymentDB events use milliseconds - Query filters for AuditDB and PaymentDB use seconds * API: Support query by `channelId` or `shortChannelId` everywhere (#969) Add support for querying a channel information by its `shortChannelId`. * Smarter strategy for sending `channel_update`s (#950) The goal is to prevent sending a lot of updates for flappy channels. Instead of sending a disabled `channel_update` after each disconnection, we now wait for a payment to try to route through the channel and only then reply with a disabled `channel_update` and broadcast it on the network. The reason is that in case of a disconnection, if noone cares about that channel then there is no reason to tell everyone about its current (disconnected) state. In addition to that, when switching from `SYNCING`->`NORMAL`, instead of emitting a new `channel_update` with flag=enabled right away, we wait a little bit and send it later. We also don't send a new `channel_update` if it is identical to the previous one (except if the previous one is outdated). This way, if a connection to a peer is unstable and we keep getting disconnected/reconnected, we won't spam the network. The extra delay allows us to remove the change made in #888, which was a workaround in case we generated `channel_update` too quickly. Also, increased refresh interval from 7 days to 10 days. There was no need to be so conservative. Note that on startup we still need to re-send `channel_update` for all channels in order to properly initialize the `Router` and the `Relayer`. Otherwise they won't know about those channels, and e.g. the `Relayer` will return `UnknownNextPeer` errors. But we don't need to create new `channel_update`s in most cases, so this should have little or no impact to gossip because our peers will already know the updates and will filter them out. On the other hand, if some global parameters (like relaying fees) are changed, it will cause the creation a new `channel_update` for all channels. * Fixed overflow issue with max duration (#975) This is a regression caused by #971, because `Duration` has a max value of `Long.MaxValue` *nanoseconds*, not *seconds*. * Use proper closing type in `ChannelClosed` event (#977) There was actually a change introduced by #944 where we used `ClosingType.toString` instead of manually defining types, causing a regression in the audit database. * Update bash autocompletion for eclair-cli (#983) * Update bash autocompletition file to suggest all the endpoints * Update list of commands in eclair-cli help message * Replace `UnknownPaymentHash` and `IncorrectPaymentAmount` with `IncorrectOrUnknownPaymentDetails` (#984) See lightning/bolts#516 and lightning/bolts#544 * Wireshark dissector support (#981) * Transport: add support for encryption key logging. This is the format the wireshark lightning-dissector uses to be able to decrypt lightning messages. * Enrich test for internal eclair API implementation (fr.acinq.eclair.Eclair.scala) (#938) * Add test to EclairImpl for `/send`, `/allupdates` and `/forceclose/` * Set default chain to "mainnet" (#989) Eclair is now configured to run on mainnet by default. * Set tcp client timeout to 20s (#990) So that it fails before the ask/api time out. * Add bot support for code coverage (codecov) (#982) * Add scoverage-maven-plugin dependency * Update travis build to generate a scoverage report * Add custom codecov configuration to have nice PR comments * Add badge for test coverage in readme * Accept `commit_sig` without changes (#988) LND sometimes sends a new signature without any changes, which is a (harmless) spec violation. Note that the test was previously not failing because it wasn't specific enough. The test now fails and has been ignored. * Ignore subprojects eclair-node/eclair-node-gui in the codecov report (#991) * Use bitcoind fee estimator first (#987) * use bitcoind fee provider first * set default `smooth-feerate-window`=6 * Configuration: increase fee rate mismatch threshold We wil accept fee rates that up to 8x bigger or smaller than our local fee rate * Updated license header (#992) * Release v0.3 (#994) * gui: include javafx native libraries for windows, mac, linux * Release v0.3 * Set version to 0.3.1-SNAPSHOT * Improved test coverage of `io` package (#996) * improved test coverage of `NodeURI` * improved test coverage of `Peer` * Fix TextUI * BackupHandler: use renameTo() on Android Most Path methods are not available at our current API level

pm47 added 3 commits March 8, 2019 20:39

pm47 requested a review from sstone March 12, 2019 12:33

pm47 added 3 commits March 14, 2019 12:44

Merge branch 'master' into restart-channel-updates

2e39f03

Merge branch 'restart-channel-updates' of github.com:ACINQ/eclair int…

76dc0b5

…o restart-channel-updates

more memory for tests

e103030

sstone requested changes Mar 18, 2019

View reviewed changes

fixed comment

35ffb74

sstone approved these changes Mar 18, 2019

View reviewed changes

pm47 merged commit cc3395a into master Mar 18, 2019

pm47 deleted the restart-channel-updates branch March 18, 2019 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better logic for sending `channel_updates`#888

Better logic for sending `channel_updates`#888
pm47 merged 7 commits intomasterfrom
restart-channel-updates

pm47 commented Mar 8, 2019

Uh oh!

sstone left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pm47 commented Mar 8, 2019

Uh oh!

sstone left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants