Skip to content

Prevent PolykeyAgent crashes from connection failures #592

@tegefaulkes

Description

@tegefaulkes

Specification

reffer to this comment #551 (comment)

In the service of making the PolykeyAgent more robust, we need to prevent connection failures from causing the PolykeyAgent to crash. This means that any connection or stream done in the background needs to be handled if it fails. Any errors need to be caught handled and possibly logged depending on the level of severity. There are some places where this needs to be handled.

  1. The NodeConnectionManager needs to handle any connection failures gracefully. Any connection failures that happen internally should not bubble up to the top.
  2. Stream handling needs to handle any errors. I think the RPC should be gracefully handing stream failures so errors shouldn't come from the streams themselves.
  3. Any connection related methods in the NodeConnectionManager such as getClosestNodes and findNodes should not throw anything by design. These need to be checked.
  4. Anything that uses nodeConnectionManager.withConnF needs to deal with connection failures at any time. Remember a connection can fail at any time for any reason. Just having access to a connection doesn't mean it's always safe to use.

On reflection on the problem. The stream factory is throwing and we need to handle that condition. That could be the long and short of it.

We need to add more testing for race conditions. Where normal operations are done concurrently with the connection ending. This should be a NodeConnection and NodeConnectionManager level test.

In two places we are using a hard coded magic number as an error code for closing a QUICConnection. We need to make a nodes domain enum for QUICConnection error codes.

On reflection there is only 1 place where a QUICConnection is force destroyed directly and that's the NodeConnection.destroy method. Two new enums need to be created in place of the hard coded reason and code used here.

enum ConnectionErrorCode {
  ForceClose = 1,
}

enum ConnectionErrorReason {
  ForceClose = 'NodeConnection is forcing destruction',
}

No new errors are made for this, We don't throw any error locally in this case and quic has it's own errors for connections that ended with an error.

Additional context

Tasks

  • 1. Prevent any connection failures in the background from bubbling up to the top of the program. The NCM should never throw when a connection fails.
  • 2. expand tests to include connection failures and concurrent connection failures.
  • 3. Remove magic number error codes from the nodes domain and use an enum to get the code and reason. all forced connection stops should use this.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions