Use default/dynamic active staker pagination size and standardize Porter execution timeouts#2732
Conversation
…kingEscrow, and standardize porter execution timeouts.
nucypher/blockchain/eth/agents.py
Outdated
| start_index, | ||
| pagination_size).call() | ||
| except Exception as e: | ||
| if 'timeout = 5s' not in str(e): |
There was a problem hiding this comment.
This seems a bit fragile, since it depends on the exception containing this string exactly. Not an RFC, just commentary.
There was a problem hiding this comment.
I hear ya - feels like we can do one of the following:
- simply check for the word "timeout" instead of including the length (it is currently a geth setting that can't be set)
OR - don't perform any specific check on the text, and assume that any exception is caused by the timeout (i.e. remove the "if" check')
OR
...?
Seems like a good approach for a minimal configuration. Let's not overthink it for now until we start getting some actual usage.
Simplicity is the best principal at this stage of Porter's development 👍🏻
Same as above. Use a high timeout by default, ans let usage dictate optimization. |
Type of PR:
Required reviews:
What this does:
Based over #2721 (Initial Porter Documentation), so only the latest commit is additional work.
Default/Dynamic Pagination Size
When attempting to get active stakers from StakingEscrow as part of sampling logic, this eth_call can be quite heavy due to network size, and can sometimes timeout
This causes Porter to sometimes fail when attempting to sample ursulas. The timeout occurs intermittently, probably because of some caching done by the underlying eth node. In my testing, I've been using Infura which uses Geth. Geth has a hard-coded timeout of 5s for contract calls - https://github.com/ethereum/go-ethereum/blob/master/internal/ethapi/api.go#L950.
I assume that since Porter and Alice use the same code, and since Alice doesn't specify a pagination size either, Alice will encounter this same problem. So instead of just fixing it for Porter, I made a generalized default solution.
By default, the pagination size used is 0 (unlimited) if no pagination size is specified which is currently the case for both Porter and Alice when creating the staker reservoir. Therefore the call attempts to obtain all active stakers, which is a heavy call and therefore sometimes times out. Instead, I propose that we use a default size if no pagination value (i.e.
pagination_sizeisNone) is provided to the API - 0 will still hold if explicitly provided as the pagination size.The value of this size is tricky and can vary by node types (geth etc.), and whether the node is a light node or not. Consequently, the pagination size for regular and light nodes have default values that can be overridden via environment variables (1000 for regular node, 30 for light node). The environment variable provides flexibility for us not to have to determine the "correct" value for a variety of scenarios.
Additionally, to provide some dynamism, if there is a failure in getting the active stakers based on the magnitude of the pagination size used, the size is reduced by half and then retried - process repeated for a maximum of 3 total attempts.
For example, if the pagination size value used is 1000
if attempt 3 fails, then raise failure exception. Note that this sequence will be attempted for any non-zero pagination size provided, including defaults.
Porter Standardized Execution Timeout
Standardized execution timeout used by Porter. All executions that take a timeout will use this default timeout (10s). It felt weird some executions using different values than others.
Issues fixed/closed:
Related to #1424.
Notes for reviewers: