Skip to content

Use default/dynamic active staker pagination size and standardize Porter execution timeouts#2732

Merged
derekpierre merged 2 commits intonucypher:porterfrom
derekpierre:timeouts
Jul 5, 2021
Merged

Use default/dynamic active staker pagination size and standardize Porter execution timeouts#2732
derekpierre merged 2 commits intonucypher:porterfrom
derekpierre:timeouts

Conversation

@derekpierre
Copy link
Copy Markdown
Member

@derekpierre derekpierre commented Jun 30, 2021

Type of PR:

  • Bugfix
  • Feature
  • Documentation
  • Other

Required reviews:

  • 1
  • 2
  • 3

What this does:
Based over #2721 (Initial Porter Documentation), so only the latest commit is additional work.

Default/Dynamic Pagination Size

When attempting to get active stakers from StakingEscrow as part of sampling logic, this eth_call can be quite heavy due to network size, and can sometimes timeout

In [51]: r = requests.get("http://127.0.0.1:9155/get_ursulas", json=payload)

In [52]: r
Out[52]: <Response [500]>

In [53]: r.text
Out[53]: "{'code': -32000, 'message': 'execution aborted (timeout = 5s)'}"

This causes Porter to sometimes fail when attempting to sample ursulas. The timeout occurs intermittently, probably because of some caching done by the underlying eth node. In my testing, I've been using Infura which uses Geth. Geth has a hard-coded timeout of 5s for contract calls - https://github.com/ethereum/go-ethereum/blob/master/internal/ethapi/api.go#L950.

I assume that since Porter and Alice use the same code, and since Alice doesn't specify a pagination size either, Alice will encounter this same problem. So instead of just fixing it for Porter, I made a generalized default solution.

By default, the pagination size used is 0 (unlimited) if no pagination size is specified which is currently the case for both Porter and Alice when creating the staker reservoir. Therefore the call attempts to obtain all active stakers, which is a heavy call and therefore sometimes times out. Instead, I propose that we use a default size if no pagination value (i.e. pagination_size is None) is provided to the API - 0 will still hold if explicitly provided as the pagination size.

The value of this size is tricky and can vary by node types (geth etc.), and whether the node is a light node or not. Consequently, the pagination size for regular and light nodes have default values that can be overridden via environment variables (1000 for regular node, 30 for light node). The environment variable provides flexibility for us not to have to determine the "correct" value for a variety of scenarios.

Additionally, to provide some dynamism, if there is a failure in getting the active stakers based on the magnitude of the pagination size used, the size is reduced by half and then retried - process repeated for a maximum of 3 total attempts.

For example, if the pagination size value used is 1000

  • try pagination size of 1000 (attempt 1.)
  • if 1000 fails, retry with pagination size 500 (attempt 2.)
  • if 500 fails, retry with pagination size 250 (attempt 3.)

if attempt 3 fails, then raise failure exception. Note that this sequence will be attempted for any non-zero pagination size provided, including defaults.

Porter Standardized Execution Timeout

Standardized execution timeout used by Porter. All executions that take a timeout will use this default timeout (10s). It felt weird some executions using different values than others.

Issues fixed/closed:
Related to #1424.

Notes for reviewers:

  • Thoughts about the environment variable approach?
  • Thoughts about how dynamic to make the retries and pagination size? - I went for simplicity
  • Thoughts on Porter default execution timeout, as opposed to using default timeout values where available.

…kingEscrow, and standardize porter execution timeouts.
@derekpierre derekpierre added Web Webpages Alice 👩 Effects the "Alice" development area ux design User experience enhancements labels Jun 30, 2021
@derekpierre derekpierre added this to the Porter v1 (MVP) milestone Jun 30, 2021
@derekpierre derekpierre self-assigned this Jun 30, 2021
@KPrasch KPrasch changed the base branch from porter to main July 3, 2021 23:04
@KPrasch KPrasch changed the base branch from main to porter July 3, 2021 23:04
start_index,
pagination_size).call()
except Exception as e:
if 'timeout = 5s' not in str(e):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit fragile, since it depends on the exception containing this string exactly. Not an RFC, just commentary.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear ya - feels like we can do one of the following:

  • simply check for the word "timeout" instead of including the length (it is currently a geth setting that can't be set)
    OR
  • don't perform any specific check on the text, and assume that any exception is caused by the timeout (i.e. remove the "if" check')
    OR
    ...?

@KPrasch
Copy link
Copy Markdown
Member

KPrasch commented Jul 3, 2021

Thoughts about the environment variable approach?

Seems like a good approach for a minimal configuration. Let's not overthink it for now until we start getting some actual usage.

Thoughts about how dynamic to make the retries and pagination size? - I went for simplicity

Simplicity is the best principal at this stage of Porter's development 👍🏻

Thoughts on Porter default execution timeout, as opposed to using default timeout values where available.

Same as above. Use a high timeout by default, ans let usage dictate optimization.

@derekpierre derekpierre requested a review from vepkenez July 3, 2021 23:45
@derekpierre derekpierre merged commit 25c406f into nucypher:porter Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Alice 👩 Effects the "Alice" development area ux design User experience enhancements Web Webpages

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants