Porter - return meaningful error if there are not enough Ursulas#2772
Porter - return meaningful error if there are not enough Ursulas#2772derekpierre merged 9 commits intonucypher:mainfrom
Conversation
|
Interesting - Ran the following on Lynx. In [2]: payload = {'quantity': 20, 'duration_periods': 5}
In [4]: r = requests.get("http://localhost:9155/get_ursulas", json=payload)
In [5]: r
Out[5]: <Response [500]>
In [6]: r.text
Out[6]: "After 10 seconds and 30 rounds, didn't find 20 nodes"How did you test it? It may be related to Ibex being a bit of a shit show for nodes. |
|
I see why. On Ibex: In [16]: payload = {'quantity': 20, 'duration_periods': 1000}
In [17]: r = requests.get("http://localhost:9155/get_ursulas", json=payload)
In [18]: r
Out[18]: <Response [500]>
In [19]: r.text
Out[19]: '0 total failures recorded'The We probably have a better shot of more generally catching those exceptions instead, and raising something more informative. Analogous situation when publishing the treasure map - https://github.com/nucypher/nucypher/blob/main/nucypher/utilities/porter/porter.py#L129. |
|
@derekpierre I see, let me rewrite this. Btw. there are like 2 good Ursulas on Ibex |
b7ad9d8 to
908d4c0
Compare
|
I had a few thoughts after thinking about this some more:
Played around with modifying the WorkerPool messaging which I think was too vague to begin with for all uses of it (not just Porter) and this should help the original problem you had. See the PR I created to merge into this branch - piotr-roslaniec#1. Interested to get your thoughts. |
|
@piotr-roslaniec I fixed the failing concurrency tests due to my changes. Could you add a newsfragment? Since both of us have changes in this PR, let's get others to review. |
|
I liked the changes to the WorkerPool and I think it's a step in the right direction for error handling in Porter. I was wondering how much Porter API consumers should know about the reasons behind API call failures. For example (correct me if I'm wrong) the response will now contain WorkerPool error messages that are fairly vague to the caller ("something timed out", "something failed"). Is it the "right amount" of information? We could provide more ("Urulas timed-out"), or less (just an HTTP error code, 404). I prefer the latter (less complexity for the caller), with the assumptions that Porter API calls are "robust" or "idempotent" (API caller can just repeat the call without unintended consequences). Maybe I am overthinking this. Any thoughts? |
I hear ya. It's a tough balance - you want to return something to the client in the error case, that provides some context but doesn't overwhelm them. The "timed out" case is simple and in most cases users will understand and trying again suffices, so saying "Timed out" type message is probably sufficient. The "out of values" case is trickier because retrying doesn't necessarily mean it will work - depends on the reason - but we don't want the entire stack trace which the WorkerPool was doing previously. The current Porter API functions are idempotent, so retrying does not have unintended consequences, it's just whether you can identify the reason, and do something about it. It will also help us when people come to us and say Porter returned some status code, and we have no idea what to tell them because there is only a status code. Remember that clients that use a Porter service may not have access to Porter's logs because the service is being run by a separate entity. |
2314092 to
ba3ad64
Compare
Co-authored-by: Derek Pierre <derek.pierre@gmail.com>
…ve exception logging for web controllers.
…eptions raised. If callers need more context they can use the get_tracebacks() function on the exception to obtain all execution failure tracebacks.
…response message - ensure WorkerPoolExceptions provide access to relevant data.
ba3ad64 to
2620d55
Compare
Type of PR:
Required reviews:
What this does:
reservoir, Porter will return0 total failures recordederror fromconcurrency.pyNotes for reviewers: