Skip to content

Retry errors with subproblems in obtain_certificate with --allow-subset-of-names (#9251)#9272

Merged
alexzorin merged 24 commits into
certbot:masterfrom
pendo-io:9251-handle-caa-failure-on-renew
Jun 8, 2022
Merged

Retry errors with subproblems in obtain_certificate with --allow-subset-of-names (#9251)#9272
alexzorin merged 24 commits into
certbot:masterfrom
pendo-io:9251-handle-caa-failure-on-renew

Conversation

@JamesBalazs

@JamesBalazs JamesBalazs commented Apr 7, 2022

Copy link
Copy Markdown
Contributor

This is an attempt at retrying when subproblems are present on the error raised by obtain_certificate.

According to the spec:

ACME clients may choose to use the "identifier" field of a subproblem
as a hint that an operation would succeed if that identifier were
omitted

so here we can use acme.Identifier.values present in the subproblems, to remove failed domains from the call to obtain_certificate, to retry.

The first pass at this contains little in the way of checks to ensure the ACME error is handled correctly. I'm assuming we might want to check:

  • acme.Identifier.type is an IDENTIFIER_FQDN or IDENTIFIER_IP
  • there are no other subproblems that aren't CAA errors (or other errors that we think could succeed if the identifier is removed from the domains we're trying to renew)
  • the code of the top-level error

But I would like some feedback on what errors we expect to be worth retrying, and which checks are necessary to ensure it is safe to retry (so we're not just wasting retry attempts that may be limited by a quota in LetsEncrypt).

Pull Request Checklist

  • If the change being made is to a distributed component, edit the master section of certbot/CHANGELOG.md to include a description of the change being made.
  • Add or update any documentation as needed to support the changes in this PR.
  • Include your name in AUTHORS.md if you like.

@JamesBalazs

JamesBalazs commented Apr 19, 2022

Copy link
Copy Markdown
Contributor Author

Hey @bmw, do you have any advice RE the checks I suggested in the PR description?

We should definitely make sure a retry is likely to succeed before retrying, but I don't know exactly what errors are going to be recoverable (or if I should just stick to CAA errors).

Currently this rescues any ACME error, and retries with the domains minus the identifiers of any subproblems that might exist, which is almost certainly too general.

@bmw

bmw commented Apr 26, 2022

Copy link
Copy Markdown
Member

Hey @JamesBalazs. Sorry for the delayed response here. I've been busy.

Three high level thoughts:

  1. I think we should probably implement this at a lower level than renewal.py. I think with the way --allow-subset-of-names works in master, the flag works when obtaining a certificate for the first time. I think by catching the problem in renewal.py, we're only catching this problem on "renewal" and not initial issuance.
  2. As for what errors are worth retrying, my initial thought is we should be pretty aggressive about retrying here. That initial idea would be to do something like as long as it errors out with subproblems and that have at least 1 identifier field, remove that identifier and try again. A slightly more conservative approach would be to only do this if all subproblems have an identifier.
  3. I think we should add a changelog entry about this although you may want to wait to write this until we've nailed down the behavior here.

It sounds like you think (2) is too aggressive though. Why is that? What types of problems are you worried about here? I certainly could be overlooking something.

@JamesBalazs

Copy link
Copy Markdown
Contributor Author

Hey @bmw, in response to your points:

  1. Does obtain_certificate sound like the right place to move the catch / retry logic? It looks like that is called both on issuance of a new cert, and renewal. Presumably I can just try / catch around the call to obtain_certificate_from_csr then have obtain_certificate call itself to retry, similar to how allow_subset_of_names already works in that method, if there's at least one subproblem with an identifier.
  2. We have been quite cautious internally about retrying with LetsEncrypt due to a limit on retries for the same list of domains. That doesn't apply here as the list of domains will change for the retry, so I'm happy to be more aggressive with retrying if there aren't any other retry limits we could expect to run into.
  3. Yeah definitely - I was going to wait until the behaviour was better defined and we're closer to something mergeable before adding to the changelog. Just to reduce the chances of conflicts, the current version changing while this is in progress etc.

Thanks for taking a look at this!

@bmw

bmw commented Apr 26, 2022

Copy link
Copy Markdown
Member
  1. That sounds exactly right to me.
  2. OK cool. I think we should do the most aggressive version of what I described in my previous post and retry if there is at least one subproblem with an identifier. This feels safe to me based on Let's Encrypt's current behavior, it gives Certbot the best chance to issue a certificate for at least some domains, and seems appropriate based on this text from the RFC:

ACME clients may choose to use the "identifier" field of a subproblem
as a hint that an operation would succeed if that identifier were
omitted.

  1. SGTM!

@JamesBalazs JamesBalazs force-pushed the 9251-handle-caa-failure-on-renew branch from 5bda47c to 2484239 Compare May 13, 2022 09:50
@JamesBalazs

Copy link
Copy Markdown
Contributor Author

@bmw I've implemented the change we discussed along with a simple test for the behaviour in client_test.py.
Let me know if there's anything I've missed or any extra tests are needed for this.

@JamesBalazs JamesBalazs changed the title Handle CAA failure on finalize_order during renewal (#9251) Retry errors with subproblems in obtain_certificate_from_csr with --allow-subset-of-names (#9251) May 13, 2022
@alexzorin

Copy link
Copy Markdown
Collaborator

I'll review this soon :) !

@alexzorin alexzorin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Thanks.

Other than my feedback here, I'd like to:

  • try get subproblem support into Pebble (PR written)
  • double-check that Boulder is using subproblems in every instance where they reject identifiers. (it does!)

Comment thread certbot/certbot/_internal/client.py
Comment thread certbot/certbot/_internal/client.py Outdated
Comment thread certbot/certbot/_internal/client.py Outdated
@JamesBalazs JamesBalazs changed the title Retry errors with subproblems in obtain_certificate_from_csr with --allow-subset-of-names (#9251) Retry errors with subproblems in obtain_certificate with --allow-subset-of-names (#9251) May 17, 2022
@JamesBalazs JamesBalazs requested a review from alexzorin May 17, 2022 10:45

@alexzorin alexzorin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am super sorry about leaving this for so long. Here is a further review.

Comment thread certbot/certbot/_internal/client.py Outdated
Comment thread certbot/certbot/_internal/client.py Outdated
Comment thread certbot/CHANGELOG.md Outdated
@JamesBalazs JamesBalazs requested a review from alexzorin May 26, 2022 15:11

@alexzorin alexzorin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very close. The actual behavior seems to be exactly right now.

With the unit tests, I've provided a lot of feedback, but only because I know that trying to make sense of how they work is super daunting.

Comment thread certbot/tests/client_test.py Outdated
Comment thread certbot/tests/client_test.py
Comment thread certbot/tests/client_test.py
Comment thread certbot/tests/client_test.py Outdated
Comment thread certbot/tests/client_test.py
Comment thread certbot/certbot/_internal/client.py Outdated
@JamesBalazs JamesBalazs requested a review from alexzorin May 27, 2022 11:28
Comment thread certbot/tests/client_test.py Outdated
Comment thread certbot/certbot/_internal/client.py Outdated
@JamesBalazs JamesBalazs requested a review from alexzorin June 6, 2022 10:24

@alexzorin alexzorin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just some small nits, but this looks great. Thanks for sticking with it!

Comment thread certbot/CHANGELOG.md
Comment thread certbot/certbot/_internal/client.py Outdated
Comment thread certbot/certbot/_internal/client.py Outdated
Comment thread certbot/certbot/_internal/client.py
Comment thread certbot/certbot/_internal/client.py
@JamesBalazs JamesBalazs requested a review from alexzorin June 8, 2022 08:21

@alexzorin alexzorin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Sorry we missed the release by half a day 😆 .

@alexzorin alexzorin merged commit a73a86b into certbot:master Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants