PermsSyncer: report exact state of each provider in each job#44257
Conversation
|
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
unknwon
left a comment
There was a problem hiding this comment.
The approach does seem to make sense to me, will do a thorough review once marked as ready 👍
There was a problem hiding this comment.
I don't see any write to providerStates in this method. Do you plan to add it?
There was a problem hiding this comment.
There are several writes, e.g. on line 843:
providerStates = append(providerStates, providerState{There was a problem hiding this comment.
Should we add a log.Debug statement that would give us the provider states even if there is no error? E.g. something like "All providers succeeded"?
850a68d to
9d688b9
Compare
aa65fb7 to
764cca8
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 9322c58...e09ee8e.
|
764cca8 to
1e27ce1
Compare
9d688b9 to
1b5569d
Compare
638f992 to
63c8511
Compare
There was a problem hiding this comment.
I think we should just declare a const type ProviderState, and use the type everywhere ProviderStateError/ProviderStateSuccess
There was a problem hiding this comment.
We would need one for the job state as well, which feels a bit duplicative, and I'm not sure if there's a naming that would make this suitable for both 🤔
There was a problem hiding this comment.
I will leave this out for now: setting and checking are both internal implementation details, there are no consumers of this yet. Can revisit in a follow-up if the need arises!
18c457a to
c408d9c
Compare
c408d9c to
e09ee8e
Compare
Tracks perms sync job outcomes, notably per-provider states (#44257), in Redis. We set an aggressive TTL of 5 minutes by default because in practice, only the details for the last few sync jobs are relevant, and this prevents us from putting too much pressure on Redis on larger instances. This can be extended in the future to be database-backed, but for now I think Redis is the right choice - the volume of entries can be quite high when accumulated so we want something easy to expire, and realistically only the details for the last few sync jobs are relevant. Additionally, the records store and reader in package synclogs can easily be rewritten without too much impact to write to a database in the future.

Right now there are three ways to try and determine if an authz/authorization provider is happy, each with caveats:
PermsSyncerjob results, which only reports the overall state of a sync job run in a singleerror. It sometimes doesn't report any error from the provider at all, swallowing the provider's error with special handlingPermsSyncermetrics and logs. This gets us ~almost there, but is difficult to programmatically assess, requires infrastructure access, and requires knowledge of howPermsSynceris implemented.This makes assessing the health of authz difficult when:
This change propagates the "real" state of each individual provider from the various
runSyncimplementations, so that we can see the state of each provider involved in a permissions sync job. As a simple proof-of-concept, I've added a log field that lists the providers that failed when a job fails, and added some assertions to a few tests to demonstrate what the output looks like.On top of this, we can then implement a cache that tracks recent permission sync job outcomes: #44258, which can then open the door to allowing us to render these results via the GraphQL API, similarly to the external service sync status API.
Part of https://github.com/sourcegraph/customer/issues/1534
Stacked on https://github.com/sourcegraph/sourcegraph/pull/44251
Test plan
Updated unit and integration tests pass