repos: log corruption events per repo#45667
Conversation
|
Not notifying subscribers because the number of notifying subscribers (23) has exceeded the threshold (10). |
|
I'm not sure I understand the WHY here of introducing extra stuff on the request path. Looking at the linked issue this is saying that when we find a corrupt repo we show it in the UI / increment a metric. So setting a boolean won't really help here since we will then reclone the repo via the janitor job and likely not spot the corrupt repo. Seems like we need some sort of event/audit table which has an entry each time a repo is corrupted. The minimal change for that would be the janitor job (which runs regularly) can report finding a corrupt repo to a DB somewhere which can then be surfaced in the UI. Changing how and when we do corruption checks seems orthogonal to the ask. |
Yeah that makes sense - I added it here since that was where repo corruption was indicated previously, but you're right in the fact that it gets cleanup pretty quickly so you won't see it (which I was wondering about 😆 ). I'll create a new event table and add the details there 👍🏼 |
I agree with that being the place where we set What's wrong with setting the |
They also want to see the log messages captured for the corrupt repo which I think will help them notice a problematic repo that constantly gets corrupted. I'm still going to keep the corrupted_at since when it is combined with the events table, it will be the indicator that the repo isn't currently corrupted. Otherwise the events table needs to store when the repo has been "fixed" as an event too which can probably be done by a trigger on the repo table as the Clone Status is changed/Last Fetched is updated, WDYT? |
That sounds really complicated. The Wouldn't a two-column solution solve the same problem? |
Yeah that should be loads easier 🤔 Thanks! It also solves other stuff I was worried about which I won't mention for brevity! |
|
What's wrong with setting the `corrupted_at` to a value and then
setting it to null once it's repaired?
If I am not mistaken, we will pretty quickly reclone the repo so
I'm unsure the reclone button is useful since it is likely already
recloning. Then the other concern is they won't know that a
certain repo keeps getting corrupted due to the quick nature of
recloning. (except of course monorepos).
If you read the linked issue, the RFE is really just log output that is
easy to find so admins can dig in (I say that without double checking
the attached message since I am looking at email while on holiday). But
I do think we can do better than that if the complexity minimisation and
engineering capacity can be satisified.
|
The RFE says:
So it doesn't look like the repository is quickly recloned and fixed.
Well, historically logs haven't solved any admin problems 😄 Also this part of the RFE is likely important: "show log events in the UI" -- we already log when a repo is corrupted, which is how they found out about it, but what they want is to be able to see it in the UI. But we can make sure that what @burmudar is building here is actually solving a problem: we can corrupt a repo and see whether it shows up as corrupted or whether it will be quickly fixed. Then we can record a video and ask Mike whether that solves the problem and then he can show it to customers. |
|
@mrnugget @keegancsmith PTAL: I've reworked this to have a |
7ebe3bf to
5326607
Compare
|
Current dependencies on/for this PR:
This comment was auto-generated by Graphite. |
0e55728 to
430a2c7
Compare
- check for err - remove reason - set default timestamp to 0 value
- log repo corruption into a JSON array - limit repo corruption log to 10 elements - db tests
- warn instead of return an error during cleanup - delete repos so that the name can be reused
- sql query indentation
- use assert module
f96852b to
885fc19
Compare
- fix SQL issue on LastError
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff b1c32e9...198a4fb.
|
|
⏳ This pull request is set to merge as part of a Graphite merge job |

Adds corrputedAt and corruption_log columns. The corruption_log column is a JSON array that contains a object with the form
{ ts: "<timestamp>", rearon: "description of the corruption"}.When does corruption get logged?
sourcegraph.maybeCorruptin the repo. This classified as "git corrupted repo detected". This config value gets set by the server while handling aexecrequestHow do we know the repo isn't currently corrupted?
By looking at the
corruptedAtvalue. If the repo is currently regarded as corrupted the time will be non zero.When does the corruptedAt value get reset?
Part of: https://github.com/sourcegraph/sourcegraph/issues/43445
Questions
Follow up:
Test plan
Unit tests