repos: log corruption events per repo by burmudar · Pull Request #45667 · sourcegraph/sourcegraph-public-snapshot

burmudar · 2022-12-14T17:30:02Z

Adds corrputedAt and corruption_log columns. The corruption_log column is a JSON array that contains a object with the form { ts: "<timestamp>", rearon: "description of the corruption"}.

When does corruption get logged?

When the janitor finds a git config value sourcegraph.maybeCorrupt in the repo. This classified as "git corrupted repo detected". This config value gets set by the server while handling a exec request
When the janitor inspects the repo and doesn't find a HEAD or the repo isn't a bare repo. This is classified as "sourcegraph detected repo corruption"

How do we know the repo isn't currently corrupted?

By looking at the corruptedAt value. If the repo is currently regarded as corrupted the time will be non zero.

When does the corruptedAt value get reset?

Repo is cloned / Status is updated
Repo is recloned/fetched

Part of: https://github.com/sourcegraph/sourcegraph/issues/43445

Questions

Should this be added to the repo_statistics?

Follow up:

Emit a metric for when a repo is corrupted
Make it visible on the frontend

Test plan

Unit tests

sourcegraph-bot · 2022-12-15T10:49:12Z

Not notifying subscribers because the number of notifying subscribers (23) has exceeded the threshold (10).

keegancsmith · 2022-12-15T14:14:08Z

I'm not sure I understand the WHY here of introducing extra stuff on the request path. Looking at the linked issue this is saying that when we find a corrupt repo we show it in the UI / increment a metric. So setting a boolean won't really help here since we will then reclone the repo via the janitor job and likely not spot the corrupt repo. Seems like we need some sort of event/audit table which has an entry each time a repo is corrupted.

The minimal change for that would be the janitor job (which runs regularly) can report finding a corrupt repo to a DB somewhere which can then be surfaced in the UI. Changing how and when we do corruption checks seems orthogonal to the ask.

burmudar · 2022-12-15T14:26:20Z

I'm not sure I understand the WHY here of introducing extra stuff on the request path. Looking at the linked issue this is saying that when we find a corrupt repo we show it in the UI / increment a metric. So setting a boolean won't really help here since we will then reclone the repo via the janitor job and likely not spot the corrupt repo. Seems like we need some sort of event/audit table which has an entry each time a repo is corrupted.

The minimal change for that would be the janitor job (which runs regularly) can report finding a corrupt repo to a DB somewhere which can then be surfaced in the UI. Changing how and when we do corruption checks seems orthogonal to the ask.

Yeah that makes sense - I added it here since that was where repo corruption was indicated previously, but you're right in the fact that it gets cleanup pretty quickly so you won't see it (which I was wondering about 😆 ). I'll create a new event table and add the details there 👍🏼

mrnugget · 2022-12-19T07:13:30Z

The minimal change for that would be the janitor job (which runs regularly)

I agree with that being the place where we set corrupted_at (that was my original idea the first time we talked about it, I think, because I vaguely remember that a janitor job already does some corruption checking), but I'm not sure we need a new events table.

What's wrong with setting the corrupted_at to a value and then setting it to null once it's repaired? As far as I understand the users only want to know whether a repo is corrupted or not and if it is, they want to hit that reclone button. That should "fix" the repo and unset the corruption state, right?

burmudar · 2022-12-19T07:34:20Z

The minimal change for that would be the janitor job (which runs regularly)

I agree with that being the place where we set corrupted_at (that was my original idea the first time we talked about it, I think, because I vaguely remember that a janitor job already does some corruption checking), but I'm not sure we need a new events table.

What's wrong with setting the corrupted_at to a value and then setting it to null once it's repaired? As far as I understand the users only want to know whether a repo is corrupted or not and if it is, they want to hit that reclone button. That should "fix" the repo and unset the corruption state, right?

They also want to see the log messages captured for the corrupt repo which I think will help them notice a problematic repo that constantly gets corrupted.

I'm still going to keep the corrupted_at since when it is combined with the events table, it will be the indicator that the repo isn't currently corrupted. Otherwise the events table needs to store when the repo has been "fixed" as an event too which can probably be done by a trigger on the repo table as the Clone Status is changed/Last Fetched is updated, WDYT?

mrnugget · 2022-12-19T08:18:25Z

I'm still going to keep the corrupted_at since when it is combined with the events table, it will be the indicator that the repo isn't currently corrupted. Otherwise the events table needs to store when the repo has been "fixed" as an event too which can probably be done by a trigger on the repo table as the Clone Status is changed/Last Fetched is updated, WDYT?

That sounds really complicated. The gitserver_repos and repo state machine itself is already complicated and adding a separate table that records state over time strikes me as something we need to be really careful about.

Wouldn't a two-column solution solve the same problem? corrupted_at and corruption_log or something?

burmudar · 2022-12-19T08:42:19Z

I'm still going to keep the corrupted_at since when it is combined with the events table, it will be the indicator that the repo isn't currently corrupted. Otherwise the events table needs to store when the repo has been "fixed" as an event too which can probably be done by a trigger on the repo table as the Clone Status is changed/Last Fetched is updated, WDYT?

That sounds really complicated. The gitserver_repos and repo state machine itself is already complicated and adding a separate table that records state over time strikes me as something we need to be really careful about.

Wouldn't a two-column solution solve the same problem? corrupted_at and corruption_log or something?

Yeah that should be loads easier 🤔 Thanks! It also solves other stuff I was worried about which I won't mention for brevity!

keegancsmith · 2022-12-19T09:25:59Z

What's wrong with setting the `corrupted_at` to a value and then setting it to null once it's repaired?

If I am not mistaken, we will pretty quickly reclone the repo so I'm unsure the reclone button is useful since it is likely already recloning. Then the other concern is they won't know that a certain repo keeps getting corrupted due to the quick nature of recloning. (except of course monorepos). If you read the linked issue, the RFE is really just log output that is easy to find so admins can dig in (I say that without double checking the attached message since I am looking at email while on holiday). But I do think we can do better than that if the complexity minimisation and engineering capacity can be satisified.

mrnugget · 2022-12-19T09:32:43Z

If I am not mistaken, we will pretty quickly reclone the repo so I'm unsure the reclone button is useful since it is likely already recloning.

The RFE says:

https://github.com/sourcegraph/accounts/issues/6716 found a corrupt repository weeks after it was initially corrupt. Initially, it looked like the repository was corrupt after an upgrade.

So it doesn't look like the repository is quickly recloned and fixed.

the RFE is really just log output that iseasy to find so admins can dig in

Well, historically logs haven't solved any admin problems 😄 Also this part of the RFE is likely important: "show log events in the UI" -- we already log when a repo is corrupted, which is how they found out about it, but what they want is to be able to see it in the UI.

But we can make sure that what @burmudar is building here is actually solving a problem: we can corrupt a repo and see whether it shows up as corrupted or whether it will be quickly fixed. Then we can record a video and ask Mike whether that solves the problem and then he can show it to customers.

burmudar · 2022-12-20T13:40:00Z

@mrnugget @keegancsmith PTAL: I've reworked this to have a corruption_log column which kind of works like the events table (TY @jhchabran for the idea) - check the PR description for details.

unknwon

Drive-by review

burmudar · 2022-12-21T14:11:51Z

Current dependencies on/for this PR:

main
- PR repos: log corruption events per repo #45667 👈
  - PR repos: check for corruption during backgroud repo update #46014
    - PR Expose corruptedAt and CorruptionLog on Graphql API #45988
      - PR show repo corruption events #46004

This comment was auto-generated by Graphite.

- check for err - remove reason - set default timestamp to 0 value

- log repo corruption into a JSON array - limit repo corruption log to 10 elements - db tests

- warn instead of return an error during cleanup - delete repos so that the name can be reused

- repo was corrupt but did not set the reason so the repo never got cleaned up

- sql query indentation

- use assert module

- fix SQL issue on LastError

sourcegraph-bot · 2023-01-09T14:47:24Z

Codenotify: Notifying subscribers in CODENOTIFY files for diff b1c32e9...198a4fb.

Notify	File(s)
@eseliger	internal/database/repos_test.go
@indradhanush	cmd/gitserver/server/cleanup.go cmd/gitserver/server/server.go cmd/gitserver/server/server_test.go
@ryanslade	cmd/gitserver/server/cleanup.go cmd/gitserver/server/server.go cmd/gitserver/server/server_test.go
@sashaostrikov	cmd/gitserver/server/cleanup.go cmd/gitserver/server/server.go cmd/gitserver/server/server_test.go
@unknwon	internal/database/repos_test.go

burmudar · 2023-01-09T15:24:30Z

⏳ This pull request is set to merge as part of a Graphite merge job
Stack job ID: hSUUYP6wzZj7kdyI1hms.
See details on graphite.dev

burmudar requested review from keegancsmith, mrnugget and ryanslade December 14, 2022 17:30

burmudar self-assigned this Dec 14, 2022

cla-bot Bot added the cla-signed label Dec 14, 2022

eseliger reviewed Dec 14, 2022

View reviewed changes

burmudar commented Dec 14, 2022

View reviewed changes

Comment thread internal/database/schema.md Outdated

burmudar marked this pull request as ready for review December 15, 2022 10:47

burmudar requested a review from eseliger December 15, 2022 12:25

burmudar changed the title ~~repos: set corruptedAt time when gitserver detects a repo is corrupted~~ repos: log corruption events per repo Dec 20, 2022

unknwon reviewed Dec 21, 2022

View reviewed changes

Comment thread migrations/frontend/squashed.sql Outdated

Comment thread internal/database/gitserver_repos.go Outdated

mrnugget reviewed Dec 21, 2022

View reviewed changes

Comment thread cmd/gitserver/server/cleanup.go Outdated

Comment thread internal/database/gitserver_repos.go Outdated

Comment thread internal/database/gitserver_repos_test.go Outdated

burmudar force-pushed the wb/repo/corruption branch from 7ebe3bf to 5326607 Compare December 21, 2022 14:11

burmudar marked this pull request as draft December 21, 2022 16:52

burmudar marked this pull request as ready for review December 21, 2022 16:52

burmudar requested review from mrnugget and unknwon December 21, 2022 17:19

burmudar mentioned this pull request Dec 28, 2022

Expose corruptedAt and CorruptionLog on Graphql API #45988

Merged

burmudar force-pushed the wb/repo/corruption branch from 0e55728 to 430a2c7 Compare December 28, 2022 15:13

burmudar added 23 commits January 9, 2023 15:08

store plumbing for SetCorruptedAt

d6dfc38

debug

3d8a0df

move to markIfCorrupted + test

16e1d84

add additional checks for repo corruption

0a519dd

add test cases for server repo corruption detection

974dd16

add migration

fc02b5f

update types for CorrruptedAt

334e797

fix sql query and remove field for now

7483c75

review comments

9b1658b

review feedback

0106d93

- check for err - remove reason - set default timestamp to 0 value

reset stiched migration graph

d910b46

add corruption_lod column and rename migration

842fbac

go gen

75c24ff

add logCorruption method

cada296

- log repo corruption into a JSON array - limit repo corruption log to 10 elements - db tests

fix wording and remove test

9f14ff7

Fix more indentation 🤦🏻

62c59d5

review feedback: update test

f13ed6e

run sg generate again

48ee040

test fixes

c384e3f

- warn instead of return an error during cleanup - delete repos so that the name can be reused

fix bug

f84882f

- repo was corrupt but did not set the reason so the repo never got cleaned up

review feedback

383a7b3

- sql query indentation

review feedback

be1bc85

- use assert module

do not clear corruption status on err

885fc19

burmudar force-pushed the wb/repo/corruption branch from f96852b to 885fc19 Compare January 9, 2023 13:12

add test for LastError and corruptedAt

198a4fb

- fix SQL issue on LastError

burmudar merged commit 7dbfb87 into main Jan 9, 2023

burmudar deleted the wb/repo/corruption branch January 9, 2023 15:24

burmudar mentioned this pull request Feb 15, 2023

Mark a repo as corrupted if it fetch / clone fails as a result of repo corruption #42213

Closed

4 tasks

Conversation

burmudar commented Dec 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When does corruption get logged?

How do we know the repo isn't currently corrupted?

When does the corruptedAt value get reset?

Questions

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcegraph-bot commented Dec 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keegancsmith commented Dec 15, 2022

Uh oh!

burmudar commented Dec 15, 2022

Uh oh!

mrnugget commented Dec 19, 2022

Uh oh!

burmudar commented Dec 19, 2022

Uh oh!

mrnugget commented Dec 19, 2022

Uh oh!

burmudar commented Dec 19, 2022

Uh oh!

keegancsmith commented Dec 19, 2022 via email

Uh oh!

mrnugget commented Dec 19, 2022

Uh oh!

burmudar commented Dec 20, 2022

Uh oh!

unknwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

burmudar commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcegraph-bot commented Jan 9, 2023

Uh oh!

burmudar commented Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

burmudar commented Dec 14, 2022 •

edited

Loading

sourcegraph-bot commented Dec 15, 2022 •

edited

Loading

burmudar commented Dec 21, 2022 •

edited

Loading

burmudar commented Jan 9, 2023 •

edited

Loading