Skip to content
This repository was archived by the owner on Sep 30, 2024. It is now read-only.

repos: log corruption events per repo#45667

Merged
burmudar merged 72 commits into
mainfrom
wb/repo/corruption
Jan 9, 2023
Merged

repos: log corruption events per repo#45667
burmudar merged 72 commits into
mainfrom
wb/repo/corruption

Conversation

@burmudar

@burmudar burmudar commented Dec 14, 2022

Copy link
Copy Markdown
Contributor

Adds corrputedAt and corruption_log columns. The corruption_log column is a JSON array that contains a object with the form { ts: "<timestamp>", rearon: "description of the corruption"}.

When does corruption get logged?

  • When the janitor finds a git config value sourcegraph.maybeCorrupt in the repo. This classified as "git corrupted repo detected". This config value gets set by the server while handling a exec request
  • When the janitor inspects the repo and doesn't find a HEAD or the repo isn't a bare repo. This is classified as "sourcegraph detected repo corruption"

How do we know the repo isn't currently corrupted?

By looking at the corruptedAt value. If the repo is currently regarded as corrupted the time will be non zero.

When does the corruptedAt value get reset?

  • Repo is cloned / Status is updated
  • Repo is recloned/fetched

Part of: https://github.com/sourcegraph/sourcegraph/issues/43445

Questions

  • Should this be added to the repo_statistics?

Follow up:

  • Emit a metric for when a repo is corrupted
  • Make it visible on the frontend

Test plan

Unit tests

@burmudar burmudar self-assigned this Dec 14, 2022
@cla-bot cla-bot Bot added the cla-signed label Dec 14, 2022
Comment thread migrations/frontend/1670934184_add_repo_corrupted_at/up.sql Outdated
Comment thread internal/database/gitserver_repos_test.go Outdated
Comment thread internal/database/gitserver_repos_test.go Outdated
Comment thread internal/types/types.go Outdated
Comment thread internal/database/gitserver_repos_test.go Outdated
Comment thread internal/database/gitserver_repos.go Outdated
Comment thread cmd/gitserver/server/cleanup.go Outdated
Comment thread cmd/gitserver/server/cleanup.go Outdated
Comment thread cmd/gitserver/server/server.go Outdated
Comment thread internal/database/schema.md Outdated
@burmudar burmudar marked this pull request as ready for review December 15, 2022 10:47
@sourcegraph-bot

sourcegraph-bot commented Dec 15, 2022

Copy link
Copy Markdown
Contributor

Not notifying subscribers because the number of notifying subscribers (23) has exceeded the threshold (10).

@burmudar burmudar requested a review from eseliger December 15, 2022 12:25
@keegancsmith

Copy link
Copy Markdown
Member

I'm not sure I understand the WHY here of introducing extra stuff on the request path. Looking at the linked issue this is saying that when we find a corrupt repo we show it in the UI / increment a metric. So setting a boolean won't really help here since we will then reclone the repo via the janitor job and likely not spot the corrupt repo. Seems like we need some sort of event/audit table which has an entry each time a repo is corrupted.

The minimal change for that would be the janitor job (which runs regularly) can report finding a corrupt repo to a DB somewhere which can then be surfaced in the UI. Changing how and when we do corruption checks seems orthogonal to the ask.

@burmudar

Copy link
Copy Markdown
Contributor Author

I'm not sure I understand the WHY here of introducing extra stuff on the request path. Looking at the linked issue this is saying that when we find a corrupt repo we show it in the UI / increment a metric. So setting a boolean won't really help here since we will then reclone the repo via the janitor job and likely not spot the corrupt repo. Seems like we need some sort of event/audit table which has an entry each time a repo is corrupted.

The minimal change for that would be the janitor job (which runs regularly) can report finding a corrupt repo to a DB somewhere which can then be surfaced in the UI. Changing how and when we do corruption checks seems orthogonal to the ask.

Yeah that makes sense - I added it here since that was where repo corruption was indicated previously, but you're right in the fact that it gets cleanup pretty quickly so you won't see it (which I was wondering about 😆 ). I'll create a new event table and add the details there 👍🏼

@mrnugget

Copy link
Copy Markdown
Contributor

The minimal change for that would be the janitor job (which runs regularly)

I agree with that being the place where we set corrupted_at (that was my original idea the first time we talked about it, I think, because I vaguely remember that a janitor job already does some corruption checking), but I'm not sure we need a new events table.

What's wrong with setting the corrupted_at to a value and then setting it to null once it's repaired? As far as I understand the users only want to know whether a repo is corrupted or not and if it is, they want to hit that reclone button. That should "fix" the repo and unset the corruption state, right?

@burmudar

Copy link
Copy Markdown
Contributor Author

The minimal change for that would be the janitor job (which runs regularly)

I agree with that being the place where we set corrupted_at (that was my original idea the first time we talked about it, I think, because I vaguely remember that a janitor job already does some corruption checking), but I'm not sure we need a new events table.

What's wrong with setting the corrupted_at to a value and then setting it to null once it's repaired? As far as I understand the users only want to know whether a repo is corrupted or not and if it is, they want to hit that reclone button. That should "fix" the repo and unset the corruption state, right?

They also want to see the log messages captured for the corrupt repo which I think will help them notice a problematic repo that constantly gets corrupted.

I'm still going to keep the corrupted_at since when it is combined with the events table, it will be the indicator that the repo isn't currently corrupted. Otherwise the events table needs to store when the repo has been "fixed" as an event too which can probably be done by a trigger on the repo table as the Clone Status is changed/Last Fetched is updated, WDYT?

@mrnugget

Copy link
Copy Markdown
Contributor

I'm still going to keep the corrupted_at since when it is combined with the events table, it will be the indicator that the repo isn't currently corrupted. Otherwise the events table needs to store when the repo has been "fixed" as an event too which can probably be done by a trigger on the repo table as the Clone Status is changed/Last Fetched is updated, WDYT?

That sounds really complicated. The gitserver_repos and repo state machine itself is already complicated and adding a separate table that records state over time strikes me as something we need to be really careful about.

Wouldn't a two-column solution solve the same problem? corrupted_at and corruption_log or something?

@burmudar

Copy link
Copy Markdown
Contributor Author

I'm still going to keep the corrupted_at since when it is combined with the events table, it will be the indicator that the repo isn't currently corrupted. Otherwise the events table needs to store when the repo has been "fixed" as an event too which can probably be done by a trigger on the repo table as the Clone Status is changed/Last Fetched is updated, WDYT?

That sounds really complicated. The gitserver_repos and repo state machine itself is already complicated and adding a separate table that records state over time strikes me as something we need to be really careful about.

Wouldn't a two-column solution solve the same problem? corrupted_at and corruption_log or something?

Yeah that should be loads easier 🤔 Thanks! It also solves other stuff I was worried about which I won't mention for brevity!

@keegancsmith

keegancsmith commented Dec 19, 2022 via email

Copy link
Copy Markdown
Member

@mrnugget

Copy link
Copy Markdown
Contributor

If I am not mistaken, we will pretty quickly reclone the repo so I'm unsure the reclone button is useful since it is likely already recloning.

The RFE says:

https://github.com/sourcegraph/accounts/issues/6716 found a corrupt repository weeks after it was initially corrupt. Initially, it looked like the repository was corrupt after an upgrade.

So it doesn't look like the repository is quickly recloned and fixed.

the RFE is really just log output that iseasy to find so admins can dig in

Well, historically logs haven't solved any admin problems 😄 Also this part of the RFE is likely important: "show log events in the UI" -- we already log when a repo is corrupted, which is how they found out about it, but what they want is to be able to see it in the UI.

But we can make sure that what @burmudar is building here is actually solving a problem: we can corrupt a repo and see whether it shows up as corrupted or whether it will be quickly fixed. Then we can record a video and ask Mike whether that solves the problem and then he can show it to customers.

@burmudar burmudar changed the title repos: set corruptedAt time when gitserver detects a repo is corrupted repos: log corruption events per repo Dec 20, 2022
@burmudar

Copy link
Copy Markdown
Contributor Author

@mrnugget @keegancsmith PTAL: I've reworked this to have a corruption_log column which kind of works like the events table (TY @jhchabran for the idea) - check the PR description for details.

@unknwon unknwon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by review

Comment thread migrations/frontend/squashed.sql Outdated
Comment thread internal/database/gitserver_repos.go Outdated
Comment thread cmd/gitserver/server/cleanup.go Outdated
Comment thread internal/database/gitserver_repos.go Outdated
Comment thread internal/database/gitserver_repos_test.go Outdated

burmudar commented Dec 21, 2022

Copy link
Copy Markdown
Contributor Author

@burmudar burmudar marked this pull request as draft December 21, 2022 16:52
@burmudar burmudar marked this pull request as ready for review December 21, 2022 16:52
@burmudar burmudar requested review from mrnugget and unknwon December 21, 2022 17:19
@burmudar burmudar force-pushed the wb/repo/corruption branch from f96852b to 885fc19 Compare January 9, 2023 13:12
- fix SQL issue on LastError
@sourcegraph-bot

Copy link
Copy Markdown
Contributor

Codenotify: Notifying subscribers in CODENOTIFY files for diff b1c32e9...198a4fb.

Notify File(s)
@eseliger internal/database/repos_test.go
@indradhanush cmd/gitserver/server/cleanup.go
cmd/gitserver/server/server.go
cmd/gitserver/server/server_test.go
@ryanslade cmd/gitserver/server/cleanup.go
cmd/gitserver/server/server.go
cmd/gitserver/server/server_test.go
@sashaostrikov cmd/gitserver/server/cleanup.go
cmd/gitserver/server/server.go
cmd/gitserver/server/server_test.go
@unknwon internal/database/repos_test.go

burmudar commented Jan 9, 2023

Copy link
Copy Markdown
Contributor Author

⏳ This pull request is set to merge as part of a Graphite merge job
Stack job ID: hSUUYP6wzZj7kdyI1hms.
See details on graphite.dev

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants