[BUG] Database deadlocks #220
Labels
No labels
arch
riscv64
backport/v1.19
backport/v1.20
backport/v1.21/forgejo
backport/v10.0/forgejo
backport/v11.0/forgejo
backport/v12.0/forgejo
backport/v13.0/forgejo
backport/v14.0/forgejo
backport/v15.0/forgejo
backport/v7.0/forgejo
backport/v8.0/forgejo
backport/v9.0/forgejo
breaking
bug
bug
confirmed
bug
duplicate
bug
needs-more-info
bug
new-report
bug
reported-upstream
code/actions
code/api
code/auth
code/auth/faidp
code/auth/farp
code/email
code/federation
code/git
code/migrations
code/packages
code/wiki
database
MySQL
database
PostgreSQL
database
SQLite
dependency-upgrade
dependency
certmagic
dependency
chart.js
dependency
Chi
dependency
Chroma
dependency
citation.js
dependency
codespell
dependency
css-loader
dependency
devcontainers
dependency
dropzone
dependency
editorconfig-checker
dependency
elasticsearch
dependency
enmime
dependency
F3
dependency
ForgeFed
dependency
garage
dependency
Git
dependency
git-backporting
dependency
Gitea
dependency
gitignore
dependency
go-ap
dependency
go-enry
dependency
go-gitlab
dependency
Go-org
dependency
go-rpmutils
dependency
go-sql-driver mysql
dependency
go-swagger
dependency
go-version
dependency
go-webauthn
dependency
gocron
dependency
Golang
dependency
goldmark
dependency
goquery
dependency
Goth
dependency
grpc-go
dependency
happy-dom
dependency
Helm
dependency
image-spec
dependency
jsonschema
dependency
KaTeX
dependency
lint
dependency
MariaDB
dependency
Mermaid
dependency
minio-go
dependency
misspell
dependency
Monaco
dependency
PDFobject
dependency
playwright
dependency
postcss
dependency
postcss-plugins
dependency
pprof
dependency
prometheus client_golang
dependency
protobuf
dependency
relative-time-element
dependency
renovate
dependency
reply
dependency
ssh
dependency
swagger-ui
dependency
tailwind
dependency
temporal-polyfill
dependency
terminal-to-html
dependency
tests-only
dependency
text-expander-element
dependency
urfave
dependency
vfsgen
dependency
vite
dependency
Woodpecker CI
dependency
x tools
dependency
XORM
Discussion
duplicate
enhancement/feature
forgejo/accessibility
forgejo/branding
forgejo/ci
forgejo/commit-graph
forgejo/documentation
forgejo/furnace cleanup
forgejo/i18n
forgejo/interop
forgejo/moderation
forgejo/privacy
forgejo/release
forgejo/scaling
forgejo/security
forgejo/ui
Gain
High
Gain
Nice to have
Gain
Undefined
Gain
Very High
good first issue
i18n/backport-stable
impact
large
impact
medium
impact
small
impact
unknown
Incompatible license
issue
closed
issue
do-not-exist-yet
issue
open
manual test
Manually tested during feature freeze
OS
FreeBSD
OS
Linux
OS
macOS
OS
Windows
problem
QA
regression
release blocker
Release Cycle
Feature Freeze
release-blocker
v7.0
release-blocker
v7.0.1
release-blocker
v7.0.2
release-blocker
v7.0.3
release-blocker
v7.0.4
release-blocker
v8.0.0
release-blocker/v9.0.0
run-all-playwright-tests
run-end-to-end-tests
test
manual
test
needed
test
needs-help
test
not-needed
test
present
untested
User research - time-tracker
valuable code
worth a release-note
User research - Accessibility
User research - Blocked
User research - Community
User research - Config (instance)
User research - Errors
User research - Filters
User research - Future backlog
User research - Git workflow
User research - Labels
User research - Moderation
User research - Needs input
User research - Notifications/Dashboard
User research - Rendering
User research - Repo creation
User research - Repo units
User research - Security
User research - Settings (in-app)
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/forgejo#220
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This one is not exactly new for Gitea (at least present during the 1.17 cycle), but I'm just in the situation that I cannot drop a big testing repo (many activity, thus database relations).
Post to https://codeberg.org/fnetX/wikitest/settings with delete repo action returns in error 500.
Log excerpt:
Is this a transient failure? Or can it be reproduced?
I tried 10 times. Retrying now:
But my browser received
"504 Gateway Time-out
The server didn't respond in time." (from reverse proxy I suppose).
So yes, seems to be fully reproducible.
What also happens in the log (not sure for which repos):
Both issues combined:
The repo is gone now, so the last err 500 / timeout was actually a success it seems.
Also see Codeberg/Community#632 (was "resolved" by the locks that are now the problem).
Looks like it's going to be one of these race conditions that will take a very long time to figure out. Could you save somewhere as much of the current logs as you can for forensic analysis? Once there is more evidence of the same problem it will help cross reference and figure out the root cause. There are so many possible race conditions in this codepath that could lead to a deadlock that I'm not sure where to begin with what there is right now. But it will resurface I'm sure.
I'm not sure if this is a race condition. This rather seems to me that the transaction is trying to lock too many tables and that MySQL is simply saying "too busy, we cannot lock some of these tables, lets return deadlock". So we rather want to know which queries are being run "roughly" at the same time?
Can MySQL return deadlock when there is no deadlock?
I tend to believe that when a DB engine said it was a deadlock, it really was.
The problem behind is probably related with the transaction isolation level chosen (or leaved by default)1,2.
I really don't know where is defined the isolation level chosen by Gitea/Forgejo for their DB model as a whole or by particular transactions, but it has a huge impact in performance of concurrency under heavy load environment.
Here is one possible deadlock scenario.
In that particular case this code is run by one goroutine and locks one of the many tables. While it is holding this lock on T1 it also tries to get a lock on another table T2. But another goroutine has the lock on T2 so it must wait. Unfortunately (and here is the race) this other goroutine will wait on the lock from T1 to be released before it releases the lock on T2.
There is no architectural design in the codebase to prevent that kind of race, therefore it is bound to happen. And the odds of it happening increase when the system is under heavy load.
I'm by no means a expert at (scaling / high-load) database, but I would assume there's some strategy to avoid this deadlocks from happening? Maybe changing the transaction isolation level that @fsologureng mentioned?
Deadlocks are the result of bad code design. In order to figure out the root cause of a given deadlock, all you need is a stack trace of both goroutine and a dose of patience to analyze the associated codepath.
At present there only is a location in the code (web/repo/setting.go:757) and no stack trace. That's why I'm saying that collecting more evidence from future similar occurrences of a deadlock is necessary.
Hmm why specifically stack trace? wouldn't it be more interesting to have the SQL queries that are being run roughly around the same time? Either way I can brew up custom patches to Codeberg to collect such information.
The SQL queries alone won't tell you much (unless they are super specific) about the two goroutine that ended up deadlocking.
Yes, but not totally code dependant; The model of consistency chosen (part of the design) could be supported by the DB Engine too. Not all the consistencies need a serializationn of all the changes. That extreme case scales very bad. Instead, it is possible to relax some conditions. For example, If the code goes to delete a repo from its dedicated table (and all the related ones), but at the same time another thread is counting the repos, is possible that the counting return while the delete is occurring (or even after), informing a former state of the DB (because it started with access shared to the table). But the deleting thread couldn't inform a bad state of the DB after finished its transaction (or inside its transaction). This prevents inconsistency in the scope of each thread, but not necessarily at global scope. That kind of details about consistency have a huge impact in performance because define when a lock (begin transaction) can be acquired.
Viewing the documentation of the ORM used I suppose that the transaction isolation level is not defined at each transaction, mainly because their syntax is very Engine dependant. Furthermore, I can't find definition at Engine level in the code neither, so apparently the defaults of each installation are being used. I suppose that autocommit is being used too.
Obtaining the whole SQL transaction permits analyze the deadlock, because occurs at pure DB level.
I have experience with this kind of debug in PG. There are ways to log errors with the SQL command involved. In MariaDB I suppose its very possible too without code modification.
I think we misunderstand each other here, AFAIK there isn't another SQL query that's deadlocking at the same time.
There's a option in app.ini to enable SQL logging.
I don't get how a single SQL statement can deadlock, could you give me an example?
I don't meant that, I meant that there aren't other SQL deadlocking at the same time. There's only one goroutine that executes a SQL query that will receive a deadlock error by MySQL. Your comment implied to me that you mean that there are two goroutines that receive the deadlocked error at the same time.
Thanks for explaining, I understand what you meant now.
In Codeberg I imagine that that is a nightmare. Are you planning to replicate the case in codeberg test?
PD: sorry that I pinned notifications about this, and didn't receive notifications 🤦
Hi all, this issue is still super serious to Codeberg. Basically, you cannot remove two repositories or issues simultaneously, and first users even started to complain this would violate privacy guidelines. (Users can still ask us to delete it, though, but it's a terrible thing nevertheless).
Further, we just now got a report about even posting issue comments resulting in deadlocks here: Codeberg/Community#1092
From a scaling point of view, this might be the most serious bug Codeberg is currently facing (database deadlocks).
I continued in the Matrix channel at https://matrix.to/#/%23forgejo-development%3Amatrix.org/%24FqVOejU_MoGYi_mP3e1DjAKlO5Dxg2cqRsYwLQW3gt0?via=matrix.tu-berlin.de&via=matrix.org&via=aria-net.org&via=exozy.me because I wasn't able to post further comments to this thread.
Original comment:
The content I am trying to add:
Funnily enough, while trying to add the "scaling" label, I caused a db deadlock.
Server log from the past 30 minutes:
Label deadlocks also have some history (see ...), although they were "fixed" multiple times already. It sounds like there is a fundamental flaw somewhere?
[BUG] Database deadlock (error 500) when removing repoto [BUG] Database deadlocks@fnetX could you add more logs that you collected in the past few hours? They are more clues to finding a workaround.
A proper fix is unlikely, as explained above, because the codebase lacks the proper mechanisms to avoid race conditions that leads to deadlocks under heavy load.
The most likely candidates for increasing the odds of a race are complex changes that involve modifying a number of tables such as deleting repositories, issues or pull requests.
A possible fix would be to have locks at the repository level, in a middleware for web / api routes, blocking all operations while expensive operations such as deletion of issues or pull requests are in progress because they are the most likely candidates for creating the conditions for a deadlock.
A new one appeared:
for
I took a look, that's probably because it uses some commit status database lookups, which are prone to deadlocks, as seen in the previous shared deadlock logs.
Some activity around this topic, https://github.com/go-gitea/gitea/pull/26055
When was the the last
Deadlock found when trying to get lockin the logs?About one minute ago.
Here are some that seem to occur often (picked at "random")
very very often and apparently batched.
I also have
and
The last one was
Not exactly resolved, but mostly worked around.