Implement HALT lock & write forwarding by benbjohnson · Pull Request #277 · superfly/litefs

benbjohnson · 2023-02-03T01:00:31Z

Overview

This pull request implements write forwarding ~~for WAL mode only~~. With write forwarding, any node can write to the database by borrowing the write lock from the primary, catching up, performing the transaction, and then shipping the changeset LTX file back to the primary. The transaction is then propagated out to the other replicas just like any other transaction.

Write forwarding is useful for infrequent writes and this should not be used for regular, write-heavy usage. The main use case for this is being able to run migration scripts from any node. This is useful because deployments can roll over nodes in an unexpected order so we want to run on the first node deployed regardless of its primary state.

Fixes #56

Performance

Please note that the SQLite write lock is held for 1 round trip plus the time it takes to execute the transaction on the replica. This means that performing a forwarded write from Chennai when the primary is in Denver could mean that the write lock is held for ~250ms which would limit your write throughput to only 4 writes per second.

If you use forwarding for regular write usage, such as for a background jobs instance, ensure the node(s) are in the same region as the primary. For example, you can set your candidacy to only your Denver nodes by setting it in the litefs.yml:

lease:
  candidate: ${FLY_REGION == "den"}

TODO

Refactor duplicate "apply LTX" code
Add test for gracefully failing when using the rollback journal.
Add failure test for when serializability is broken
Add tests for heavy read/write load
Add tests for failure scenarios

kentcdodds · 2023-02-03T16:50:22Z

I was planning on recording some more videos today, but then I saw this and now I'm thinking maybe I should wait because this impacts the migration script I would be teaching. I'm ok trying an experimental version of this if need be. How long before that is available do you suppose? (No pressure of course 😅, I can wait).

benbjohnson · 2023-02-03T17:13:13Z

How long before that is available do you suppose?

Assuming I don't hit any major roadblocks, I'm assuming I'll merge this early next week. It needs some clean up and a bunch of testing but it's not doing anything too crazy. Also, I can always merge fixes in with later PRs.

kentcdodds · 2023-02-03T17:45:30Z

Sounds good! I'll wait. Thanks a lot!

legionxiong · 2023-02-06T02:16:34Z

Test the "forwarding" branch encountered "remote tx confirmation failed" error. The database was updated in node-2 which is a replica node.
Logs of node-2(replica):

commit wal error: remote commit: remote tx confirmation failed (1684108385)
commit wal error: remote commit: remote tx confirmation failed (1684108385)
commit wal error: remote commit: remote tx confirmation failed (1684108385)
commit wal error: remote commit: remote tx confirmation failed (1684108385)

Logs of node-3(primary):

http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c
http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c
http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c
http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c

benbjohnson · 2023-02-06T02:49:42Z

@legionxiong Thanks for giving the branch a try. It's still really rough around the edges. I'll move the PR out of "Draft" state once I do some decent testing.

legionxiong · 2023-02-06T03:01:46Z

@legionxiong Thanks for giving the branch a try. It's still really rough around the edges. I'll move the PR out of "Draft" state once I do some decent testing.

Well, It would be encouraging. I will try it again once you remove the "Draft" state and give some feedback.

benbjohnson · 2023-02-07T21:54:53Z

@kentcdodds @legionxiong I fixed up some issues on write forwarding but I'm still having a lingering issue with read queries picking up an invalid version of the WAL index header and it getting into a retry loop. I was hoping to have this merged in but it looks like there's still some debugging to do. Just wanted to give y'all a heads up.

benbjohnson · 2023-02-10T18:50:10Z

@kentcdodds @legionxiong There's still an outstanding issue on write forwarding in that it performs a checkpoint on the writing replica after the WAL commits but the replica client doesn't have the CKPT lock so there could be problems under higher concurrency. I went ahead and marked this as "ready for review" in case anyone wants to play around with it. I've mostly done limited local testing but I'm going to deploy and try it on some geographically distant nodes and see what breaks.

This is definitely still "alpha" quality so don't deploy it into production please. I'm looking at you, Kent. :)

I also need to add timeouts but I'll do that in a separate PR.

kentcdodds · 2023-02-10T18:54:04Z

Awesome! So this has been pushed to dockerhub? Which version should I use?

benbjohnson · 2023-02-10T19:38:56Z

So this has been pushed to dockerhub? Which version should I use?

@kentcdodds Yes, it's on DockerHub. If you want to use the latest version built from this PR then you can use the PR tagged version:

COPY --from=flyio/litefs:pr-277 /usr/local/bin/litefs /usr/local/bin/litefs

legionxiong · 2023-02-12T11:20:39Z

@benbjohnson It works well for my use case, so AWESOME! Great Job Ben!

benbjohnson · 2023-02-12T15:45:26Z

@legionxiong Thanks for testing it out! I had some issues when I was doing load testing on Friday so there's definitely still some kinks in there. I'm going to be doing some more testing & fixes today though.

benbjohnson · 2023-02-13T20:57:02Z

@legionxiong @kentcdodds Just a heads up that I've having a hard time getting this stable when trying to make it handle it behind the scenes. I'm going to rework this to separate out the lock acquisition from the actual write forwarding. Basically, you'll need to wrap your write transaction code in a LiteFS-specific lock.

Right now, there are a couple different options for how to implement this:

Command-line migrations

If someone is running a one-off command in a temporary VM to perform a migration then they could acquire the lock for the life of the mount:

# For JS/Prisma apps
$ litefs mount -with-lock -- prisma migrate deploy

# For Rails apps
$ litefs mount -with-lock -- bin/rails db:migrate

If you have a migration script then it could acquire for a single command:

$ litefs run -with-lock -- prisma migrate deploy

or it could acquire the lock for the life of the script:

#!/bin/bash

litefs lock
prisma migrate deploy
... do other stuff ...
litefs unlock

The litefs unlock would be optional as the lock could be released when the script exits.

Library migrations

We can also support acquiring the lock within a library. @kentcdodds has a litefs-js library that could hook into this lock mechanism so it could look like:

import { withWriteLock } from 'litefs-js'

withWriteLock(() => {
  // do transaction stuff here
})

Please excuse my terrible JavaScript. :)

I'm planning on implementing this as another lock byte on the database file so it should be easy for libraries to hook into the functionality. It should just be a fcntl() call with F_SETLKW on a given byte and that'll acquire the remote lock behind the scenes. It can be released explicitly with a F_UNLCK on the same byte or by exiting the process.

kentcdodds · 2023-02-13T20:59:46Z

That makes sense to me. 👍

legionxiong · 2023-02-15T13:01:37Z

@benbjohnson It's fine for me, I have nothing to migrate. Or, Is that means the app/tool in replica nodes need to acquire the lock itself before writing data to the database?

benbjohnson · 2023-02-15T20:14:38Z

that means the app/tool in replica nodes need to acquire the lock itself before writing data to the database?

Yes, SQLite is a single-writer database so only one node can write at a time. If it's a replica, it'll essentially borrow the lock from the primary, perform a write, and then push that change set back to the primary. It's a lot slower than simply writing on the primary directly but it makes it simpler for situations like running migrations or low-write scenarios.

legionxiong · 2023-02-16T01:43:19Z

that means the app/tool in replica nodes need to acquire the lock itself before writing data to the database?

Yes, SQLite is a single-writer database so only one node can write at a time. If it's a replica, it'll essentially borrow the lock from the primary, perform a write, and then push that change set back to the primary. It's a lot slower than simply writing on the primary directly but it makes it simpler for situations like running migrations or low-write scenarios.

Thanks for your explanation! We are developing a maintenance tool using sqlite3 for a distributed system. It will not happen for two or more maintainers doing the same operation in different nodes, no concurrency no heavy writing. So it is basically a single user scenario. LiteFS with write forwarding is exactly what we expected to do database replication, THANK again! What is supposed to do for this single user scenario? De we need add some wrapper for our tool?

benbjohnson · 2023-02-16T02:07:13Z

Ok, I refactored the internals into two separate parts: the "halt" lock & the transaction forwarding.

There's an API on the HTTP server to acquire the halt lock (POST /halt) & release it (DELETE /halt). When the halt lock is acquired, it implicitly acquires write locks on the database and returns back a randomly generated ID and the current database position on the primary. The replica holds onto that ID and waits to catch-up to the primary's position. While the replica holds the halt lock, it can perform as many transactions as it wants to until it releases the halt lock. To release the lock, it passes the original ID back to the primary. This avoids race conditions where the lock could expire and then receive an old release call.

On the write forwarding side, the primary checks that the caller is holding the current halt lock ID and it verifies that the pre/post checksums and TXIDs match too.

Next, I need to add the application's interface via FUSE. It'll use a F_SETLKW on a byte in the database file (maybe 72 since that's unused space in the header, not that it really matters too much). When LiteFS receives that lock request, it'll call the halt API internally and hold onto the lock ID itself. As long as the process acquiring the lock byte is using open file descriptor locks (OFD) then it'll support multiple threads attempting to acquire the halt lock. The other benefit to the file lock approach is that FUSE can detect when the file descriptor is closed (or when the calling process exits) and it can release automatically.

This PR still needs a bunch of clean up and testing. It isn't smart about recovery, it doesn't have a timeout on the halt lock, etc.

benbjohnson · 2023-02-16T02:08:23Z

@legionxiong In your application code, you'll need to acquire a lock on a file byte before performing your write transaction on your replica. What language are you writing in?

legionxiong · 2023-02-16T02:54:02Z

@legionxiong In your application code, you'll need to acquire a lock on a file byte before performing your write transaction on your replica. What language are you writing in?

golang :)

legionxiong · 2023-02-16T02:54:03Z

@legionxiong In your application code, you'll need to acquire a lock on a file byte before performing your write transaction on your replica. What language are you writing in?

golang :)

benbjohnson · 2023-02-16T23:34:43Z

@legionxiong I added the FUSE API for the HALT lock. I also created a Go client library called litefs-go for it so there's an API to run a block of code with a HALT lock if it's on a replica (or it'll just run it normally if it's the primary). This feature still has some bugs to work out but I'm open to feedback on the API.

benbjohnson · 2023-02-28T03:05:10Z

This is doing pretty well in our long-running test. I'm going to go ahead and merge this although it still needs additional testing before release. I still need to add some other functionality like pings and a shorter TTL. Otherwise, if a replica unexpected shuts down while holding the HALT lock then it the primary will be blocked for writes for 30 seconds, which isn't ideal.

benbjohnson · 2023-02-28T03:09:50Z

Also, I ended up moving non-SQLite locks (which is just the HALT lock right now) to a separate -lock file that exists for each database. The file doesn't show up when listing the directory but it can be opened. I figured it'd be extra clutter.

The reference implementation is in the litefs-go library. It's pretty simple—it's just a F_OFD_SETLKW/F_WRLCK on byte 72 in the lock file to halt and then it's F_OFD_SETLKW/F_UNLCK on the same file to unhalt. The benefit of using a file descriptor is that it should automatically unhalt when the application closes.

legionxiong · 2023-03-07T09:07:39Z

So far so good! 👍

benbjohnson force-pushed the forwarding branch 2 times, most recently from d31bae9 to 0b63fa0 Compare February 3, 2023 01:14

benbjohnson force-pushed the forwarding branch 2 times, most recently from c515539 to a675d3d Compare February 5, 2023 19:02

benbjohnson force-pushed the forwarding branch 2 times, most recently from 1e66023 to be0433d Compare February 10, 2023 17:23

benbjohnson marked this pull request as ready for review February 10, 2023 18:46

benbjohnson force-pushed the forwarding branch from 829c865 to 91eedd7 Compare February 14, 2023 01:56

benbjohnson force-pushed the forwarding branch from 3baf4db to 3b4f9eb Compare February 16, 2023 23:15

benbjohnson force-pushed the forwarding branch from e2973ed to e0ef315 Compare February 17, 2023 18:07

benbjohnson mentioned this pull request Feb 17, 2023

Use spin lock for write lock acquisition #283

Merged

benbjohnson force-pushed the forwarding branch from 7b330f6 to 39b37c2 Compare February 17, 2023 23:59

Implement write forwarding

5895619

benbjohnson force-pushed the forwarding branch from 39b37c2 to 5895619 Compare February 18, 2023 00:07

benbjohnson added 2 commits February 20, 2023 14:25

Check for HALT lock when processing LTX frames

59489a2

Add HALT timeout

99bfc48

benbjohnson force-pushed the forwarding branch from 91ab4fd to 55e0da0 Compare February 21, 2023 22:30

Add FUSE interrupt handling on HALT lock

5bdde15

benbjohnson force-pushed the forwarding branch from 55e0da0 to 5bdde15 Compare February 21, 2023 22:30

Add NodeID to LTX header; truncate ID to 64-bits

9abbd54

benbjohnson force-pushed the forwarding branch from 2a8da57 to 9abbd54 Compare February 23, 2023 21:34

Fix lock retry hanging

388a3d3

benbjohnson force-pushed the forwarding branch from b590ec3 to 388a3d3 Compare February 26, 2023 21:42

benbjohnson added 2 commits February 26, 2023 16:11

Error handling clean up; litefs-bench tweaks

c4b96ee

Add test with sqlite3 CLI

b1c5997

benbjohnson force-pushed the forwarding branch 2 times, most recently from bb126c5 to d1b229a Compare February 27, 2023 23:24

Move HALT lock to a separate per-database lock file

b799f1c

benbjohnson force-pushed the forwarding branch from d1b229a to b799f1c Compare February 27, 2023 23:27

benbjohnson changed the title ~~Implement write forwarding~~ Implement HALT lock & write forwarding Feb 28, 2023

benbjohnson merged commit 404e00f into main Feb 28, 2023

benbjohnson deleted the forwarding branch February 28, 2023 03:05

benbjohnson added this to the v0.4.0 milestone Mar 3, 2023

Conversation

benbjohnson commented Feb 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Performance

TODO

Uh oh!

kentcdodds commented Feb 3, 2023

Uh oh!

benbjohnson commented Feb 3, 2023

Uh oh!

kentcdodds commented Feb 3, 2023

Uh oh!

legionxiong commented Feb 6, 2023

Uh oh!

benbjohnson commented Feb 6, 2023

Uh oh!

legionxiong commented Feb 6, 2023

Uh oh!

benbjohnson commented Feb 7, 2023

Uh oh!

benbjohnson commented Feb 10, 2023

Uh oh!

kentcdodds commented Feb 10, 2023

Uh oh!

benbjohnson commented Feb 10, 2023

Uh oh!

legionxiong commented Feb 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbjohnson commented Feb 12, 2023

Uh oh!

benbjohnson commented Feb 13, 2023

Command-line migrations

Library migrations

Uh oh!

kentcdodds commented Feb 13, 2023

Uh oh!

legionxiong commented Feb 15, 2023

Uh oh!

benbjohnson commented Feb 15, 2023

Uh oh!

legionxiong commented Feb 16, 2023

Uh oh!

benbjohnson commented Feb 16, 2023

Uh oh!

benbjohnson commented Feb 16, 2023

Uh oh!

legionxiong commented Feb 16, 2023

Uh oh!

legionxiong commented Feb 16, 2023

Uh oh!

benbjohnson commented Feb 16, 2023

Uh oh!

benbjohnson commented Feb 28, 2023

Uh oh!

benbjohnson commented Feb 28, 2023

Uh oh!

legionxiong commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benbjohnson commented Feb 3, 2023 •

edited

Loading

legionxiong commented Feb 12, 2023 •

edited

Loading