Skip to content

Implement HALT lock & write forwarding#277

Merged
benbjohnson merged 9 commits intomainfrom
forwarding
Feb 28, 2023
Merged

Implement HALT lock & write forwarding#277
benbjohnson merged 9 commits intomainfrom
forwarding

Conversation

@benbjohnson
Copy link
Copy Markdown
Collaborator

@benbjohnson benbjohnson commented Feb 3, 2023

Overview

This pull request implements write forwarding for WAL mode only. With write forwarding, any node can write to the database by borrowing the write lock from the primary, catching up, performing the transaction, and then shipping the changeset LTX file back to the primary. The transaction is then propagated out to the other replicas just like any other transaction.

Write forwarding is useful for infrequent writes and this should not be used for regular, write-heavy usage. The main use case for this is being able to run migration scripts from any node. This is useful because deployments can roll over nodes in an unexpected order so we want to run on the first node deployed regardless of its primary state.

Fixes #56

Performance

Please note that the SQLite write lock is held for 1 round trip plus the time it takes to execute the transaction on the replica. This means that performing a forwarded write from Chennai when the primary is in Denver could mean that the write lock is held for ~250ms which would limit your write throughput to only 4 writes per second.

If you use forwarding for regular write usage, such as for a background jobs instance, ensure the node(s) are in the same region as the primary. For example, you can set your candidacy to only your Denver nodes by setting it in the litefs.yml:

lease:
  candidate: ${FLY_REGION == "den"}

TODO

  • Refactor duplicate "apply LTX" code
  • Add test for gracefully failing when using the rollback journal.
  • Add failure test for when serializability is broken
  • Add tests for heavy read/write load
  • Add tests for failure scenarios

@benbjohnson benbjohnson force-pushed the forwarding branch 2 times, most recently from d31bae9 to 0b63fa0 Compare February 3, 2023 01:14
@kentcdodds
Copy link
Copy Markdown

I was planning on recording some more videos today, but then I saw this and now I'm thinking maybe I should wait because this impacts the migration script I would be teaching. I'm ok trying an experimental version of this if need be. How long before that is available do you suppose? (No pressure of course 😅, I can wait).

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

How long before that is available do you suppose?

Assuming I don't hit any major roadblocks, I'm assuming I'll merge this early next week. It needs some clean up and a bunch of testing but it's not doing anything too crazy. Also, I can always merge fixes in with later PRs.

@kentcdodds
Copy link
Copy Markdown

Sounds good! I'll wait. Thanks a lot!

@benbjohnson benbjohnson force-pushed the forwarding branch 2 times, most recently from c515539 to a675d3d Compare February 5, 2023 19:02
@legionxiong
Copy link
Copy Markdown

Test the "forwarding" branch encountered "remote tx confirmation failed" error. The database was updated in node-2 which is a replica node.
Logs of node-2(replica):

commit wal error: remote commit: remote tx confirmation failed (1684108385)
commit wal error: remote commit: remote tx confirmation failed (1684108385)
commit wal error: remote commit: remote tx confirmation failed (1684108385)
commit wal error: remote commit: remote tx confirmation failed (1684108385)

Logs of node-3(primary):

http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c
http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c
http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c
http: POST /tx: error: database checksum a620b5bf1493b2c2 does not match LTX post-apply checksum b33b37868377a23c

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

@legionxiong Thanks for giving the branch a try. It's still really rough around the edges. I'll move the PR out of "Draft" state once I do some decent testing.

@legionxiong
Copy link
Copy Markdown

@legionxiong Thanks for giving the branch a try. It's still really rough around the edges. I'll move the PR out of "Draft" state once I do some decent testing.

Well, It would be encouraging. I will try it again once you remove the "Draft" state and give some feedback.

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

@kentcdodds @legionxiong I fixed up some issues on write forwarding but I'm still having a lingering issue with read queries picking up an invalid version of the WAL index header and it getting into a retry loop. I was hoping to have this merged in but it looks like there's still some debugging to do. Just wanted to give y'all a heads up.

@benbjohnson benbjohnson force-pushed the forwarding branch 2 times, most recently from 1e66023 to be0433d Compare February 10, 2023 17:23
@benbjohnson benbjohnson marked this pull request as ready for review February 10, 2023 18:46
@benbjohnson
Copy link
Copy Markdown
Collaborator Author

@kentcdodds @legionxiong There's still an outstanding issue on write forwarding in that it performs a checkpoint on the writing replica after the WAL commits but the replica client doesn't have the CKPT lock so there could be problems under higher concurrency. I went ahead and marked this as "ready for review" in case anyone wants to play around with it. I've mostly done limited local testing but I'm going to deploy and try it on some geographically distant nodes and see what breaks.

This is definitely still "alpha" quality so don't deploy it into production please. I'm looking at you, Kent. :)

I also need to add timeouts but I'll do that in a separate PR.

@kentcdodds
Copy link
Copy Markdown

Awesome! So this has been pushed to dockerhub? Which version should I use?

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

So this has been pushed to dockerhub? Which version should I use?

@kentcdodds Yes, it's on DockerHub. If you want to use the latest version built from this PR then you can use the PR tagged version:

COPY --from=flyio/litefs:pr-277 /usr/local/bin/litefs /usr/local/bin/litefs

@legionxiong
Copy link
Copy Markdown

legionxiong commented Feb 12, 2023

@benbjohnson It works well for my use case, so AWESOME! Great Job Ben!

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

@legionxiong Thanks for testing it out! I had some issues when I was doing load testing on Friday so there's definitely still some kinks in there. I'm going to be doing some more testing & fixes today though.

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

@legionxiong @kentcdodds Just a heads up that I've having a hard time getting this stable when trying to make it handle it behind the scenes. I'm going to rework this to separate out the lock acquisition from the actual write forwarding. Basically, you'll need to wrap your write transaction code in a LiteFS-specific lock.

Right now, there are a couple different options for how to implement this:

Command-line migrations

If someone is running a one-off command in a temporary VM to perform a migration then they could acquire the lock for the life of the mount:

# For JS/Prisma apps
$ litefs mount -with-lock -- prisma migrate deploy

# For Rails apps
$ litefs mount -with-lock -- bin/rails db:migrate

If you have a migration script then it could acquire for a single command:

$ litefs run -with-lock -- prisma migrate deploy

or it could acquire the lock for the life of the script:

#!/bin/bash

litefs lock
prisma migrate deploy
... do other stuff ...
litefs unlock

The litefs unlock would be optional as the lock could be released when the script exits.

Library migrations

We can also support acquiring the lock within a library. @kentcdodds has a litefs-js library that could hook into this lock mechanism so it could look like:

import { withWriteLock } from 'litefs-js'

withWriteLock(() => {
  // do transaction stuff here
})

Please excuse my terrible JavaScript. :)

I'm planning on implementing this as another lock byte on the database file so it should be easy for libraries to hook into the functionality. It should just be a fcntl() call with F_SETLKW on a given byte and that'll acquire the remote lock behind the scenes. It can be released explicitly with a F_UNLCK on the same byte or by exiting the process.

@kentcdodds
Copy link
Copy Markdown

That makes sense to me. 👍

@legionxiong
Copy link
Copy Markdown

@benbjohnson It's fine for me, I have nothing to migrate. Or, Is that means the app/tool in replica nodes need to acquire the lock itself before writing data to the database?

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

that means the app/tool in replica nodes need to acquire the lock itself before writing data to the database?

Yes, SQLite is a single-writer database so only one node can write at a time. If it's a replica, it'll essentially borrow the lock from the primary, perform a write, and then push that change set back to the primary. It's a lot slower than simply writing on the primary directly but it makes it simpler for situations like running migrations or low-write scenarios.

@legionxiong
Copy link
Copy Markdown

that means the app/tool in replica nodes need to acquire the lock itself before writing data to the database?

Yes, SQLite is a single-writer database so only one node can write at a time. If it's a replica, it'll essentially borrow the lock from the primary, perform a write, and then push that change set back to the primary. It's a lot slower than simply writing on the primary directly but it makes it simpler for situations like running migrations or low-write scenarios.

Thanks for your explanation! We are developing a maintenance tool using sqlite3 for a distributed system. It will not happen for two or more maintainers doing the same operation in different nodes, no concurrency no heavy writing. So it is basically a single user scenario. LiteFS with write forwarding is exactly what we expected to do database replication, THANK again! What is supposed to do for this single user scenario? De we need add some wrapper for our tool?

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

Ok, I refactored the internals into two separate parts: the "halt" lock & the transaction forwarding.

There's an API on the HTTP server to acquire the halt lock (POST /halt) & release it (DELETE /halt). When the halt lock is acquired, it implicitly acquires write locks on the database and returns back a randomly generated ID and the current database position on the primary. The replica holds onto that ID and waits to catch-up to the primary's position. While the replica holds the halt lock, it can perform as many transactions as it wants to until it releases the halt lock. To release the lock, it passes the original ID back to the primary. This avoids race conditions where the lock could expire and then receive an old release call.

On the write forwarding side, the primary checks that the caller is holding the current halt lock ID and it verifies that the pre/post checksums and TXIDs match too.

Next, I need to add the application's interface via FUSE. It'll use a F_SETLKW on a byte in the database file (maybe 72 since that's unused space in the header, not that it really matters too much). When LiteFS receives that lock request, it'll call the halt API internally and hold onto the lock ID itself. As long as the process acquiring the lock byte is using open file descriptor locks (OFD) then it'll support multiple threads attempting to acquire the halt lock. The other benefit to the file lock approach is that FUSE can detect when the file descriptor is closed (or when the calling process exits) and it can release automatically.

This PR still needs a bunch of clean up and testing. It isn't smart about recovery, it doesn't have a timeout on the halt lock, etc.

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

@legionxiong In your application code, you'll need to acquire a lock on a file byte before performing your write transaction on your replica. What language are you writing in?

@legionxiong
Copy link
Copy Markdown

@legionxiong In your application code, you'll need to acquire a lock on a file byte before performing your write transaction on your replica. What language are you writing in?

golang :)

1 similar comment
@legionxiong
Copy link
Copy Markdown

@legionxiong In your application code, you'll need to acquire a lock on a file byte before performing your write transaction on your replica. What language are you writing in?

golang :)

@benbjohnson
Copy link
Copy Markdown
Collaborator Author

@legionxiong I added the FUSE API for the HALT lock. I also created a Go client library called litefs-go for it so there's an API to run a block of code with a HALT lock if it's on a replica (or it'll just run it normally if it's the primary). This feature still has some bugs to work out but I'm open to feedback on the API.

@benbjohnson benbjohnson force-pushed the forwarding branch 2 times, most recently from bb126c5 to d1b229a Compare February 27, 2023 23:24
@benbjohnson
Copy link
Copy Markdown
Collaborator Author

This is doing pretty well in our long-running test. I'm going to go ahead and merge this although it still needs additional testing before release. I still need to add some other functionality like pings and a shorter TTL. Otherwise, if a replica unexpected shuts down while holding the HALT lock then it the primary will be blocked for writes for 30 seconds, which isn't ideal.

@benbjohnson benbjohnson changed the title Implement write forwarding Implement HALT lock & write forwarding Feb 28, 2023
@benbjohnson benbjohnson merged commit 404e00f into main Feb 28, 2023
@benbjohnson benbjohnson deleted the forwarding branch February 28, 2023 03:05
@benbjohnson
Copy link
Copy Markdown
Collaborator Author

Also, I ended up moving non-SQLite locks (which is just the HALT lock right now) to a separate -lock file that exists for each database. The file doesn't show up when listing the directory but it can be opened. I figured it'd be extra clutter.

The reference implementation is in the litefs-go library. It's pretty simple—it's just a F_OFD_SETLKW/F_WRLCK on byte 72 in the lock file to halt and then it's F_OFD_SETLKW/F_UNLCK on the same file to unhalt. The benefit of using a file descriptor is that it should automatically unhalt when the application closes.

@benbjohnson benbjohnson added this to the v0.4.0 milestone Mar 3, 2023
@legionxiong
Copy link
Copy Markdown

So far so good! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write Forwarding

3 participants