tx: add missing lock on meta page update#989
Conversation
|
Hi @roman-khimov. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
tx.go
Outdated
| lg := tx.db.Logger() | ||
| buf := make([]byte, tx.db.pageSize) | ||
| p := tx.db.pageInBuffer(buf, 0) | ||
| tx.db.metalock.Lock() |
There was a problem hiding this comment.
It isn't correct.
We don't need acquire metalock here, it's updating tx's local meta page.
There was a problem hiding this comment.
+1, i also thought the same, but u know better 👍🏽
There was a problem hiding this comment.
OK, I see now, we're filling the buf basically. But then how about the writeAt below? It'll flush the buf onto the page 0 or 1 and that's exactly where (*DB) meta() reads data from.
There was a problem hiding this comment.
Or that's the question of how pwrite() updates the page. Not sure about that.
There was a problem hiding this comment.
iiuc bbolt uses CoW, which means the updates will be written to a new pages, and only upon commit the metadata list will be updated to point to the new pages
@ahrtr know better these details 👍🏽
There was a problem hiding this comment.
But then how about the
writeAtbelow?
Only one write TXN is allowed at a time.
Or that's the question of how
pwrite()updates the page. Not sure about that.
pls refer to https://github.com/ahrtr/etcd-issues/blob/master/docs/cncf_storage_tag_etcd.md#storage-boltdb-feature
There was a problem hiding this comment.
Only one write TXN is allowed at a time.
That I know, that's rwlock. But my concern here is concurrent RO and RW transactions and initialization specifically. To initialize an RO transaction we need to get the meta page and copy it, like in
Line 53 in 092ee98
(*DB) meta(), but an RW transaction can be updating one of these pages via writeAt. The question is whether this update is atomic or not. To me it's not, because if we're to strip down all the layers of Go, unsafe, structures and other things the situation is not much different from this C snippet (crude one, sorry, the last time I wrote C was years ago, I can't even name page_size correctly now):
#include <fcntl.h>
#include <pthread.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
size_t const pageSize = 4096;
char *mm;
void *reader(void *arg)
{
for (;;) {
int i;
char first, next;
for (i = 0; i < pageSize; i++) {
if (i == 0) {
first = mm[i];
} else {
next = mm[i];
if (next != first) {
printf("reader mismatch, first %hhX, read %hhX at %d", first, next, i);
exit(11);
}
}
}
}
}
int main()
{
char b[pageSize];
int fd, r, i;
ssize_t written;
pthread_t pth;
fd = open("data", O_RDWR | O_CREAT);
if (fd < 0) {
printf("bad fd: %d\n", fd);
exit(10);
}
r = ftruncate(fd, pageSize);
if (r < 0) {
printf("bad ftruncate: %d\n", r);
exit(10);
}
mm = mmap(NULL, pageSize, PROT_READ, MAP_SHARED, fd, 0);
if (mm == NULL) {
printf("mmap failed\n");
exit(10);
}
r = pthread_create(&pth, NULL, reader, NULL);
if (r < 0) {
printf("bad pthread_create: %d\n", r);
exit(10);
}
for (i = 0; i < 1000; i++) {
memset(b, i%256, pageSize);
written = pwrite(fd, b, pageSize, 0);
if (written != pageSize) {
printf("can't write properly\n");
exit(10);
}
}
printf( "Done\n" );
return 0;
}With one reader thread doing its things via mmapped region and writer doing its job via pwrite() (which is what (*File) WriteAt() does under the hood). To me it gives things like reader mismatch, first 4, read 5 at 1478 or reader mismatch, first 6, read 7 at 1115 easily.
Is this a correct analogy or am I missing something?
There was a problem hiding this comment.
Yes, it's a valid point, although it's highly unlikely in practice. We have two meta pages, usually Readonly TXN won't read the same meta page as the RW TXN at the same time; even it does, if it reads some dirty data or partially written data, then the checksum won't match, so it still fallback to the other meta page. It's exactly the reason why we never see any issue in the concurrent test
Lines 1128 to 1140 in 092ee98
There was a problem hiding this comment.
it's highly unlikely in practice
I agree with that, the amount of data is small, there is validation, yet it's still a race that can happen. Threads can be stalled at any point on a busy machine as well in which case chances can get higher. Like consider validation done successfully in meta(), but then the reader thread paused and waiting in the queue while writers change the contents before or in the middle of reader doing a Copy().
metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (RO transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen. Signed-off-by: Roman Khimov <roman@nspcc.ru>
ahrtr
left a comment
There was a problem hiding this comment.
LGTM
thx
We need to backport the fix to release-1.4 and release-1.3.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahrtr, roman-khimov The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| tx.meta.Write(p) | ||
|
|
||
| // Write the meta page to file. | ||
| tx.db.metalock.Lock() |
There was a problem hiding this comment.
There might be some performance penalty under high concurrent readonly TXNs (especially short live TXNs)
There was a problem hiding this comment.
writeAt() should be fast, it's a single page. And the next patch will make readers work much better anyway. The thing is that I want to leverage this lock in that patch now.
Let's discuss this part of #967 (not related to the original problem in fact). But it can affect ed58abd rework, a better version of it is still in progress.
metalock is supposed to protect meta page, but it looks like the only place where we're modifying it is not protected in fact. Since page update is not atomic a concurrent reader (like transaction) can get an inconsistent page. It's likely to fall back to the other one in this case, but still we better not allow this to happen.