Race when accessing `replicaset.vclock` from applier thread

Found while debugging hangs in tarantool/tarantool-qa#166

One of the nodes logs contains the following output:
```console
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> Got a corrupted row:
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/102/main:reader V> 00000000: 81 00 00 81 26 81 01 09 02 01
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:(socket) I> RAFT: message {term: 4, vote: 2, leader: 2, leader is seen: true, state: follower} from 3
2022-04-26 17:32:07.767 [36183] main/117/applier/unix/:(socket) I> RAFT: vote request is skipped - the leader is already known - 2
2022-04-26 17:32:07.767 [36183] relay/unix/:(socket)/101/main xrow.c:229 E> ER_INVALID_MSGPACK: Invalid MsgPack - packet body
```

The issue is with bad vclock encoding:
```markdown
| header: map of 1 element
|    | type: IPROTO_OK
|   / \  | body: map of 1 element
V  /   \ V
81 00 00 81 26 81 01 09 02 01
               \            /
                   vclock
```

The vclock is incorrect: `81` states that the vclock contains a single replica_id : lsn pair, but two follow: `01 09` and `02 01`.

With the following patch one of the nodes occasionally fails with an assertion in `xrow_encode_vclock` inside `applier_thread_writer_f`:
```diff
diff --git a/src/box/xrow.c b/src/box/xrow.c
index bbd697f11..2447fb8f2 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -62,6 +62,7 @@ mp_sizeof_vclock_ignore0(const struct vclock *vclock)
 static inline char *
 mp_encode_vclock_ignore0(char *data, const struct vclock *vclock)
 {
+       int map_size = vclock_size_ignore0(vclock);
        data = mp_encode_map(data, vclock_size_ignore0(vclock));
        struct vclock_iterator it;
        vclock_iterator_init(&it, vclock);
@@ -72,6 +73,7 @@ mp_encode_vclock_ignore0(char *data, const struct vclock *vclock)
        for ( ; replica.id < VCLOCK_MAX; replica = vclock_iterator_next(&it)) {
                data = mp_encode_uint(data, replica.id);
                data = mp_encode_uint(data, replica.lsn);
+               assert(--map_size >= 0);
        }
        return data;
 }
```

Seems the reason is `replicaset.vclock` access from 2 threads:  TX updates vclock on a yet another WAL write, and applier tries to encode vclock at the same moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race when accessing `replicaset.vclock` from applier thread #7089

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race when accessing replicaset.vclock from applier thread #7089

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Race when accessing `replicaset.vclock` from applier thread #7089