raft node notifies configure when confChanged#15708
Conversation
20a9dba to
eba61de
Compare
|
Can #15528 be easily reproduced? If not, then we need to add an integration test; we can intentionally inject a sleep failpoint before r.Advanced() |
Not easily. Inject a sleep should do the trick. |
|
It's hard to reproduce in few requests. It needs to send request one by one without pause. package verify
import (
"context"
"strings"
"testing"
"time"
integration2 "go.etcd.io/etcd/tests/v3/framework/integration"
"go.etcd.io/etcd/tests/v3/integration"
)
var lazyCluster = integration.NewLazyClusterWithConfig(
integration2.ClusterConfig{
Size: 3,
WatchProgressNotifyInterval: 200 * time.Millisecond,
DisableStrictReconfigCheck: true})
func TestVerify(t *testing.T) {
defer lazyCluster.Terminate()
clus := lazyCluster.Cluster()
leaderIdx := clus.WaitLeader(t)
members := clus.Members
followerIdx := (leaderIdx + 1) % 3
cli := clus.Client(leaderIdx)
// promoting any of the voting members in cluster should fail
expectedErrKeywords := "can only promote a learner member"
var err error
for i := 0; i < 10000; i++ {
ctx, cancel := context.WithTimeout(context.TODO(), 1*time.Minute)
_, err = cli.MemberPromote(ctx, uint64(members[followerIdx].ID()))
cancel()
if err == nil {
t.Fatalf("expect promoting voting member to fail, got no error")
}
if strings.Contains(err.Error(), expectedErrKeywords) {
continue
}
t.Fatalf("unexpected error: %v", err)
}
} |
|
This PR not only changes MemberPromote, but introduces a big change for all membership changes. We should double check that all configuration changes are properly tested. If not we should add a large set of e2e tests that:
I would not want to rush this change, so I would recommend #15528 independently. |
Hi @ahrtr could you please help clarify how would integration test inject a sleep failpoint? I know with etcd binary built with gofail enabled it is achievable. etcd/tests/framework/e2e/etcd_process.go Lines 298 to 318 in b754994 |
Please refer to the followings, |
Hi @serathius I audited all the membership reconfiguration tests.
Number 1 & 2 tests does not expose issue mentioned in #15528 because it applies configuration change only once. common tests cover all the e2e test cases so there are some duplicates. Test case Number 4 example test cases share a lazy cluster that could have back to back membership reconfigurations. So that's why #12983 and #14040 is flaky. So My proposal is to enhance tests/integration/clientv3/cluster_test.go with failpoint injected with sleep 100ms on etcd/server/etcdserver/raft.go Line 310 in d81d3c3 MemberAdd, MemberUpdate, MemberRemove, MemberPromote repeatedly, what do you think? @ahrtr @serathius
|
eba61de to
cab837c
Compare
… to client Signed-off-by: Benjamin Wang <wachao@vmware.com>
1f4c4b8 to
2d4c8ec
Compare
1. rename confChangeCh to raftAdvancedC 2. rename waitApply to confChanged 3. add comments and test assertion Signed-off-by: Chao Chen <chaochn@amazon.com>
2d4c8ec to
6cdc9ae
Compare
|
Ping @serathius @ahrtr @fuweid @tjungblu @jmhbnz for review ~ |
jmhbnz
left a comment
There was a problem hiding this comment.
A comment to respond to review ping. I've read through this pr a few times and at a code level it looks good, however I'm abstaining from hitting approve as I don't understand overall system implications well enough.
tjungblu
left a comment
There was a problem hiding this comment.
/lgtm (non-binding)
Thanks for finally fixing this!
Fix #15528
The PR is built on top of #15703
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
PTAL @ahrtr @serathius, thanks!