raft node notifies configure when confChanged by chaochn47 · Pull Request #15708 · etcd-io/etcd

chaochn47 · 2023-04-13T07:10:07Z

The PR is built on top of #15703

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

ahrtr · 2023-04-13T07:18:15Z

Can #15528 be easily reproduced? If not, then we need to add an integration test; we can intentionally inject a sleep failpoint before r.Advanced()

chaochn47 · 2023-04-13T07:37:07Z

Can #15528 be easily reproduced? If not, then we need to add an integration test; we can intentionally inject a sleep failpoint before r.Advanced()

Not easily. Inject a sleep should do the trick.

fuweid · 2023-04-13T07:53:54Z

It's hard to reproduce in few requests. It needs to send request one by one without pause.
FYI: You can use the following test code to reproduce it.

package verify

import (
        "context"
        "strings"
        "testing"
        "time"

        integration2 "go.etcd.io/etcd/tests/v3/framework/integration"
        "go.etcd.io/etcd/tests/v3/integration"
)

var lazyCluster = integration.NewLazyClusterWithConfig(
        integration2.ClusterConfig{
                Size:                        3,
                WatchProgressNotifyInterval: 200 * time.Millisecond,
                DisableStrictReconfigCheck:  true})

func TestVerify(t *testing.T) {
        defer lazyCluster.Terminate()

        clus := lazyCluster.Cluster()
        leaderIdx := clus.WaitLeader(t)
        members := clus.Members

        followerIdx := (leaderIdx + 1) % 3
        cli := clus.Client(leaderIdx)

        // promoting any of the voting members in cluster should fail
        expectedErrKeywords := "can only promote a learner member"
        var err error
        for i := 0; i < 10000; i++ {
                ctx, cancel := context.WithTimeout(context.TODO(), 1*time.Minute)
                _, err = cli.MemberPromote(ctx, uint64(members[followerIdx].ID()))
                cancel()
                if err == nil {
                        t.Fatalf("expect promoting voting member to fail, got no error")
                }

                if strings.Contains(err.Error(), expectedErrKeywords) {
                        continue
                }

                t.Fatalf("unexpected error: %v", err)
        }
}

serathius · 2023-04-13T08:23:15Z

This PR not only changes MemberPromote, but introduces a big change for all membership changes. We should double check that all configuration changes are properly tested. If not we should add a large set of e2e tests that:

Grow cluster from 1 to 3 members (without learners)
Grow cluster from 1 to 3 members (with learners) in both 1 at the time and two at one time scenario
Reduce cluster from 3 to 1 member
etc

I would not want to rush this change, so I would recommend #15528 independently.

chaochn47 · 2023-05-05T00:01:01Z

Can #15528 be easily reproduced? If not, then we need to add an integration test; we can intentionally inject a sleep failpoint before r.Advanced()

Hi @ahrtr could you please help clarify how would integration test inject a sleep failpoint? I know with etcd binary built with gofail enabled it is achievable.

etcd/tests/framework/e2e/etcd_process.go

Lines 298 to 318 in b754994

    
           func (f *BinaryFailpoints) Setup(ctx context.Context, failpoint, payload string) error { 
        
           	host := fmt.Sprintf("127.0.0.1:%d", f.member.Config().GoFailPort) 
        
           	failpointUrl := url.URL{ 
        
           		Scheme: "http", 
        
           		Host:   host, 
        
           		Path:   failpoint, 
        
           	} 
        
           	r, err := http.NewRequestWithContext(ctx, "PUT", failpointUrl.String(), bytes.NewBuffer([]byte(payload))) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	resp, err := httpClient.Do(r) 
        
           	if err != nil { 
        
           		return err 
        
           	} 
        
           	defer resp.Body.Close() 
        
           	if resp.StatusCode != http.StatusNoContent { 
        
           		return fmt.Errorf("bad status code: %d", resp.StatusCode) 
        
           	} 
        
           	return nil 
        
           }

Thanks to @serathius as part of the robustness test work.

ahrtr · 2023-05-05T00:15:03Z

Hi @ahrtr could you please help clarify how would integration test inject a sleep failpoint? I know with etcd binary built with gofail enabled it is achievable.

Please refer to the followings,

chaochn47 · 2023-05-09T18:01:16Z

This PR not only changes MemberPromote, but introduces a big change for all membership changes. We should double check that all configuration changes are properly tested. If not we should add a large set of e2e tests that:
* Grow cluster from 1  to 3 members (without learners)

* Grow cluster from 1 to 3 members (with learners) in both 1 at the time and two at one time scenario

* Reduce cluster from 3 to 1 member
  etc
I would not want to rush this change, so I would recommend #15528 independently.

Hi @serathius I audited all the membership reconfiguration tests.

Number 1 & 2 tests does not expose issue mentioned in #15528 because it applies configuration change only once. common tests cover all the e2e test cases so there are some duplicates.

Test case TestMemberPromoteMemberNotLearner in Number 3 test covers the scenario that back to back member promote will timeout while it does not cover MemberAdd, MemberUpdate, MemberRemove.

Number 4 example test cases share a lazy cluster that could have back to back membership reconfigurations. So that's why #12983 and #14040 is flaky.

So My proposal is to enhance tests/integration/clientv3/cluster_test.go with failpoint injected with sleep 100ms on

etcd/server/etcdserver/raft.go

Line 310 in d81d3c3

r.Advance()

and exercise MemberAdd, MemberUpdate, MemberRemove, MemberPromote repeatedly, what do you think? @ahrtr @serathius

… to client Signed-off-by: Benjamin Wang <wachao@vmware.com>

1. rename confChangeCh to raftAdvancedC 2. rename waitApply to confChanged 3. add comments and test assertion Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 · 2023-06-27T06:29:26Z

Ping @serathius @ahrtr @fuweid @tjungblu @jmhbnz for review ~

jmhbnz

A comment to respond to review ping. I've read through this pr a few times and at a code level it looks good, however I'm abstaining from hitting approve as I don't understand overall system implications well enough.

tjungblu

/lgtm (non-binding)

Thanks for finally fixing this!

fuweid

LGTM

Thanks!

chaochn47 force-pushed the confchange_raft_node_notifies_apply branch from 20a9dba to eba61de Compare April 13, 2023 07:12

ahrtr mentioned this pull request Apr 13, 2023

Flaky Test TestMemberPromoteMemberNotLearner #15528

Closed

serathius mentioned this pull request Apr 13, 2023

etcdserver: wait for raft being notified on confChange before responding to client #15703

Closed

serathius reviewed Apr 13, 2023

View reviewed changes

Comment thread server/etcdserver/server.go Outdated

serathius reviewed Apr 13, 2023

View reviewed changes

Comment thread server/etcdserver/server.go Outdated

serathius mentioned this pull request Apr 13, 2023

Unexpected 'unhealthy cluster' when remove a downed member #15710

Closed

This was referenced Apr 14, 2023

tests/integration/clientv3: fix flaky TestMemberPromoteMemberNotLearner #15696

Closed

ARM E2E test consistently failed #15647

Closed

This was referenced May 9, 2023

add failpoint raftBeforeAdvance to reproduce TestMemberPromoteMemberNotLearner reliably #15865

Closed

Run tests/failpoint test suite by default #15879

Closed

chaochn47 mentioned this pull request Jun 14, 2023

Flake TestMaxLearnerInCluster #16078

Closed

chaochn47 force-pushed the confchange_raft_node_notifies_apply branch from eba61de to cab837c Compare June 19, 2023 18:54

This was referenced Jun 19, 2023

Enable failpoint in integration test #16099

Merged

add runtime reconfiguration tests #16127

Merged

etcdserver: wait for raft is notified on confChange before responding…

ad3b6ee

… to client Signed-off-by: Benjamin Wang <wachao@vmware.com>

chaochn47 force-pushed the confchange_raft_node_notifies_apply branch 2 times, most recently from 1f4c4b8 to 2d4c8ec Compare June 26, 2023 23:11

server/etcdserver/raft.go:

6cdc9ae

1. rename confChangeCh to raftAdvancedC 2. rename waitApply to confChanged 3. add comments and test assertion Signed-off-by: Chao Chen <chaochn@amazon.com>

chaochn47 force-pushed the confchange_raft_node_notifies_apply branch from 2d4c8ec to 6cdc9ae Compare June 27, 2023 05:56

chaochn47 changed the title ~~raft node notifies apply when confChanged~~ raft node notifies configure when confChanged Jun 27, 2023

jmhbnz reviewed Jun 27, 2023

View reviewed changes

serathius approved these changes Jun 27, 2023

View reviewed changes

tjungblu approved these changes Jun 27, 2023

View reviewed changes

ahrtr reviewed Jun 27, 2023

View reviewed changes

Comment thread server/etcdserver/server.go

fuweid approved these changes Jun 28, 2023

View reviewed changes

ahrtr approved these changes Jun 28, 2023

View reviewed changes

ahrtr merged commit 22f9dac into etcd-io:main Jun 28, 2023

chaochn47 deleted the confchange_raft_node_notifies_apply branch June 28, 2023 16:03

This was referenced Oct 22, 2023

Flaked integration/clientv3/examples ExampleCluster_memberAddAsLearner #14040

Closed

[3.4] Backport clientv3 naming implementation #16800

Merged

serathius mentioned this pull request Feb 28, 2026

Don't reuse same ReadIndex in retries #21399

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft node notifies configure when confChanged#15708

raft node notifies configure when confChanged#15708
ahrtr merged 2 commits intoetcd-io:mainfrom
chaochn47:confchange_raft_node_notifies_apply

chaochn47 commented Apr 13, 2023

Uh oh!

ahrtr commented Apr 13, 2023

Uh oh!

chaochn47 commented Apr 13, 2023

Uh oh!

Uh oh!

Uh oh!

fuweid commented Apr 13, 2023

Uh oh!

serathius commented Apr 13, 2023 •

edited

Loading

Uh oh!

chaochn47 commented May 5, 2023

Uh oh!

ahrtr commented May 5, 2023

Uh oh!

chaochn47 commented May 9, 2023 •

edited

Loading

Uh oh!

chaochn47 commented Jun 27, 2023

Uh oh!

jmhbnz left a comment

Uh oh!

tjungblu left a comment

Uh oh!

Uh oh!

fuweid left a comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

Conversation

chaochn47 commented Apr 13, 2023

Uh oh!

ahrtr commented Apr 13, 2023

Uh oh!

chaochn47 commented Apr 13, 2023

Uh oh!

Uh oh!

Uh oh!

fuweid commented Apr 13, 2023

Uh oh!

serathius commented Apr 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaochn47 commented May 5, 2023

Uh oh!

ahrtr commented May 5, 2023

Uh oh!

chaochn47 commented May 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaochn47 commented Jun 27, 2023

Uh oh!

jmhbnz left a comment

Choose a reason for hiding this comment

Uh oh!

tjungblu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

serathius commented Apr 13, 2023 •

edited

Loading

chaochn47 commented May 9, 2023 •

edited

Loading