Skip to content

Segfault when using outlier detection LB with pick_first #32967

@s-matyukevich

Description

@s-matyukevich

What version of gRPC and what language are you using?

v1.50.0

What operating system (Linux, Windows,...) and version?

Linux

What runtime / compiler are you using (e.g. python version or version of gcc)

python 3

What did you do?

  1. Use the following service config:
    {
        "loadBalancingConfig": [
            {
                "outlier_detection_experimental": {
                    "interval": "1s",
                    "baseEjectionTime": "30s",
                    "maxEjectionTime": "300s",
                    "maxEjectionPercent": 100,
                    "failurePercentageEjection": {
                        "enforcementPercentage": 100,
                        "requestVolume": 1,
                        "minimumHosts": 1
                    },
                    "childPolicy": [{
                        "pick_first": {}
                    }]
                }
            }
        ]
    }
  1. Create 2 upstream services one of which always returns errors and the second one always succeeds.
  2. Run the test several times until pick_first picker decides to pick the bad upstream (we need this step because our DNS server randomizes the response, I guess you can also just make sure that bad server is always returned first in the DNS response)

What did you expect to see?

Outlier detection LB should eject the failing host (this part actually happens) and pick_first picker should pick the second healthy host and keep using it.

What did you see instead?

Segfault.

Here are the most verbose trace logs at the time of failure:

I0427 19:14:08.574305009      15 outlier_detection.cc:827]   [outlier_detection_lb 0x55d01ab087b0] ejection timer running
I0427 19:14:08.574316433      15 outlier_detection.cc:868]   [outlier_detection_lb 0x55d01ab087b0] found 0 success rate candidates and 1 failure percentage candidates; ejected_host_count=0; success_rate_sum=0.000
I0427 19:14:08.574324699      15 outlier_detection.cc:944]   [outlier_detection_lb 0x55d01ab087b0] running failure percentage algorithm
I0427 19:14:08.574328305      15 outlier_detection.cc:950]   [outlier_detection_lb 0x55d01ab087b0] checking candidate 0x55d01b018860: success_rate=0.000
I0427 19:14:08.574334408      15 outlier_detection.cc:964]   [outlier_detection_lb 0x55d01ab087b0] random_key=95 ejected_host_count=0 current_percent=0.000
I0427 19:14:08.574339370      15 outlier_detection.cc:977]   [outlier_detection_lb 0x55d01ab087b0] ejecting candidate
I0427 19:14:08.574351486      15 subchannel_list.h:247]      [PickFirstSubchannelList 0x55d01ac97a70] subchannel list 0x55d01af8a670 index 0 of 2 (subchannel 0x55d01ad32d10): connectivity changed: old_state=READY, new_state=TRANSIENT_FAILURE, status=UNAVAILABLE: subchannel ejected by outlier detection, shutting_down=0, pending_watcher=0x55d01af8a7a0
I0427 19:14:08.574357970      15 pick_first.cc:315]          Pick First 0x55d01ac97a70 selected subchannel connectivity changed to TRANSIENT_FAILURE
I0427 19:14:08.574361493      15 child_policy_handler.cc:100] [child_policy_handler 0x55d01af8a6e0] started name re-resolving
I0427 19:14:08.574365196      15 child_policy_handler.cc:100] [child_policy_handler 0x55d01ae65b70] started name re-resolving
I0427 19:14:08.574368366      15 client_channel.cc:912]      chand=0x55d01ad83a70: started name re-resolving
I0427 19:14:08.574373779      15 polling_resolver.cc:231]    [polling resolver 0x55d01af89f70] in cooldown from last resolution (from 1003 ms ago); will resolve again in 28997 ms
I0427 19:14:08.574379794      15 timer_generic.cc:344]       TIMER 0x55d01af89fe0: SET 31006 now 2009 call 0x55d01af8a018[0x7f04362e25c0]
I0427 19:14:08.574382950      15 timer_generic.cc:381]         .. add to shard 19 with queue_deadline_cap=3007 => is_first_timer=false
I0427 19:14:08.574385976      15 subchannel_list.h:419]      [PickFirstSubchannelList 0x55d01ac97a70] Shutting down subchannel_list 0x55d01af8a670
I0427 19:14:08.574390582      15 subchannel_list.h:336]      [PickFirstSubchannelList 0x55d01ac97a70] subchannel list 0x55d01af8a670 index 0 of 2 (subchannel 0x55d01ad32d10): canceling connectivity watch (shutdown)
I0427 19:14:08.574401550      15 subchannel_list.h:293]      [PickFirstSubchannelList 0x55d01ac97a70] subchannel list 0x55d01af8a670 index 0 of 2 (subchannel 0x55d01ad32d10): unreffing subchannel (shutdown)
I0427 19:14:08.574407747      15 subchannel_list.h:400]      [PickFirstSubchannelList 0x55d01ac97a70] Destroying subchannel_list 0x55d01af8a670
I0427 19:14:08.574411896      15 outlier_detection.cc:760]   [outlier_detection_lb 0x55d01ab087b0] child connectivity state update: state=IDLE (OK) picker=0x7f042c002f20
I0427 19:14:08.574415391      15 outlier_detection.cc:508]   [outlier_detection_lb 0x55d01ab087b0] constructed new picker 0x7f042c000eb0 and counting is enabled
I0427 19:14:08.574418304      15 outlier_detection.cc:698]   [outlier_detection_lb 0x55d01ab087b0] updating connectivity: state=IDLE status=(OK) picker=0x7f042c000eb0
I0427 19:14:08.574422343      15 client_channel.cc:897]      chand=0x55d01ad83a70: update: state=IDLE status=(OK) picker=0x7f042c000eb0
I0427 19:14:08.574426787      15 connectivity_state.cc:160]  ConnectivityStateTracker client_channel[0x55d01ad83b28]: READY -> IDLE (helper, OK)
/app/domains/ephemera/apps/ephemera-probe-python/ephemera-probe-python: line 27:    11 Segmentation fault      $PYTHON_INTERPRETER_RUNFILE_PATH $PYTHON_BIN_RUNFILE_PATH "$@"

Anything else we should know about your project / environment?

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions