What version of gRPC and what language are you using?
v1.50.0
What operating system (Linux, Windows,...) and version?
Linux
What runtime / compiler are you using (e.g. python version or version of gcc)
python 3
What did you do?
- Use the following service config:
{
"loadBalancingConfig": [
{
"outlier_detection_experimental": {
"interval": "1s",
"baseEjectionTime": "30s",
"maxEjectionTime": "300s",
"maxEjectionPercent": 100,
"failurePercentageEjection": {
"enforcementPercentage": 100,
"requestVolume": 1,
"minimumHosts": 1
},
"childPolicy": [{
"pick_first": {}
}]
}
}
]
}
- Create 2 upstream services one of which always returns errors and the second one always succeeds.
- Run the test several times until pick_first picker decides to pick the bad upstream (we need this step because our DNS server randomizes the response, I guess you can also just make sure that bad server is always returned first in the DNS response)
What did you expect to see?
Outlier detection LB should eject the failing host (this part actually happens) and pick_first picker should pick the second healthy host and keep using it.
What did you see instead?
Segfault.
Here are the most verbose trace logs at the time of failure:
I0427 19:14:08.574305009 15 outlier_detection.cc:827] [outlier_detection_lb 0x55d01ab087b0] ejection timer running
I0427 19:14:08.574316433 15 outlier_detection.cc:868] [outlier_detection_lb 0x55d01ab087b0] found 0 success rate candidates and 1 failure percentage candidates; ejected_host_count=0; success_rate_sum=0.000
I0427 19:14:08.574324699 15 outlier_detection.cc:944] [outlier_detection_lb 0x55d01ab087b0] running failure percentage algorithm
I0427 19:14:08.574328305 15 outlier_detection.cc:950] [outlier_detection_lb 0x55d01ab087b0] checking candidate 0x55d01b018860: success_rate=0.000
I0427 19:14:08.574334408 15 outlier_detection.cc:964] [outlier_detection_lb 0x55d01ab087b0] random_key=95 ejected_host_count=0 current_percent=0.000
I0427 19:14:08.574339370 15 outlier_detection.cc:977] [outlier_detection_lb 0x55d01ab087b0] ejecting candidate
I0427 19:14:08.574351486 15 subchannel_list.h:247] [PickFirstSubchannelList 0x55d01ac97a70] subchannel list 0x55d01af8a670 index 0 of 2 (subchannel 0x55d01ad32d10): connectivity changed: old_state=READY, new_state=TRANSIENT_FAILURE, status=UNAVAILABLE: subchannel ejected by outlier detection, shutting_down=0, pending_watcher=0x55d01af8a7a0
I0427 19:14:08.574357970 15 pick_first.cc:315] Pick First 0x55d01ac97a70 selected subchannel connectivity changed to TRANSIENT_FAILURE
I0427 19:14:08.574361493 15 child_policy_handler.cc:100] [child_policy_handler 0x55d01af8a6e0] started name re-resolving
I0427 19:14:08.574365196 15 child_policy_handler.cc:100] [child_policy_handler 0x55d01ae65b70] started name re-resolving
I0427 19:14:08.574368366 15 client_channel.cc:912] chand=0x55d01ad83a70: started name re-resolving
I0427 19:14:08.574373779 15 polling_resolver.cc:231] [polling resolver 0x55d01af89f70] in cooldown from last resolution (from 1003 ms ago); will resolve again in 28997 ms
I0427 19:14:08.574379794 15 timer_generic.cc:344] TIMER 0x55d01af89fe0: SET 31006 now 2009 call 0x55d01af8a018[0x7f04362e25c0]
I0427 19:14:08.574382950 15 timer_generic.cc:381] .. add to shard 19 with queue_deadline_cap=3007 => is_first_timer=false
I0427 19:14:08.574385976 15 subchannel_list.h:419] [PickFirstSubchannelList 0x55d01ac97a70] Shutting down subchannel_list 0x55d01af8a670
I0427 19:14:08.574390582 15 subchannel_list.h:336] [PickFirstSubchannelList 0x55d01ac97a70] subchannel list 0x55d01af8a670 index 0 of 2 (subchannel 0x55d01ad32d10): canceling connectivity watch (shutdown)
I0427 19:14:08.574401550 15 subchannel_list.h:293] [PickFirstSubchannelList 0x55d01ac97a70] subchannel list 0x55d01af8a670 index 0 of 2 (subchannel 0x55d01ad32d10): unreffing subchannel (shutdown)
I0427 19:14:08.574407747 15 subchannel_list.h:400] [PickFirstSubchannelList 0x55d01ac97a70] Destroying subchannel_list 0x55d01af8a670
I0427 19:14:08.574411896 15 outlier_detection.cc:760] [outlier_detection_lb 0x55d01ab087b0] child connectivity state update: state=IDLE (OK) picker=0x7f042c002f20
I0427 19:14:08.574415391 15 outlier_detection.cc:508] [outlier_detection_lb 0x55d01ab087b0] constructed new picker 0x7f042c000eb0 and counting is enabled
I0427 19:14:08.574418304 15 outlier_detection.cc:698] [outlier_detection_lb 0x55d01ab087b0] updating connectivity: state=IDLE status=(OK) picker=0x7f042c000eb0
I0427 19:14:08.574422343 15 client_channel.cc:897] chand=0x55d01ad83a70: update: state=IDLE status=(OK) picker=0x7f042c000eb0
I0427 19:14:08.574426787 15 connectivity_state.cc:160] ConnectivityStateTracker client_channel[0x55d01ad83b28]: READY -> IDLE (helper, OK)
/app/domains/ephemera/apps/ephemera-probe-python/ephemera-probe-python: line 27: 11 Segmentation fault $PYTHON_INTERPRETER_RUNFILE_PATH $PYTHON_BIN_RUNFILE_PATH "$@"
Anything else we should know about your project / environment?
What version of gRPC and what language are you using?
v1.50.0What operating system (Linux, Windows,...) and version?
LinuxWhat runtime / compiler are you using (e.g. python version or version of gcc)
python 3
What did you do?
What did you expect to see?
Outlier detection LB should eject the failing host (this part actually happens) and pick_first picker should pick the second healthy host and keep using it.
What did you see instead?
Segfault.
Here are the most verbose trace logs at the time of failure:
Anything else we should know about your project / environment?