Skip to content

Some network/cbs auth issue will cause message lost or delay to return to caller during receive #20086

@enoeden

Description

@enoeden
  • Package Name: azure-servicebus
  • Package Version: 7.3.1
  • Operating System: Ubuntu
  • Python Version: 3.7.1

Describe the bug
We observe msg delay or re-delivery in our env, so we write some simple send-receive code to re-produce.
After running for several days, we found the issue happened many times and many reason will lead to this while few can handle by user, so we fire this ticket to report and see if any enhance point or workaround.

To Reproduce
Steps to reproduce the behavior:

  1. We setup the queue with max delivery count 2, lock duration 2min, sender will send 20 msg (msg body contains the timestamp while sending) in batch every 150 secs, and there receiver in container will receive the msg and settle it after around 20~30sec and also log the duration between send and receive.
  2. We use ClientSecretCredential for service bus auth and websocket as transport_type and also open debug log of servicebus and uamqp.
  3. By hunting for long during msg and analyze the log of that host during that time, we can identify the abnormal point.

From our test log we can see some msg be re-delivery as the duration is 120s, we even found some msg can deley 180s/300s and seems the underlying link receive the TRANSFER frame but it have no chance to return to caller which also cause lock expired error when we try to settle it, also some DISPOSITION frame will send failed but sdk are not aware of it.

We select several cases and give a brief description here so that you may easily re-produce.

Untitled

Timezone in description below and picture is UTC-8 and the raw log use UTC time, also import the log into elasticsearch or some tool will help a log while digging.
The extra two line in picture is for comparation, we use sdk 7.0.1 and we only provide log of 7.3.1 here.
In description, host are equal to host field in log, seq means the seq_number of msg (if msg re-delivery happen , we can look for the first time of this seq_number shows)

  1. 2021-07-26 19:22:11.685 ~ 2021-07-26 19:30:45.873 host 743584b73a35 receive transfer @july 26th 2021, 19:25:14.998 but app recv it @july 26th 2021, 19:27:57.671 the flow issued @july 26th 2021, 19:22:48.114 and cbs refresh attempt @10sec later
  2. 2021-07-27 01:54:16.968 ~ 2021-07-27 01:58:19.850 host 8c8853256ac9 receive seq 38575 @july 27th 2021, 01:55:03.135 but link down @july 27th 2021, 01:55:32.268 which cause disposition seems faild
  3. 2021-07-27 03:07:48.910 ~ 2021-07-27 03:14:11.168 similiar as case 1 host fa72d36c25be receive transfer @july 27th 2021, 03:08:37.180 but app recv it @july 27th 2021, 03:11:09.292 along with cbs auth failed, the flow issued @july 27th 2021, 03:05:56.452 and cbs refresh attempt @10sec later
  4. 2021-07-27 06:48:12.438 ~ 2021-07-27 06:59:46.704 similiar as case2 seq 40893, host 8c8853256ac9 can not realize the connection issue till cbs refresh timeout.
  5. 2021-07-27 16:14:15.362 ~ 2021-07-27 16:20:53.051 similiar as case 1 , host 743584b73a35 normal recv ops seems disturbed by cbs refresh ? around July 27th 2021 16:15:26.546, and back to normal after 5 min, cause the 30sec delay
  6. 2021-07-28 00:43:58.664 ~ 2021-07-28 01:00:05.928 host 743584b73a35 seq 49431 error message "Failed to complete the message as the handler has already been shutdown.Please use ServiceBusClient to create a new instance." settle during handler re-connection caused message re-dispatch, other settle after re-connection is success via management link
  7. 2021-07-28 14:10:50.323 ~ 2021-07-28 14:23:06.054 all host has connection error and seq 55763 just show for second time, the chance can be ws link lost dring some transfer?
  8. 2021-07-28 16:08:19.154 ~ 2021-07-28 16:12:50.000 almost same as case7
  9. 2021-07-29 01:08:09.143 ~ 2021-07-29 01:23:29.195 similiar as case4 in host fa72d36c25be

We classify cases above to 2 types below and seems they are not handled properly.

  • 1 underlying network error, we can see msg “b'on_underlying_ws_error called with error code 4' (b'/data/src/vendor/azure-uamqp-c/deps/azure-c-shared-utility/src/wsio.c':b'on_underlying_ws_error':450)” and the issue can be

  • 1.1 DISPOSITION frame will lost while sdk think it's settled and we will see the msg re-delivery

  • 1.2 some times it will take quite a long time before the underlying error found and the operation which make the link not work (receive and settle ) till reset.

  • 2 CBS refresh related issue, we can see msg like "Unexpected error occurred (TimeoutError('Authorization timeout.')). Handler shutting down." and even see get token timeout and the issue can be

  • 2.1 uamqp and ClientSecretCredential have time window gap when token is about to expire, uamqp will try refresh 6mins before expire while ClientSecretCredential will only refresh 5mins before expire(before that only cached token return), so we can see uamqp do the token refresh rapidly (once 50ms for about 1min) in the gap minute and the normal receve operation seems not work during that time and if have TRANSFER frame before fresh, it will delay to emit or lost . Also not sure whether the rapid refresh operation will affect the connection and lead to "Authorization timeout" later.

  • 2.2 some times the get token will hang till 300s timeout (we see visit login.microsoft ... timeout error log), in that case, TRANSFER frame not emitted will lost/delay and the whole receiver will hang which makes max wait time not work.

For underlying network error, we have tried to move the receiver to azure vm which in same region with the service bus and seems it helps while the CBS refresh error "Authorization timeout" still exist.

Expected behavior
We expect the sdk can have a nice error handle to avoid "msg lost"/ "delivery already expired msg to client" in cases above and maybe give some advice for client user.
For CBS refresh, not sure the "refresh window gap caused rapid token refresh" and "get token timeout the whole receive operation" is a proper workflow

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.
rawlogs.zip

Metadata

Metadata

Assignees

Labels

ClientThis issue points to a problem in the data-plane of the library.MQThis issue is part of a "milestone of quality" initiative.MessagingMessaging crewService Buscustomer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions