Skip to content

Fix #1110 Socket Mode disconnection issue with the aiohttp-based client#1112

Merged
seratch merged 4 commits intoslackapi:mainfrom
seratch:issue-1110-aiohttp-socket-mode
Sep 9, 2021
Merged

Fix #1110 Socket Mode disconnection issue with the aiohttp-based client#1112
seratch merged 4 commits intoslackapi:mainfrom
seratch:issue-1110-aiohttp-socket-mode

Conversation

@seratch
Copy link
Copy Markdown
Contributor

@seratch seratch commented Sep 9, 2021

Summary

This pull request fixes #1110 by improving the connection management of the AIOHTTP based Socket Mode client. I will add a few comments on the improvement.

Category (place an x in each of the [ ])

  • slack_sdk.web.WebClient (sync/async) (Web API client)
  • slack_sdk.webhook.WebhookClient (sync/async) (Incoming Webhook, response_url sender)
  • slack_sdk.socket_mode (Socket Mode client)
  • slack_sdk.signature (Request Signature Verifier)
  • slack_sdk.oauth (OAuth Flow Utilities)
  • slack_sdk.models (UI component builders)
  • slack_sdk.scim (SCIM API client)
  • slack_sdk.audit_logs (Audit Logs API client)
  • slack_sdk.rtm_v2 (RTM client)
  • /docs-src (Documents, have you run ./scripts/docs.sh?)
  • /docs-src-v2 (Documents, have you run ./scripts/docs-v2.sh?)
  • /tutorial (PythOnBoardingBot tutorial)
  • tests/integration_tests (Automated tests for this library)

Requirements (place an x in each [ ])

  • I've read and understood the Contributing Guidelines and have done my best effort to follow them.
  • I've read and agree to the Code of Conduct.
  • I've run python3 -m venv .venv && source .venv/bin/activate && ./scripts/run_validation.sh after making the changes.

@seratch seratch added this to the 3.11.0 milestone Sep 9, 2021
@seratch seratch self-assigned this Sep 9, 2021
self.wss_uri = await self.issue_new_wss_url()
self.current_session = await self.aiohttp_client_session.ws_connect(
self.wss_uri,
autoping=False,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the key change in this pull request. I've decided to stop relying on the AIOHTTP's ping-pong to better track the actual liveness of the connections on our end.

We've added manual ping-pong communication and last_ping_pong_time as the state of the current session. If the value is too old, this Socket Mode client tries to reconnect anyways. This is a proven way in the non-asyncio Socket Mode client. We haven't any similar bug reports for the non-asyncio client.

self.monitor_current_session()
)
if self.current_session_monitor is not None:
self.current_session_monitor.cancel()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, while doing more tests (=aggressively & repeatedly doing force reconnection), I found that the Task object may refer to old abandoned objects. Thus, cancelling the old Task this way makes this code much safer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth filing this issue upstream to the aio project?

Copy link
Copy Markdown
Contributor Author

@seratch seratch Sep 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some part of the session object may be related to the behavior but this part is our own code. Thus, no need to file an issue to the project. Basically, we've improved our code to be safer this time 😅

web_client: Optional[AsyncWebClient] = None,
auto_reconnect_enabled: bool = True,
ping_interval: float = 10,
trace_enabled: bool = False,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we added this flag to the parent class, this should be added to this websockets based implementation too.

elif message.type == WSMsgType.CLOSED:
await asyncio.sleep(self.ping_interval)
continue
elif message.type == WSMsgType.PING:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we handles ping / pong on our own, I've added these handlers.


if self.last_ping_pong_time is not None:
disconnected_seconds = int(time.time() - self.last_ping_pong_time)
if disconnected_seconds >= (self.ping_interval * 4):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same logic with the non-asyncio client.

@seratch seratch requested a review from filmaj September 9, 2021 14:52
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 9, 2021

Codecov Report

Merging #1112 (07af1ee) into main (9f3240d) will decrease coverage by 0.37%.
The diff coverage is 18.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1112      +/-   ##
==========================================
- Coverage   85.71%   85.34%   -0.38%     
==========================================
  Files         110      110              
  Lines       10443    10506      +63     
==========================================
+ Hits         8951     8966      +15     
- Misses       1492     1540      +48     
Impacted Files Coverage Δ
slack_sdk/socket_mode/builtin/client.py 79.19% <ø> (-0.68%) ⬇️
slack_sdk/socket_mode/websocket_client/__init__.py 60.76% <0.00%> (ø)
slack_sdk/socket_mode/aiohttp/__init__.py 46.11% <9.18%> (-10.93%) ⬇️
slack_sdk/socket_mode/async_client.py 71.17% <73.33%> (-0.12%) ⬇️
slack_sdk/socket_mode/websockets/__init__.py 62.03% <100.00%> (+0.71%) ⬆️
slack_sdk/socket_mode/builtin/connection.py 35.71% <0.00%> (-0.43%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f3240d...07af1ee. Read the comment docs.

should_reconnect = False
if self.current_session is None or self.current_session.closed:
self.logger.info(
"The session seems to be already closed. Going to reconnect..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting "Reconnecting..." instead of "Going to reconnect..." to be more consistent with "Disconnecting..."


if should_reconnect is True or not await self.is_connected():
await self.connect_to_new_endpoint()
self.logger.info("Reconnection done.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest "Reconnected to endpoint" (with no period at end; other log messages don't end with punctuation)

elif message.type == WSMsgType.CLOSE:
if self.auto_reconnect_enabled:
self.logger.info(
"Received CLOSE event. Going to reconnect..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting "Reconnecting..." instead of "Going to reconnect..." here too

try:
await self.connect_operation_lock.acquire()
if self.trace_enabled:
self.logger.debug("For reconnection, the connect_operation_lock was acquired")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this called in other cases than a reconnect? If so, suggest simpler "Acquired connect_operation_lock"

if self.connect_operation_lock.locked() is True:
self.connect_operation_lock.release()
if self.trace_enabled:
self.logger.debug("The connect_operation_lock for reconnection was released")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If changing above, suggest simpler "Released connection_operation_lock" here also

@seratch
Copy link
Copy Markdown
Contributor Author

seratch commented Sep 9, 2021

@eddyg Thanks for the suggestion! I agree with you on the log message consistency. I've updated them.

Copy link
Copy Markdown
Contributor

@filmaj filmaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

I wasn't able to determine from a quick ack through the project, but, are there tests for handling ping/pong messages? Possibly should we add tests for the monitoring / reconnection behaviours?

self.monitor_current_session()
)
if self.current_session_monitor is not None:
self.current_session_monitor.cancel()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth filing this issue upstream to the aio project?

try:
message: WSMessage = await self.current_session.receive()
if self.trace_enabled and self.logger.level <= logging.DEBUG:
type = WSMsgType(message.type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next conditional below checks that message is not None - should that be done here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type should always exists but just in case, we may want to have additional check before the assignment. will update this part later.

if is_stale:
self.logger.info(
"The connection seems to be stale. Disconnecting..."
"The connection seems to be stale. Reconnecting..."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify and for my own understanding: so the next disconnect() call will end up actually re-establishing the connection via the monitoring loop you added?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. You're right. This change should be reverted as the situation here is not always auto-reconnection enabled.

@seratch
Copy link
Copy Markdown
Contributor Author

seratch commented Sep 9, 2021

@filmaj

are there tests for handling ping/pong messages?

Yes, this test runs the code of the patterns.

Possibly should we add tests for the monitoring / reconnection behaviours?

Right now, we don't have any tests for this scenario. The reconnection tests require more updates on the testing utility side. I did lots of manual testing this time. Let me add automated tests for it in future updates.

@filmaj
Copy link
Copy Markdown
Contributor

filmaj commented Sep 9, 2021

OK sounds good - I am not saying it is required, I am mainly asking questions to learn 😄

@seratch seratch merged commit dafdef0 into slackapi:main Sep 9, 2021
@seratch seratch deleted the issue-1110-aiohttp-socket-mode branch September 9, 2021 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Socket Mode disconnection issue with the aiohttp-based client

3 participants