Fix #1110 Socket Mode disconnection issue with the aiohttp-based client#1112
Conversation
| self.wss_uri = await self.issue_new_wss_url() | ||
| self.current_session = await self.aiohttp_client_session.ws_connect( | ||
| self.wss_uri, | ||
| autoping=False, |
There was a problem hiding this comment.
This is the key change in this pull request. I've decided to stop relying on the AIOHTTP's ping-pong to better track the actual liveness of the connections on our end.
We've added manual ping-pong communication and last_ping_pong_time as the state of the current session. If the value is too old, this Socket Mode client tries to reconnect anyways. This is a proven way in the non-asyncio Socket Mode client. We haven't any similar bug reports for the non-asyncio client.
| self.monitor_current_session() | ||
| ) | ||
| if self.current_session_monitor is not None: | ||
| self.current_session_monitor.cancel() |
There was a problem hiding this comment.
Also, while doing more tests (=aggressively & repeatedly doing force reconnection), I found that the Task object may refer to old abandoned objects. Thus, cancelling the old Task this way makes this code much safer.
There was a problem hiding this comment.
Is it worth filing this issue upstream to the aio project?
There was a problem hiding this comment.
Some part of the session object may be related to the behavior but this part is our own code. Thus, no need to file an issue to the project. Basically, we've improved our code to be safer this time 😅
| web_client: Optional[AsyncWebClient] = None, | ||
| auto_reconnect_enabled: bool = True, | ||
| ping_interval: float = 10, | ||
| trace_enabled: bool = False, |
There was a problem hiding this comment.
As we added this flag to the parent class, this should be added to this websockets based implementation too.
| elif message.type == WSMsgType.CLOSED: | ||
| await asyncio.sleep(self.ping_interval) | ||
| continue | ||
| elif message.type == WSMsgType.PING: |
There was a problem hiding this comment.
Now that we handles ping / pong on our own, I've added these handlers.
|
|
||
| if self.last_ping_pong_time is not None: | ||
| disconnected_seconds = int(time.time() - self.last_ping_pong_time) | ||
| if disconnected_seconds >= (self.ping_interval * 4): |
There was a problem hiding this comment.
This is the same logic with the non-asyncio client.
Codecov Report
@@ Coverage Diff @@
## main #1112 +/- ##
==========================================
- Coverage 85.71% 85.34% -0.38%
==========================================
Files 110 110
Lines 10443 10506 +63
==========================================
+ Hits 8951 8966 +15
- Misses 1492 1540 +48
Continue to review full report at Codecov.
|
| should_reconnect = False | ||
| if self.current_session is None or self.current_session.closed: | ||
| self.logger.info( | ||
| "The session seems to be already closed. Going to reconnect..." |
There was a problem hiding this comment.
Suggesting "Reconnecting..." instead of "Going to reconnect..." to be more consistent with "Disconnecting..."
|
|
||
| if should_reconnect is True or not await self.is_connected(): | ||
| await self.connect_to_new_endpoint() | ||
| self.logger.info("Reconnection done.") |
There was a problem hiding this comment.
Suggest "Reconnected to endpoint" (with no period at end; other log messages don't end with punctuation)
| elif message.type == WSMsgType.CLOSE: | ||
| if self.auto_reconnect_enabled: | ||
| self.logger.info( | ||
| "Received CLOSE event. Going to reconnect..." |
There was a problem hiding this comment.
Suggesting "Reconnecting..." instead of "Going to reconnect..." here too
| try: | ||
| await self.connect_operation_lock.acquire() | ||
| if self.trace_enabled: | ||
| self.logger.debug("For reconnection, the connect_operation_lock was acquired") |
There was a problem hiding this comment.
Is this called in other cases than a reconnect? If so, suggest simpler "Acquired connect_operation_lock"
| if self.connect_operation_lock.locked() is True: | ||
| self.connect_operation_lock.release() | ||
| if self.trace_enabled: | ||
| self.logger.debug("The connect_operation_lock for reconnection was released") |
There was a problem hiding this comment.
If changing above, suggest simpler "Released connection_operation_lock" here also
|
@eddyg Thanks for the suggestion! I agree with you on the log message consistency. I've updated them. |
filmaj
left a comment
There was a problem hiding this comment.
Looks good to me!
I wasn't able to determine from a quick ack through the project, but, are there tests for handling ping/pong messages? Possibly should we add tests for the monitoring / reconnection behaviours?
| self.monitor_current_session() | ||
| ) | ||
| if self.current_session_monitor is not None: | ||
| self.current_session_monitor.cancel() |
There was a problem hiding this comment.
Is it worth filing this issue upstream to the aio project?
| try: | ||
| message: WSMessage = await self.current_session.receive() | ||
| if self.trace_enabled and self.logger.level <= logging.DEBUG: | ||
| type = WSMsgType(message.type) |
There was a problem hiding this comment.
The next conditional below checks that message is not None - should that be done here too?
There was a problem hiding this comment.
The type should always exists but just in case, we may want to have additional check before the assignment. will update this part later.
| if is_stale: | ||
| self.logger.info( | ||
| "The connection seems to be stale. Disconnecting..." | ||
| "The connection seems to be stale. Reconnecting..." |
There was a problem hiding this comment.
Just to clarify and for my own understanding: so the next disconnect() call will end up actually re-establishing the connection via the monitoring loop you added?
There was a problem hiding this comment.
Thanks. You're right. This change should be reverted as the situation here is not always auto-reconnection enabled.
Yes, this test runs the code of the patterns.
Right now, we don't have any tests for this scenario. The reconnection tests require more updates on the testing utility side. I did lots of manual testing this time. Let me add automated tests for it in future updates. |
|
OK sounds good - I am not saying it is required, I am mainly asking questions to learn 😄 |
Summary
This pull request fixes #1110 by improving the connection management of the AIOHTTP based Socket Mode client. I will add a few comments on the improvement.
Category (place an
xin each of the[ ])/docs-src(Documents, have you run./scripts/docs.sh?)/docs-src-v2(Documents, have you run./scripts/docs-v2.sh?)/tutorial(PythOnBoardingBot tutorial)tests/integration_tests(Automated tests for this library)Requirements (place an
xin each[ ])python3 -m venv .venv && source .venv/bin/activate && ./scripts/run_validation.shafter making the changes.