Skip to content

refactor: prioritize control traffic and shape RoomTransform broadcasts#282

Merged
from2001 merged 3 commits intodevelopfrom
codex/implement-server.py-improvements-for-transform-i2297p
Jan 12, 2026
Merged

refactor: prioritize control traffic and shape RoomTransform broadcasts#282
from2001 merged 3 commits intodevelopfrom
codex/implement-server.py-improvements-for-transform-i2297p

Conversation

@from2001
Copy link
Collaborator

Motivation

  • Ensure RPC and Network Variable (NV) delivery is not delayed by high-rate RoomTransform processing under load.
  • Reduce lock contention in receive/periodic paths by moving heavy transform work outside self._rooms_lock.
  • Limit transform CPU and bandwidth impact so control-plane messages remain responsive even with many clients.
  • Preserve latest-only semantics for transforms while avoiding control-plane starvation.

Description

  • Prioritize control messages by adding a dedicated control queue self._pub_queue_ctrl and draining it first in _publisher_loop with a CTRL_DRAIN_BATCH limit and backlog watermark CTRL_BACKLOG_WATERMARK.
  • Add transform shaping using a token-bucket style budget (TRANSFORM_BUDGET_BYTES_PER_SEC) and send coalesced transforms with zmq.DONTWAIT in _try_send_transform to avoid blocking the publisher on slow SUBs.
  • Move NV flush before transform generation in _periodic_loop and skip transform generation when control backlog is high via _control_backlog_exceeded.
  • Reduce lock hold time by snapshotting room clients under self._rooms_lock, extracting per-client transform body bytes at receive time (saved in self.client_transform_body_cache), and serializing room transforms from these snapshots using bytearray + struct.pack_into in _serialize_room_transform outside the rooms lock.

Testing

  • No automated tests were executed as part of this change (no pytest run).

Codex Task

@from2001
Copy link
Collaborator Author

@copilot Review this PR.

Copy link
Contributor

Copilot AI commented Jan 12, 2026

@from2001 I've opened a new pull request, #283, to work on those changes. Once the pull request is ready, I'll request review from you.

@from2001 from2001 marked this pull request as ready for review January 12, 2026 11:31
@from2001
Copy link
Collaborator Author

@claude Review the latest code.

@claude
Copy link
Contributor

claude bot commented Jan 12, 2026

Claude finished @from2001's task in 1m 28s —— View job


Code Review Complete ✓

Todo List

  • Read the modified server.py file
  • Compare changes against origin/develop
  • Review implementation details
  • Provide comprehensive feedback

Review Summary

This PR introduces significant improvements to the server's traffic handling architecture by prioritizing control messages (RPC/NV) over transforms and adding rate limiting to prevent control-plane starvation. The implementation is generally solid, but I've identified several areas that need attention.

Critical Issues

1. Potential Deadlock in _serialize_room_transform (High Priority)

Location: server.py:1545-1575

The _serialize_room_transform method is called from _broadcast_room (line 1538), which is invoked from _adaptive_broadcast_all_rooms (line 1525) outside the _rooms_lock. However, it calls binary_serializer._serialize_client_data_short(buffer, transform_data) which may access shared state without proper locking.

Concern: If _serialize_client_data_short or _pack_string access any shared server state, this could cause race conditions since the snapshot is taken under lock but serialization happens outside.

Recommendation:

  • Verify that binary_serializer._serialize_client_data_short and binary_serializer._pack_string are pure functions that don't access server state
  • Add documentation comments confirming the lock-free nature of serialization
  • Consider adding assertions in debug mode to detect any unexpected state access

2. Token Bucket Refund Logic May Under-Refund (Medium Priority)

Location: server.py:438-447

When send_multipart fails with zmq.Again, the code refunds the full cost. However, the cost is calculated as len(message_bytes) * sub_count (line 421). If the SUB count changes between the cost calculation and the refund, you might refund the wrong amount.

sub_count = max(1, self._get_sub_connection_count())  # Line 420
cost = len(message_bytes) * sub_count                  # Line 421
# ... later ...
self._transform_tokens += cost                         # Line 440 (refund)

Recommendation: Capture and refund the exact amount that was deducted, not recalculate:

deducted_cost = cost  # Save the actual deducted amount
# ... in error handlers ...
self._transform_tokens += deducted_cost

3. Client Transform Body Cache Leak Risk (Medium Priority)

Location: server.py:978, 991, 1598

The client_transform_body_cache is cleaned up when clients timeout (line 1598), but if a client_no is reused (via _find_reusable_client_no), the old cached body might still be present. This could cause stale transform data to be broadcast.

Recommendation: Clear the cache entry when reusing a client_no in _find_reusable_client_no:

def _find_reusable_client_no(self, room_id: str) -> int:
    # ... existing code ...
    if found_reusable_client_no != -1:
        # Clear stale cache entry
        if found_reusable_client_no in self.client_transform_body_cache:
            del self.client_transform_body_cache[found_reusable_client_no]
    return found_reusable_client_no

Design Issues

4. Backlog Check After NV Flush Creates Priority Inversion (Medium Priority)

Location: server.py:1419-1420

The NV flush happens before the control backlog check:

# Flush Network Variables before other categories
# ... NV flush code ...
if not self._control_backlog_exceeded():
    self._adaptive_broadcast_all_rooms(current_time)

Problem: If the NV flush itself enqueues many control messages via _enqueue_pub, it could push the backlog over CTRL_BACKLOG_WATERMARK, but transforms would still be skipped even though the backlog was caused by the flush, not incoming traffic.

Recommendation: Check backlog before NV flush as well, or use separate watermarks for "skip transforms" vs "skip NV flush".

5. Missing Coalesce Buffer Size Limit (Low-Medium Priority)

Location: server.py:209-211

The _coalesce_latest dictionary has no size limit. With many active rooms, this could grow unbounded if transforms are generated faster than the publisher can drain them.

Recommendation: Add a watermark check in _enqueue_pub_latest:

def _enqueue_pub_latest(self, topic_bytes: bytes, message_bytes: bytes):
    with self._coalesce_lock:
        if len(self._coalesce_latest) > 1000:  # Or make this a constant
            logger.warning("Coalesce buffer overflow, dropping oldest")
            # Drop oldest entry (FIFO)
            self._coalesce_latest.pop(next(iter(self._coalesce_latest)))
        self._coalesce_latest[topic_bytes] = message_bytes

Code Quality Issues

6. Inconsistent Lock Acquisition Patterns (Low Priority)

Location: Multiple places

Some methods acquire _rooms_lock at the top level (e.g., _get_or_assign_client_no:549), while others acquire it mid-function (e.g., _adaptive_broadcast_all_rooms:1483). This makes it harder to reason about lock ordering.

Recommendation: Establish and document a consistent pattern. Consider using context managers with named scopes:

with self._rooms_lock:  # LOCK: snapshot room state
    # ... snapshot code ...
# UNLOCK: now serialize outside lock

7. Magic Number for Batch Size (Low Priority)

Location: server.py:158

CTRL_DRAIN_BATCH = 256 - This value appears arbitrary. Add a comment explaining the rationale (e.g., "Balance between latency and throughput based on benchmarks").

8. Potential Type Inconsistency in _adaptive_broadcast_all_rooms (Low Priority)

Location: server.py:1489

The line:

last_broadcast = self.room_last_broadcast.get(room_id, 0.0)

Uses 0.0 as the default, but previously used 0 (line 1413). While functionally equivalent, using 0.0 consistently improves type clarity.

Status: Actually, this is now consistent with line 1383. Good fix!

Performance Observations

9. Good: Lock Hold Time Reduction

The snapshot-based approach in _adaptive_broadcast_all_rooms (lines 1508-1518) successfully reduces lock hold time by extracting client data under lock, then releasing before serialization. This is excellent for concurrency.

10. Good: Control Traffic Prioritization

The two-phase drain in _publisher_loop (lines 477-508) properly prioritizes control messages. The CTRL_DRAIN_BATCH limit prevents RPC/NV from completely starving transforms.

11. Good: Token Bucket Rate Limiting

The token bucket implementation (lines 398-447) correctly implements refill logic and considers fanout cost (sub_count). This should effectively prevent bandwidth exhaustion.

Testing Gaps

12. Missing Test Coverage (High Priority)

The PR description states: "No automated tests were executed as part of this change (no pytest run)."

Critical tests needed:

  1. Control message priority under high transform load
  2. Token bucket rate limiting behavior
  3. Transform coalescing correctness (latest-only semantics)
  4. Backlog watermark triggering
  5. Cache cleanup during client_no reuse

Recommendation: Before merging, run:

pytest tests/
pytest --cov=src/styly_netsync

Minor Issues

  1. Removed import: import logging was removed (line 18 in diff) but you added import struct (line 27). Good - struct is actually used.

  2. Renamed cache: client_binary_cacheclient_transform_body_cache improves clarity about what's cached (body without deviceId).

Security Considerations

  1. No security issues identified - The changes don't introduce new attack surfaces. Rate limiting actually improves DoS resilience.

Recommendations Before Merge

  1. Critical: Fix the token bucket refund logic (Add Claude Code GitHub Workflow #2) to use the actual deducted cost
  2. Critical: Add cache cleanup when reusing client_no (Add Gemini CLI Actions #3)
  3. Important: Run pytest to verify no regressions (feat: Add optional defaultValue parameter to GetGlobalVariable #12)
  4. Important: Add unit tests for new control priority logic (feat: Add optional defaultValue parameter to GetGlobalVariable #12)
  5. Consider: Add coalesce buffer size limit (Enable /review and /triage #5)
  6. Consider: Adjust backlog check placement relative to NV flush (Update LLM CLI GitHub Actions #4)

Positive Aspects

  • Excellent architecture: The separation of control and transform traffic is well-designed
  • Good concurrency patterns: Lock-free serialization from snapshots is a solid approach
  • Clear constants: The new constants (CTRL_DRAIN_BATCH, etc.) make tuning explicit
  • Backward compatible: Changes maintain protocol compatibility

Overall Assessment

This PR makes meaningful improvements to server scalability and responsiveness. The core design is sound, but testing is required before merge, and the token bucket refund logic needs a fix. With those addressed, this will be a solid enhancement.

Recommendation: Request changes for items #2, #3, and #12, then approve after fixes.


@from2001 from2001 merged commit e334c28 into develop Jan 12, 2026
@from2001 from2001 deleted the codex/implement-server.py-improvements-for-transform-i2297p branch January 12, 2026 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants