refactor: prioritize control traffic and shape RoomTransform broadcasts by from2001 · Pull Request #282 · styly-dev/STYLY-NetSync

from2001 · 2026-01-12T11:24:21Z

Motivation

Ensure RPC and Network Variable (NV) delivery is not delayed by high-rate RoomTransform processing under load.
Reduce lock contention in receive/periodic paths by moving heavy transform work outside self._rooms_lock.
Limit transform CPU and bandwidth impact so control-plane messages remain responsive even with many clients.
Preserve latest-only semantics for transforms while avoiding control-plane starvation.

Description

Prioritize control messages by adding a dedicated control queue self._pub_queue_ctrl and draining it first in _publisher_loop with a CTRL_DRAIN_BATCH limit and backlog watermark CTRL_BACKLOG_WATERMARK.
Add transform shaping using a token-bucket style budget (TRANSFORM_BUDGET_BYTES_PER_SEC) and send coalesced transforms with zmq.DONTWAIT in _try_send_transform to avoid blocking the publisher on slow SUBs.
Move NV flush before transform generation in _periodic_loop and skip transform generation when control backlog is high via _control_backlog_exceeded.
Reduce lock hold time by snapshotting room clients under self._rooms_lock, extracting per-client transform body bytes at receive time (saved in self.client_transform_body_cache), and serializing room transforms from these snapshots using bytearray + struct.pack_into in _serialize_room_transform outside the rooms lock.

Testing

No automated tests were executed as part of this change (no pytest run).

Codex Task

from2001 · 2026-01-12T11:31:37Z

@copilot Review this PR.

Copilot · 2026-01-12T11:31:46Z

@from2001 I've opened a new pull request, #283, to work on those changes. Once the pull request is ready, I'll request review from you.

from2001 · 2026-01-12T11:48:36Z

@claude Review the latest code.

claude · 2026-01-12T11:48:47Z

Claude finished @from2001's task in 1m 28s —— View job

Code Review Complete ✓

Todo List

Read the modified server.py file
Compare changes against origin/develop
Review implementation details
Provide comprehensive feedback

Review Summary

This PR introduces significant improvements to the server's traffic handling architecture by prioritizing control messages (RPC/NV) over transforms and adding rate limiting to prevent control-plane starvation. The implementation is generally solid, but I've identified several areas that need attention.

Critical Issues

1. Potential Deadlock in `_serialize_room_transform` (High Priority)

Location: server.py:1545-1575

The _serialize_room_transform method is called from _broadcast_room (line 1538), which is invoked from _adaptive_broadcast_all_rooms (line 1525) outside the _rooms_lock. However, it calls binary_serializer._serialize_client_data_short(buffer, transform_data) which may access shared state without proper locking.

Concern: If _serialize_client_data_short or _pack_string access any shared server state, this could cause race conditions since the snapshot is taken under lock but serialization happens outside.

Recommendation:

Verify that binary_serializer._serialize_client_data_short and binary_serializer._pack_string are pure functions that don't access server state
Add documentation comments confirming the lock-free nature of serialization
Consider adding assertions in debug mode to detect any unexpected state access

2. Token Bucket Refund Logic May Under-Refund (Medium Priority)

Location: server.py:438-447

When send_multipart fails with zmq.Again, the code refunds the full cost. However, the cost is calculated as len(message_bytes) * sub_count (line 421). If the SUB count changes between the cost calculation and the refund, you might refund the wrong amount.

sub_count = max(1, self._get_sub_connection_count())  # Line 420
cost = len(message_bytes) * sub_count                  # Line 421
# ... later ...
self._transform_tokens += cost                         # Line 440 (refund)

Recommendation: Capture and refund the exact amount that was deducted, not recalculate:

deducted_cost = cost  # Save the actual deducted amount
# ... in error handlers ...
self._transform_tokens += deducted_cost

3. Client Transform Body Cache Leak Risk (Medium Priority)

Location: server.py:978, 991, 1598

The client_transform_body_cache is cleaned up when clients timeout (line 1598), but if a client_no is reused (via _find_reusable_client_no), the old cached body might still be present. This could cause stale transform data to be broadcast.

Recommendation: Clear the cache entry when reusing a client_no in _find_reusable_client_no:

def _find_reusable_client_no(self, room_id: str) -> int:
    # ... existing code ...
    if found_reusable_client_no != -1:
        # Clear stale cache entry
        if found_reusable_client_no in self.client_transform_body_cache:
            del self.client_transform_body_cache[found_reusable_client_no]
    return found_reusable_client_no

Design Issues

4. Backlog Check After NV Flush Creates Priority Inversion (Medium Priority)

Location: server.py:1419-1420

The NV flush happens before the control backlog check:

# Flush Network Variables before other categories
# ... NV flush code ...
if not self._control_backlog_exceeded():
    self._adaptive_broadcast_all_rooms(current_time)

Problem: If the NV flush itself enqueues many control messages via _enqueue_pub, it could push the backlog over CTRL_BACKLOG_WATERMARK, but transforms would still be skipped even though the backlog was caused by the flush, not incoming traffic.

Recommendation: Check backlog before NV flush as well, or use separate watermarks for "skip transforms" vs "skip NV flush".

5. Missing Coalesce Buffer Size Limit (Low-Medium Priority)

Location: server.py:209-211

The _coalesce_latest dictionary has no size limit. With many active rooms, this could grow unbounded if transforms are generated faster than the publisher can drain them.

Recommendation: Add a watermark check in _enqueue_pub_latest:

def _enqueue_pub_latest(self, topic_bytes: bytes, message_bytes: bytes):
    with self._coalesce_lock:
        if len(self._coalesce_latest) > 1000:  # Or make this a constant
            logger.warning("Coalesce buffer overflow, dropping oldest")
            # Drop oldest entry (FIFO)
            self._coalesce_latest.pop(next(iter(self._coalesce_latest)))
        self._coalesce_latest[topic_bytes] = message_bytes

Code Quality Issues

6. Inconsistent Lock Acquisition Patterns (Low Priority)

Location: Multiple places

Some methods acquire _rooms_lock at the top level (e.g., _get_or_assign_client_no:549), while others acquire it mid-function (e.g., _adaptive_broadcast_all_rooms:1483). This makes it harder to reason about lock ordering.

Recommendation: Establish and document a consistent pattern. Consider using context managers with named scopes:

with self._rooms_lock:  # LOCK: snapshot room state
    # ... snapshot code ...
# UNLOCK: now serialize outside lock

7. Magic Number for Batch Size (Low Priority)

Location: server.py:158

CTRL_DRAIN_BATCH = 256 - This value appears arbitrary. Add a comment explaining the rationale (e.g., "Balance between latency and throughput based on benchmarks").

8. Potential Type Inconsistency in `_adaptive_broadcast_all_rooms` (Low Priority)

Location: server.py:1489

The line:

last_broadcast = self.room_last_broadcast.get(room_id, 0.0)

Uses 0.0 as the default, but previously used 0 (line 1413). While functionally equivalent, using 0.0 consistently improves type clarity.

Status: Actually, this is now consistent with line 1383. Good fix!

Performance Observations

9. Good: Lock Hold Time Reduction ✓

The snapshot-based approach in _adaptive_broadcast_all_rooms (lines 1508-1518) successfully reduces lock hold time by extracting client data under lock, then releasing before serialization. This is excellent for concurrency.

10. Good: Control Traffic Prioritization ✓

The two-phase drain in _publisher_loop (lines 477-508) properly prioritizes control messages. The CTRL_DRAIN_BATCH limit prevents RPC/NV from completely starving transforms.

11. Good: Token Bucket Rate Limiting ✓

The token bucket implementation (lines 398-447) correctly implements refill logic and considers fanout cost (sub_count). This should effectively prevent bandwidth exhaustion.

Testing Gaps

12. Missing Test Coverage (High Priority)

The PR description states: "No automated tests were executed as part of this change (no pytest run)."

Critical tests needed:

Control message priority under high transform load
Token bucket rate limiting behavior
Transform coalescing correctness (latest-only semantics)
Backlog watermark triggering
Cache cleanup during client_no reuse

Recommendation: Before merging, run:

pytest tests/
pytest --cov=src/styly_netsync

Minor Issues

Removed import: import logging was removed (line 18 in diff) but you added import struct (line 27). Good - struct is actually used.
Renamed cache: client_binary_cache → client_transform_body_cache improves clarity about what's cached (body without deviceId).

Security Considerations

No security issues identified - The changes don't introduce new attack surfaces. Rate limiting actually improves DoS resilience.

Recommendations Before Merge

Critical: Fix the token bucket refund logic (Add Claude Code GitHub Workflow #2) to use the actual deducted cost
Critical: Add cache cleanup when reusing client_no (Add Gemini CLI Actions #3)
Important: Run pytest to verify no regressions (feat: Add optional defaultValue parameter to GetGlobalVariable #12)
Important: Add unit tests for new control priority logic (feat: Add optional defaultValue parameter to GetGlobalVariable #12)
Consider: Add coalesce buffer size limit (Enable /review and /triage #5)
Consider: Adjust backlog check placement relative to NV flush (Update LLM CLI GitHub Actions #4)

Positive Aspects

Excellent architecture: The separation of control and transform traffic is well-designed
Good concurrency patterns: Lock-free serialization from snapshots is a solid approach
Clear constants: The new constants (CTRL_DRAIN_BATCH, etc.) make tuning explicit
Backward compatible: Changes maintain protocol compatibility

Overall Assessment

This PR makes meaningful improvements to server scalability and responsiveness. The core design is sound, but testing is required before merge, and the token bucket refund logic needs a fix. With those addressed, this will be a solid enhancement.

Recommendation: Request changes for items #2, #3, and #12, then approve after fixes.

…d cache cleanup

refactor: prioritize control traffic over transforms

f5da956

from2001 added the codex label Jan 12, 2026 — with ChatGPT Codex Connector

This was referenced Jan 12, 2026

fix: prioritize control traffic over transforms #281

Closed

fix: prioritize control sync over transforms #280

Closed

fix: prioritize control traffic over transforms #279

Closed

from2001 marked this pull request as ready for review January 12, 2026 11:31

Copilot AI mentioned this pull request Jan 12, 2026

Code review: identify thread-safety and code quality issues #283

Closed

refactor: improve transform rate limiting and control traffic handling

b6a78d0

refactor: enhance transform handling with coalescing buffer limits an…

a43d0d7

…d cache cleanup

from2001 merged commit e334c28 into develop Jan 12, 2026

from2001 deleted the codex/implement-server.py-improvements-for-transform-i2297p branch January 12, 2026 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: prioritize control traffic and shape RoomTransform broadcasts#282

refactor: prioritize control traffic and shape RoomTransform broadcasts#282
from2001 merged 3 commits intodevelopfrom
codex/implement-server.py-improvements-for-transform-i2297p

from2001 commented Jan 12, 2026

Uh oh!

from2001 commented Jan 12, 2026

Uh oh!

Copilot AI commented Jan 12, 2026

Uh oh!

from2001 commented Jan 12, 2026

Uh oh!

claude bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

from2001 commented Jan 12, 2026

Motivation

Description

Testing

Uh oh!

from2001 commented Jan 12, 2026

Uh oh!

Copilot AI commented Jan 12, 2026

Uh oh!

from2001 commented Jan 12, 2026

Uh oh!

claude bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✓

Todo List

Review Summary

Critical Issues

1. Potential Deadlock in _serialize_room_transform (High Priority)

2. Token Bucket Refund Logic May Under-Refund (Medium Priority)

3. Client Transform Body Cache Leak Risk (Medium Priority)

Design Issues

4. Backlog Check After NV Flush Creates Priority Inversion (Medium Priority)

5. Missing Coalesce Buffer Size Limit (Low-Medium Priority)

Code Quality Issues

6. Inconsistent Lock Acquisition Patterns (Low Priority)

7. Magic Number for Batch Size (Low Priority)

8. Potential Type Inconsistency in _adaptive_broadcast_all_rooms (Low Priority)

Performance Observations

9. Good: Lock Hold Time Reduction ✓

10. Good: Control Traffic Prioritization ✓

11. Good: Token Bucket Rate Limiting ✓

Testing Gaps

12. Missing Test Coverage (High Priority)

Minor Issues

Security Considerations

Recommendations Before Merge

Positive Aspects

Overall Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Jan 12, 2026 •

edited

Loading

1. Potential Deadlock in `_serialize_room_transform` (High Priority)

8. Potential Type Inconsistency in `_adaptive_broadcast_all_rooms` (Low Priority)