Skip to content

Make NetSync robust under low-bandwidth + high-client-count conditions #302

@from2001

Description

@from2001

NetSync Low-Bandwidth Hardening — Implementation Instructions

1. Objective

Improve STYLY NetSync so that connecting many clients on a constrained network does not cause:

  • frequent JOIN/REMOVE churn (false timeouts)
  • multi-minute delays for Network Variables
  • RPC loss

The system must remain stable and responsive when the available bandwidth is below the aggregate message production rate.

2. High-Level Design

2.1 Separate QoS by channel (downlink)

Split the current downlink broadcast stream into two distinct PUB/SUB channels:

  1. Transform Downlink (Lossy, Latest-Only)

    • carries room pose/transform snapshots
    • configured to drop old messages under pressure
  2. State Downlink (Lossy, Latest-Only per key)

    • carries Network Variable sync snapshots/deltas and device ID mapping (and other “state” messages)
    • also configured to drop old snapshots under pressure, because the next snapshot contains the latest state

2.2 Reliable RPC over ROUTER/DEALER (not PUB/SUB)

RPC must not use PUB/SUB. Implement RPC delivery as:

  • client → server: ROUTER/DEALER request
  • server → client: ROUTER/DEALER delivery
  • client → server: ROUTER/DEALER ACK
  • server retries until ACK or max attempts reached

2.3 Decouple liveness from transforms (fix JOIN/REMOVE storms)

Currently, server timeouts are driven by client_data["last_update"], which is updated primarily by receiving transforms. Under bandwidth pressure, transform delivery can stall and clients are removed despite being alive.

Add a heartbeat message to keep last_update fresh independently of transform flow.

2.4 No backward compatibility is needed.

3. Protocol Changes (Binary Message Types)

Update both server and Unity client serializers.

Add message IDs (example)

  • MSG_HEARTBEAT = 13
  • MSG_RPC_DELIVERY = 14
  • MSG_RPC_ACK = 15

(IDs must be consistent across Python and C#.)

Heartbeat payload

Required fields:

  • deviceId (string)
  • clientNo (optional if known; server can resolve)
  • timestamp (monotonic time)

RPC delivery payload

Required fields:

  • rpcId (uint64 or GUID string; must be unique per RPC)
  • senderClientNo
  • functionName
  • args (array)
  • (optional) orderingKey if you plan per-sender ordering

RPC ACK payload

Required fields:

  • rpcId
  • receiverClientNo or deviceId
  • timestamp

4. Server-Side Implementation (Python)

4.1 Configuration updates

File: STYLY-NetSync-Server/src/styly_netsync/config.py

Add config fields:

  • transform_pub_port (new)

  • state_pub_port (new)

  • heartbeat_timeout (can reuse client_timeout but recommended separate semantics)

  • state_broadcast_rate_hz (e.g., 5–20 Hz depending on load)

  • heartbeat_expected_interval (e.g., 1.0s; used only for diagnostics)

  • RPC retry parameters:

    • rpc_retry_initial_ms (e.g., 50–100ms)
    • rpc_retry_max_ms (e.g., 1000ms)
    • rpc_retry_max_attempts (e.g., 30)
    • rpc_outbox_max_per_client (backpressure)

4.2 Create two PUB sockets

File: STYLY-NetSync-Server/src/styly_netsync/server.py

Replace the single self.pub with:

  • self.pub_transform
  • self.pub_state

Each should have independent HWM tuning:

  • Transform: low SNDHWM (e.g., 1–10), use DONTWAIT, drop on overflow
  • State: low SNDHWM (e.g., 1–10), use DONTWAIT, drop on overflow

Implementation rule:

  • Never block the server main loop on PUB sends.

4.3 Route messages to correct PUB

Server message routing:

  • Transform snapshots (MSG_ROOM_POSE_V2 / room transform) → pub_transform
  • NV sync (MSG_GLOBAL_VAR_SYNC, MSG_CLIENT_VAR_SYNC) and device ID mapping (MSG_DEVICE_ID_MAPPING) → pub_state

4.4 Heartbeat handling and liveness update

Add heartbeat deserialization

File: STYLY-NetSync-Server/src/styly_netsync/binary_serializer.py

  • Implement serialize/deserialize for MSG_HEARTBEAT.

Update receive loop to refresh last_update

File: server.py, in the ROUTER receive block (where you handle msg_type):

  • On every message received (transform, rpc request, nv set, heartbeat):

    • Resolve device_id:

      • If message contains deviceId, use it
      • Else resolve from client_identity via current mapping
    • Ensure client exists in self.rooms[room_id]

    • Update:

      • self.rooms[room_id][device_id]["last_update"] = time.monotonic()
      • self.rooms[room_id][device_id]["identity"] = client_identity (refresh identity to be safe)

This directly prevents false timeouts due to congested transforms.

4.5 Network Variables: keep “latest-only per variable key”

Your design assumption is correct.

Important note from the current server implementation:

  • _buffer_global_var_set() already does “latest-wins per variableName”
  • _buffer_client_var_set() already does “latest-wins per (targetClientNo, variableName)”

Keep this behavior.

Delivery change: publish NV as state snapshots/deltas on state PUB

Goal: prevent multi-minute backlog.

  • Publish periodic NV sync frames containing the current state (or compact delta) for a room.
  • Send with DONTWAIT. If it fails, do nothing—the next tick sends the latest again.
  • Do not queue “every set” on the wire; only send the newest consolidated state.

4.6 Reliable RPC (ROUTER/DEALER with ACK + retry)

Modify RPC flow

Current server uses _enqueue_pub() for RPC broadcast; replace it.

File: server.py

  • Replace _send_rpc_to_room():

    • For each target client in the room (excluding sender):

      • Create rpcId
      • Enqueue into an RPC outbox keyed by target device/client
      • Immediately attempt send via self.router.send_multipart([identity, room_id_bytes, payload])
    • Do not use PUB.

Add outbox + retry scheduler

Maintain:

  • self.rpc_outbox[(room_id, target_device_id)] = {rpcId: OutboxEntry(...)}

  • Each OutboxEntry tracks:

    • payload bytes
    • next_send_time
    • attempts

In main loop, periodically:

  • resend due messages
  • drop messages that exceed max attempts (log error + metrics)

Add ACK processing

When receiving MSG_RPC_ACK from client:

  • remove corresponding rpcId from outbox for that client

5. Unity Client Implementation (C# / NetMQ)

5.1 ConnectionManager: support two SUB sockets and receiving from DEALER

File: STYLY-NetSync-Unity/Packages/com.styly.styly-netsync/Runtime/Internal Scripts/ConnectionManager.cs

Add sockets

  • _transformSubSocket (SubscriberSocket)
  • _stateSubSocket (SubscriberSocket)
  • Keep _dealerSocket (DealerSocket)

Connect to three endpoints

  • Dealer connects to dealer_port (existing)
  • TransformSub connects to transform_pub_port
  • StateSub connects to state_pub_port
  • Subscribe room topic on both sub sockets

Network loop must receive from:

  • transform SUB
  • state SUB
  • dealer socket (for RPC deliveries and any future reliable server messages)

Implementation rule:

  • Use a poller or a manual loop that checks all sockets without blocking the thread indefinitely.
  • Ensure MessageProcessor.ProcessIncomingMessage(payload) is called for all incoming binary payloads.

5.2 BinarySerializer: add new message types

File: .../Runtime/Internal Scripts/BinarySerializer.cs

  • Add constants for new message IDs

  • Add serialize/deserialize methods for:

    • heartbeat
    • rpc delivery
    • rpc ack

5.3 Heartbeat sender

File candidates:

  • TransformSyncManager.cs or a new lightweight HeartbeatManager.cs

Specification:

  • Send heartbeat via DEALER every ~0.5–1.0s
  • Include deviceId and timestamp
  • Heartbeat must be small and should never block; if send fails, skip.

5.4 Reliable RPC receiver + ACK

File: .../Runtime/Internal Scripts/RPCManager.cs and MessageProcessor.cs

Changes:

  • When MSG_RPC_DELIVERY received:

    • invoke the RPC handler locally
    • send MSG_RPC_ACK back via DEALER immediately
  • Add deduplication:

    • maintain a fixed-size cache of recently seen rpcId values
    • if duplicate delivery arrives, ACK again but do not execute twice

6. Acceptance Criteria

6.1 Stability

  • With 100+ clients on a constrained network, the server should not produce frequent “removed (timeout)” logs for healthy clients.
  • JOIN/REMOVE churn must be eliminated under steady-state operation.

6.2 Transform behavior under congestion

  • Old transforms are dropped; clients converge to the newest pose.
  • No uncontrolled latency growth.

6.3 Network Variables behavior

  • For each variable key:

    • clients converge to the latest value quickly
    • no multi-minute delays
  • Under severe congestion, temporary staleness is acceptable, but “eventual convergence” must occur.

6.4 RPC reliability

  • No RPC loss under induced congestion.
  • Duplicates may occur (due to retry) but must not cause double execution.

7. Test Plan

  1. Low-bandwidth simulation

    • artificially restrict bandwidth and introduce jitter/packet loss
  2. Soak test

    • 100–300 clients for 30–60 minutes
  3. RPC correctness test

    • generate RPCs at a controlled rate
    • verify all clients receive all intended RPCs (and exactly-once execution via dedup)
  4. NV convergence test

    • update same variable key frequently; confirm clients converge to latest value (not all intermediate values)
  5. Timeout robustness

    • confirm heartbeat prevents false removals even when transform send is heavily throttled

8. Implementation Order (Recommended)

  1. Heartbeat + liveness update (fastest win; stops JOIN/REMOVE storms)
  2. Split Transform PUB vs State PUB (prevents one stream from starving the other)
  3. NV sync as snapshot/delta over State PUB (eliminates backlog-driven delays)
  4. Reliable RPC over ROUTER/DEALER with ACK+retry (meets “must deliver” requirement)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions