-
Notifications
You must be signed in to change notification settings - Fork 9
Description
NetSync Low-Bandwidth Hardening — Implementation Instructions
1. Objective
Improve STYLY NetSync so that connecting many clients on a constrained network does not cause:
- frequent JOIN/REMOVE churn (false timeouts)
- multi-minute delays for Network Variables
- RPC loss
The system must remain stable and responsive when the available bandwidth is below the aggregate message production rate.
2. High-Level Design
2.1 Separate QoS by channel (downlink)
Split the current downlink broadcast stream into two distinct PUB/SUB channels:
-
Transform Downlink (Lossy, Latest-Only)
- carries room pose/transform snapshots
- configured to drop old messages under pressure
-
State Downlink (Lossy, Latest-Only per key)
- carries Network Variable sync snapshots/deltas and device ID mapping (and other “state” messages)
- also configured to drop old snapshots under pressure, because the next snapshot contains the latest state
2.2 Reliable RPC over ROUTER/DEALER (not PUB/SUB)
RPC must not use PUB/SUB. Implement RPC delivery as:
- client → server: ROUTER/DEALER request
- server → client: ROUTER/DEALER delivery
- client → server: ROUTER/DEALER ACK
- server retries until ACK or max attempts reached
2.3 Decouple liveness from transforms (fix JOIN/REMOVE storms)
Currently, server timeouts are driven by client_data["last_update"], which is updated primarily by receiving transforms. Under bandwidth pressure, transform delivery can stall and clients are removed despite being alive.
Add a heartbeat message to keep last_update fresh independently of transform flow.
2.4 No backward compatibility is needed.
3. Protocol Changes (Binary Message Types)
Update both server and Unity client serializers.
Add message IDs (example)
MSG_HEARTBEAT = 13MSG_RPC_DELIVERY = 14MSG_RPC_ACK = 15
(IDs must be consistent across Python and C#.)
Heartbeat payload
Required fields:
deviceId(string)clientNo(optional if known; server can resolve)timestamp(monotonic time)
RPC delivery payload
Required fields:
rpcId(uint64 or GUID string; must be unique per RPC)senderClientNofunctionNameargs(array)- (optional)
orderingKeyif you plan per-sender ordering
RPC ACK payload
Required fields:
rpcIdreceiverClientNoordeviceIdtimestamp
4. Server-Side Implementation (Python)
4.1 Configuration updates
File: STYLY-NetSync-Server/src/styly_netsync/config.py
Add config fields:
-
transform_pub_port(new) -
state_pub_port(new) -
heartbeat_timeout(can reuseclient_timeoutbut recommended separate semantics) -
state_broadcast_rate_hz(e.g., 5–20 Hz depending on load) -
heartbeat_expected_interval(e.g., 1.0s; used only for diagnostics) -
RPC retry parameters:
rpc_retry_initial_ms(e.g., 50–100ms)rpc_retry_max_ms(e.g., 1000ms)rpc_retry_max_attempts(e.g., 30)rpc_outbox_max_per_client(backpressure)
4.2 Create two PUB sockets
File: STYLY-NetSync-Server/src/styly_netsync/server.py
Replace the single self.pub with:
self.pub_transformself.pub_state
Each should have independent HWM tuning:
- Transform: low SNDHWM (e.g., 1–10), use
DONTWAIT, drop on overflow - State: low SNDHWM (e.g., 1–10), use
DONTWAIT, drop on overflow
Implementation rule:
- Never block the server main loop on PUB sends.
4.3 Route messages to correct PUB
Server message routing:
- Transform snapshots (
MSG_ROOM_POSE_V2/ room transform) →pub_transform - NV sync (
MSG_GLOBAL_VAR_SYNC,MSG_CLIENT_VAR_SYNC) and device ID mapping (MSG_DEVICE_ID_MAPPING) →pub_state
4.4 Heartbeat handling and liveness update
Add heartbeat deserialization
File: STYLY-NetSync-Server/src/styly_netsync/binary_serializer.py
- Implement serialize/deserialize for
MSG_HEARTBEAT.
Update receive loop to refresh last_update
File: server.py, in the ROUTER receive block (where you handle msg_type):
-
On every message received (transform, rpc request, nv set, heartbeat):
-
Resolve
device_id:- If message contains
deviceId, use it - Else resolve from
client_identityvia current mapping
- If message contains
-
Ensure client exists in
self.rooms[room_id] -
Update:
self.rooms[room_id][device_id]["last_update"] = time.monotonic()self.rooms[room_id][device_id]["identity"] = client_identity(refresh identity to be safe)
-
This directly prevents false timeouts due to congested transforms.
4.5 Network Variables: keep “latest-only per variable key”
Your design assumption is correct.
Important note from the current server implementation:
_buffer_global_var_set()already does “latest-wins per variableName”_buffer_client_var_set()already does “latest-wins per (targetClientNo, variableName)”
Keep this behavior.
Delivery change: publish NV as state snapshots/deltas on state PUB
Goal: prevent multi-minute backlog.
- Publish periodic NV sync frames containing the current state (or compact delta) for a room.
- Send with
DONTWAIT. If it fails, do nothing—the next tick sends the latest again. - Do not queue “every set” on the wire; only send the newest consolidated state.
4.6 Reliable RPC (ROUTER/DEALER with ACK + retry)
Modify RPC flow
Current server uses _enqueue_pub() for RPC broadcast; replace it.
File: server.py
-
Replace
_send_rpc_to_room():-
For each target client in the room (excluding sender):
- Create
rpcId - Enqueue into an RPC outbox keyed by target device/client
- Immediately attempt send via
self.router.send_multipart([identity, room_id_bytes, payload])
- Create
-
Do not use PUB.
-
Add outbox + retry scheduler
Maintain:
-
self.rpc_outbox[(room_id, target_device_id)] = {rpcId: OutboxEntry(...)} -
Each
OutboxEntrytracks:- payload bytes
- next_send_time
- attempts
In main loop, periodically:
- resend due messages
- drop messages that exceed max attempts (log error + metrics)
Add ACK processing
When receiving MSG_RPC_ACK from client:
- remove corresponding
rpcIdfrom outbox for that client
5. Unity Client Implementation (C# / NetMQ)
5.1 ConnectionManager: support two SUB sockets and receiving from DEALER
File: STYLY-NetSync-Unity/Packages/com.styly.styly-netsync/Runtime/Internal Scripts/ConnectionManager.cs
Add sockets
_transformSubSocket(SubscriberSocket)_stateSubSocket(SubscriberSocket)- Keep
_dealerSocket(DealerSocket)
Connect to three endpoints
- Dealer connects to
dealer_port(existing) - TransformSub connects to
transform_pub_port - StateSub connects to
state_pub_port - Subscribe room topic on both sub sockets
Network loop must receive from:
- transform SUB
- state SUB
- dealer socket (for RPC deliveries and any future reliable server messages)
Implementation rule:
- Use a poller or a manual loop that checks all sockets without blocking the thread indefinitely.
- Ensure
MessageProcessor.ProcessIncomingMessage(payload)is called for all incoming binary payloads.
5.2 BinarySerializer: add new message types
File: .../Runtime/Internal Scripts/BinarySerializer.cs
-
Add constants for new message IDs
-
Add serialize/deserialize methods for:
- heartbeat
- rpc delivery
- rpc ack
5.3 Heartbeat sender
File candidates:
TransformSyncManager.csor a new lightweightHeartbeatManager.cs
Specification:
- Send heartbeat via DEALER every ~0.5–1.0s
- Include
deviceIdandtimestamp - Heartbeat must be small and should never block; if send fails, skip.
5.4 Reliable RPC receiver + ACK
File: .../Runtime/Internal Scripts/RPCManager.cs and MessageProcessor.cs
Changes:
-
When
MSG_RPC_DELIVERYreceived:- invoke the RPC handler locally
- send
MSG_RPC_ACKback via DEALER immediately
-
Add deduplication:
- maintain a fixed-size cache of recently seen
rpcIdvalues - if duplicate delivery arrives, ACK again but do not execute twice
- maintain a fixed-size cache of recently seen
6. Acceptance Criteria
6.1 Stability
- With 100+ clients on a constrained network, the server should not produce frequent “removed (timeout)” logs for healthy clients.
- JOIN/REMOVE churn must be eliminated under steady-state operation.
6.2 Transform behavior under congestion
- Old transforms are dropped; clients converge to the newest pose.
- No uncontrolled latency growth.
6.3 Network Variables behavior
-
For each variable key:
- clients converge to the latest value quickly
- no multi-minute delays
-
Under severe congestion, temporary staleness is acceptable, but “eventual convergence” must occur.
6.4 RPC reliability
- No RPC loss under induced congestion.
- Duplicates may occur (due to retry) but must not cause double execution.
7. Test Plan
-
Low-bandwidth simulation
- artificially restrict bandwidth and introduce jitter/packet loss
-
Soak test
- 100–300 clients for 30–60 minutes
-
RPC correctness test
- generate RPCs at a controlled rate
- verify all clients receive all intended RPCs (and exactly-once execution via dedup)
-
NV convergence test
- update same variable key frequently; confirm clients converge to latest value (not all intermediate values)
-
Timeout robustness
- confirm heartbeat prevents false removals even when transform send is heavily throttled
8. Implementation Order (Recommended)
- Heartbeat + liveness update (fastest win; stops JOIN/REMOVE storms)
- Split Transform PUB vs State PUB (prevents one stream from starving the other)
- NV sync as snapshot/delta over State PUB (eliminates backlog-driven delays)
- Reliable RPC over ROUTER/DEALER with ACK+retry (meets “must deliver” requirement)