Make NetSync robust under low-bandwidth + high-client-count conditions

# NetSync Low-Bandwidth Hardening — Implementation Instructions

## 1. Objective

Improve STYLY NetSync so that connecting many clients on a constrained network does **not** cause:

* frequent JOIN/REMOVE churn (false timeouts)
* multi-minute delays for Network Variables
* RPC loss

The system must remain stable and responsive when the **available bandwidth is below the aggregate message production rate**.

## 2. High-Level Design

### 2.1 Separate QoS by channel (downlink)

Split the current downlink broadcast stream into two distinct PUB/SUB channels:

1. **Transform Downlink (Lossy, Latest-Only)**

   * carries room pose/transform snapshots
   * configured to drop old messages under pressure

2. **State Downlink (Lossy, Latest-Only per key)**

   * carries Network Variable sync snapshots/deltas and device ID mapping (and other “state” messages)
   * also configured to drop old snapshots under pressure, because the next snapshot contains the latest state

### 2.2 Reliable RPC over ROUTER/DEALER (not PUB/SUB)

RPC must not use PUB/SUB. Implement RPC delivery as:

* client → server: ROUTER/DEALER request
* server → client: ROUTER/DEALER delivery
* client → server: ROUTER/DEALER ACK
* server retries until ACK or max attempts reached

### 2.3 Decouple liveness from transforms (fix JOIN/REMOVE storms)

Currently, server timeouts are driven by `client_data["last_update"]`, which is updated primarily by receiving transforms. Under bandwidth pressure, transform delivery can stall and clients are removed despite being alive.

Add a **heartbeat** message to keep `last_update` fresh independently of transform flow.

### 2.4 No backward compatibility is needed.

## 3. Protocol Changes (Binary Message Types)

Update both server and Unity client serializers.

### Add message IDs (example)

* `MSG_HEARTBEAT = 13`
* `MSG_RPC_DELIVERY = 14`
* `MSG_RPC_ACK = 15`

(IDs must be consistent across Python and C#.)

### Heartbeat payload

Required fields:

* `deviceId` (string)
* `clientNo` (optional if known; server can resolve)
* `timestamp` (monotonic time)

### RPC delivery payload

Required fields:

* `rpcId` (uint64 or GUID string; must be unique per RPC)
* `senderClientNo`
* `functionName`
* `args` (array)
* (optional) `orderingKey` if you plan per-sender ordering

### RPC ACK payload

Required fields:

* `rpcId`
* `receiverClientNo` or `deviceId`
* `timestamp`

## 4. Server-Side Implementation (Python)

### 4.1 Configuration updates

File: `STYLY-NetSync-Server/src/styly_netsync/config.py`

Add config fields:

* `transform_pub_port` (new)
* `state_pub_port` (new)
* `heartbeat_timeout` (can reuse `client_timeout` but recommended separate semantics)
* `state_broadcast_rate_hz` (e.g., 5–20 Hz depending on load)
* `heartbeat_expected_interval` (e.g., 1.0s; used only for diagnostics)
* RPC retry parameters:

  * `rpc_retry_initial_ms` (e.g., 50–100ms)
  * `rpc_retry_max_ms` (e.g., 1000ms)
  * `rpc_retry_max_attempts` (e.g., 30)
  * `rpc_outbox_max_per_client` (backpressure)

### 4.2 Create two PUB sockets

File: `STYLY-NetSync-Server/src/styly_netsync/server.py`

Replace the single `self.pub` with:

* `self.pub_transform`
* `self.pub_state`

Each should have independent HWM tuning:

* Transform: low SNDHWM (e.g., 1–10), use `DONTWAIT`, drop on overflow
* State: low SNDHWM (e.g., 1–10), use `DONTWAIT`, drop on overflow

Implementation rule:

* **Never block** the server main loop on PUB sends.

### 4.3 Route messages to correct PUB

Server message routing:

* Transform snapshots (`MSG_ROOM_POSE_V2` / room transform) → `pub_transform`
* NV sync (`MSG_GLOBAL_VAR_SYNC`, `MSG_CLIENT_VAR_SYNC`) and device ID mapping (`MSG_DEVICE_ID_MAPPING`) → `pub_state`

### 4.4 Heartbeat handling and liveness update

#### Add heartbeat deserialization

File: `STYLY-NetSync-Server/src/styly_netsync/binary_serializer.py`

* Implement serialize/deserialize for `MSG_HEARTBEAT`.

#### Update receive loop to refresh last_update

File: `server.py`, in the ROUTER receive block (where you handle `msg_type`):

* On **every** message received (transform, rpc request, nv set, heartbeat):

  * Resolve `device_id`:

    * If message contains `deviceId`, use it
    * Else resolve from `client_identity` via current mapping
  * Ensure client exists in `self.rooms[room_id]`
  * Update:

    * `self.rooms[room_id][device_id]["last_update"] = time.monotonic()`
    * `self.rooms[room_id][device_id]["identity"] = client_identity` (refresh identity to be safe)

This directly prevents false timeouts due to congested transforms.

### 4.5 Network Variables: keep “latest-only per variable key”

Your design assumption is correct.

Important note from the current server implementation:

* `_buffer_global_var_set()` already does “latest-wins per variableName”
* `_buffer_client_var_set()` already does “latest-wins per (targetClientNo, variableName)”

Keep this behavior.

#### Delivery change: publish NV as *state snapshots/deltas* on state PUB

Goal: prevent multi-minute backlog.

* Publish periodic NV sync frames containing the *current* state (or compact delta) for a room.
* Send with `DONTWAIT`. If it fails, do nothing—the next tick sends the latest again.
* Do not queue “every set” on the wire; only send the newest consolidated state.

### 4.6 Reliable RPC (ROUTER/DEALER with ACK + retry)

#### Modify RPC flow

Current server uses `_enqueue_pub()` for RPC broadcast; replace it.

File: `server.py`

* Replace `_send_rpc_to_room()`:

  * For each target client in the room (excluding sender):

    * Create `rpcId`
    * Enqueue into an **RPC outbox** keyed by target device/client
    * Immediately attempt send via `self.router.send_multipart([identity, room_id_bytes, payload])`
  * Do not use PUB.

#### Add outbox + retry scheduler

Maintain:

* `self.rpc_outbox[(room_id, target_device_id)] = {rpcId: OutboxEntry(...)}`
* Each `OutboxEntry` tracks:

  * payload bytes
  * next_send_time
  * attempts

In main loop, periodically:

* resend due messages
* drop messages that exceed max attempts (log error + metrics)

#### Add ACK processing

When receiving `MSG_RPC_ACK` from client:

* remove corresponding `rpcId` from outbox for that client

## 5. Unity Client Implementation (C# / NetMQ)

### 5.1 ConnectionManager: support two SUB sockets and receiving from DEALER

File: `STYLY-NetSync-Unity/Packages/com.styly.styly-netsync/Runtime/Internal Scripts/ConnectionManager.cs`

#### Add sockets

* `_transformSubSocket` (SubscriberSocket)
* `_stateSubSocket` (SubscriberSocket)
* Keep `_dealerSocket` (DealerSocket)

#### Connect to three endpoints

* Dealer connects to `dealer_port` (existing)
* TransformSub connects to `transform_pub_port`
* StateSub connects to `state_pub_port`
* Subscribe room topic on both sub sockets

#### Network loop must receive from:

* transform SUB
* state SUB
* dealer socket (for RPC deliveries and any future reliable server messages)

Implementation rule:

* Use a poller or a manual loop that checks all sockets without blocking the thread indefinitely.
* Ensure `MessageProcessor.ProcessIncomingMessage(payload)` is called for all incoming binary payloads.

### 5.2 BinarySerializer: add new message types

File: `.../Runtime/Internal Scripts/BinarySerializer.cs`

* Add constants for new message IDs
* Add serialize/deserialize methods for:

  * heartbeat
  * rpc delivery
  * rpc ack

### 5.3 Heartbeat sender

File candidates:

* `TransformSyncManager.cs` or a new lightweight `HeartbeatManager.cs`

Specification:

* Send heartbeat via DEALER every ~0.5–1.0s
* Include `deviceId` and `timestamp`
* Heartbeat must be small and should never block; if send fails, skip.

### 5.4 Reliable RPC receiver + ACK

File: `.../Runtime/Internal Scripts/RPCManager.cs` and `MessageProcessor.cs`

Changes:

* When `MSG_RPC_DELIVERY` received:

  * invoke the RPC handler locally
  * send `MSG_RPC_ACK` back via DEALER immediately
* Add deduplication:

  * maintain a fixed-size cache of recently seen `rpcId` values
  * if duplicate delivery arrives, ACK again but do not execute twice

## 6. Acceptance Criteria

### 6.1 Stability

* With 100+ clients on a constrained network, the server should not produce frequent “removed (timeout)” logs for healthy clients.
* JOIN/REMOVE churn must be eliminated under steady-state operation.

### 6.2 Transform behavior under congestion

* Old transforms are dropped; clients converge to the newest pose.
* No uncontrolled latency growth.

### 6.3 Network Variables behavior

* For each variable key:

  * clients converge to the latest value quickly
  * no multi-minute delays
* Under severe congestion, temporary staleness is acceptable, but “eventual convergence” must occur.

### 6.4 RPC reliability

* No RPC loss under induced congestion.
* Duplicates may occur (due to retry) but must not cause double execution.

## 7. Test Plan

1. **Low-bandwidth simulation**

   * artificially restrict bandwidth and introduce jitter/packet loss
2. **Soak test**

   * 100–300 clients for 30–60 minutes
3. **RPC correctness test**

   * generate RPCs at a controlled rate
   * verify all clients receive all intended RPCs (and exactly-once execution via dedup)
4. **NV convergence test**

   * update same variable key frequently; confirm clients converge to latest value (not all intermediate values)
5. **Timeout robustness**

   * confirm heartbeat prevents false removals even when transform send is heavily throttled

---

## 8. Implementation Order (Recommended)

1. **Heartbeat + liveness update** (fastest win; stops JOIN/REMOVE storms)
2. **Split Transform PUB vs State PUB** (prevents one stream from starving the other)
3. **NV sync as snapshot/delta over State PUB** (eliminates backlog-driven delays)
4. **Reliable RPC over ROUTER/DEALER with ACK+retry** (meets “must deliver” requirement)


Make NetSync robust under low-bandwidth + high-client-count conditions #302

Description

NetSync Low-Bandwidth Hardening — Implementation Instructions

1. Objective

2. High-Level Design

2.1 Separate QoS by channel (downlink)

2.2 Reliable RPC over ROUTER/DEALER (not PUB/SUB)

2.3 Decouple liveness from transforms (fix JOIN/REMOVE storms)

2.4 No backward compatibility is needed.

3. Protocol Changes (Binary Message Types)

Add message IDs (example)

Heartbeat payload

RPC delivery payload

RPC ACK payload

4. Server-Side Implementation (Python)

4.1 Configuration updates

4.2 Create two PUB sockets

4.3 Route messages to correct PUB

4.4 Heartbeat handling and liveness update

Add heartbeat deserialization

Update receive loop to refresh last_update

4.5 Network Variables: keep “latest-only per variable key”

Delivery change: publish NV as state snapshots/deltas on state PUB

4.6 Reliable RPC (ROUTER/DEALER with ACK + retry)

Modify RPC flow

Add outbox + retry scheduler

Add ACK processing

5. Unity Client Implementation (C# / NetMQ)

5.1 ConnectionManager: support two SUB sockets and receiving from DEALER

Add sockets

Connect to three endpoints

Network loop must receive from:

5.2 BinarySerializer: add new message types

5.3 Heartbeat sender

5.4 Reliable RPC receiver + ACK

6. Acceptance Criteria

6.1 Stability

6.2 Transform behavior under congestion

6.3 Network Variables behavior

6.4 RPC reliability

7. Test Plan

8. Implementation Order (Recommended)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions