You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three-agent cumulative review (Claude + Codex challenge + Kimi review) over
the 2026-04-27 batch (#34 docs / #22 bounded LRU / #18 Phase 0 instrumentation)
surfaced four emergent issues only visible reading the diffs together:
1. connect# off-by-one (Claude HIGH + Codex MEDIUM): in _group_subscribe_once
the counter was bumped AFTER _on_group_event was awaited, so the gap log
inside dispatch carried connect#=N-1 while the "subscribe stream live"
log moments later carried connect#=N for the same first-frame event —
defeating the in-line correlation goal. Fix: bump pre-dispatch (counter
+ ts represent "WS layer received a frame this cycle"; subscribe-live
log + backoff clamp stay post-dispatch as "operator-visible recovery").
2. time.time() vs time.monotonic() (Codex MEDIUM + Kimi MEDIUM independent):
wall-clock delta is unsafe under NTP step / leap-second / VM suspend-
resume — can produce spurious gap candidates. Switched arrival-time
tracking + since-reconnect correlation to monotonic. Wall-clock
correlation comes from the log line's own leading ISO timestamp.
3. _BoundedLRU shared cap blinds Phase 0 (Codex HIGH lead objection):
sharing the 500-entry cap with _last_arrival_time_per_room would
silently neuter the gap log under cardinality pressure (>=500 rooms) —
the very scenario the cap was meant to defend against. Fix: separate
cap arrival_time_cache_max_entries (default 10000, 20x the dedup cap),
docstring acknowledging arrival-time is observability-critical.
4. Anchor IDs on every paginated page (Kimi MEDIUM): list_group_messages
echoed start_message_id/end_message_id alongside pageToken on follow-up
pages — risks duplicate results once max_pages>1. Fix: anchors only on
first request, mirrors list_group_files.
Fold-in (#60): _fetch_group_message now uses max_pages=2 so a tight
start=end=message_id window that Kimi paginates around is found instead
of silently returning None. Safe given fix #4.
Docs: 4-dict bounded-LRU enumeration; monotonic semantic; README demoted
instrumentation under Diagnostics (Codex LOW: it's observability not a
feature).
Tests: 189 passing (+10 new, 1 modified). New: anchor-only-on-first-page,
monotonic clock-skew safety, 5x arrival cap independence, 2x
_fetch_group_message pagination, connect# ordering invariant, sentinel
assertion. Modified: subscribe handler-exception test reflects new
pre-dispatch state semantic.
Review artifacts preserved at .review/58-{claude-code-reviewer,
codex-challenge,kimi-review,synthesis}.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+31-23Lines changed: 31 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -101,8 +101,7 @@ If you don't see the `register_platform_adapter` line, the hook isn't present in
101
101
|**Slash commands**| Pass-through to the Hermes runtime |`/new`, `/compact`, `/status`, etc. handled at the runtime layer. |
102
102
|**Tool calls**| Native streaming via session/update | Tool-call frames are forwarded to the Kimi UI without being filtered. |
103
103
|**Output modes**|`output_mode: tool_only \| passthrough`|`tool_only` suppresses agent text in favour of `SendMessage` tool calls (matches the hakimi pattern). Default is `passthrough`. See [Picking an output_mode](#picking-an-output_mode) below. |
104
-
|**Bounded room state**|`_BoundedLRU(maxsize=N)` on `_rooms`, `_last_message_id_per_room`, `_probe_msg_id_room_counts`| Per-room dicts cap at `room_cache_max_entries` entries (default 500). Eviction does NOT affect message-dispatch correctness (replay-dedup is owned by `_processed_set`, separately bounded). It does have observable side effects under cardinality pressure — see [Bounded room state](#bounded-room-state) below. |
105
-
|**Burst-drop instrumentation**| Gap-candidate INFO logs + reconnect counter + paginated `list_group_messages`| Gathers evidence for whether suspicious per-room timing gaps cluster around stream reconnects. Does NOT recover messages — that's a future Phase 1+. See [Burst-drop instrumentation](#burst-drop-instrumentation) below. |
104
+
|**Bounded room state**|`_BoundedLRU(maxsize=N)` on four per-room dicts |`_rooms`, `_last_message_id_per_room`, `_probe_msg_id_room_counts` share `room_cache_max_entries` (default 500); `_last_arrival_time_per_room` uses its own larger `arrival_time_cache_max_entries` (default 10000) because eviction would silently blind gap detection. Eviction never affects message-dispatch correctness — replay dedup is owned by `_processed_set`, separately bounded. See [Bounded room state](#bounded-room-state) below. |
106
105
|**Onboarding skill**| Embedded `optional-skills/communication/kimi-platform/`| Once enabled in skill settings, agents get a brief on Kimi-specific behaviours (group vs DM, slash semantics, etc.). |
107
106
108
107
### Picking an `output_mode`
@@ -132,50 +131,67 @@ This flag exists because the bridge's previous workaround (`HIDE_TOOL_CALLS=1`)
132
131
133
132
### Bounded room state
134
133
135
-
Three of the adapter's per-room dicts (`_rooms`, `_last_message_id_per_room`, `_probe_msg_id_room_counts`) used to grow without ceiling on long-running deployments. They're now backed by `_BoundedLRU` with a configurable cap (`room_cache_max_entries`, default 500). The cap is **per-dict**, but all three share the same `room_id` key space so cardinality is symmetric.
134
+
Four of the adapter's per-room dicts are backed by `_BoundedLRU` so cardinality can't grow without ceiling on long-running deployments:
135
+
136
+
-**Shared cap** (`room_cache_max_entries`, default 500) — `_rooms`, `_last_message_id_per_room`, `_probe_msg_id_room_counts`. All three share the same `room_id` key space and have non-load-bearing eviction semantics.
137
+
-**Larger separate cap** (`arrival_time_cache_max_entries`, default 10000) — `_last_arrival_time_per_room`. Eviction would silently blind the Phase 0 gap-candidate log (next message produces `prev_arrival=None` and the INFO log won't fire regardless of actual delay). Codex review #58 flagged this as a contradiction with the "no load-bearing state" promise; the separate higher cap resolves it without making memory unbounded — 10000 entries × ~16 bytes ≈ 160KB, effectively unbounded for any realistic deployment.
136
138
137
139
**What eviction never breaks:** message-dispatch correctness. Replay dedup is owned by `_processed_set`, which is bounded by `_DEDUP_MAXLEN=2000` independently. Whether a room's metadata is in `_rooms` or not has zero bearing on whether the agent sees a duplicate or dropped message.
138
140
139
-
**What eviction *does* affect** under cardinality pressure (≥ cap rooms ever encountered):
141
+
**What eviction *does* affect** under cardinality pressure (≥ shared cap rooms ever encountered):
140
142
141
143
1.**`first-seen` DEBUG log on resumed rooms.**`_last_message_id_per_room` is hoisted out of the DEBUG gate so toggling DEBUG on later doesn't produce a misleading `first-seen`. Eviction reintroduces a narrower version of that artifact: a quiet room evicted then resumed will log `first-seen` instead of a delta on its first message back. Misleading observability, not misleading state.
142
144
2.**Probe sample-phase reset.**`_probe_msg_id_room_counts` keys its DEBUG sampling phase off the count modulo `probe_msg_id_sample_rate`. After eviction the count restarts at 1, so the first `sample_rate - 1` resumed messages are skipped from the DEBUG sample.
143
145
3.**Cold-resume RPC failure.**`_rooms` re-fetches a missing entry via `GetRoom` + `ListMembers`; on transient `KimiAdapterError` the fallback returns `{"name": room_id, "type": "group"}` with no members. Without eviction, that fallback only runs for *never-cached* rooms; with eviction, a failed re-fetch on a previously-cached room can briefly degrade display name + members until the next successful refresh.
144
146
145
-
For Bloom's typical Pi deployment (~10 unique rooms over weeks) none of these ever fire. For deployments with hundreds-to-thousands of rooms, raise the cap:
147
+
For Bloom's typical Pi deployment (~10 unique rooms over weeks) none of these ever fire. For deployments with hundreds-to-thousands of rooms, raise the caps:
146
148
147
149
```yaml
148
150
platforms:
149
151
kimi:
150
152
extra:
151
153
room_cache_max_entries: 5000
154
+
arrival_time_cache_max_entries: 50000# only if >10000 distinct rooms
152
155
```
153
156
154
-
### Burst-drop instrumentation
157
+
## Production reference
158
+
159
+
Bloom (Xule's home Raspberry Pi) has been running this adapter against real Kimi traffic:
160
+
- 2026-04-23 → 2026-04-26: in-tree at `gateway/platforms/kimi.py` on `linxule/hermes-agent:feat/kimi-platform-adapter`
161
+
- 2026-04-26 → 2026-04-27: as this plugin in-tree on `linxule/hermes-agent:feat/kimi-plugin-variant`
162
+
- 2026-04-27 → present: as this **standalone plugin** symlinked into `~/.hermes/plugins/kimi`, against `linxule/hermes-agent:feat/platform-kimi-enum` (the upstream-equivalent base for both prospective PRs)
163
+
164
+
Validated end-to-end via gateway logs showing `hermes_plugins.kimi.kimi_adapter` as the connecting module (i.e., the plugin path from this repo, not any in-tree fallback shim).
165
+
166
+
## Diagnostics
167
+
168
+
These are **observability** signals, not capabilities — operators read them to diagnose, but the plugin never recovers messages on the basis of them. Recovery design (Phase 1+) is gated on a release cycle of these signals from real production data.
169
+
170
+
### Burst-drop instrumentation (Phase 0)
155
171
156
-
Three INFO-level signals gather evidence about message-loss patterns *without* recovering anything. They exist so a future Phase 1+ recovery design can be picked against real production data instead of assumed traffic patterns.
172
+
Three INFO-level signals gather evidence about message-loss patterns *without* recovering anything.
157
173
158
-
**1. Gap-candidate INFO log.** Tracks **wall-clock arrival time** per room (`time.time()` at the moment each `chatMessage` is processed by the adapter). When the delta between consecutive arrivals in a room is `≥ burst_drop_gap_log_threshold_s` (default 30s, set to 0 to disable):
174
+
**1. Gap-candidate INFO log.** Tracks **monotonic arrival time** per room (`time.monotonic()` at the moment each `chatMessage` is processed by the adapter). When the delta between consecutive arrivals in a room is `≥ burst_drop_gap_log_threshold_s` (default 30s, set to 0 to disable):
159
175
160
176
```
161
177
INFO Kimi groups: gap candidate room=<chat_id> id=<this_id> prev=<prev_id> delta_s=N.N (>=30.0s threshold) since_reconnect_s=K.K connect#=M
162
178
```
163
179
164
-
Note: this is **wall-clock arrival delta**, not Kimi-`messageId`-embedded timestamp delta. Kimi's production message ids are UUID v8 with a non-standard epoch (their first 48 bits decode to magnitudes ~16× wall-clock seconds), so id-based timestamps are unreliable for time-domain analysis. Wall-clock at receive time is the operator-meaningful signal.
165
-
166
-
The `since_reconnect_s` and `connect#` fields make correlation against reconnects implicit in a single line — operators don't need to grep two log streams.
180
+
`time.monotonic()` is the right primitive for "how long since last delivery" — it never goes backward, never jumps on NTP sync or VM suspend/resume, and is immune to leap-second adjustments. Wall-clock correlation comes from the log line's own leading ISO timestamp; the embedded `delta_s` is honest process-time. (Earlier wall-clock implementation switched to monotonic in #58 cumulative review after independent flags from Codex and Kimi reviewers.)
167
181
168
182
This is anomaly-spotting only — most gaps are legitimate (idle room, no traffic). The signal becomes useful when **clusters** of these correlate against reconnects (low `since_reconnect_s` values in burst). At Bloom's typical traffic patterns the default 30s threshold may produce one-off false positives from human conversational pauses; raise to 60-120s if INFO noise becomes distracting.
169
183
170
-
**2. Reconnect counter + snapshot log.** On the first `chatMessage` post-(re)connect:
184
+
Also note: Kimi's production message ids are UUID v8 with a non-standard epoch (their first 48 bits decode to magnitudes ~16× wall-clock seconds), so id-based timestamps are unreliable for time-domain analysis. That's why the gap delta is wall-time-based, not id-derived.
185
+
186
+
**2. Reconnect counter + snapshot log.** On the first `chatMessage` post-(re)connect, after the counter is bumped (the bump is now pre-dispatch in #58 to keep the gap log's `connect#` consistent with the snapshot log):
171
187
172
188
```
173
189
INFO Kimi groups: subscribe stream live (connect#N, rooms_tracked=M, prev_backoff=X.Xs)
174
190
```
175
191
176
192
`connect#N` is monotonic over the adapter lifetime (cold start = #1). The same value also appears in every gap-candidate log line for in-line correlation.
177
193
178
-
**3. Paginated `list_group_messages`.** The wrapper now follows `nextPageToken` up to a configurable `max_pages` (default 1, backward-compatible with every existing caller). Required for any future recovery design that fetches more than `limit` messages from a gap.
194
+
**3. Paginated `list_group_messages`.** The wrapper follows `nextPageToken` up to a configurable `max_pages` (default 1, backward-compatible with every existing caller). Anchor IDs (`startMessageId` / `endMessageId`) are sent only on the first request — subsequent requests carry the opaque cursor only, avoiding the duplicate-results / undefined-ordering risk Kimi review #58 flagged. Required for any future recovery design that fetches more than `limit` messages from a gap.
179
195
180
196
Tunable via `config.extra`:
181
197
@@ -184,20 +200,12 @@ platforms:
184
200
kimi:
185
201
extra:
186
202
burst_drop_gap_log_threshold_s: 30.0 # 0 disables the gap-candidate INFO
203
+
arrival_time_cache_max_entries: 10000 # see Bounded room state above
187
204
# (no config knob for the reconnect counter — always on)
188
205
# (max_pages is per-call; default 1)
189
206
```
190
207
191
-
This is **evidence-gathering**, not recovery. Recovery designs (Phase 1+) require a release cycle of this data first to choose between reconnect catch-up vs periodic poll vs gap-triggered.
192
-
193
-
## Production reference
194
-
195
-
Bloom (Xule's home Raspberry Pi) has been running this adapter against real Kimi traffic:
196
-
- 2026-04-23 → 2026-04-26: in-tree at `gateway/platforms/kimi.py` on `linxule/hermes-agent:feat/kimi-platform-adapter`
197
-
- 2026-04-26 → 2026-04-27: as this plugin in-tree on `linxule/hermes-agent:feat/kimi-plugin-variant`
198
-
- 2026-04-27 → present: as this **standalone plugin** symlinked into `~/.hermes/plugins/kimi`, against `linxule/hermes-agent:feat/platform-kimi-enum` (the upstream-equivalent base for both prospective PRs)
199
-
200
-
Validated end-to-end via gateway logs showing `hermes_plugins.kimi.kimi_adapter` as the connecting module (i.e., the plugin path from this repo, not any in-tree fallback shim).
208
+
Recovery designs (Phase 1+) require a release cycle of this data first to choose between reconnect catch-up vs periodic poll vs gap-triggered.
0 commit comments