Skip to content

fix(slack): key thread-context cache by team_id to prevent cross-workspace leak (#12421)#12889

Closed
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/slack-thread-context-cache-team-id
Closed

fix(slack): key thread-context cache by team_id to prevent cross-workspace leak (#12421)#12889
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/slack-thread-context-cache-team-id

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

Fixes #12421.

TL;DR

SlackAdapter._fetch_thread_context caches rendered thread context under f"{channel_id}:{thread_ts}" — ignoring team_id — but the rendered content is workspace-scoped: bot mention stripping reads _team_bot_user_ids[team_id], user-name resolution routes through _get_client(channel_id) which consults _channel_team. A subsequent call from a different workspace for the same (channel_id, thread_ts) therefore receives stale context rendered for the first workspace.

Fix: include team_id in the cache key. Empty team_id falls into its own namespace for backward compat.

Reporter's exact repro (reproduces on clean origin/main 6fb69229)

first  = await adapter._fetch_thread_context('Csame', '1000.0', '9999.9', team_id='T1')
second = await adapter._fetch_thread_context('Csame', '1000.0', '9999.9', team_id='T2')
assert first != second   # FAILS — both equal T1's cached content

Root cause

gateway/platforms/slack.py:1416 (pre-fix):

cache_key = f"{channel_id}:{thread_ts}"  # ignores team_id

Multi-workspace Slack mode (_team_clients, _team_bot_user_ids, _channel_team) is explicitly supported elsewhere in the adapter — ignoring team_id in this one cache is a clear omission rather than a design choice.

Fix

cache_key = f"{team_id}:{channel_id}:{thread_ts}"

Plus an updated docstring explaining the workspace-scoping contract. That's it — 1-line behavior change, ~10 lines of docstring.

Behaviour matrix

Call sequence Before After
T1 then T2, same (channel, thread_ts) 2nd hits cache → T1's content (bug) each workspace renders + caches its own
T1 twice, same (channel, thread_ts) cache hit cache hit (unchanged)
T1 twice, different channels 2 entries 2 entries (unchanged)
team_id="" then team_id="T1", same (C, ts) same key — collision distinct keys
TTL expiry on T1's entry evicts the shared entry evicts T1's entry only; T2 stays cached

Narrow scope — explicitly not changed

Per the reporter's "Suggested Investigation Direction" I audited the rest of the adapter's caches / routing maps:

  • _user_name_cache (line 87). Keyed solely by user_id. Slack user IDs are globally unique across the platform, so standard multi-workspace operation doesn't collide there — but in Enterprise Grid where a single user has different display names per workspace, this could serve a stale name. Out of scope for this PR (separate fix, separate test surface); flagging here for maintainer awareness.
  • _assistant_threads (line 111). Keyed by (channel_id, thread_ts). Channel IDs are distinct per workspace in standard multi-workspace mode, so no collision occurs in practice. Left unchanged.
  • Other cache/routing maps. Audited; only _thread_context_cache (this PR) and the Enterprise Grid edge case on _user_name_cache exhibit the team-unaware pattern.

Regression coverage

tests/gateway/test_slack_approval_buttons.py::TestSlackThreadContextWorkspaceIsolation — 7 cases, backed by a new _make_multi_workspace_adapter() helper with two workspaces (T1, T2) and distinct bot UIDs (U_BOT1, U_BOT2):

Test Fails on main Role
test_reporter_repro_two_workspaces_do_not_share_cache Issue body repro
test_structural_pin_cache_key_contains_team_id Pins concrete key shape + absence of old key
test_mention_stripping_is_team_specific Proves content-level leak (not just cache identity): each workspace's output strips its own bot and leaves the other's mention intact
test_ttl_is_per_workspace Ages one team's entry past TTL; only that team re-fetches
test_empty_team_id_has_own_namespace Backward-compat for legacy team_id="" callers
test_different_channels_same_team_each_get_cached Sanity: no cross-channel collapsing after the change
test_same_workspace_same_thread_uses_cache Preserved-behaviour canary — ensures the fix doesn't over-correct into per-call fetches

6 of 7 fail on clean origin/main with symptoms like:

AssertionError: assert 'Hello from T2' in '…[thread parent] Alice: Hello from T1…'
AssertionError: assert 'T1:Csame:1000.0' in {'Csame:1000.0': _ThreadContextCache(...)}

Validation

source venv/bin/activate
python -m pytest tests/gateway/test_slack_approval_buttons.py::TestSlackThreadContextWorkspaceIsolation -q
# 7 passed

Broader Slack + session suites under -n auto:

python -m pytest \
  tests/gateway/test_slack*.py \
  tests/gateway/test_session.py \
  tests/gateway/test_session_dm_thread_seeding.py -q -n auto
# 249 passed, 0 failures

Pre-empted review questions

Q. Why not tuple (team_id, channel_id, thread_ts) instead of string concatenation?
Adapter's sibling caches already use string keys with colon separators (_assistant_thread_key returns a tuple but rest of the adapter leans on string keys). Sticking with the string convention for consistency; no performance difference at this dict size.

Q. Could an empty team_id (key starts with :) collide with any legitimate team name?
No — Slack team_id values are always non-empty, T-prefixed strings. The :C:ts shape is unambiguous.

Q. Why is _user_name_cache scoped out of this PR?
Different failure mode: user IDs are globally unique in Slack, so stale lookups are an Enterprise Grid edge case rather than a standard multi-workspace bug. Also the fix needs a different shape — a keyed-by-team lookup with a migration for already-cached entries. Better as a separate PR.

Q. What about the existing TestSlackThreadContext tests in this file?
All 4 existing tests use a single-workspace _make_adapter() and continue to pass — they don't exercise the multi-workspace path. My new class uses a fresh _make_multi_workspace_adapter() helper to avoid entangling the two test surfaces.


Co-authored via LLM assistance; I've reviewed every line and am responsible for correctness.

…space leak (NousResearch#12421)

``SlackAdapter._fetch_thread_context`` cached rendered context under
``f"{channel_id}:{thread_ts}"`` but the rendered content is
workspace-scoped — bot-mention stripping uses
``_team_bot_user_ids[team_id]`` and user-name resolution routes
through ``_get_client(channel_id)`` (which in turn consults
``_channel_team``).  A later lookup from a different workspace for
the same ``(channel_id, thread_ts)`` tuple therefore receives stale
context rendered for the first workspace.

Multi-workspace Slack mode (``_team_clients``, ``_team_bot_user_ids``,
``_channel_team``) is explicitly supported elsewhere in the adapter,
so ignoring ``team_id`` in this one cache is a clear omission rather
than a design choice.  Reporter: NousResearch#12421.

Reporter's exact repro reproduces on clean ``origin/main`` (``6fb69229``)::

    first  = await adapter._fetch_thread_context('Csame', '1000.0', '9999.9', team_id='T1')
    second = await adapter._fetch_thread_context('Csame', '1000.0', '9999.9', team_id='T2')
    assert first != second   # FAILS — both equal T1's content

Fix
---
Include ``team_id`` in the cache key:

    cache_key = f"{team_id}:{channel_id}:{thread_ts}"

Empty ``team_id`` (single-workspace legacy callers) falls into its
own namespace (key starts with ``":"``), backward-compatible with
existing test harnesses and zero-team deployments.

Behaviour matrix
----------------
| Call                                          | Before                                 | After                                        |
| --- | --- | --- |
| T1 then T2, same ``(channel, thread_ts)``     | second call cache-hits → T1's content  | each workspace gets its own fetch + content |
| T1 twice, same ``(channel, thread_ts)``       | cache hit (unchanged)                  | cache hit (unchanged)                        |
| T1 then T1, different channels                | separate entries (unchanged)           | separate entries (unchanged)                 |
| ``team_id=""`` then ``team_id="T1"``          | same key (``"C:ts"``) — collision      | distinct keys (``":C:ts"`` vs ``"T1:C:ts"``) |
| TTL expiry on T1's entry                       | evicts shared entry                    | evicts T1's entry only; T2 still cached      |

Narrow scope — explicitly not changed
-------------------------------------
* **``_user_name_cache``** (line 87).  Also keyed solely by
  ``user_id``.  Slack user IDs are globally unique across the
  platform, so standard multi-workspace operation doesn't collide
  there — but in Enterprise Grid where a single user has different
  display names per workspace, this cache *could* serve a stale
  name.  Out of scope for this PR (separate fix, separate test
  surface); noted here for the reviewer's investigation.
* **``_assistant_threads``** (line 111).  Keyed by
  ``(channel_id, thread_ts)``.  Channel IDs are distinct per
  workspace in standard multi-workspace mode, so no collision
  occurs in practice.  Left unchanged.
* **Other Slack caches/routing maps**.  Audited; only
  ``_thread_context_cache`` and the Enterprise Grid edge case on
  ``_user_name_cache`` exhibit the team-unaware pattern.  Reporter's
  "Suggested Investigation Direction" resolved as above.

Regression coverage
-------------------
``tests/gateway/test_slack_approval_buttons.py`` gets a new
``TestSlackThreadContextWorkspaceIsolation`` class with 7 cases
backed by a ``_make_multi_workspace_adapter()`` helper:

* ``test_reporter_repro_two_workspaces_do_not_share_cache`` —
  exact issue-body repro.
* ``test_same_workspace_same_thread_uses_cache`` — preserved-
  behaviour canary (cache hit within one workspace still saves an
  API call).
* ``test_structural_pin_cache_key_contains_team_id`` — asserts the
  concrete cache key shape contains the team_id AND the old
  workspace-unaware key is not populated; guards against future
  regressions silently stripping the team component.
* ``test_mention_stripping_is_team_specific`` — each workspace
  returns text containing *both* bots' mentions; after per-
  workspace stripping, T1's output has only T2's mention and vice
  versa.  Proves the bug affects rendered content, not just cache
  identity.
* ``test_ttl_is_per_workspace`` — artificially ages one team's
  entry past TTL; asserts only that team re-fetches while the
  other stays cached.
* ``test_empty_team_id_has_own_namespace`` — backward-compat for
  legacy callers passing ``team_id=""``.
* ``test_different_channels_same_team_each_get_cached`` — sanity
  canary: within one workspace, different channels still get
  distinct entries (no cross-channel collapsing).

6 of the 7 fail on clean ``origin/main`` with KeyError on the new
key shape or assertion errors on cross-workspace content leak.  The
1 that passes on main is the preserved-behaviour canary — it
guarantees the fix doesn't over-correct into per-call fetches.

Validation
----------
``source venv/bin/activate && python -m pytest
tests/gateway/test_slack_approval_buttons.py::TestSlackThreadContextWorkspaceIsolation
-q`` → **7 passed**.

Broader Slack + session suites
(``tests/gateway/test_slack*.py`` + ``test_session.py`` +
``test_session_dm_thread_seeding.py``) under ``-n auto`` →
**249 passed, 0 failures**.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 20, 2026 07:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a multi-workspace Slack bug where _fetch_thread_context cached rendered thread context without including team_id, allowing cross-workspace cache hits and leaked/incorrect context.

Changes:

  • Key _thread_context_cache entries by team_id in addition to (channel_id, thread_ts).
  • Expand _fetch_thread_context docstring to document workspace-scoped caching semantics and the legacy empty-team_id namespace.
  • Add regression tests ensuring cache isolation, mention stripping, and TTL behavior are per-workspace.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
gateway/platforms/slack.py Includes team_id in the thread-context cache key and documents workspace-scoped caching to prevent cross-workspace leaks.
tests/gateway/test_slack_approval_buttons.py Adds a multi-workspace adapter helper and a comprehensive regression suite covering cache isolation and related behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Expire only T1's entry.
ttl = adapter._THREAD_CACHE_TTL
t1_entry = adapter._thread_context_cache["T1:Csame:1000.0"]
# Rebuild with a stale timestamp rather than mutate a frozen dataclass.

Copilot AI Apr 20, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this is a "frozen dataclass", but _ThreadContextCache in gateway/platforms/slack.py is a regular (mutable) dataclass. Consider rewording to reflect the real reason for dataclasses.replace here (e.g., avoiding in-place mutation / keeping the test intent explicit).

Suggested change
# Rebuild with a stale timestamp rather than mutate a frozen dataclass.
# Rebuild with a stale timestamp instead of mutating the cached entry
# in place, which keeps the test setup explicit.

Copilot uses AI. Check for mistakes.
…t frozen (follow-up to NousResearch#12421)

Copilot review caught that the comment in
``test_ttl_is_per_workspace`` said ``dataclasses.replace`` was used
"to avoid mutating a frozen dataclass" — but
``_ThreadContextCache`` is a regular ``@dataclass``, not frozen.

Reworded to reflect the real reason for using ``dataclasses.replace``:
keeping the test intent (rebuild an entry with an aged timestamp)
explicit at the call site rather than mutating ``t1_entry.fetched_at``
in place.

No behaviour change.  7 passed, same as before.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@briandevans

Copy link
Copy Markdown
Contributor Author

Thanks @copilot-pull-request-reviewer — good catch, addressed in c2877a70.

You're right — _ThreadContextCache is @dataclass without frozen=True (gateway/platforms/slack.py:51), so it's perfectly mutable. My comment was wrong about the reason. Reworded to reflect the real intent: dataclasses.replace keeps the "rebuild a cache entry with an aged timestamp" operation explicit at the call site rather than doing t1_entry.fetched_at -= ttl + 10 which hides the test setup. No behaviour change.

7 passed on branch (same as before).

@briandevans

Copy link
Copy Markdown
Contributor Author

CI audit on the Tests run (24655153236):

4 test failures — all baseline xdist-ordering flakes, none touch Slack code.

Test Symptom Classification
test_gemini_provider.py::TestGeminiModelCatalog::test_provider_models_exist 'gemini-2.5-pro' in [... 3.x models only] Curated catalog drift; passes on clean origin/main
test_interrupt_propagation.py::TestInterruptPropagationToChild::test_interrupt_during_child_api_call_detected AttributeError: 'AIAgent' object has no attribute 'provider' Same class of stub drift as #12574; passes on clean origin/main
test_command_guards.py::TestTirithWarnSafe::test_warn_session_approved assert False is True xdist worker fixture ordering
test_command_guards.py::TestCombinedWarnings::test_combined_cli_session_approves_both assert False xdist worker fixture ordering

Verified on clean origin/main (6fb69229):

# All 4 in isolation:
$ pytest <four_tests_above> -q
4 passed in 2.76s

# All containing modules under -n auto:
$ pytest tests/hermes_cli/test_gemini_provider.py \
         tests/run_agent/test_interrupt_propagation.py \
         tests/tools/test_command_guards.py -q -n auto
72 passed in 5.81s

They only fail when the full tests/ suite runs under -n auto — classic pytest-xdist worker-import-ordering pattern I've seen on prior PRs in this repo. Zero in gateway/platforms/slack.py or tests/gateway/test_slack_approval_buttons.py.

Green: check-attribution, e2e.

Focused suite on branch: pytest tests/gateway/test_slack_approval_buttons.py::TestSlackThreadContextWorkspaceIsolation -q7 passed.
Broader Slack + session suites: 249 passed, 0 failures.

@briandevans

Copy link
Copy Markdown
Contributor Author

Supply-chain FAILURE on this PR is a false positive from a scan bug, not a real finding. Opened #13411 with a root-cause analysis and a one-character fix.

Quick version: .github/workflows/supply-chain-audit.yml computes the PR diff with git diff "$BASE".."$HEAD" (two-dot) which includes every file main has drifted through since the PR was forked. On this PR, that adds hermes_cli/setup.py (modified upstream, not by this PR) to the scan's file list, which trips the install-hook check. The 3-dot merge-base diff shows only this PR's actual 2 files (neither of which matches any attack pattern).

Was masked pre-19db7fa3 because that commit changed the fail condition from critical == 'true' to found == 'true', unmasking the pre-existing 2-dot bug.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/slack Slack app adapter labels Apr 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #12502 (same fix approach) and #12421 (original bug report).

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #12502 (same fix approach) and #12421 (original bug report).

@briandevans

Copy link
Copy Markdown
Contributor Author

Thanks @alt-glitch for the cross-reference. 👍

@flobo3's #12502 uses the same idea (add team_id to cache key) — their diff is 1 line, no test coverage added. This PR's test file covers:

  • Reporter's exact repro (two workspaces, same (channel_id, thread_ts) → distinct cache entries, independent API fetches)
  • Content-level leak proof: each workspace's input contains both bots' mentions; after per-workspace stripping, T1's output has only T2's mention left and vice versa. This catches the bug at the rendered-content level, not just cache identity.
  • TTL-per-workspace (aging one team's entry doesn't affect the other)
  • Empty-team_id backward-compat canary
  • Structural pin on the key shape (prevents future refactor from silently reverting)

If maintainers pick #12502, happy to offer these tests as follow-up coverage. Either way, the user-facing fix is the same shape.

@briandevans

Copy link
Copy Markdown
Contributor Author

Closing — superseded by the already-merged f414df3a ("fix(slack): include team_id in thread-context cache key"), which landed on main and implements the identical fix: keying the thread-context cache by (channel, team_id) to prevent cross-workspace leaks.

No reason to land a duplicate. Thanks to whoever merged the upstream fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/slack Slack app adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Slack thread-context cache key ignores workspace identity

3 participants