Skip to content

feat(gateway): add CSV and JSON to supported document types with text injection#4109

Open
tjp2021 wants to merge 1 commit into
NousResearch:mainfrom
tjp2021:feat/csv-json-document-support
Open

feat(gateway): add CSV and JSON to supported document types with text injection#4109
tjp2021 wants to merge 1 commit into
NousResearch:mainfrom
tjp2021:feat/csv-json-document-support

Conversation

@tjp2021

@tjp2021 tjp2021 commented Mar 31, 2026

Copy link
Copy Markdown

What does this PR do?

Adds .csv and .json to SUPPORTED_DOCUMENT_TYPES and introduces a centralized TEXT_INJECTABLE_EXTENSIONS constant so all gateway adapters consistently accept, cache, and optionally inline these file types — matching the behavior WhatsApp already had.

Closes #4105

Type of Change

  • New feature (non-breaking change which adds functionality)

Problem

CSV and JSON files uploaded on Slack, Discord, Telegram, and Feishu are silently skipped — they're not in SUPPORTED_DOCUMENT_TYPES, so the file handler's if ext not in SUPPORTED_DOCUMENT_TYPES: continue ignores them. Telegram sends an "Unsupported document type" reply; the other three give no feedback.

The WhatsApp adapter already handles both types — its text injection list at line 775 of whatsapp.py includes .csv, .json, and several others. The other four adapters don't.

Each adapter also hardcodes its own if ext in (".md", ".txt") check for text injection. Adding a new injectable type means editing every adapter file individually.

Changes Made

Core (gateway/platforms/base.py)

  • Added .csv: "text/csv" and .json: "application/json" to SUPPORTED_DOCUMENT_TYPES
  • Added TEXT_INJECTABLE_EXTENSIONS frozenset ({".md", ".txt", ".csv", ".json"}) to centralize the set of extensions eligible for inline text injection

Adapters (Slack, Discord, Telegram, Feishu)

  • Imported TEXT_INJECTABLE_EXTENSIONS from base
  • Replaced hardcoded if ext in (".md", ".txt") with if ext in TEXT_INJECTABLE_EXTENSIONS
  • Feishu: also added "text/csv" and "application/json" to the MIME-type fallback check in _maybe_extract_text_document()

Tests (20 new tests across 4 files)

  • test_document_cache.py: Added .csv and .json to test_expected_extensions_present parametrize list (+2); added test_text_injectable_is_subset_of_supported (+1) and parametrized test_expected_text_injectable_extensions (+4)
  • test_slack.py: 4 new tests — CSV cached+injected, JSON cached+injected, large CSV (>100KB) cached but not injected, binary JSON cached but not injected
  • test_discord_document_handling.py: 4 new tests — same coverage as Slack
  • test_telegram_documents.py: 5 new tests — same as Discord plus MIME→extension fallback test for JSON without filename

Not modified

  • whatsapp.py — already handles CSV/JSON; no changes needed

Testing

Live-tested (Slack): Tested on a running Slack gateway with real file uploads:

  • Single CSV/JSON file with @bot mention — file cached, content injected, agent responds with summary
  • Multi-file upload (4 files: 2 CSV + 2 JSON) in one message — all files cached and injected in order, agent summarized each one
  • Unicode content (Japanese, Portuguese, German characters in CSV) — decoded and injected correctly

Unit-tested (all adapters): 145 tests pass across 4 test files (125 before this PR, 145 after). Discord, Telegram, and Feishu were verified via unit tests only — the code change is the same one-line swap in each adapter.

./venv/bin/python -m pytest tests/gateway/test_document_cache.py tests/gateway/test_slack.py tests/gateway/test_discord_document_handling.py tests/gateway/test_telegram_documents.py -v -o "addopts="

Usage note

On Slack, files must be attached to a message that @mentions the bot. Uploading files without a mention sends a file_shared event, which the adapter does not handle. This is a pre-existing limitation of the Slack adapter's event handling, not introduced by this PR.

Security Notes

Text injection (decoding file content into event.text) already exists for .md and .txt. CSV and JSON use the same code path with the same guards:

  • 100 KB cap on injected content (MAX_TEXT_INJECT_BYTES)
  • UTF-8 decode with UnicodeDecodeError catch (binary files skip injection)
  • 20 MB download size limit
  • Path traversal protection in cache_document_from_bytes()

CSV/JSON content could contain adversarial text, but this is the same risk as .txt/.md — not a new attack class.

Code Checklist

  • Changes follow existing code patterns and conventions
  • All new code has corresponding test coverage
  • No breaking changes to existing functionality
  • Security implications considered and documented
  • Conventional Commits format used

… injection

CSV (.csv) and JSON (.json) files uploaded to any messaging platform are
currently silently dropped because they are not in SUPPORTED_DOCUMENT_TYPES.
This is inconsistent with the WhatsApp adapter, which already handles these
types including text injection.

Changes:
- Add .csv (text/csv) and .json (application/json) to SUPPORTED_DOCUMENT_TYPES
- Introduce TEXT_INJECTABLE_EXTENSIONS constant in base.py to centralize the
  set of extensions eligible for inline content injection
- Update text injection conditions in Slack, Discord, Telegram, and Feishu
  adapters to use TEXT_INJECTABLE_EXTENSIONS instead of hardcoded tuples
- Add tests for CSV/JSON acceptance, text injection, oversized file handling,
  binary content graceful degradation, and MIME-based extension resolution

Security note: text injection carries the same prompt injection surface as
existing .txt/.md support. Existing mitigations apply (100KB cap, UTF-8
validation, UnicodeDecodeError handling, path traversal protection).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@tjp2021 tjp2021 force-pushed the feat/csv-json-document-support branch from d5992e5 to c5fd76a Compare March 31, 2026 02:03
@alt-glitch alt-glitch added type/feature New feature or request comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have labels May 2, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Superseded by #13943 which addresses the same feature (CSV/JSON document support + centralized text-injectable rules) and explicitly closes #4105.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add CSV and JSON to supported document types across all gateway adapters

2 participants