Skip to content

Support claude.ai privacy export format with sender field#605

Open
carlito1979 wants to merge 2 commits into
MemPalace:developfrom
carlito1979:claude/fix-claude-ai-exports-1leWI
Open

Support claude.ai privacy export format with sender field#605
carlito1979 wants to merge 2 commits into
MemPalace:developfrom
carlito1979:claude/fix-claude-ai-exports-1leWI

Conversation

@carlito1979

@carlito1979 carlito1979 commented Apr 11, 2026

Copy link
Copy Markdown

Fixes #602

What does this PR do?

Extends Claude.ai JSON export support to handle the actual privacy export format, which uses sender instead of role and stores rendered messages in a top-level text field alongside structured content blocks.

Key changes:

  • Refactored message extraction into _extract_claude_ai_message() helper that:
    • Accepts both role and sender fields for author identification
    • Falls back to top-level text field when content blocks are empty
    • Handles both nested (privacy export) and flat message list formats
  • Increased MAX_FILE_SIZE from 10 MB to 100 MB to accommodate typical claude.ai privacy exports (20–50 MB)
  • Added user-visible warnings when files exceed the size limit instead of silently skipping them
  • Improved docstrings to document the supported formats

How to test

Run the test suite:

python -m pytest tests/test_normalize.py::test_claude_ai_privacy_export_sender_field -v
python -m pytest tests/test_normalize.py::test_claude_ai_privacy_export_text_field_fallback -v
python -m pytest tests/test_normalize.py::test_claude_ai_flat_messages_sender_field -v
python -m pytest tests/test_convo_miner_unit.py::TestScanConvos::test_scan_warns_on_oversized_file -v
python -m pytest tests/ -v

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

https://claude.ai/code/session_01GUH8MeAt6jAjKpbQ227AcC

Two bugs caused `mine --mode convos` to silently file zero drawers from
claude.ai privacy exports:

1. `_try_claude_ai_json` only looked at `role`, but the privacy export
   uses `sender` ("human" / "assistant"). Now accepts either field, and
   falls back to the message's top-level `text` when the structured
   `content` blocks yield nothing.

2. `convo_miner.MAX_FILE_SIZE` was 10 MB while real claude.ai exports
   routinely run 20–50 MB, so `conversations.json` was dropped before
   parsing with no diagnostic. The default cap is now 100 MB and
   oversize files emit a visible warning to stderr.

Adds unit tests covering the `sender` field, the `text` fallback, and
the new oversize-file warning.

@web3guru888 web3guru888 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the fix I described in #602 — clean, well-scoped, and the implementation is more thorough than the minimal one-liner I suggested.

_extract_claude_ai_message() helper — the right abstraction

Factoring this out as a dedicated helper (rather than inline item.get("role") or item.get("sender")) is the correct choice. Both the flat and nested paths now share the same extraction logic, which eliminates the class of "fix one path and miss the other" bug that would have happened with a patch approach.

The fallback chain is right:

  1. Try role first (legacy exports)
  2. Fall back to sender (current privacy exports)
  3. Try structured content blocks
  4. Fall back to top-level text

That ordering handles all known format versions and will degrade gracefully for future variations.

MAX_FILE_SIZE = 100MB

In #602 I suggested 50MB, but 100MB is reasonable — claude.ai exports scale with conversation volume and 50MB is already uncomfortably close to the observed 38MB exports being reported. 100MB gives headroom without being reckless. The comment "20–50 MB JSON files" is accurate and useful context.

Warning on stderr rather than stdout

Correct: file=sys.stderr keeps stdout clean for piping. The warning format is human-readable and includes both actual size and limit, which is what users need to understand the skip.

Tests

Three normalize tests + the oversized warning test cover the main cases. The patch.object(convo_miner, "MAX_FILE_SIZE", 1) pattern for the warning test is the right way to trigger the condition without writing a 100MB file.

One minor note: the test_scan_default_limit_accepts_typical_claude_ai_export test just checks MAX_FILE_SIZE >= 50MB — that will pass even if someone accidentally sets it to 51MB. It is useful as a regression guard for the constant though.

This closes #602 cleanly. The original report mentioned "10MB" as the issue and both root causes (silent skip + schema mismatch) are addressed.

LGTM. Approving.

@bensig bensig changed the base branch from main to develop April 11, 2026 22:21
@bensig bensig requested a review from igorls as a code owner April 11, 2026 22:21
@igorls igorls added the area/mining File and conversation mining label Apr 14, 2026
@igorls

igorls commented May 8, 2026

Copy link
Copy Markdown
Member

Hi, thanks for the contribution.

This PR has merge conflicts with develop, and the branch has not been updated in over 7 days, which puts it before our most recent release. The conflicts are likely against work that landed in that release.

Could you rebase onto develop so we can take another look?

If this change is no longer relevant, feel free to close the PR.

(This message is part of a periodic backlog pass, sent to all open PRs that match this state.)

@igorls igorls added the needs-rebase PR has merge conflicts with develop and needs rebase label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining needs-rebase PR has merge conflicts with develop and needs rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mine --mode convos silently skips claude.ai exports due to sender/role field mismatch and 10MB file size limit

4 participants