Skip to content

fix: handle large claude.ai exports and multi-conversation "messages" key#676

Open
z3tz3r0 wants to merge 1 commit into
MemPalace:developfrom
z3tz3r0:fix/claude-ai-export-mining
Open

fix: handle large claude.ai exports and multi-conversation "messages" key#676
z3tz3r0 wants to merge 1 commit into
MemPalace:developfrom
z3tz3r0:fix/claude-ai-export-mining

Conversation

@z3tz3r0

@z3tz3r0 z3tz3r0 commented Apr 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Root cause 1: MAX_FILE_SIZE in convo_miner.py was 10 MB — claude.ai exports routinely exceed this (21+ MB for active users). Files were silently skipped with zero feedback. Raised to 100 MB and added a warning when files are skipped.
  • Root cause 2: _try_claude_ai_json in normalize.py only detected multi-conversation exports using the "chat_messages" key (privacy export). Standard claude.ai exports use "messages" — these fell through to the flat-messages parser which failed silently (conversation dicts have no "role" at top level), producing 0 drawers.
  • Parser fix: Now checks for both "chat_messages" and "messages" at the conversation object level, and processes each conversation into a separate transcript section instead of concatenating all 844+ conversations into one.
  • Tests: 3 new test cases for multi-conversation parsing ("messages" key, per-conversation separation, short conversation filtering).

Note: #646 was closed via #667, but #667 addresses paginated export/read-back — it does not touch convo_miner.py or the MAX_FILE_SIZE skip, nor the "messages" key mismatch in the parser. The two root causes reported in #646 remain unfixed on main.

Test plan

  • pytest tests/ -v — 592 passed (589 base + 3 new), 0 failed
  • New tests verify: "messages" key parsing, per-conversation separation, short conversation filtering
  • Mine a real claude.ai conversations.json export (> 10 MB) and verify drawers are created per conversation

Addresses #646

… key

Two bugs in claude.ai export mining:

1. MAX_FILE_SIZE was 10 MB — claude.ai conversation exports routinely
   exceed this (21+ MB for active users). Files were silently skipped
   with no warning. Raised to 100 MB and added a warning message when
   files are skipped due to size.

2. _try_claude_ai_json only detected multi-conversation exports when
   conversations used the "chat_messages" key (privacy export format).
   Standard exports use "messages" instead — these fell through to the
   flat-messages parser which failed silently (conversation dicts have
   no "role" key at top level), producing 0 drawers.

   Now checks for both "chat_messages" and "messages" at the conversation
   level, and processes each conversation into a separate transcript
   section instead of concatenating all into one.

Adds 3 tests for multi-conversation parsing.

Addresses MemPalace#646
@igorls

igorls commented May 8, 2026

Copy link
Copy Markdown
Member

Hi, thanks for the contribution.

This PR has merge conflicts with develop, and the branch has not been updated in over 7 days, which puts it before our most recent release. The conflicts are likely against work that landed in that release.

Could you rebase onto develop so we can take another look?

If this change is no longer relevant, feel free to close the PR.

(This message is part of a periodic backlog pass, sent to all open PRs that match this state.)

@igorls igorls added the needs-rebase PR has merge conflicts with develop and needs rebase label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining bug Something isn't working needs-rebase PR has merge conflicts with develop and needs rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants