Skip to content

feat: add Gemini CLI / AI Studio session import support#204

Open
FBISiri wants to merge 1 commit into
MemPalace:developfrom
FBISiri:feat/gemini-cli-import
Open

feat: add Gemini CLI / AI Studio session import support#204
FBISiri wants to merge 1 commit into
MemPalace:developfrom
FBISiri:feat/gemini-cli-import

Conversation

@FBISiri

@FBISiri FBISiri commented Apr 8, 2026

Copy link
Copy Markdown

What does this PR do?

Adds Gemini CLI and Google AI Studio session import support to normalize.py.

MemPalace already has a Gemini CLI integration guide, but normalize.py couldn't parse Gemini session files. This PR fills that gap.

Supported formats

1. Gemini API contents format — used by Gemini CLI session files (~/.gemini/sessions/*.json):

{"contents": [{"role": "user", "parts": [{"text": "..."}]}, ...]}

2. Flat messages list with role: "model" (Gemini convention):

[{"role": "user", "content": "..."}, {"role": "model", "content": "..."}]

Design decisions

  • Disambiguation: requires at least one role: "model" entry to match. This prevents false positives against Claude/ChatGPT formats that use role: "assistant". Placed after ChatGPT parser and before Slack in the detection chain.
  • Multi-part text: Gemini API allows multiple text parts per turn — they're joined with spaces.
  • Non-text parts: inline_data, function_call, etc. are gracefully skipped (only text parts are extracted).
  • Zero new dependencies: pure JSON parsing, consistent with the existing parsers.

How to test

python -m pytest tests/test_normalize.py -v

6 new test cases:

  • test_gemini_contents_format — standard Gemini CLI session
  • test_gemini_flat_messages_format — simplified export
  • test_gemini_multi_part_text — multiple text blocks in one turn
  • test_gemini_skips_non_text_parts — images/functions ignored
  • test_gemini_single_message_returns_none — edge case

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Follows existing code patterns (same style as _try_claude_ai_json, _try_chatgpt_json)

@bgauryy

bgauryy commented Apr 8, 2026

Copy link
Copy Markdown

PR Review: feat: add Gemini CLI / AI Studio session import support

Executive Summary

Aspect Value
PR Goal Add Gemini CLI and Google AI Studio session import support to normalize.py
Files Changed 2 (mempalace/normalize.py, tests/test_normalize.py)
Risk Level 🟡 MEDIUM — parser precedence bug silently drops assistant responses for one layout
Review Effort 2/5 — focused change, clear scope
Recommendation 🔄 REQUEST_CHANGES

Affected Areas: normalize.py parser chain, _try_gemini_json new parser, test suite

Business Impact: Users importing Gemini CLI sessions via {"messages": [...]} wrapper with 2+ user turns would get silently broken transcripts (assistant responses dropped).

Flow Changes: New _try_gemini_json parser inserted into the JSON parser chain between _try_chatgpt_json and _try_slack_json. Parser ordering creates a precedence conflict with _try_claude_ai_json for one of the three documented layouts.

Ratings

Aspect Score
Correctness 3/5
Security 5/5
Performance 5/5
Maintainability 4/5

PR Health

  • Has clear description
  • References ticket/issue (if applicable)
  • Appropriate size (or justified if large)
  • Has relevant tests (if applicable)

High Priority Issues

(Must fix before merge)

🐛 #1: Parser precedence bug — _try_claude_ai_json silently claims Gemini {"messages": [...]} data

Location: mempalace/normalize.py:63 | Confidence: ✅ HIGH

_try_claude_ai_json runs before _try_gemini_json in the parser chain. When Gemini data uses the {"messages": [...]} wrapper (Layout 2a), _try_claude_ai_json extracts the list via data.get("messages", ...), captures all role="user" entries, but silently skips role="model" entries (not in its accepted roles: user, human, assistant, ai). If there are 2+ user messages, it returns a broken transcript with only user turns — the Gemini parser never executes.

Trace for input {"messages": [{"role": "user", "content": "Q1"}, {"role": "model", "content": "A1"}, {"role": "user", "content": "Q2"}, {"role": "model", "content": "A2"}]}:

  1. _try_claude_ai_json → extracts messages list → captures ("user", "Q1"), skips model, captures ("user", "Q2"), skips model
  2. len(messages) = 2 >= 2returns broken transcript: > Q1\n\n> Q2\n
  3. _try_gemini_json never runs

Fix (choose one):

Option A — Move Gemini parser before Claude AI in the chain (minimal change, safe due to has_model_role guard):

- for parser in (_try_claude_ai_json, _try_chatgpt_json, _try_gemini_json, _try_slack_json):
+ for parser in (_try_gemini_json, _try_claude_ai_json, _try_chatgpt_json, _try_slack_json):

Option B — Add assistant-role guard to _try_claude_ai_json (more defensive, prevents the class of bug):

+ has_assistant = any(r == "assistant" for r, _ in messages)
- if len(messages) >= 2:
+ if len(messages) >= 2 and has_assistant:
      return _messages_to_transcript(messages)

Option A is recommended — _try_gemini_json already has its own has_model_role discriminator that prevents false positives against Claude/ChatGPT data.


🔗 #2: Missing test for Layout 2a — the exact layout affected by bug #1

Location: tests/test_normalize.py (new tests) | Confidence: ✅ HIGH

The PR documents three supported layouts but only tests two:

  • ✅ Layout 1: {"contents": [...]} — tested in test_gemini_contents_format
  • ❌ Layout 2a: {"messages": [...]}NOT tested
  • ✅ Layout 2b: flat list [...] — tested in test_gemini_flat_messages_format

The missing test is for the exact layout affected by bug #1. Adding this test would have caught the parser precedence issue.

def test_gemini_messages_wrapper_format():
    """Gemini data wrapped in {"messages": [...]} should parse correctly."""
    data = {
        "messages": [
            {"role": "user", "content": "What is Python?"},
            {"role": "model", "content": "A programming language."},
            {"role": "user", "content": "And Java?"},
            {"role": "model", "content": "Also a programming language."},
        ]
    }
    # write to tmp file, normalize, assert both user and model content present

Medium Priority Issues

(Should fix, not blocking)

🔄 #3: Test file will have merge conflicts with current main

Location: tests/test_normalize.py | Confidence: ✅ HIGH

The PR's test diff targets an older base where tests/test_normalize.py has bare functions (test_empty()) with tempfile/os.unlink. Current main has been restructured into pytest classes (TestNormalizePlainText, TestClaudeAiJson, etc.) using the tmp_path fixture. The new tests will conflict on merge.

The new tests should be rebased onto current main and restructured as a TestGeminiJson class using tmp_path:

class TestGeminiJson:
    def test_contents_format(self, tmp_path):
        data = {"contents": [
            {"role": "user", "parts": [{"text": "Capital of France?"}]},
            {"role": "model", "parts": [{"text": "Paris."}]},
        ]}
        f = tmp_path / "gemini.json"
        f.write_text(json.dumps(data))
        result = normalize(str(f))
        assert "> Capital of France?" in result
        assert "Paris." in result

🎨 #4: Tests use tempfile/os.unlink instead of pytest tmp_path

Location: tests/test_normalize.py (all new test functions) | Confidence: ✅ HIGH

All 5 new tests use tempfile.NamedTemporaryFile with manual os.unlink(f.name). The codebase convention (every existing test) uses pytest's tmp_path fixture. The tempfile pattern has a cleanup risk: if an assertion fails before os.unlink(), the temp file leaks. tmp_path handles cleanup automatically.

Related to #3 — fixing this during rebase would address both issues.


Low Priority Issues

(Nice to have)

🎨 #5: elif not text: is always true — simplify to else:

Location: mempalace/normalize.py:+279 (in _try_gemini_json) | Confidence: ✅ HIGH

text = ""
parts = item.get("parts")
if isinstance(parts, list):
    # ... sets text from parts ...
elif not text:  # ← always True here since text="" and parts wasn't a list
    text = _extract_content(item.get("content", ""))

Since text = "" is assigned just above and the elif branch only executes when isinstance(parts, list) is False, not text is always True. This should be else: for clarity.

- elif not text:
+ else:
      text = _extract_content(item.get("content", ""))

🎨 #6: Module docstring mentions "Gemini CLI" but not "Google AI Studio"

Location: mempalace/normalize.py:10 | Confidence: ✅ HIGH

The PR title and body say this supports both "Gemini CLI" and "Google AI Studio" sessions. The docstring update only adds "Gemini CLI JSON sessions". Consider:

-    - Gemini CLI JSON sessions
+    - Gemini CLI / Google AI Studio JSON sessions

Flow Impact Analysis

normalize()
  └─ _try_normalize_json(content)
       ├─ _try_claude_code_jsonl(content)    # JSONL, runs on raw string
       ├─ json.loads(content) → data
       ├─ _try_claude_ai_json(data)          # ⚠️ Claims {"messages":[...]} with 2+ user msgs
       ├─ _try_chatgpt_json(data)            # Requires "mapping" key — no conflict
       ├─ _try_gemini_json(data)             # NEW — handles "contents", "messages", list
       └─ _try_slack_json(data)              # Requires type="message" items — no conflict

Conflict zone: _try_claude_ai_json and _try_gemini_json both handle {"messages": [...]} and top-level lists. The has_model_role guard in the Gemini parser is correct, but it never gets a chance to run when the Claude parser claims the data first.

Safe layouts (no precedence issue):

  • {"contents": [...]} — Claude parser doesn't check this key
  • [{"role": "user", ...}, {"role": "model", ...}] (flat, ≤1 user msg) — Claude parser returns None (< 2 messages)

Unsafe layout:

  • {"messages": [{"role": "user", ...}, {"role": "model", ...}, {"role": "user", ...}, ...]} with 2+ user messages

Created by Octocode MCP https://octocode.ai 🔍🐙

@web3guru888 web3guru888 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of #204feat: add Gemini CLI / AI Studio session import support

Scope: +150/−1 · 2 file(s)

  • mempalace/normalize.py (modified: +68/−1)
  • tests/test_normalize.py (modified: +82/−0)

Strengths

  • ✅ Includes test coverage

🟢 Approved — clean, well-structured PR. Good work @FBISiri!


🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis

@bensig bensig changed the base branch from main to develop April 11, 2026 22:23
@igorls igorls added area/cli CLI commands area/mining File and conversation mining enhancement New feature or request labels Apr 14, 2026
@igorls

igorls commented May 8, 2026

Copy link
Copy Markdown
Member

Hi, thanks for the contribution.

This PR has merge conflicts with develop, and the branch has not been updated in over 7 days, which puts it before our most recent release. The conflicts are likely against work that landed in that release.

Could you rebase onto develop so we can take another look?

If this change is no longer relevant, feel free to close the PR.

(This message is part of a periodic backlog pass, sent to all open PRs that match this state.)

@igorls igorls added the needs-rebase PR has merge conflicts with develop and needs rebase label May 8, 2026
Adds _try_gemini_json parser to normalize.py for three layouts:

  1. Gemini API contents format (~/.gemini/sessions/*.json):
     {"contents": [{"role": "user", "parts": [{"text": "..."}]}, ...]}
  2. Messages-wrapper variant:
     {"messages": [{"role": "user", ...}, {"role": "model", ...}]}
  3. Flat top-level list with role="model".

This complements the existing _try_gemini_jsonl parser (which handles
~/.gemini/tmp/<hash>/chats/session-*.jsonl with session_metadata
sentinel) — JSONL covers Gemini CLI runtime sessions, JSON covers
exported / Studio-saved transcripts.

## Review feedback addressed (PR MemPalace#204)

bgauryy review:
- MemPalace#1 Parser-precedence bug: _try_gemini_json runs *before*
  _try_claude_ai_json so the {"messages":[..., role=model, ...]}
  layout is no longer silently claimed by the Claude parser. The
  Gemini parser's has_model_role guard prevents false-positives
  against Claude / ChatGPT data.
- MemPalace#2 Layout 2a coverage: TestGeminiJson.test_messages_wrapper_format
  + test_messages_wrapper_does_not_get_claimed_by_claude pin the
  fix in place.
- MemPalace#3 Test conflicts with current main: rebased onto develop;
  tests restructured into TestGeminiJson class.
- MemPalace#4 tempfile/os.unlink → pytest tmp_path everywhere.
- MemPalace#5 elif not text → else (the elif branch was dead).
- MemPalace#6 Module docstring updated to mention Google AI Studio.

Tests: 9 new cases in TestGeminiJson covering all three layouts,
multi-part text joining, non-text part skipping, has_model_role
disambiguation, dispatch-chain regression for review MemPalace#1.
@FBISiri FBISiri force-pushed the feat/gemini-cli-import branch from 1ed408a to a9a2c35 Compare May 9, 2026 06:05
@FBISiri

FBISiri commented May 9, 2026

Copy link
Copy Markdown
Author

@bgauryy thanks for the thorough review — all six points addressed in the rebase (a9a2c35).

High priority

  • Congratulations Milla #1 Parser precedence bug: went with Option A — _try_gemini_json now runs before _try_claude_ai_json in the dispatch chain. The Gemini parser's has_model_role guard (returns None if no role="model" entry is seen) prevents false-positives against Claude / ChatGPT data, so the reordering is safe.
  • Integrating MemPalace with SoulForge's code intelligence system #2 Layout 2a coverage: added two regression tests — test_messages_wrapper_format (unit-level) and test_messages_wrapper_does_not_get_claimed_by_claude (full normalize() pipeline). Both fail without the precedence fix.

Medium priority

  • feat: Hermes memory provider integration #3 Test conflicts with current develop: rebased the branch onto current develop. develop has since gained its own _try_gemini_jsonl (for ~/.gemini/tmp/.../chats/session-*.jsonl runtime sessions) — that's a complementary parser to the JSON one this PR adds. Coexistence intentional and called out in the docstring.
  • Using AAAK as language for agents #4 tempfile/os.unlinktmp_path: all new tests now use the tmp_path fixture inside a TestGeminiJson pytest class, matching the existing convention.

Low priority

tests/test_normalize.py runs clean locally (126 passed, including the 9 new TestGeminiJson cases). PR is back to mergeable — git pull should be a fast-forward.

@igorls heads up — rebased and force-pushed onto current develop per your earlier ask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli CLI commands area/mining File and conversation mining enhancement New feature or request needs-rebase PR has merge conflicts with develop and needs rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants