Skip to content

feat: Adds a miner mode for git log entries.#98

Open
MrDys wants to merge 1 commit into
MemPalace:developfrom
MrDys:feature-git-log
Open

feat: Adds a miner mode for git log entries.#98
MrDys wants to merge 1 commit into
MemPalace:developfrom
MrDys:feature-git-log

Conversation

@MrDys

@MrDys MrDys commented Apr 7, 2026

Copy link
Copy Markdown

What does this PR do?

This adds support for mining git log metadata (commit messages, hashes, file paths, etc.):

  • Runs git log with structured format output (no Python git libraries — stays zero-dependency)
  • One commit = one drawer, containing: subject + body + files changed
  • Room detection is hybrid: file-path matching against mempalace.yaml rooms first, then keyword scoring on commit message
  • Dedup via synthetic git://# URIs stored as source_file
  • Git-specific metadata preserved: commit_hash, commit_author, commit_date, files_changed
  • Supports --extract general to classify commits via the existing general extractor (DECISION/MILESTONE/PROBLEM/etc.)

How to test

  1. mempalace mine . --mode git-log --dry-run — should list commits with detected rooms
  2. mempalace mine . --mode git-log --dry-run --since 2025-01-01 — date filtering
  3. mempalace mine . --mode git-log — actually store, then mempalace search "some commit message" to verify retrieval
  4. mempalace status — should show git-log drawers in the counts

Checklist

  • Tests pass (python -m pytest tests/ -v)
  • No hardcoded paths
  • Linter passes (ruff check .)

@bgauryy

bgauryy commented Apr 8, 2026

Copy link
Copy Markdown

PR Review: feat: Adds a miner mode for git log entries

Executive Summary

Aspect Value
PR Goal Add --mode git-log to mine git commit history into the palace
Files Changed 3 (+654 / -3)
Risk Level 🟡 MEDIUM - Functional feature but bypasses core architectural rules
Review Effort 3/5
Recommendation 🔄 REQUEST_CHANGES

Affected Areas: mempalace/git_miner.py (new), mempalace/cli.py (integration), tests/test_git_miner.py (new)

Business Impact: Unlocks a new ingest source — git history becomes searchable memory. Good feature direction.

Flow Changes: New --mode git-log branch in cmd_mine(). Adds --since, --until, --branch CLI args (visible globally, not scoped to git-log mode).

Ratings

Aspect Score
Correctness 3/5
Security 4/5
Performance 4/5
Maintainability 2/5

PR Health

  • Has clear description
  • References ticket/issue (if applicable)
  • Appropriate size (or justified if large)
  • Has relevant tests (if applicable)

High Priority Issues

(Must fix before merge)

🏗️ #1: Direct chromadb import violates project architecture rule

Location: mempalace/git_miner.py:19 | Confidence: ✅ HIGH

The project has an explicit rule: "All ChromaDB access goes through palace_db.py — never import chromadb directly elsewhere." Currently only palace_db.py imports chromadb. Every other module (cli.py, convo_miner.py, miner.py, layers.py, mcp_server.py, palace_graph.py) uses from .palace_db import get_collection. This PR adds a second direct import and re-implements get_collection() locally (lines 255-261), creating a parallel ChromaDB access path that bypasses the singleton client, collection caching, and any future centralized logic.

- import chromadb
-
- def get_collection(palace_path):
-     os.makedirs(palace_path, exist_ok=True)
-     client = chromadb.PersistentClient(path=palace_path)
-     try:
-         return client.get_collection("mempalace_drawers")
-     except Exception:
-         return client.create_collection("mempalace_drawers")
+ from .palace_db import get_collection, file_already_mined

🐛 #2: sys.exit(1) in library code kills MCP server and breaks testability

Location: mempalace/git_miner.py:117,121,128 | Confidence: ✅ HIGH

run_git_log() calls sys.exit(1) in three error paths (git not found, timeout, git log failure). This is fine for a CLI-only script but mine_git_log is imported by cli.py and could be called from the MCP server or tests. A sys.exit() in a library function terminates the entire process. The existing miners (miner.py, convo_miner.py) don't use sys.exit in their core logic — they raise exceptions or return empty.

- except FileNotFoundError:
-     print("  ERROR: git not found. Is git installed?")
-     sys.exit(1)
- except subprocess.TimeoutExpired:
-     print("  ERROR: git log timed out (60s). Try --limit or --since to narrow the range.")
-     sys.exit(1)
+ except FileNotFoundError:
+     raise FileNotFoundError("git not found. Is git installed?")
+ except subprocess.TimeoutExpired:
+     raise TimeoutError("git log timed out (60s). Try --limit or --since to narrow the range.")

(Same pattern for the returncode != 0 case — raise RuntimeError instead of sys.exit.)


Medium Priority Issues

(Should fix, not blocking)

🔗 #3: commit_already_mined duplicates file_already_mined from palace_db.py

Location: mempalace/git_miner.py:264-269 | Confidence: ✅ HIGH

The new commit_already_mined(collection, source_uri) function is functionally identical to palace_db.file_already_mined(collection, source_file) — both do collection.get(where={"source_file": ...}, limit=1) and check if results are non-empty. Both miner.py and convo_miner.py import file_already_mined from palace_db.

- def commit_already_mined(collection, source_uri):
-     try:
-         results = collection.get(where={"source_file": source_uri}, limit=1)
-         return len(results.get("ids", [])) > 0
-     except Exception:
-         return False
+ # Use: from .palace_db import file_already_mined
+ # Then call: file_already_mined(collection, source_uri)

🔗 #4: TOPIC_KEYWORDS copy-pasted from convo_miner.py

Location: mempalace/git_miner.py:26-77 | Confidence: ✅ HIGH

50+ lines of keyword dictionaries are duplicated verbatim. The code comment even acknowledges this: "reused from convo_miner's TOPIC_KEYWORDS". If keywords are updated in one place, the other gets stale. Should be extracted to constants.py (where project constants live per AGENTS.md) or imported directly.

- # Keyword sets for room detection (reused from convo_miner's TOPIC_KEYWORDS)
- TOPIC_KEYWORDS = { ... }  # 50+ lines
+ from .constants import TOPIC_KEYWORDS  # or from .convo_miner import TOPIC_KEYWORDS

🎨 #5: Tests use tempfile.mkdtemp() instead of tmp_path fixture

Location: tests/test_git_miner.py:136,150,166,179 | Confidence: ✅ HIGH

All test functions create temp directories manually with tempfile.mkdtemp() and clean up with shutil.rmtree in try/finally. AGENTS.md states: "All tests use tmp_path — no home directory pollution". The conftest.py provides fixtures for this. Using tmp_path also means pytest handles cleanup automatically, even if tests crash.

- def test_mine_git_log_dry_run():
-     tmpdir = tempfile.mkdtemp()
-     try:
-         _make_git_repo(tmpdir)
-         palace_path = os.path.join(tmpdir, "palace")
-         mine_git_log(tmpdir, palace_path, dry_run=True)
-         assert not os.path.exists(palace_path)
-     finally:
-         shutil.rmtree(tmpdir)
+ def test_mine_git_log_dry_run(tmp_path):
+     _make_git_repo(str(tmp_path / "repo"))
+     palace_path = str(tmp_path / "palace")
+     mine_git_log(str(tmp_path / "repo"), palace_path, dry_run=True)
+     assert not os.path.exists(palace_path)

Low Priority Issues

(Nice to have)

🎨 #6: --since, --until, --branch args visible for all modes

Location: mempalace/cli.py:318-335 | Confidence: ⚠️ MED

These three CLI arguments are only meaningful for --mode git-log but are always visible in --help and accepted silently for projects/convos modes. Minor UX confusion — could add a note in help text or validate that they're only used with git-log mode.


🎨 #7: MD5 for drawer ID generation

Location: mempalace/git_miner.py:366 | Confidence: ⚠️ MED

Uses hashlib.md5() for generating drawer IDs. Not a security concern (IDs, not crypto), but MD5 has known collision weaknesses. Consider hashlib.sha256() for new code — same performance for short inputs, no collision risk discussion.


Created by Octocode MCP https://octocode.ai 🔍🐙

@web3guru888 web3guru888 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of #98feat: Adds a miner mode for git log entries.

Scope: +654/−3 · 3 file(s)

  • mempalace/cli.py (modified: +33/−3)
  • mempalace/git_miner.py (added: +429/−0)
  • tests/test_git_miner.py (added: +192/−0)

Technical Analysis

  • 🪟 Windows compatibility — verify path handling works cross-platform

Suggestions

  • Magic number(s) 2025 — consider extracting to named constant(s)

Strengths

  • ✅ Includes test coverage

🟢 Approved — clean, well-structured PR. Good work @MrDys!


🏛️ Reviewed by MemPalace-AGI · Autonomous research system with perfect memory · Showcase: Truth Palace of Atlantis

@bensig bensig changed the base branch from main to develop April 11, 2026 22:23
@igorls igorls added enhancement New feature or request area/mining File and conversation mining labels Apr 14, 2026
bensig added a commit that referenced this pull request Apr 18, 2026
Draft plugin specification for source adapters, mirroring RFC 001's
role for storage backends. Formalizes the contract six community
ingester PRs (#274, #23, #169, #232, #567, #98, #702) plus #981's
metadata-only mode have been reinventing ad-hoc, so adapter authors
can build to a stable surface.

Key decisions:
- Single ingest() method; lazy adapters yield SourceItemMetadata
  ahead of drawers, eager adapters interleave
- Declared-transformation model (§1.4) replaces informal verbatim
  promise with a verifiable one; byte_preserving adapters declare
  the empty set, declared_lossy adapters enumerate. Existing
  miner.py and the convo_miner+normalize pipeline map cleanly
- Palace is the incremental cursor via is_current(item, metadata);
  no sidecar persistence
- Routing is adapter-owned; detect_room/detect_hall move into the
  filesystem adapter
- Flat metadata per ChromaDB (RFC 001 §1.4) — entity hints as
  json_string field, KG triples route to SQLite knowledge graph
- Closets stay core-built as a post-step; adapters may emit flat
  closet_hints. Closes existing gap where convo drawers get no
  closets
- No per-drawer field renames: source_file, filed_at, source_mtime,
  added_by, normalize_version, entities, ingest_mode all preserved.
  Spec adds adapter_name, adapter_version, privacy_class

§9 enumerates the cleanup PR prerequisites (mempalace/sources/
module, PalaceContext facade, KnowledgeGraph.add_triple gaining
backwards-compatible source_drawer_id + adapter_name params).

Tracking issue: #989
@igorls

igorls commented May 8, 2026

Copy link
Copy Markdown
Member

Hi, thanks for the contribution.

This PR has merge conflicts with develop, and the branch has not been updated in over 7 days, which puts it before our most recent release. The conflicts are likely against work that landed in that release.

Could you rebase onto develop so we can take another look?

If this change is no longer relevant, feel free to close the PR.

(This message is part of a periodic backlog pass, sent to all open PRs that match this state.)

@igorls igorls added the needs-rebase PR has merge conflicts with develop and needs rebase label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mining File and conversation mining enhancement New feature or request needs-rebase PR has merge conflicts with develop and needs rebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants