Skip to content

[RFC] Add MemoryProvider.health_check() ABC method for doctor-visible reachability probes (#36 follow-up) #42

@PowerCreek

Description

@PowerCreek

Background

After auditing every shipped memory provider's is_available() implementation (see #36 + PR #37), the picture is:

This RFC proposes a path to symmetric coverage.

Proposed addition to agent/memory_provider.py

class MemoryProvider(ABC):
    # ... existing methods ...

    def health_check(self) -> tuple[bool, str]:
        """Probe the backing service and report (healthy, reason).

        Default implementation delegates to ``is_available()`` — for
        provider implementations that haven't yet defined a real
        probe, the contract is preserved: the (False, "is_available
        returned False") case still fires when the basic config is
        missing.

        Implementations that CAN probe should override:

          * Return (True, "") when the backend responds successfully
            within a reasonable timeout (default ~5 seconds).
          * Return (False, "<one-line reason>") for any failure.
            Reason should distinguish auth vs network so the operator
            knows which fix to apply (rotate key vs wait / check
            connectivity). Strings like "auth: API key rejected",
            "unreachable: connection refused", "not_found: endpoint
            404", "rate_limited: 429" are good prefixes.

        Must never raise — doctor invokes this in a loop and a
        raised exception would crash the section.
        """
        try:
            return (self.is_available(), "" if self.is_available()
                                          else "is_available() returned False")
        except Exception as exc:  # noqa: BLE001
            return (False, f"is_available() raised: {exc}")

Doctor consumption (in hermes_cli/doctor.py)

The generic-memory else-branch becomes:

else:
    try:
        from plugins.memory import load_memory_provider
        _provider = load_memory_provider(_active_memory_provider)
        if not _provider:
            check_warn(f"{_active_memory_provider} plugin not found",
                       "run: hermes memory setup")
        else:
            healthy, reason = _provider.health_check()
            if healthy:
                check_ok(f"{_active_memory_provider} reachable")
            else:
                # Classify reason prefix → use existing _classify
                # logic (e.g. utils.classify_http_error analog for
                # message-prefix taxonomy).
                if reason.startswith("auth:"):
                    _fail_and_issue(
                        f"{_active_memory_provider} auth rejected",
                        reason, ...)
                elif reason.startswith("unreachable:"):
                    check_warn(
                        f"{_active_memory_provider} unreachable",
                        reason)
                else:
                    check_warn(
                        f"{_active_memory_provider} health check failed",
                        reason)
    except Exception as _e:
        check_warn(f"{_active_memory_provider} check failed", str(_e))

The Honcho + Mem0 specific branches (lines ~2200-2310 in doctor.py today) collapse into this generic path once both plugins implement health_check() natively. That removes ~120 lines of provider-specific doctor logic.

Provider migration sketch

Provider Probe call
Honcho get_honcho_client(hcfg) (already validates)
Mem0 MemoryClient(api_key).get_all(user_id, limit=1) (already in #35)
openviking httpx.get(endpoint + "/health") if endpoint has a health endpoint; otherwise httpx.head(endpoint)
supermemory __import__("supermemory"); client.search("ping", limit=1) if a search endpoint exists
hindsight depends on mode — for cloud, a small GET /health; for local, just check the runtime path
byterover subprocess.run([brv_path, "status"], timeout=3)
retaindb httpx.get(base_url + "/health") with bearer
holographic Trivial — local SQLite, always (True, "")

Each provider owns its probe; doctor stays generic.

Out of scope for this RFC

  • Async health checks. The probe runs once when hermes doctor invokes it; sync semantics keep the call sites simple. Mem0 + Honcho already use sync clients.
  • Caching results. Doctor is a one-shot diagnostic; cache invalidation isn't worth the complexity.
  • Removing is_available(). Backward-compat: is_available() stays the cheap pre-flight check (no network calls); health_check() is the deeper probe. Many call sites in the codebase already use is_available() in hot paths.

Status

This is an RFC — design only, no PR. Comments / counter-proposals welcome. If the shape is accepted, the plan is:

  1. Ship the ABC method with the default is_available() delegation (one commit, no behavior change).
  2. Migrate Honcho + Mem0 to implement health_check() natively. Collapse the provider-specific doctor branches.
  3. Migrate openviking + supermemory + hindsight + byterover + retaindb + holographic, one at a time. Each PR is small.
  4. Close hermes doctor: generic memory provider check overstates is_available() — "X provider active" implies reachability that wasn't probed #36 when the migration is complete.

Filed by hermes-maintainer (PowerCreek). Continues #36; does not close it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions