Skip to content

feat: track embedding model costs in budget system #997

@Aureliolo

Description

@Aureliolo

Context

The budget system (CostRecord, CostTracker, BudgetEnforcer) tracks LLM completion API calls -- input/output tokens, cost per call, categorized as productive/coordination/system. However, embedding model calls are completely invisible to the budget system.

Embedding calls happen inside the Mem0 SDK on every store() (embed content before storing vector) and retrieve() (embed query for similarity search). For cloud embedding APIs, each call has a cost that currently goes untracked. For local models (Ollama), it's free compute but still worth tracking for observability.

This gap will become more significant as:

Requirements

1. Embedding cost tracking

  • Add an EMBEDDING category or separate cost type to distinguish embedding calls from LLM completion calls
  • Instrument Mem0 SDK calls (or wrap them) to capture per-call metrics: model, token count (input only -- embeddings have no output tokens), cost
  • Record embedding costs as CostRecord entries (or a new EmbeddingCostRecord if the schema diverges too much)

2. Budget enforcement integration

  • Include embedding costs in budget totals and per-agent spend
  • Evaluate whether embedding calls should be gated by budget enforcement (currently only LLM completions are gated)
  • Consider: embedding is on the critical path for memory store/retrieve -- budget-gating it could break memory entirely

3. Dashboard visibility

4. Fine-tuning cost tracking (#966 follow-up)

  • Stage 1 (synthetic data generation) makes LLM API calls -- these should flow through the provider system and get tracked as SYSTEM category costs
  • Stage 3 (GPU training) is compute-only, not an API cost -- consider whether to track duration/resource usage separately

Design Considerations

  • Mem0 SDK calls the embedding provider directly -- SynthOrg doesn't intercept these calls. Options:
    • Wrap the Mem0 client with a proxy that intercepts embedding calls
    • Use Mem0's callback/hook system if available
    • Estimate costs from known model pricing + input text length
  • For local models (Ollama), cost is zero but call count and latency are still useful metrics
  • CostRecord currently requires output_tokens >= 0 -- embedding calls have zero output tokens (just the vector), so this should work as-is

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:mediumShould do, but not blockingscope:medium1-3 days of workspec:memoryDESIGN_SPEC Section 7 - Memory & Persistencetype:featureNew feature implementationv0.7Minor version v0.7v0.7.2Patch release v0.7.2

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions