Observability for Ephemeral Sandboxes

Inspiration

We’re building a startup in the same spirit as Lovable — but instead of end-user apps, we focus on internal business tools. Our core hypothesis is that the future of internal tooling is ephemeral: short-lived, on-demand sandboxes that spin up when a user needs them and disappear when they’re done.

As our platform grew, this hypothesis started becoming very real.

At peak usage, we were running anywhere between 100–200 ephemeral containers at a time. Users would suddenly report issues like:

  • “My sandbox froze”
  • “This step keeps failing”
  • “It worked yesterday but not today”

The problem?
We had no idea what was actually going wrong inside those sandboxes.

Traditional logging and monitoring broke down completely in an ephemeral world. By the time we tried to debug, the container was often already gone.

That pain is what inspired this project.


What We Built

We built an observability dashboard specifically designed for ephemeral sandboxes.

Each sandbox integrates with Sentry, but instead of treating them as isolated apps, we aggregate everything into a single, centralized view that gives us:

  • Real-time error visibility across all active sandboxes
  • Insight into sandbox state (running, degraded, failing)
  • Resource usage signals (CPU / memory pressure)
  • A historical trail of failures, even after a sandbox is destroyed

This lets us move from reactive debugging to proactive support.

In many cases, we can now reach out to a user before they even report an issue, because we already see their sandbox failing.


How We Built It

At a high level, the system works like this:

  1. Lightweight instrumentation inside each sandbox
    Every ephemeral container initializes a minimal observability layer on startup.

  2. Shared Sentry ingestion
    Instead of generating a unique Sentry project or DSN per sandbox, events are tagged with:

    • sandbox ID
    • user / workspace ID
    • lifecycle metadata (creation time, TTL, shutdown reason)
  3. Central aggregation layer
    We pull error events, performance data, and breadcrumbs into a unified backend that understands sandbox lifecycles, not just applications.

  4. Observability dashboard
    The dashboard shows:

    • Active vs terminated sandboxes
    • Error rates per sandbox
    • Memory-heavy or CPU-intensive workloads
    • Failure patterns across users and actions

This gives us something traditional APM tools don’t: context about short-lived compute.


Challenges We Faced

1. Ephemeral lifetimes

Sandboxes can live for minutes — sometimes seconds. Any observability system that assumes long-running processes simply doesn’t work here.

We had to design everything to:

  • initialize fast
  • emit value immediately
  • gracefully handle sudden termination

2. Signal vs noise

With hundreds of containers, raw logs quickly become unusable. We focused on high-signal errors and performance indicators rather than full log streams.

3. Cost and cardinality

Ephemeral systems explode cardinality (IDs everywhere). We had to be intentional about:

  • which dimensions we indexed
  • what metadata we attached to each event
  • how much we retained

4. Actionability

Observability is useless if it doesn’t change behavior. The hardest part wasn’t collecting data — it was turning it into something that let us help users faster.


What We Learned

  • Traditional observability tools are built for static infrastructure, not ephemeral compute.
  • In ephemeral systems, context matters more than volume.
  • Seeing sandbox state (resource pressure, lifecycle phase) is often more useful than raw logs.
  • Proactive support is only possible when observability is first-class, not an afterthought.

Most importantly, we learned that ephemeral infrastructure needs its own observability primitives.


What’s Next

We plan to:

  • Add predictive signals for sandbox failure
  • Surface “this is about to break” warnings to users
  • Automatically suggest fixes based on recurring error patterns

This project started as a way to debug faster — it’s turning into a core part of how we support users in an ephemeral world.

Built With

Share this project:

Updates