Practical observability: foundations with OpenTelemetry

In this series, we explore how to bring observability into a Scala application using OpenTelemetry - the open standard for telemetry data - and otel4s, a purely functional OpenTelemetry library for the Typelevel (Cats Effect) ecosystem.

Series roadmap
  • Part 1 (this post): Foundations with OpenTelemetry.
  • Part 2: Distributed tracing with otel4s.
  • Part 3 (coming soon): Implementing metrics with otel4s.
  • Part 4 (coming eventually): Advanced configuration and operational tips.

This first post focuses on the conceptual foundation: what observability means, how telemetry signals work, and how they stay connected in distributed systems.

1. Introduction

There are countless variables that can affect how your system behaves in production.

Maybe a new marketing collaboration with a popular tech blogger triples your traffic overnight. A minor slowdown, such as a third-party API call or slow database query, can cascade through your services: threads pile up, external calls time out, queues overflow, and soon unrelated components start failing. Generally, such things are just accidents without a clear trail. We may know that something is wrong, somewhere, however, logs alone rarely tell the full story.

  • Why did latency spike last night?
  • Why does one instance behave differently under the same load?
  • Where does the time actually go between the request entering the system and the response leaving?

This is where observability comes into the picture.

Observability isn't about collecting more data. It's about understanding your system through the data it produces. The key is capturing structured, correlated signals that let us reason about why something happens, not just what happened.

Traditionally, monitoring told us whether a service was healthy. Observability extends that idea: instead of asking Is it up?, we ask Why is it behaving this way? even when we didn't anticipate the failure mode.

To achieve that, we rely on telemetry - data emitted from our applications in the form of signals: traces, metrics, and logs. Together, these signals create a complete picture of how our system behaves.

2. From monitoring to observability

What problem are we actually trying to solve?

Monitoring is great when we know what to expect. We define a few metrics, set thresholds, and get alerts when they are crossed. Unfortunately, in modern distributed systems, a single user request can cross dozens of services and queues. Once systems become that complex, we can't predict every possible point of failure.

Monitoring is like a dashboard warning light - it catches a symptom and tells us something is wrong, for example, Latency has increased. Observability is a diagnostic toolkit - it helps uncover the cause and understand why something is wrong, for example, Database connection pool is exhausted in one region. Both are essential, but observability gives us the context to reason about system behavior, not just detect anomalies.

As Scala engineers, we often work with concurrent and effectful systems built on asynchronous computations across multiple services. In such environments, failures often have multiple reasons. Monitoring might tell us that a request is slower than usual, but observability helps us pinpoint the source of the slowdown in the execution chain, whether it's a downstream dependency, a blocked fiber, or an overloaded queue.

Observability solves the problem of having disconnected telemetry signals that, on their own, fail to tell a coherent story about what happened inside the system. By correlating traces, metrics, and logs around the execution of a single request, observability turns isolated symptoms into an explanation we can reason about.

3. Understanding telemetry signals

Think of your system like a body that ate a questionable gas-station sandwich: when discomfort hits later, you need different vital signs to explain what happened. Traces, metrics, and logs are those vital signs:

  • Traces show what the system did that led to an outcome: "Stopped for gas" -> "bought a sandwich" -> "ate it" -> "drove to work" -> "digestive regret".
  • Metrics describe continuous measurements, like a smartwatch reporting a pulse. You might see the moment when the heart rate spiked after lunch, but the metric alone doesn't explain why.
  • Logs capture the context in the moment: "This sandwich looks fine. What bad could happen?", a few hours later: "Regret. So much regret". Logs record real-time perception and events that help us understand why this happened.

Each signal offers a different perspective - traces supply the sequence, metrics capture the impact over time, and logs preserve moment-to-moment context. Together they tell us not only that something went wrong, but what happened, when, and why. That's observability: connecting separate telemetry pieces into a cohesive, meaningful view of system behavior.

3.1. Traces

What happened, step by step.

Traces capture the storyline of a request: a chronological view of how work propagates through a system.

Each trace is made up of spans. A span is a building block of a trace - it represents a specific operation and the time it takes to complete, such as handling an HTTP request or querying a database.

A span records things like:

  • When the operation started and ended
  • How it relates to other spans (parent/child relationships)
  • Whether it succeeded or failed

A trace then answers broader questions:

  • What did the request actually do?
  • Which components were involved?
  • Where was time spent?
  • Where did errors occur?

Let's dive into an example. A user requests a weather forecast:

  • The gateway-service receives an HTTP GET /weather?location=Kyiv request
  • The gateway-service calls weather-service to fetch the forecast
  • The gateway-service also publishes a Kafka event with request details to the warehouse-service (for analytics to consume later)
  • The weather-service retrieves the forecast and returns it

When we visualize this trace, we can see the entire journey of the request: how it moves through services, how long each step took, and where problems occurred.

gantt  
title Trace timeline: GET /weather?location=Kyiv  
dateFormat HH:mm:ss  
axisFormat %H:%M:%S

section gateway-service
HTTP request handling      :active, 10:00:00, 10s
Send request to weather-service    :         10:00:02, 6s
Publish Kafka event        :         10:00:02, 3s

section weather-service
Fetch forecast             :         10:00:03, 4s

section warehouse-service
Consume Kafka event        :         10:00:11, 2s

Persist analytics record   :         10:00:12, 1s

Traces reveal what happened, step by step, allowing us to follow a request across fibers, services, and boundaries.

3.2. Metrics

How the system behaves over time.

If traces tell stories, metrics show patterns and trends. Metrics are the numbers a system emits continuously, such as counts, durations, and resource usages, that show how the system performs over time.

They let us spot changes at scale: traffic spikes, regressions, or slow degradation that might not be visible from individual traces or logs.

Metrics are also essential for alerting and automation. They can detect outages, trigger alerts, or even drive auto-scaling when demand suddenly increases.

Metrics answer questions like:

  • How fast are requests being processed on average?
  • What's the 95th percentile latency?
  • How many errors occurred in the past 5 minutes?
  • How much CPU or memory is being used right now?

Here are the most common types of metrics:

TypeWhat it doesExample
CounterMonotonically increasing total; no decrements, reset only on restart.Count successful HTTP requests or emitted events.
UpDownCounterValue that increments when work starts and decrements when it ends.Track in-flight HTTP requests or queued tasks by incrementing on start, decrementing on finish.
Gauge/ObservableSnapshot gathered by a callback or poll.Report CPU utilization, heap usage, or queue depth every collection cycle.
HistogramDistribution recorder that powers tail percentiles.Emit latency and payload size metrics to alert on p95 and p99 regressions.

Back to our over-engineered weather service, it might expose metrics like:

  • http.server.active_requests - an UpDownCounter tracking the number of active HTTP requests
  • http.server.request.duration - a Histogram recording the distribution of request latencies
  • system.cpu.utilization - a Gauge reflecting the current CPU usage
---
config:
  themeVariables:
    xyChart:
      titleColor: "#ff0000"
---
xychart
title "http.server.request.duration p95 latency"
x-axis ["12:00", "14:00", "16:00", "18:00", "20:00", "22:00"]
y-axis "Latency (ms)" 80 --> 260
bar "active requests" [120, 130, 140, 150, 135, 110]
line "p95 duration" [110, 125, 140, 240, 160, 120]

The chart shows how the weather-service histogram of request durations changes throughout a stormy afternoon. A spike around 18:00 signals downstream slowness despite traffic volume remaining roughly the same, precisely the kind of trend metrics excel at detecting early.

Metrics reveal how a system is performing over time, helping us observe long-term trends, spot regressions early, and detect anomalies.

3.3. Logs

The context around what happened.

If traces tell stories and metrics show trends, logs are the day-to-day journal of your system - recording events one at a time.

Logs capture what's happening inside your code: messages, events, and context. They're timestamped, detailed, and usually the first place to look when something goes wrong.

Logs answer questions such as:

  • What specific event occurred?
  • What values or state were involved?
  • What errors or exceptions were thrown?

For example, our trusty weather service might log something like:

2025-10-03 15:14:18.123 INFO 
  trace_id=4bf92f3577b34da6a3ce929d0e0e4736 span_id=00f067aa0ba902b7 
  Checking forecast for location [Kyiv]

Here, the trace_id and span_id link this log entry to a specific span within a trace, allowing us to see all related operations and understand them in context. When logs are tied to traces, they stop being isolated lines of text and instead help explain what the system was doing when something happened.

3.4. Baggage

A global context along the path.

There's also baggage - not a separate telemetry signal, but metadata that rides inside the same context payload as the trace identifiers. Baggage is a lightweight key-value map that propagates across services and threads, so every instrumentation can read the same request-level hints, such as tenant, plan, experiment flag, and so on.

Think of it as a sticky note attached to the trace context: invisible to users yet available to each instrument. Because it travels with every outbound call, baggage should stay small and focused on attributes that truly help explain or partition the work.

4. Context propagation

The invisible glue that makes distributed tracing possible.

Now that we've looked at what, when, and why of telemetry signals, one question remains: How do these signals stay connected?

The answer is context propagation - the invisible glue that makes distributed tracing possible. The context carries metadata, such as trace and span IDs, that identify the current span and link it to the overall trace. With context propagation, spans can be correlated and assembled into a single trace, regardless of where each span originated.

Modern applications are highly asynchronous and distributed. A single HTTP request might trigger a Kafka message, spawn a background fiber, and make a gRPC call before completing. Without context propagation, each of those operations would start a new trace, making it impossible to reconstruct the timeline of an individual request.

In practice, context propagation means passing along the trace context: serializing it before an outbound call and deserializing it on the other side.

Depending on the protocol, the serialized context travels in different carriers:

  • HTTP - passed via request headers
  • gRPC - passed via metadata entries (the gRPC equivalent of headers)
  • Messaging (Kafka, RabbitMQ) - attached to message headers

For example, the W3C Trace Context standard defines the traceparent header that carries tracing information in HTTP:

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

where 4bf92f3577b34da6a3ce929d0e0e4736 is the trace ID, 00f067aa0ba902b7 is the span ID, and 01 is the trace flags.

When a downstream service receives a request with this header, it extracts the context, restores the trace state, and starts a new child span linked to its parent, continuing the same trace seamlessly across service boundaries.

sequenceDiagram
    autonumber
    actor User
    participant G as gateway
    participant WS as weather-service
    participant WH as warehouse

    User->>G: HTTP GET /weather?location=Kyiv
    activate G

    G->>WS: gRPC: Fetch forecast (location=Kyiv)
    activate WS
    WS-->>WS: Retrieve forecast
    WS-->>G: gRPC: OK + Forecast
    deactivate WS

    G-->>WH: Publish event to Kafka (request details)
    
    #Note over WH: Consumed for analytics later

    G-->>User: 200 OK + Forecast
    deactivate G

In this example, the gateway service starts a new root trace and creates a child span for the gRPC request. When the weather-service handles a gRPC call, it extracts the trace context from the gRPC metadata and restores the trace state so the new span is linked to the parent. Warehouse consumes the Kafka event, extracts the context from the message headers, and either continues the parent or links to it depending on the ingestion style. That way, we can track the full journey of the request from start to finish.

5. OpenTelemetry

The standard behind it all.

Standards XKCD Comics

Obligatory XKCD comics

At this point, we've explored the what, when, why, and how of observability signals. Now it's time to look at the standard that ties all these ideas together.

OpenTelemetry is an open-source, vendor-neutral framework for collecting, processing, and exporting telemetry data (traces, metrics, logs) from applications. It defines how telemetry data is structured, correlated, and exchanged between services so that your observability setup doesn't depend on any single vendor or tool.

Before OpenTelemetry, each observability vendor provided its own SDKs, APIs, and data formats. For application developers, that often meant choosing one monitoring stack and sticking with it. For library authors, it was more problematic: you'd have to provide instrumentation for every possible backend or force users into one ecosystem.

OpenTelemetry provides a set of standards and conventions that all languages and vendors can follow. Each language ships an API (the abstraction instrumentation code targets) and an SDK (the concrete implementation that manages exporters, processors, and context propagation). With that split, application code depends only on the stable API while the SDK can evolve or be swapped without touching business logic. On top of the API/SDK split, the project defines surrounding components that keep telemetry portable:

  • Common APIs and data models - consistent interfaces and data structures for generating telemetry signals in any language.
  • Context propagation standards - a unified approach (e.g., the W3C Trace Context) to share trace context across service boundaries.
  • OpenTelemetry Collector - a standalone service that receives, processes (transforms/filters), and routes telemetry data between your applications and observability backends.
  • Semantic conventions - standardized naming and attributes for common operations, such as HTTP requests, database calls, messaging systems, and more.
  • OTLP (OpenTelemetry Protocol) - a standard wire protocol for transmitting telemetry data between SDKs, collectors, and backend systems.

Together, these parts form an ecosystem that keeps your telemetry consistent and portable, regardless of language, framework, or vendor. For example, you might collect traces in your services and send them to a tracing backend like Grafana Tempo, export metrics to Prometheus, and forward logs to Grafana Loki. The OpenTelemetry Collector can route and export data to each of these systems using the standard OTLP format.

Concepts stick once you see the API surface. The following sections show the same tiny span/metric examples first with the OpenTelemetry Java SDK, then with otel4s, so you can see the trade-offs between thread-local and effect-aware approaches.

6. OpenTelemetry Java overview

The official OpenTelemetry Java comes with instrumentation for many popular frameworks (e.g., Spring) and libraries (e.g., JDBC connection pools, HTTP clients). That works well for traditional Java or Kotlin applications.

However, it doesn't play nicely with purely functional Scala apps built on top of Cats Effect. The reason is that the Java SDK relies on thread-local context propagation, which doesn't align with how Cats Effect manages fiber-based concurrency and context.

Below, we can peek at the API and compare it with the otel4s counterparts later.

6.1. Tracing API

The OpenTelemetry Java API keeps instrumentation explicit: you build spans manually, push them into the current Scope, run the work, and close the span yourself. Thread-local context propagation links the span to the current thread, so Span.current() retrieves whatever you placed in scope.

static void doWorkWithSpan(Tracer tracer) {
  Span span = tracer.spanBuilder("do-work").setAttribute("example.attribute", "hello").startSpan();
  try (Scope scope = span.makeCurrent()) {
    doWork();
    span.setStatus(StatusCode.OK);
  } catch (Throwable e) {
    span.setStatus(StatusCode.ERROR);
    span.recordException(e);
    span.setAttribute("error.type", e.getClass().getName());
  } finally {
    span.end();
  }
}

static void doWork() {
  Span current = Span.current();
  System.out.println("Doing some work: " + current.getSpanContext().toString());
}

6.2. Metrics API

Metrics follow the same imperative pattern. You measure latency by hand and ensure you don't forget to record failures.

static void doWorkWithMetrics(LongCounter counter, DoubleHistogram latencyMs) {
  Attributes attributes = Attributes.builder().put("example.attribute", "hello").build();
  long startNanos = System.nanoTime();
  try {
    doWork();
    double elapsedMs = (System.nanoTime() - startNanos) / 1_000_000.0;
    latencyMs.record(elapsedMs, attributes);
    counter.add(1, attributes);
  } catch (Throwable e) {
    double elapsedMs = (System.nanoTime() - startNanos) / 1_000_000.0;
    latencyMs.record(elapsedMs, attributes.toBuilder.put("error.type", e.getClass.getName).build());
  }
}

static void doWork() {
  System.out.println("Doing some work");
}

7. otel4s overview

otel4s brings the OpenTelemetry specification to Scala in a purely functional form. It provides effect-aware context propagation and composable, idiomatic APIs for tracing, metrics, and logging. The API integrates seamlessly with Cats Effect applications, without resorting to thread-locals or unsafe side effects.

While otel4s works perfectly with the official OpenTelemetry Java SDK, it also offers an alternative SDK implementation written in Scala. This allows you to use the same observability model across different platforms: JVM, Scala.js, and Scala Native. In other words, you can instrument Scala applications on the JVM or Scala Native using the same OpenTelemetry concepts, all within the Typelevel ecosystem.

The library utilizes metaprogramming techniques to reduce runtime costs and allocations. Near-zero overhead when telemetry is disabled, ensuring production performance is unaffected when tracing or metrics collection is not required.

---
config:
  look: handDrawn
  layout: elk
---
graph BT
otel-spec["OpenTelemetry Specification"]

otel4s-core["otel4s-core<br><br>Defines interfaces, such as Tracer, Meter, and Logger"] --> otel-spec
otel4s-sdk["otel4s-sdk<br><br>An independent implementation of the OpenTelemetry specification in Scala"] -->
otel4s-core
otel4s-oteljava["otel4s-oteljava<br><br>Uses <a href="https://github.com/open-telemetry/opentelemetry-java">OpenTelemetry Java SDK</a> under the hood"] -->
otel4s-core

An oversimplified otel4s architecture

A modular architecture allows including only the required components and makes it easier for library authors to implement instrumentation for their tools.

Now, let's get back to the API examples.

7.1. Tracing API

otel4s can automatically manage the lifecycle of spans. This means you don't have to manually start and end them. If the effect fails, the error.type attribute is automatically set to the full-qualified name of the exception class, and the span is marked as errored. Propagation relies on cats.mtl.Local instead of thread-locals, which keeps the context in sync across fibers and async boundaries.

def doWorkWithSpan(using Tracer[IO]): IO[Unit] = 
  Tracer[IO].span("do-work", Attribute("example.attribute", "hello")).use { span =>
    doWork() *> span.setStatus(StatusCode.Ok)
  }

def doWork()(using Tracer[IO]): IO[Unit] = 
  for {
    span <- Tracer[IO].currentSpanOrThrow
    _    <- IO.println(s"Doing some work: ${span.context}")
  } yield ()

Nevertheless, you can still have full control over the span lifecycle by using a low-level API.

7.2. Metrics API

Meters mirror that philosophy. Instruments such as Histogram expose helpers like recordDuration that wrap an effect and capture timing without manually using timers.

def doWorkWithMetrics(counter: Counter[IO, Long], latencyMs: Histogram[IO, Double]): IO[Unit] = {
  val attributes = Attributes(Attribute("example.attribute", "hello"))
  val histogramAttributes: Resource.ExitCase => Attributes = {
    case Resource.ExitCase.Succeeded  => attributes
    case Resource.ExitCase.Errored(e) => attributes + Attribute("error.type", e.getClass.getName)
    case Resource.ExitCase.Canceled   => attributes + Attribute("error.type", "canceled")
  }

  latencyMs.recordDuration(TimeUnit.MILLISECONDS, histogramAttributes).surround {
    doWork() *> counter.add(1L, attributes)
  }
}

def doWork(): IO[Unit] =
  IO.println("Doing some work")

As with the tracing API, you can still use the low-level API if you need more control.

8. Conclusion

Observability gives us a way to understand systems through the signals they produce: traces, metrics, and logs.
We've looked at how these signals complement each other, how context propagation keeps them connected, and how OpenTelemetry provides a shared foundation for collecting and exchanging them.

For Scala developers, otel4s brings these ideas into the functional world, allowing telemetry to flow naturally through effectful code while staying consistent with OpenTelemetry standards.

This post focused on the concepts behind observability: what the signals mean, how they relate, and why they matter. In the next part of the series, we'll move from theory to practice and see how to use otel4s to instrument the application.