feat: Added host.name to OTel traces#2307
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
avitenzer
left a comment
There was a problem hiding this comment.
Review: feat: Added host.name to OTel traces
Reviewed the change end-to-end and verified the two load-bearing OTel-SDK behaviors (detector merge precedence and schema-URL handling) against the go.opentelemetry.io/otel/sdk v1.38.0 source. The full protocol/tracing suite passes, and go vet / gofmt are clean.
Verdict: clean, well-scoped, well-documented. The mechanism is correct. One robustness edge case is worth considering plus a few trivial notes — none blocking.
What's done well
- The precedence comment is accurate and earns its place. Detector merge-ordering in
resource.Newis genuinely non-obvious. Confirmed:detectfolds each detector viaMerge(res, r)andMerge(a, b)letsbwin on key collisions, so later detectors win —WithFromEnv()being last is what gives operators the override.TestNewResource_HostNameOverridableViaEnvexercises this. - No schema-URL conflict — checked, because it's the classic footgun here. Only
WithHost()contributes a non-empty schema URL (semconv.SchemaURL);WithAttributesandWithFromEnvproduce schemaless resources, and merging empty-with-non-empty never triggersErrSchemaURLConflict. Verified empirically (newResourcereturns no error in the tests). - The hermeticity fix is correct and important. Adding
OTEL_RESOURCE_ATTRIBUTEStoclearTracingEnvstops the override test from leakinghost.name=eu-west-1-consumerinto sibling tests. Easy to forget. - The
newResourceextraction is the right move — it makes resource construction unit-testable without standing up the full exporter/SDK pipeline.
Potential issue (worth considering, non-blocking)
Adding WithHost() introduces a new, environmental trigger for a fatal startup path.
resource.WithHost() calls os.Hostname(). On error, the SDK's stringDetector.Detect returns a plain error (not ErrPartialResource); detect then joins it and continues with the error still set, so resource.New returns a usable partial resource plus a non-nil error. That error flows up through newResource → tracing.New, and all three call sites abort startup:
// rpcconsumer.go:684, rpcprovider.go:900, rpcsmartrouter.go:1589
traceManager, err := tracing.New(ctx, traceCfg)
if err != nil {
return err // returned from cobra RunE → the lavap binary fails to start
}The point isn't probability — os.Hostname() essentially never fails on Linux (realistically only ENAMETOOLONG, a >64-byte hostname, which normal pod/host names won't hit). The point is asymmetry: before this PR, newResource could only fail on operator-malformed OTEL_RESOURCE_ATTRIBUTES (a config error). This adds the first environmental, non-config way for it to fail — and the consequence is a node that won't boot, with a failed to create OTel resource message that gives no hint the culprit is a hostname quirk. Since the PR makes this unconditional, the blast radius is every binary, all to protect an observability label that should never be able to take a node down.
Suggested fix (preferred): log-and-continue. This also matches OTel's guidance to treat resource.New errors as non-fatal. You already get a usable partial resource back, so in New():
res, err := newResource(ctx, cfg)
if err != nil {
// resource.New returns a usable partial resource even on error (e.g. if
// os.Hostname() fails). Warn and proceed rather than aborting node startup
// over a missing observability attribute.
utils.LavaFormatWarning("OTel resource detection incomplete; continuing", err)
}(res is always non-nil here — resource.New allocates it before detection and mutates in place.) Keeps WithHost(), no new imports.
Narrower alternative: detect the hostname yourself with os.Hostname() and add it via WithAttributes only on success, dropping WithHost() — a failed lookup is then silently skipped and newResource keeps its old config-error-only failure profile. Touches only the new path, but the log-and-continue version is more robust since it also covers any future detector.
Minor / informational
- Schema-URL side-effect (benign): before this PR the resource's schema URL was empty; after, it becomes the SDK's bundled semconv version (from
WithHost()). Harmless — the attribute keys are stable across semconv versions — but backends keying on schema URL will now see one. Worth a note, not a change. - semconv version mix (cosmetic): the code pins
semconv/v1.21.0forServiceNamewhileWithHost()uses the SDK's newer internal semconv for the host attribute. Not a bug, just slightly inconsistent. - Test DRY: both new tests repeat the same
for … range res.Attributes()scan; a smallgetAttr(res, "host.name") (string, bool)helper would tighten them.TestNewResource_HostNameOverridableViaEnvalso asserts only the value, not presence — a dropped attr would give a slightly less clear"" != "eu-west-1-consumer"failure. - Coverage gap (optional): no test asserts that
service.nameandhost.namecoexist, nor thatOTEL_SERVICE_NAMEoverride still works now that a second detector is in the chain. Low value if service-name override is covered elsewhere.
Summary
| Area | Assessment |
|---|---|
| Correctness | ✅ Precedence + schema-URL behavior verified against SDK source & tests |
| Conventions | ✅ Matches the file's env-driven, well-commented style |
| Tests | ✅ Pass; hermetic; minor DRY/coverage polish possible |
| Risk | os.Hostname() |
I'd suggest the log-and-continue tweak before merge, but it's a judgment call — the change is otherwise solid.
70189ee to
15d25a6
Compare
What
Attach the host's name as a resource attribute (
host.name) to every OTel span emitted bylavap(consumer, provider, and smart router). This lets us map collected traces back tothe region a consumer/provider runs in.
How
protocol/tracing/trace_manager.go: factored resource construction intonewResource()and added OTel's built-in host detector (resource.WithHost()), which sets the standardhost.nameattribute fromos.Hostname(). Enabled unconditionally — no flag, no opt-in.WithFromEnv()last so operator config wins on conflicts:OTEL_SERVICE_NAMEoverrides the in-code service name, andOTEL_RESOURCE_ATTRIBUTES=host.name=...overrides the detected hostname (e.g. to label a clean region instead of a pod hostname).
Override example
OTEL_SERVICE_NAME=lava-consumer
OTEL_RESOURCE_ATTRIBUTES=host.name=eu-west-1-consumer-3,deployment.environment=production
Tests
TestNewResource_HostName— default resource carriesos.Hostname().TestNewResource_HostNameOverridableViaEnv—OTEL_RESOURCE_ATTRIBUTESoverrides the detected hostname.OTEL_RESOURCE_ATTRIBUTEStoclearTracingEnvso the suite stays hermetic.No protocol-surface or config changes; standard OTel SDK behavior only.
Example