OTEP: Thread Context: Sharing Thread-Level Information with external readers by scottgerring · Pull Request #4947 · open-telemetry/opentelemetry-specification

scottgerring · 2026-03-16T10:56:41Z

Changes

External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot collect information about active OpenTelemetry traces running within the process they are sampling. We (@ivoanjo and @scottgerring) propose a mechanism for OpenTelemetry SDKs to publish thread-level attributes — including trace ID, span ID, and configurable custom attributes — through a standard format based on the ELF Thread Local Storage (TLSDESC) dialect.

Because this mechanism relies on having a native component and knowing when a runtime switches contexts, we consider it optional for SDKs to support, as some runtimes (or even runtime versions) may not be able to feasibly/efficiently (or undesirable, maintenance-wise) to implement it.

When a request context is attached or detached from a thread, the SDK publishes select information to a thread-local variable that external readers such as the eBPF profiler can discover and read. This enables correlation of profiling samples with request context, even when the active span was not sampled by the SDK.

This builds on and extends OTEP 4719: Process Context, using its process context mechanism to store the static, process-scoped reference data (schema version and attribute key map) that the thread-local records reference.

Why open as draft: OTEP 4719 is a dependency for this OTEP, and thus we'll need to wait for that OTEP to land to ensure we have a solid underpinning to build on.

This OTEP is based and heavily inspired on the custom-labels work by [Polar Signals](https://www.polarsignals.com/) and the universal profiling integration by Elastic — big thanks to them for the prior art and inspiration.and everyone that collaborated with us on the draft google doc this is based on.

Related issues #
Related OTEP(s): Profiling Vision #212, Profiles Data Model #239, Process Context #4719
Links to the prototypes (when adding or changing features)
CHANGELOG.md file updated for non-trivial changes
Spec compliance matrix updated if necessary

…elemetry eBPF Profiler

… for eBPF profiler

scottgerring · 2026-04-08T07:49:16Z

Now that #4719 has been accepted, this one is ready for review!

Co-authored-by: Attila Szegedi <szegedi@users.noreply.github.com>

linux-foundation-easycla · 2026-04-13T08:58:42Z

The committers listed above are authorized under a signed CLA.

✅ login: ivoanjo / name: Ivo Anjo (31dbc10, 5c9f504, f529e91)
✅ login: scottgerring / name: Scott Gerring (01dcdb4, 047db5a, 0504443, 0e4bc01, 0e7c7c8, 22e7d28, 4e16475, 6400099, 6b8f73a, 6fad842, 8ba154e, 92164ae, b2aa8e3, bc4958d, e5d5798, 5c9f504, 8e89fe8, f529e91, ffc4c4e)
✅ login: szegedi / name: Attila Szegedi (ffc4c4e)

Copilot

Pull request overview

This PR adds a new Profiles OTEP proposing a Linux TLS/TLSDESC-based mechanism for SDKs to publish thread-level context for external readers such as the OpenTelemetry eBPF Profiler.

Changes:

Adds 4947-thread-ctx.md describing motivation, data layout, publication/reading protocols, trade-offs, runtime feasibility, and prototypes.
Adds a changelog entry for the new OTEP.
Updates textlint terminology exclusions for “Local Storage”.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`oteps/profiles/4947-thread-ctx.md`	Adds the Thread Context OTEP proposal.
`CHANGELOG.md`	Records the new OTEP in the release changelog.
`.textlintrc.yml`	Adds a terminology exclusion related to Local Storage wording.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

christos68k

I'll poke at this some more

christos68k · 2026-05-28T16:50:35Z

+
+This mechanism is designed to achieve the following goals:
+
+* **Reader flexibility**: Readers are not limited to eBPF-based implementations; any external process with sufficient system permissions to read `/proc/<pid>/maps` to identify loaded libraries, as well as sufficient permission to read from process memory should be able to use this mechanism  (nb: [OTEP-4719](4719-process-ctx.md) also requires access to this resource)


Suggested change

* **Reader flexibility**: Readers are not limited to eBPF-based implementations; any external process with sufficient system permissions to read `/proc/<pid>/maps` to identify loaded libraries, as well as sufficient permission to read from process memory should be able to use this mechanism (nb: [OTEP-4719](4719-process-ctx.md) also requires access to this resource)

* **Reader flexibility**: Readers are not limited to eBPF-based implementations; any external process with sufficient system permissions to read `/proc/<pid>/maps` to identify loaded libraries and read from target process memory should be able to use this mechanism (nb: [OTEP-4719](4719-process-ctx.md) also requires access to this resource)

Agreed this is a bit wordy, althoughto me the suggestion makes it sound like /prod/<pid>/maps is also the mechanism to 'read from target process memory'. I'll try think of something else

Reworded inline. lmk what you think!

christos68k · 2026-05-28T17:11:01Z

+
+### Thread Local Context Record
+
+This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support stringy (utf-8 bytes) attributes only.


Suggested change

This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support stringy (utf-8 bytes) attributes only.

This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support string (length prefixed utf-8 bytes) attributes only.

I think stringy is definitely too informal here, and the addition of utf-8 bytes makes sense to me. The 'length prefixed' feels a bit overspecified for this bit in the doc, though - it feels like it implies something more than what we have in the field layout section below. Reworded inline - wdyt @christos68k ?

I think either one works... And I have a small suggestion -- perhaps we could include a "hexdump" of a tiny example to clarify how it looks in practice?

We didn't do this for process context since that's mostly protobuf, but folks bootstrapping their own readers could use as a first test to check their parsing.

christos68k · 2026-05-28T17:26:29Z

+**Allocation:** Implementations may choose to preallocate storage for some fixed set of **Thread Local Reference Data** instances. This removes the need to allocate in the hot path.
+
+**Concurrency model**: Unlike the process context (OTEP 4719), where the writer races asynchronously with the reader and CPU memory barriers (`atomic_thread_fence` with `seq_cst`) are required, thread context assumes signal handler-like semantics.
+In practice, context reads are expected to behave as if the thread whose context is being read is stopped or otherwise interrupted, and thus there can't be any concurrency hazards between reads and writes.


Clarifying:

Suggested change

In practice, context reads are expected to behave as if the thread whose context is being read is stopped or otherwise interrupted, and thus there can't be any concurrency hazards between reads and writes.

In practice, context reads are expected to behave as if the thread whose context is being read is stopped or otherwise interrupted, and thus there can't be any concurrency hazards between reads and writes (as the writer logic will always execute as part of the thread).

Good point! I've rejigged it slightly more as i'm not happy with how we wrote it originally either - will push change shortly - lmk what you think

scottgerring · 2026-05-29T09:38:08Z

+
+This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support stringy (utf-8 bytes) attributes only.
+
+| Name            |            | Data type                          | Notes                                                                                                                                                    |


@florianl @christos68k @felixge @ivoanjo @brancz - as discussed in the prof sig yesterday, here is where we could consider adding add-hoc (i.e. not pre-configured in process context) labels.

Our discussion yesterday, I believe, came to the conclusion we probably shouldnt do this for now because:

we want to discourage high cardinality keys / churn

we don't expect that this is likely to serve a common use case

we want to err on the side of keeping the spec simple and opinionated

we can always add this later if we see demand

... but let's use this thread for more complete discussion as we need!

For context, relevant meeting notes are on the May 28th, 2026 meeting.

+1 I agree with err'ing on the side of simplicity for now, for the reasons Scott stated, especially as I can see no blockers with this being added in the future as an extension if needed.

umanwizard · 2026-05-28T19:58:25Z

+
+Because this mechanism relies on having a native component and knowing when a runtime switches contexts, we consider it optional for SDKs to support, as some runtimes (or even runtime versions) may not be able to feasibly/efficiently implement it.
+
+The TLSDESC-based publication mechanism and the in-memory **Thread Local Context Record** format are intentionally separable. Runtimes that cannot efficiently expose context through TLSDESC are not required to implement this mechanism. However, we highly recommend reusing the same in-memory record format when publishing equivalent context through a runtime-specific discovery mechanism, rather than defining a runtime-specific payload format. Such mechanisms are out of scope for this OTEP, but reusing the record layout where practical allows readers to share parsing logic across runtimes.


Runtimes that cannot efficiently expose context, or runtimes for whom thread-local context is not relevant (e.g. in Go, Goroutine-local context is more interesting, and in Node, we want something that follows async execution, etc.)

This is a great point; there it's more about the unit of execution context being fundamentally different. Let me think about how to word this

Both go and node do end up mapping their own high-level things onto OS threads, so there is a world where they could use this thread-local context as well.

Thus at least for me the distinction is about what we can be implemented efficiently as a library -- it doesn't matter that "theoretically" someone could change go/node directly to update the thread-local context when they swap async execution scopes or goroutines, because we need a solution that works today.

This to say, in terms of wording, I think that's what's interesting to get across?

umanwizard · 2026-05-29T17:40:22Z

+| attrs-data      |            | uint8[]                            | A byte buffer containing the attributes themselves. Its total length is given by `attrs-data-size`.                                                      |
+|                 | [x].key    | uint8 (*See below for alternative) | Index into the key table                                                                                                                                 |
+|                 | [x].length | uint8 (*See below for alternative) | Length of val string                                                                                                                                     |
+|                 | [x].val    | uint8[length] (utf-8 bytes)        | Inline array of the string value itself. Exactly 'length' bytes appear here, before the next record begins.                                              |


before the next record begins

...unless we are assuming a 2-byte alignment, in which case there will be an extra padding byte if the entire record has an odd length.

I was discussing this with Scott and I realized I wasn't quite sure I fully understood this suggestion.

Are you suggesting:

That the overall "Thread-Local Context Record" should have a total size that is aligned (so that e.g. people don't create one that is "exactly 613 bytes")?

That inside attrs-data the key/length/bytes entries should be aligned?

(Something else?)

Maybe there's confusion around

before the next record begins.

We're actually talking about the next attribute in attrs-data here, right? And not the next ThreadContextRecord (which needs not exist, needs not be placed after this one, and is not 1-byte aligned indeed). On the other hand, everything in attrs-data are individual bytes or byte arrays, so there's no reason for padding or alignment here.

If so, maybe substituting "record" for "attribute" or at least "attribute record" could dispel the confusion?

I think that's a good idea to disambiguate. Have added:

Inline array of the string value itself. Exactly length bytes appear here, before the next attribute entry begins.

umanwizard · 2026-05-29T18:04:01Z

+
+When a request context is attached to a thread, the SDK:
+
+1. Sets the TLS pointer to 0 to ensure readers see no record during construction


Why is this initial setting to 0 necessary; can't we just set the pointer to the new record once it's done being constructed? Then the reader will see either the old record or the new record, never "no record" (at least in this code path; the mutate-in-place path is different).

I reckon the tradeoff here is whether we want to prefer "some record" over "no record". Said otherwise, at the point we are trying to set a new record, the old record is no longer valid for the thread and we should clear it quickly. Maybe something like:

Suggested change

1. Sets the TLS pointer to 0 to ensure readers see no record during construction

1. Sets the TLS pointer to 0 to dettach the old record from the thread

wdyt @umanwizard ?

We could restrict this a bit; yet I think allowing "no record" does make things easier on the writer. It allows a new webserver request to start by setting a record and to end by clearing it (before a thread goes back to the idle thread pool or similar).

Requiring always an active record would disallow an implementation of this kind. So I'd classify the "be able to go back to no record" as a nice to have, although yes we could make do without it.

I agree that we should not always require an active record. I guess my point is that either approach will work, and which one should be preferred probably depends on the application, so I don't think the OTEP needs to be prescriptive about it.

Ah, yeah, good point -- indeed the text can suggest detaching for in-place updates OR swapping if an "update" is done by creating an entire new record to replace the previous point.

In general I agree without that the spec should not be overly prescriptive and should try to leave space for implementations to do what makes sense for them.

Ok so i've dropped the "Sets the TLS pointer to zero ..." bit, and added a comment below:

If reusing storage that may already be visible to readers, the SDK MAY first set the TLS pointer to NULL to ensure readers see no record during construction.

umanwizard · 2026-05-29T18:11:20Z

+| :-------------- | :--------- | :--------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| trace-id        |            | uint8[16]                          | In W3C Trace Context format. Zeroes can be used to indicate none active. If either of trace-id/span-id are set, both must be set.                        |
+| span-id         |            | uint8[8]                           | In W3C Trace Context format.                                                                                                                             |
+| valid           |            | uint8                              | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set  when they read.                          |


Thinking about this more, I'm not convinced valid is needed.

If we didn't have the valid field, then when modifying a record in-place, we could simply detach it first, e.g.:

ThreadContextRecord *r = otel_thread_ctx_v1; otel_thread_ctx_v1 = NULL; BARRIER; // do some mutations on r... BARRIER; otel_thread_ctx_v1 = r;

AFAICT, this would not be any slower or worse than the current approach of mutating valid, but I might be missing something.

Lifting @ivoanjo 's words out of a slack chat:

What you're saying is correct -- the thinking about having this bit of extra flexibility is that if you're exposing the thread context from C/C++/Rust to another language, if there's a valid field in there you can "just" expose a byte-buffer and allow the high-level language to directly update the context without needing to also somehow teach it how to update the TLS (and expose something like "a raw pointer"). Thus it's a bit of extra redundancy for a bit of extra implementation flexibility for writers -- at least that's how I'd describe it

While we've been PoCing this internally, we found it to be useful for Java in particular where the cost of twiddling the TLS pointer itself is relatively high.

Yeah, this :D

We actually already have some text about this later on in the doc:

A SDK should choose to either set/unset the TLS pointer itself, or the valid flag, but not both.

The intention of this design is to enable flexibility for the writers. We envision that some writers may choose to keep a fixed record for a given thread and can thus mutate it in-place, and that other writers may instead keep the record associated with some other high-level concept (coroutine, request, etc) and swap out the pointer as needed when they become active on any given thread.

We discussed this a bit more in Slack, but it's probably better to take the discussion to a public place. I am still not fully convinced, for two main reasons:

First, why is it expensive in Java? Even if the FFI required to access TLS variables in Java is high, it seems to me you could save a pointer to the TLS variable in each Java thread, and access the variable through that pointer, rather than redoing the TLS lookup every time. I don't know Java beyond the basics, so please let me know if I'm making any erroneous assumptions.

Even if it is in fact true that the valid flag makes it possible for the Java implementation to be slightly faster, this has other efficiency drawbacks. We have to check the valid flag on every trip through the unwinder (that's only one branch but it's not nothing), whereas any disadvantage on the writing side is only incurred when the label set is actually changed. Furthermore, having this flag makes the struct 2 bytes longer and the ABI more complicated.

I could be convinced that this flag is useful, but I'd be interested in seeing a more concrete example of how it's used from Java and what it would look like from Java if we didn't have the valid flag and only could set/unset the TLS variable to NULL.

Those are good points, let me try to address them ;)

First, why is it expensive in Java? Even if the FFI required to access TLS variables in Java is high, it seems to me you could save a pointer to the TLS variable in each Java thread, and access the variable through that pointer, rather than redoing the TLS lookup every time. I don't know Java beyond the basics, so please let me know if I'm making any erroneous assumptions.

Indeed what you're saying it's possible. The way I think about it is:

You always need to touch the thread-local context record to update it, so if you're exposing it to a high-level language as a byte-buffer ish (e.g. Java, Ruby), you're always going to need to expose that memory location. Exposing a byte-buffer to the high-level language is useful because the JIT can then kick in and make updating the context something you can do in a very optimized way, without needing to jump over the FFI. (In the JVM in specific, calling across the FFI is more expensive than regular Java methods, and these calls cannot be optimized by the JIT)

Thinking about CPU caches and whatnot, with valid sitting in the middle of the record, it would be pulled into the CPU cache already.

The context record structure itself has only data and no pointers. So a high-level bit of code that's limited to writing in a sized byte-buffer can't do much damage if there's any bugs.

Removing the valid and forcing the thread-local storage to always be written would mean:

Two pointers need to be exposed to the high-level code: the TLS and the context record. (Or at least, there would need to be an FFI call to "set TLS to null" && "set it back to the context")

The TLS pointer is almost for sure in a different memory location from the context, so that's an extra cache miss

Having the high-level code be able to update a pointer sounds like it might be a bit more error-prone. E.g. what if it sets the pointer to something else that's not a context record?

As part of validating this work, we've implemented the current thread context sharing proposal in Datadog's Java Profiler (used by dd-trace-java, Datadog's Java SDK) in DataDog/java-profiler#347 and it takes advantage of the things I mentioned above:

It initializes the TLS when a new thread gets created and never touches it again

It exposes the thread-local context record to the high-level Java code, and thus you never need to cross the FFI back to C++ code to update the context

In our internal benchmarks, this approach was even cheaper than the one we had before (there's a markdown file in that PR with our results), which was really great news, since I think we want as many OTel and other SDKs to adopt this feature, on by default, and being able to say "look the cost is ultra-low, you don't need to worry about this impacting performance".

(To be clear, my claim is not that we couldn't have done this without valid, only that it worked out quite well there -- and you can see the history of all our discussions and that we tried several designs and went quite a lot back-and-forth on that PR)

Even if it is in fact true that the valid flag makes it possible for the Java implementation to be slightly faster, this has other efficiency drawbacks. We have to check the valid flag on every trip through the unwinder (that's only one branch but it's not nothing), whereas any disadvantage on the writing side is only incurred when the label set is actually changed. Furthermore, having this flag makes the struct 2 bytes longer and the ABI more complicated.

I won't claim this does not add a bit more work on the reader side -- so that's definitely a trade-off for the gains I mention above.

Yet, I think there's an important flip-side to -- "We have to check the valid flag on every trip through the unwinder (that's only one branch but it's not nothing), whereas any disadvantage on the writing side is only incurred when the label set is actually changed.": the writer must always incur the cost of setting, otherwise, data will be missing/incomplete. If spans are installed for 100ms or even more at a time, that's not a lot of writes. But on a high-throughput system with requests in the single-ms range, it's possible there'll be even more writes than reads. And while the reader can always decide to self-throttle by tweaking the sampling rate, the writer doesn't have a good mechanism for this.

Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: umanwizard <brennan@umanwizard.com>

ivoanjo · 2026-06-02T13:44:43Z

+
+Because this mechanism relies on having a native component and knowing when a runtime switches contexts, we consider it optional for SDKs to support, as some runtimes (or even runtime versions) may not be able to feasibly/efficiently implement it.
+
+The TLSDESC-based publication mechanism and the in-memory **Thread Local Context Record** format are intentionally separable. Runtimes that cannot efficiently expose context through TLSDESC are not required to implement this mechanism. However, we highly recommend reusing the same in-memory record format when publishing equivalent context through a runtime-specific discovery mechanism, rather than defining a runtime-specific payload format. Such mechanisms are out of scope for this OTEP, but reusing the record layout where practical allows readers to share parsing logic across runtimes.


Both go and node do end up mapping their own high-level things onto OS threads, so there is a world where they could use this thread-local context as well.

Thus at least for me the distinction is about what we can be implemented efficiently as a library -- it doesn't matter that "theoretically" someone could change go/node directly to update the thread-local context when they swap async execution scopes or goroutines, because we need a solution that works today.

This to say, in terms of wording, I think that's what's interesting to get across?

ivoanjo · 2026-06-02T13:51:43Z

+
+### Thread Local Context Record
+
+This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support stringy (utf-8 bytes) attributes only.


I think either one works... And I have a small suggestion -- perhaps we could include a "hexdump" of a tiny example to clarify how it looks in practice?

We didn't do this for process context since that's mostly protobuf, but folks bootstrapping their own readers could use as a first test to check their parsing.

ivoanjo · 2026-06-02T13:53:31Z

+| :-------------- | :--------- | :--------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| trace-id        |            | uint8[16]                          | In W3C Trace Context format. Zeroes can be used to indicate none active. If either of trace-id/span-id are set, both must be set.                        |
+| span-id         |            | uint8[8]                           | In W3C Trace Context format.                                                                                                                             |
+| valid           |            | uint8                              | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set  when they read.                          |


Yeah, this :D

We actually already have some text about this later on in the doc:

A SDK should choose to either set/unset the TLS pointer itself, or the valid flag, but not both.

The intention of this design is to enable flexibility for the writers. We envision that some writers may choose to keep a fixed record for a given thread and can thus mutate it in-place, and that other writers may instead keep the record associated with some other high-level concept (coroutine, request, etc) and swap out the pointer as needed when they become active on any given thread.

ivoanjo · 2026-06-02T13:55:49Z

+| attrs-data      |            | uint8[]                            | A byte buffer containing the attributes themselves. Its total length is given by `attrs-data-size`.                                                      |
+|                 | [x].key    | uint8 (*See below for alternative) | Index into the key table                                                                                                                                 |
+|                 | [x].length | uint8 (*See below for alternative) | Length of val string                                                                                                                                     |
+|                 | [x].val    | uint8[length] (utf-8 bytes)        | Inline array of the string value itself. Exactly 'length' bytes appear here, before the next record begins.                                              |


I was discussing this with Scott and I realized I wasn't quite sure I fully understood this suggestion.

Are you suggesting:

That the overall "Thread-Local Context Record" should have a total size that is aligned (so that e.g. people don't create one that is "exactly 613 bytes")?

That inside attrs-data the key/length/bytes entries should be aligned?

(Something else?)

ivoanjo · 2026-06-02T13:59:08Z

+
+When a request context is attached to a thread, the SDK:
+
+1. Sets the TLS pointer to 0 to ensure readers see no record during construction


We could restrict this a bit; yet I think allowing "no record" does make things easier on the writer. It allows a new webserver request to start by setting a record and to end by clearing it (before a thread goes back to the idle thread pool or similar).

Requiring always an active record would disallow an implementation of this kind. So I'd classify the "be able to go back to no record" as a nice to have, although yes we could make do without it.

ivoanjo · 2026-06-02T14:05:40Z

+This means CPU memory ordering is not a concern.
+Writers need only use compiler fences (`atomic_signal_fence` or equivalent) and/or volatile writes to prevent compilers from reordering writes to the context with those to `valid` or the TLS pointer.
+These fences are expected to carry no runtime cost.
+Readers conforming to this specification MUST only read thread context while the target thread is stopped or interrupted (e.g. via eBPF perf events, ptrace-stop, or equivalent mechanisms).


I think mandating this seems a bit too strong?

E.g. if a reader wants to race the writer, why not? They'd be signing up for any of the usual problems, but I think that's par-for-the-course for that path, so I think the spec itself can suggest "you pro'lly don't wanna do that", but I think we can leave some space and not outright mandate it.

Suggested change

Readers conforming to this specification MUST only read thread context while the target thread is stopped or interrupted (e.g. via eBPF perf events, ptrace-stop, or equivalent mechanisms).

Readers conforming to this specification SHOULD read thread context while the target thread is stopped or interrupted (e.g. via eBPF perf events, ptrace-stop, or equivalent mechanisms).

I think MUST is fine. It doesn't mean you will go to jail if you violate it, it just means "whatever you are doing is outside the scope of this specification, so we make no guarantees about what will happen". Which seems appropriate to me. E.g. if you violate this constraint you'll possibly read valid == true and nevertheless get torn reads in the rest of the record, and the spec can't really assign any useful semantics in that case.

I was indeed thinking along the lines you're suggesting. Going by the definition of the terms in https://datatracker.ietf.org/doc/html/rfc2119 "should" looked closer to me to what we want here, but I don't mind keeping the must if we want the stronger language to "discourage" racy implementations.

1. MUST This word, > or the terms "REQUIRED" or "SHALL", mean that the
definition is an absolute requirement of the specification.

3. SHOULD This > word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

I'm going to leave this as it is until someone has a stronger opinion one way or the other; I kinda like being overly restrictive here because it feels like it constrains readers in a sensible fashion but agree it could go either way

reyang · 2026-06-04T15:56:09Z

+
+## Explanation
+
+We propose a mechanism for OpenTelemetry SDKs to publish thread-level information reflecting the context of the active request, if any, through a standard format based on the Linux-specific ELF Thread-Local Storage (TLS) TLSDESC dialect.


Does this mean it will be targeting Linux only? (e.g., what about MacOS and Windows)

With this first version yes, for a few reasons:

We expect the first consumers of this to be the eBPF Profiler and OBI both of which are Linux-specific

This mechanism is quite low-level and with very tight efficiency goals, so we believe it's better to evaluate how exactly they would map to other OS with more care, than just going "they also have some thread-locals right? just use whatever they have there"

While all three OS provide thread-locals, there is a very different security model across them (especially on macOS), and because of that it's not clear that this option would be the best one there

florianl · 2026-06-05T09:04:19Z

+
+**Why this layout**: It is scalable; if a writer is configured to publish only the required attributes and no custom fields, it can set `attrs-data-size` to 0; having read the first portion of the structure, the reader stops without having to read the rest.
+
+**Cache Impact:** Likewise, a frugal writer may aim to keep the entire record under 64 bytes (the typical size of a cache line) in which case we might expect the entire record is in cache after the first read.  This format takes 28 bytes for the lead in, then an overhead of 2 bytes per key-value pair.  For a single pair, we'd have 28 bytes for the value; for two, 26 bytes total. If we expected to track attributes such as `path` and `method`, this means we'd realistically expect to fit in a single cache line of 64 bytes. We recommend keeping at or under 640 bytes for the total record [to match the limit](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/tree/main/design-docs/00002-custom-labels#proposed-solution) on the OTel eBPF Profiler.


For a single pair, we'd have 28 bytes for the value; for two, 26 bytes total.
This sounds unexpected to me. Could we rephrase it to make it clear how we get to 28 bytes for a single value but 26 bytes for two values?

florianl · 2026-06-05T09:09:14Z

+| valid           |            | uint8                              | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set  when they read.                          |
+| _reserved       |            | uint8                              | One spare byte here to align attrs-data-size at two byte boundary.                                                                                       |
+| attrs-data-size |            | uint16                             | Size of `attrs-data`. This lets the reader know when it has consumed all `attrs-data` records within the TLS buffer.                                       |
+| attrs-data      |            | uint8[]                            | A byte buffer containing the attributes themselves. Its total length is given by `attrs-data-size`.                                                      |


Do we want to have (key+length+val) to be aligned? And if so, should there be some padding?

Would alignment be useful here? My thinking is that in general the reader will want to read the whole structure in one go, so alignment wouldn't make a big difference to the performance-sensitive part; and for the writer, it doesn't seem like it'd make a big difference either?

(E.g. maybe this is a really long-winded way of asking "I'm curious what would be the advantage of alignment for these fields")

I think we should err on the side of preserving bytes here - similar intuition to @ivoanjo that we'd likely read the whole thing in a single stride if possible anyway.

I've added this to the field description to clarify as this has come up a couple times - lmk what you all think!

Entries in attrs-data are packed consecutively with no padding between entries;

florianl · 2026-06-05T09:11:18Z

+| :-------------- | :--------- | :--------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| trace-id        |            | uint8[16]                          | In W3C Trace Context format. Zeroes can be used to indicate none active. If either of trace-id/span-id are set, both must be set.                        |
+| span-id         |            | uint8[8]                           | In W3C Trace Context format.                                                                                                                             |
+| valid           |            | uint8                              | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set  when they read.                          |


What about making valid future proof and allow us to make some changes in the future by adding more flags?

Suggested change

| valid | | uint8 | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. |

| flag | | uint8 | This flag is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. All other values are reserved and treated as invalid |

In practice, we already have the _reserved we could use for flags, so that could be an alternative as well. (Although we're not mandating it should be all zeros for the first version of the spec and perhaps we should?)

(just adding that i'm actively not going to touch this one until we're all aligned on the need for valid)

Co-authored-by: Ivo Anjo <ivo@ivoanjo.me> Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>

jsuereth · 2026-06-09T15:09:35Z

+
+> **Note:** The `threadlocal.*` keys are defined here rather than as semantic conventions because they are inter-process coordination metadata, not telemetry attributes, and are not expected to appear in OTLP exports.
+
+The exact format used will be the `repeated KeyValue` protobuf structure from the `ProcessContext.attributes` field standardized in OTEP-4719. A stringified representation of this showing the usage of the elements of that schema along with some example values:


Should we be preferring a write-optimised format here?

Protobufs are size-prefixed - so if you go to mutate this dictionary, you may need to rewrite the entire block or otherwise keep the block in memory with various optimisations. Is that desired?

jsuereth · 2026-06-09T15:11:03Z

+```
+
+**Why:** this mechanism separates static, process-scoped data from the TLS storage, so that a reader can read it once and not every time it samples a thread.
+This reduces the cost of both writing and reading thread samples, while retaining flexibility to store an arbitrary set of extra attributes on samples as required.


My concern here is whether or not we know the dictionary ahead of time - What does that look like in instrumentation?

I realize we may be able to do this in OBI - how does this work generically in an API/SDK component?

jsuereth · 2026-06-09T15:11:57Z

+| :-------------- | :--------- | :--------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| trace-id        |            | uint8[16]                          | In W3C Trace Context format. Zeroes can be used to indicate none active. If either of trace-id/span-id are set, both must be set.                        |
+| span-id         |            | uint8[8]                           | In W3C Trace Context format.                                                                                                                             |
+| valid           |            | uint8                              | This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set  when they read.                          |


You should also include trace flags here somewhere

jsuereth · 2026-06-09T15:13:31Z

+| _reserved       |            | uint8                              | One spare byte here to align attrs-data-size at two byte boundary.                                                                                       |
+| attrs-data-size |            | uint16                             | Size of `attrs-data`. This lets the reader know when it has consumed all `attrs-data` records within the TLS buffer. The total record is recommended to stay at or under 640 bytes. |
+| attrs-data      |            | uint8[]                            | A byte buffer containing the attributes themselves. Its total length is given by `attrs-data-size`.                                                      |
+|                 | [x].key    | uint8 (*See below for alternative) | Index into the key table. Readers MUST ignore entries whose key index is outside `threadlocal.attribute_key_map`.                                         |


IIUC - this means we need to update the overall dictionary from ANY thread.

I have concerns about that - how do we do it safely and notfiy any reader that this memory has changed? How often does this lead to a cache flush in hot paths?

jsuereth · 2026-06-09T15:15:15Z

+  * If it does, collect them
+* Note down the TLS offset for **Thread-Local Reference Data** discovered
+
+If the `threadlocal.*` keys are not present in the process context at this point, the reader should defer TLS symbol discovery and re-run step 1.2 when the process context is next updated. Readers can detect process context updates using the polling or prctl-hook mechanisms described in OTEP-4719.


This is the bit that makes me the most nervous - Does this mean we need to understand all possible keys at startup? Our APIs are dynamic here for what can go into context - does this come with a mechanism to lock that down?

OTEP: Thread Context: Sharing Thread-Level Information with the OpenT…

92164ae

…elemetry eBPF Profiler

scottgerring requested review from a team as code owners March 16, 2026 10:56

scottgerring marked this pull request as draft March 16, 2026 10:56

scottgerring force-pushed the profiling-tls-context branch from b0f4345 to 580100a Compare March 16, 2026 15:10

nsavoire mentioned this pull request Mar 20, 2026

feat: Add support for reading OTel thread context from native TLS open-telemetry/opentelemetry-ebpf-profiler#1229

Open

scottgerring and others added 5 commits March 26, 2026 10:46

Update OTEP number based on PR number

0e7c7c8

chore: address lints

e5d5798

Minor: Clarify trace-id/span-id are expected in W3C Trace Context format

31dbc10

chore(otep-4947): make it clear this is a generic mechanism, not only…

01dcdb4

… for eBPF profiler

chore(otep-4947): clarify memory ordering semantics

047db5a

scottgerring force-pushed the profiling-tls-context branch from 3cf6426 to 047db5a Compare March 26, 2026 09:47

chore(otep-4947): address text lints

0504443

scottgerring force-pushed the profiling-tls-context branch from e687c4b to 0504443 Compare March 26, 2026 13:28

Gui774ume mentioned this pull request Mar 26, 2026

[WIP][WP] Implementation of Otel 4947 thread context DataDog/datadog-agent#48446

Open

chore(otep-4947): introduce proper name for TLS var

4e16475

This was referenced Apr 2, 2026

feat!: root_span_id handling in otel thread ctx DataDog/libdatadog#1834

Merged

feat: replace v1 custom labels with OTEP 4947 TLS record format scottgerring/custom-labels#1

Closed

scottgerring and others added 3 commits April 8, 2026 09:39

Merge branch 'main' into profiling-tls-context

8e89fe8

update to reflect accepted status of OTEP-4719

6fad842

update textlint config to allow Local Storage

bc4958d

scottgerring marked this pull request as ready for review April 8, 2026 07:49

szegedi reviewed Apr 9, 2026

View reviewed changes

Comment thread oteps/profiles/4947-thread-ctx.md Outdated

szegedi reviewed Apr 9, 2026

View reviewed changes

Comment thread oteps/profiles/4947-thread-ctx.md Outdated

Comment thread oteps/profiles/4947-thread-ctx.md Outdated

scottgerring changed the title ~~OTEP: Thread Context: Sharing Thread-Level Information with the OpenTelemetry eBPF Profiler~~ OTEP: Thread Context: Sharing Thread-Level Information with external readers Apr 9, 2026

Apply suggestions from code review

ffc4c4e

Co-authored-by: Attila Szegedi <szegedi@users.noreply.github.com>

scottgerring mentioned this pull request Apr 14, 2026

feat: Implement process context publishing (OTEP-4719) open-telemetry/opentelemetry-rust#3460

Open

4 tasks

NimrodAvni78 mentioned this pull request Apr 15, 2026

🏆 Epic: OTel API/SDK Integration open-telemetry/opentelemetry-ebpf-instrumentation#1140

Open

pellared requested a review from Copilot May 27, 2026 08:25

Copilot started reviewing on behalf of pellared May 27, 2026 08:25 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

Comment thread oteps/profiles/4947-thread-ctx.md

Comment thread oteps/profiles/4947-thread-ctx.md Outdated

Comment thread oteps/profiles/4947-thread-ctx.md Outdated

florianl reviewed May 27, 2026

View reviewed changes

Comment thread oteps/profiles/4947-thread-ctx.md

address review feedback

47acef3

github-actions Bot removed the Stale label May 28, 2026

christos68k reviewed May 28, 2026

View reviewed changes

scottgerring commented May 29, 2026

View reviewed changes

umanwizard reviewed May 29, 2026

View reviewed changes

scottgerring and others added 2 commits June 1, 2026 11:47

Apply suggestions from code review

1c364ba

Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co> Co-authored-by: umanwizard <brennan@umanwizard.com>

more code review changes

7c7bd2c

scottgerring force-pushed the profiling-tls-context branch from bc54200 to 7c7bd2c Compare June 1, 2026 09:53

NimrodAvni78 mentioned this pull request Jun 1, 2026

[OTEP] Process Context: Sharing Registered Instrumentations with External Readers #5116

Open

6 tasks

review feedback

8c2ab83

ivoanjo reviewed Jun 2, 2026

View reviewed changes

mention not updating semconv

283eceb

gnurizen mentioned this pull request Jun 4, 2026

feat(rtld): refresh process state on dlopen via uprobe open-telemetry/opentelemetry-ebpf-profiler#1492

Closed

reyang reviewed Jun 4, 2026

View reviewed changes

florianl reviewed Jun 5, 2026

View reviewed changes

Apply suggestions from code review

7c7ef8d

Co-authored-by: Ivo Anjo <ivo@ivoanjo.me> Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>

scottgerring requested a review from a team as a code owner June 9, 2026 11:39

address review comments

4e8cfd3

reyang approved these changes Jun 9, 2026

View reviewed changes

szegedi mentioned this pull request Jun 9, 2026

Port OTEP-4947 thread-context writer from polarsignals/custom-labels DataDog/pprof-nodejs#347

Draft

jsuereth reviewed Jun 9, 2026

View reviewed changes

jmacd approved these changes Jun 9, 2026

View reviewed changes


		This mechanism is designed to achieve the following goals:

		* Reader flexibility: Readers are not limited to eBPF-based implementations; any external process with sufficient system permissions to read `/proc/<pid>/maps` to identify loaded libraries, as well as sufficient permission to read from process memory should be able to use this mechanism (nb: [OTEP-4719](4719-process-ctx.md) also requires access to this resource)


		### Thread Local Context Record

		This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support stringy (utf-8 bytes) attributes only.

	In practice, context reads are expected to behave as if the thread whose context is being read is stopped or otherwise interrupted, and thus there can't be any concurrency hazards between reads and writes.
	In practice, context reads are expected to behave as if the thread whose context is being read is stopped or otherwise interrupted, and thus there can't be any concurrency hazards between reads and writes (as the writer logic will always execute as part of the thread).


		This is the attached thread record itself. SDK-side implementations may choose to hold multiple instances of this for active spans, and attach/detach them by setting the TLS to point to the appropriate entry. We err on the side of simplicity and support stringy (utf-8 bytes) attributes only.

		\| Name \| \| Data type \| Notes \|


		Because this mechanism relies on having a native component and knowing when a runtime switches contexts, we consider it optional for SDKs to support, as some runtimes (or even runtime versions) may not be able to feasibly/efficiently implement it.

		The TLSDESC-based publication mechanism and the in-memory Thread Local Context Record format are intentionally separable. Runtimes that cannot efficiently expose context through TLSDESC are not required to implement this mechanism. However, we highly recommend reusing the same in-memory record format when publishing equivalent context through a runtime-specific discovery mechanism, rather than defining a runtime-specific payload format. Such mechanisms are out of scope for this OTEP, but reusing the record layout where practical allows readers to share parsing logic across runtimes.


		When a request context is attached to a thread, the SDK:

		1. Sets the TLS pointer to 0 to ensure readers see no record during construction

	1. Sets the TLS pointer to 0 to ensure readers see no record during construction
	1. Sets the TLS pointer to 0 to dettach the old record from the thread

	Readers conforming to this specification MUST only read thread context while the target thread is stopped or interrupted (e.g. via eBPF perf events, ptrace-stop, or equivalent mechanisms).
	Readers conforming to this specification SHOULD read thread context while the target thread is stopped or interrupted (e.g. via eBPF perf events, ptrace-stop, or equivalent mechanisms).


		## Explanation

		We propose a mechanism for OpenTelemetry SDKs to publish thread-level information reflecting the context of the active request, if any, through a standard format based on the Linux-specific ELF Thread-Local Storage (TLS) TLSDESC dialect.


		Why this layout: It is scalable; if a writer is configured to publish only the required attributes and no custom fields, it can set `attrs-data-size` to 0; having read the first portion of the structure, the reader stops without having to read the rest.

		Cache Impact: Likewise, a frugal writer may aim to keep the entire record under 64 bytes (the typical size of a cache line) in which case we might expect the entire record is in cache after the first read. This format takes 28 bytes for the lead in, then an overhead of 2 bytes per key-value pair. For a single pair, we'd have 28 bytes for the value; for two, 26 bytes total. If we expected to track attributes such as `path` and `method`, this means we'd realistically expect to fit in a single cache line of 64 bytes. We recommend keeping at or under 640 bytes for the total record [to match the limit](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/tree/main/design-docs/00002-custom-labels#proposed-solution) on the OTel eBPF Profiler.

	\| valid \| \| uint8 \| This value is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. \|
	\| flag \| \| uint8 \| This flag is set to 1 when the record is valid. Consumers should ignore this record if any other value is set when they read. All other values are reserved and treated as invalid \|


		> Note: The `threadlocal.*` keys are defined here rather than as semantic conventions because they are inter-process coordination metadata, not telemetry attributes, and are not expected to appear in OTLP exports.

		The exact format used will be the `repeated KeyValue` protobuf structure from the `ProcessContext.attributes` field standardized in OTEP-4719. A stringified representation of this showing the usage of the elements of that schema along with some example values:

Conversation

scottgerring commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

scottgerring commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linux-foundation-easycla Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christos68k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottgerring Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottgerring Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scottgerring Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yannham Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

scottgerring commented Mar 16, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Apr 13, 2026 •

edited

Loading

scottgerring Jun 1, 2026 •

edited

Loading

scottgerring Jun 1, 2026 •

edited

Loading

scottgerring Jun 2, 2026 •

edited

Loading

yannham Jun 4, 2026 •

edited

Loading