Capture GenAI prompts and completions as events or attributes

The GenAI SIG has been discussing how to capture prompts and completion for a while and there are several issues that are blocked on this discussion (#1913, #1883, #1556) 

What we have today on OTel semconv is a set of structured events exported separately from spans. This was implemented in #980 and discussed in #834 and #829. The motivation to use events was to
- overcome size limits on attribute values by using event body
- use a signal that supports structured body and attributes
- have a clear 1:1 relationship between event name and structure (as opposed to polymorphic types or arrays of heterogeneous objects)
- make it possible and easy to consume individual events and prompts/completions without spans
- have verbosity controls

Turns out that:
- after ~9 months events are still not adopted by GenAI-focused tracing tools and their external instrumentation libs including [Arize](https://docs.arize.com/arize/llm-tracing/tracing/semantic-conventions), [Traceloop](https://www.traceloop.com/docs/openllmetry/contributing/semantic-conventions#llm-foundation-models), [Langtrace](https://github.com/Scale3-Labs/langtrace-trace-attributes/blob/main/schemas/llm_span_attributes.json) - all these providers use span attributes to capture prompts and completions.
- These backends consume prompts and completions along with spans and don't envision separating them - they store and visualize this data altogether

So, the GenAI SIG is re-litigating this decision taking into account backends' feedback and other relevant issues: #1621, #1912, https://github.com/open-telemetry/opentelemetry-specification/issues/4414

-------------------

**The fundamental question is whether this data belongs on the span or is a separate thing useful without a span.**

How it can be useful without a span:
- audit logs - https://cloud-native.slack.com/archives/C06KR7ARS3X/p1742322601090389?thread_ts=1741895340.932419&cid=C06KR7ARS3X - we could capture them on the request-response payloads where they are not unified/filtered/altered in other ways. But also audit logs have different delivery guarantees/storage/retention needs than telemetry
- some applications don't use tracing and rely on logs

To be useful without a span, events should probably duplicate some of the span attributes - endpoint, model used, input parameters, etc - it's not the case today

Are prompts/completions point-in-time telemetry?
- they don't really have a timestamp - prompts are input parameters, completion comes at the end of non-streaming call, buffered completion comes at the end of the streaming call. (#1701, https://github.com/open-telemetry/semantic-conventions/issues/1621#issuecomment-2552335590)
- Streaming chunks, if captured at all, would have timestamps (#1964)


**Arguably, from what we've seen so far, GenAI prompts and completion are used along with the spans and there is no great use-case for standalone events**

------------------

**Another fundamental question is how and if to capture unbounded (text, video, audio, etc) data on telemetry**

It's problematic because of:
- privacy - prompts can contain health concerns, ssns, addresses, names, etc. Apps that remain compliant with different regulators would have a problem of sharing this data with a broad audience of DevOps humans. The data should be accessible for evaluations, audit, but access should be restricted
- size - non-GenAI specific backends are not optimized for this and it's expensive to store such data in hot storage. 

Imagine, we had a solution that allowed us to store chat history somewhere and added a deep-link to that specific conversation to the telemetry - would we consider reporting this link as an event? We might, but we'd most likely have added this link as attribute on the span.

**Arguably, the long term solution to this problem is having this data stored separately from telemetry, but recorded by reference (e.g. URL on span that points to the chat history)**

------------------

TL;DR: 
- current approach doesn't work, we're blocked and need to find path forward. 
- GenAI-focused backends, innerloop scenarios, non-production apps would benefit from having prompts/completions stamped on the spans directly
- General-purpose observability backends and high-scale applications would have a problem with sensitive/large/binary data coming from end-users on telemetry anyway


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture GenAI prompts and completions as events or attributes #2010

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Capture GenAI prompts and completions as events or attributes #2010

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions