prototype OpContext by davepacheco · Pull Request #331 · oxidecomputer/omicron

davepacheco · 2021-10-21T19:05:26Z

This PR is experimental. I'd like feedback about how this looks. If it's promising, I can go make the rest of the changes (which will be extensive) to plumb this through in more places.

Motivation

There are a bunch of things that I think we might want to do in the middle of virtually any code path in Nexus:

Log a message with key-value pairs related to whatever operation we're doing. For example, it would be really nice if async-bb8-diesel's load_async could log a message that includes the current HTTP request id. Of course, it doesn't know anything about HTTP requests today, so this needs to be baked into whatever logger it gets. Slog already has a good pattern for this. This whole thing is kind of a generalization of that pattern.
Bump a counter (or any other Oximeter metric) with key-value pairs related to whatever operation we're doing. Similarly, we could imagine bumping a counter for queries executed inside async-bb8-diesel. Like with logging, we might want that to use key-value pairs related to things async-bb8-diesel doesn't know anything about.
Fire DTrace probes. Again, these might want access to current key-value pairs like loggers.
Do an authorization check. We don't do these today, but it'd be nice to get to a point where we're able to easily check these anywhere. I think it's important to get these deep in the stack. For example, datastore operations could require that the caller has checked that the user is allowed to do something that interacts with the database.

Proposal

This PR prototypes a unified OpContext that bundles this stuff together. I've seen and used similar patterns a bunch in other places but this is the first time it's been unified and the first time I'm trying to apply it to Rust.

The basic idea is that you create an OpContext when you start an operation:

implemented here: a background task (like saga recovery)
implemented here: handling an external API request (I only did this for "list organizations" so far)
not yet implemented: handling an internal API request
not yet implemented: applying this to sagas

The OpContext includes:

a slog::Logger so that you can always log a message, and it will include whatever key-value pairs are specific to this operation (e.g., request id)
an authn::Context so that you can always do an authz check (not yet implemented -- this is kind of a first step)
a creation Instant (which we can use to measure latency of the overall operation or to any point in the request)
a creation wall time (this can be useful in the debugger or in logs when correlating events across machines)
arbitrary key-value metadata. This is intended for the same reason you can put key-value pairs into slog::Loggers. But instead of going to the log, these could be used in DTrace probes or tracing spans
kind, which describes what operation we're doing (currently either handling an external API request or running a background operation)

The idea is that we pass the OpContext around basically everywhere. Think of passing this to every Datastore function so that it could log messages and record metrics. Maybe even down to async-bb8-diesel's load_async() and having it log the query in a message that includes the current request_id. This particular example is tricky though -- see below.

Lots of context

We already have a bunch of contexts in Nexus:

While handling an HTTP request, dropshot::RequestContext gives you information related to the current HTTP request
While handling an HTTP request, omicron_nexus::context::Context gives you access to server-wide state (like a reference to Nexus itself)
While inside a saga action, steno::ActionContext gives you information related to the current saga action and some control interfaces
In any authenticated code path, authn::Context tells you about the actor that's currently authenticated. Eventually this will provide a handle for doing authorization checks.

I think this pattern basically makes sense -- there's just a lot of different state, it's organized by its scope, and different scopes might have different combinations of context. I could also see the naming being confusing though.

Things I'm punting on

I want to defer all of this because I think we need experience with a prototype to make informed decisions here. But if folks think this stuff is important to nail down first (maybe because this feels like a bad direction), please say so.

Would it be useful to have child contexts? Slog and distributed tracing already have notions of this and it seems pretty useful. I hope this would be pretty straightforward to add but I haven't done it here because I wanted to start as small as possible.
How should this integrate with the existing contexts above? I toyed with the idea of putting the other contexts into OpContext. That way, you'd only need one object to reach all the others that were relevant for what you were doing. The ugly thing about this: either OpContext takes type parameter(s) describing the contexts inside it -- in which case all the general-purpose functions that accept an OpContext need to be parameterized, which feels annoying -- or else the other contexts are only available sometimes (e.g., if kind == ExternalApiRequest, you get a dropshot::RequestContext), in which case it feels like a lot of code is going to be doing unwrap() because it knows its handling, say, an external API request, but that's not reflected in the type signature.
How will this work with DTrace probes? I'm hoping at the very least, code could use the metadata inside the OpContext in probe arguments if they want.
How will this work with Oximeter-collected metrics? I'm hoping at the very least, we could provide handles to things like LatencyTracker in global Context (like we do today) and use key-value pairs from the metadata.
How will this work with distributed tracing? I haven't looked into this much at all, but I hope it doesn't look a lot worse than DTrace probes or metrics.
How does this work with components outside of Omicron? Right now, OpContext includes the authn context, which is Nexus-specific. In principle, we might want similar functionality for very general-purpose crates. You could imagine passing this to reqwest so that it can log and record metrics about client requests. But obviously it won't know about authn. So do we create some kind of general-purpose thing that can also store special-purpose data? This problem is very theoretical at this point because most general-purpose crates don't even accept a slog Logger today, let alone something else for metrics, etc. So I want to punt on it.

Trying to solve all of these simultaneously seems really hard. I'm just looking to take a step in a useful direction here.

smklein

I have a couple first impressions:

Passing around the authn context seems great
Passing around sloggers (and using them more!) seems great
The question of "what should be handled within the OpContext object" vs "what should be handled by the slog KV pairs + slog drains" seems much fuzzier to me. Some thoughts left as comments below.

smklein · 2021-10-21T19:19:49Z

+
+    created_instant: Instant,
+    created_walltime: SystemTime,
+    metadata: BTreeMap<String, String>,


Why not just use the KV list that exists within the logger? Why make this one distinct?

A perk of using the KV from slog is that it minimizes coupling across API boundaries. As a concrete example...

If we want to pass a "log" object to async-bb8-diesel, it only needs to add a dependency on slog. It can add whatever KV pairs it wants, and will implicitly log the KV pairs we have previously attached (such as request ID!).

If we want to pass a "OpContext" object (or just the metadata field) to async-bb8-diesel, it will need to add a dependency on Nexus - or we'll need to factor OpContext into a new crate. Additionally, it'll need to do all logging through the OpContext interface to capture these variables - and frankly, why bother, when slog already exists?

I think what you're getting at is that the common functionality between most of these things is the key-value metadata that lives in the slog::Logger. That's true -- in an ideal world, I would have a more general purpose crate that gives you hierarchical key-value contexts. Then I'd have slog use it. As it is today, slog takes ownership of that metadata. It's only available to you via a Drain, and then only when someone has logged a message. It feels very strange to me to phrase operations like bumping a counter, firing a DTrace probe, or recording timers as "send a log message". A bunch of things related to log messages don't apply (like the "level" and the "message" itself), and we'd need to build some programmatic logic with side effects atop the log messages. All of this feels very counter-intuitive to me as a user.

The problem you mention about wanting to use this in dependencies that don't know about Nexus -- that's a real problem. If this proves useful, we could separate out the Nexus-specific parts and pass a more general object down to those dependencies. There are also other ways to solve the problem, though. If we want to instrument database queries, for example, we can just create a common point in Nexus that does so.

smklein · 2021-10-21T20:06:35Z

+        let created_instant = Instant::now();
+        let created_walltime = SystemTime::now();
+        let apictx = rqctx.context();
+        let log = apictx.log.new(o!());


Related to the comment above - seems a little silly to make this with an empty K/V pair, (o!()) and then immediately fill the metadata structure with string pairs. Seems like it might be handy to just have these all exist within [https://docs.rs/slog/2.7.0/slog/struct.OwnedKV.html

I'm curious about @bnaecker 's input here - https://github.com/oxidecomputer/slog-dtrace/blob/main/src/lib.rs seemed like an example where the drain itself can make more complex decisions about "how to use these K/V pairs", whereas the front-end part of the logger can just be concerned with dumping whatever context seems relevant.

Yeah, this behavior is essentially trying to duplicate the slog behavior, but in a way that allows us to control what happens with the data. I do like the idea of using OwnedKV rather than BTreeMap here (which I know isn't exactly what you don't like about this, but is a good point about reusing a more complete interface here).

smklein · 2021-10-21T20:12:49Z

+    created_instant: Instant,
+    created_walltime: SystemTime,
+    metadata: BTreeMap<String, String>,
+    kind: OpKind,


Similar question here. I think there's a general tension between "structure vs extensibility" here:

On the "fully structured" side, we would have all the possible values that could be inserted into OpContext fully enumerated and strongly typed. Want to do an internal operation? It must be added it to the OpKind enum.

On the "fully extensible" side, we'd just keep track of all these K/V pairs as strings. Anyone can add them, and OpContext would need no modifications for new strings to be added + dumped out through the logging/tracing backends.

"Where do we fit in on this scale" is what I'm grappling with here - why bother making the "Op Kind" a strongly typed field instead of an arbitrary K/V pair attached to the slogger? What would fit the bill for one side or the other?

In my opinion, we should be using strongly-typed data when we want to perform meaningful transformations of the data before sending it to a logging backend. As an example: If we want the logs to be filled with "relative time" operations, I think keeping tracked of the "created_time" values makes total sense - but I would still emit them to the logger as strings.

All that being said - slog drains can totally parse the K/V pairs to make meaningful decisions, which may be nice for keeping this work centralized in the drain itself. You mentioned a use case like "counters", "firing dtrace probes" , etc - certain values of keys can instruct drains to parse or ignore records in a particular way. By using drains here, we also would get the benefit that this information can be modulated entirely by log-levels.

Yeah, this tension is real, and I feel I need more experience programming with a prototype to propose useful guidelines. Certainly, using types allows us to enforce some invariants like: the field must be present, it must conform to some known structure, it must include some other piece of data, etc. As you said, it also allows us to use it programmatically in a robust way, as we might expect to do for the create timestamps. You could make an argument that we should not support arbitrary key-value pairs at all, that we should instead have a (potentially huge) enum of all the specific contexts we might find ourselves in and the associated metadata for each one. I don't think that's as off-the-rails as it sounds. But it would be rather a pain to work with and I'm not sure it's worth it. Right now, I think I draw the line at fields that we know we're not programmatically consuming in any way -- they're just being serialized to DTrace or a log file. In that case, the benefits of easy changeability far outweigh the benefits of structure.

smklein · 2021-10-21T21:51:30Z

+    created_instant: Instant,
+    created_walltime: SystemTime,


https://github.com/slog-rs/perf is a thing that exists - although I actually don't really love the implementation, I think they got one part right: tracking the "duration of an operation", and logging how long things happened at the end, is something that could be refactored to a new structure and embedded in this structure.

(I think this is, perhaps, functionally similar to what you're trying to do here, but the key part is that it could look like a "separate structure embedded within the Context", to help reduce coupling further)

bnaecker · 2021-10-21T22:51:26Z

So it’ll take me some time to look at this and formulate a more nuanced response. But I wanted to point out something that @smklein alluded to, about slog’s KV pairs that I learned when building slog-dtrace. They are 100% _not_ a general purpose map-like structure. I was expecting for example, a method to iterate over the keys or values. That does not exist. There is, as far as I could tell, nothing useful you can do with the KV pairs except pass them indirectly to child loggers, and to serialize them, and even then not using serde, but slog’s own parallel implementation. I just want to make clear that if we need pairs of data that we need to inspect or do anything meaningful with really, slog will not provide that.

…

On 21 Oct 2021, at 17:51, Sean Klein ***@***.***> wrote: @smklein commented on this pull request. In nexus/src/context.rs: > + created_instant: Instant, + created_walltime: SystemTime, https://github.com/slog-rs/perf is a thing that exists - although I actually don't really love the implementation, I think they got one part right: tracking the "duration of an operation", and logging how long things happened at the end, is something that could be refactored to a new structure and embedded in this structure. (I think this is, perhaps, functionally similar to what you're trying to do here, but the key part is that it could look like a "separate structure embedded within the Context", to help reduce coupling further) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

davepacheco · 2021-10-21T22:55:26Z

So it’ll take me some time to look at this and formulate a more nuanced response. But I wanted to point out something that @smklein alluded to, about slog’s KV pairs that I learned when building slog-dtrace. They are 100% not a general purpose map-like structure. I was expecting for example, a method to iterate over the keys or values. That does not exist. There is, as far as I could tell, nothing useful you can do with the KV pairs except pass them indirectly to child loggers, and to serialize them, and even then not using serde, but slog’s own parallel implementation. I just want to make clear that if we need pairs of data that we need to inspect or do anything meaningful with really, slog will not provide that.

Thanks for that. I had come to the same conclusion in chat as well.

smklein · 2021-10-21T23:03:05Z

So it’ll take me some time to look at this and formulate a more nuanced response. But I wanted to point out something that @smklein alluded to, about slog’s KV pairs that I learned when building slog-dtrace. They are 100% not a general purpose map-like structure. I was expecting for example, a method to iterate over the keys or values. That does not exist. There is, as far as I could tell, nothing useful you can do with the KV pairs except pass them indirectly to child loggers, and to serialize them, and even then not using serde, but slog’s own parallel implementation. I just want to make clear that if we need pairs of data that we need to inspect or do anything meaningful with really, slog will not provide that.

Thanks for that. I had come to the same conclusion in chat as well.

We followed-up on this a bit in the control-plane chat, but https://tokio.rs/blog/2019-08-tracing has some examples of "trace-providers" - akin to slog drains- from the tokio tracing crate, that do appear to support improved parsing functionality of supplied K/V pairs.

davepacheco · 2021-10-21T23:11:55Z

So it’ll take me some time to look at this and formulate a more nuanced response. But I wanted to point out something that @smklein alluded to, about slog’s KV pairs that I learned when building slog-dtrace. They are 100% not a general purpose map-like structure. I was expecting for example, a method to iterate over the keys or values. That does not exist. There is, as far as I could tell, nothing useful you can do with the KV pairs except pass them indirectly to child loggers, and to serialize them, and even then not using serde, but slog’s own parallel implementation. I just want to make clear that if we need pairs of data that we need to inspect or do anything meaningful with really, slog will not provide that.

Thanks for that. I had come to the same conclusion in chat as well.

We followed-up on this a bit in the control-plane chat, but https://tokio.rs/blog/2019-08-tracing has some examples of "trace-providers" - akin to slog drains- from the tokio tracing crate, that do appear to support improved parsing functionality of supplied K/V pairs.

Yeah, we can use drains or subscribers. I think @bnaecker was responding to my comment about using OwnedKV to store our metadata. I don't think we can do that if our goal is to get it back out in order to pass it to DTrace or the like. (We can still pursue the drain-based approach for all this.)

bnaecker · 2021-10-22T01:59:57Z

Yeah Dave is right, my comment was directed at the question, why our own string map if we already have slog’s KV pairs? They’re not equivalent, which it sounds like y’all have discussed. The only way to get data back out is to serialize it, such as to a JSON object.

…

On 21 Oct 2021, at 19:12, David Pacheco ***@***.***> wrote: So it’ll take me some time to look at this and formulate a more nuanced response. But I wanted to point out something that @smklein alluded to, about slog’s KV pairs that I learned when building slog-dtrace. They are 100% not a general purpose map-like structure. I was expecting for example, a method to iterate over the keys or values. That does not exist. There is, as far as I could tell, nothing useful you can do with the KV pairs except pass them indirectly to child loggers, and to serialize them, and even then not using serde, but slog’s own parallel implementation. I just want to make clear that if we need pairs of data that we need to inspect or do anything meaningful with really, slog will not provide that. Thanks for that. I had come to the same conclusion in chat as well. We followed-up on this a bit in the control-plane chat, but https://tokio.rs/blog/2019-08-tracing has some examples of "trace-providers" - akin to slog drains- from the tokio tracing crate, that do appear to support improved parsing functionality of supplied K/V pairs. Yeah, we can use drains or subscribers. I think @bnaecker was responding to my comment about using OwnedKV to store our metadata. I don't think we can do that if our goal is to get it back out in order to pass it to DTrace or the like. (We can still pursue the drain-based approach for all this.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

davepacheco · 2021-11-05T00:07:30Z

Superseded by #346 and #366.

prototype OpContext

df9133c

davepacheco requested review from bnaecker, smklein and teisenbe October 21, 2021 19:05

smklein reviewed Oct 21, 2021

View reviewed changes

davepacheco mentioned this pull request Oct 28, 2021

authz: first cut #346

Merged

5 tasks

davepacheco mentioned this pull request Nov 4, 2021

use OpContext during saga recovery #366

Merged

davepacheco closed this Nov 5, 2021

davepacheco deleted the opcontext branch November 5, 2021 00:07

Conversation

davepacheco commented Oct 21, 2021

Motivation

Proposal

Lots of context

Things I'm punting on

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

smklein Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

davepacheco Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

smklein Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

davepacheco Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

smklein Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

davepacheco Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

smklein Oct 21, 2021

Choose a reason for hiding this comment

Uh oh!

bnaecker commented Oct 21, 2021 via email

Uh oh!

davepacheco commented Oct 21, 2021

Uh oh!

smklein commented Oct 21, 2021

Uh oh!

davepacheco commented Oct 21, 2021

Uh oh!

bnaecker commented Oct 22, 2021 via email

Uh oh!

davepacheco commented Nov 5, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants