prototype OpContext#331
Conversation
smklein
left a comment
There was a problem hiding this comment.
I have a couple first impressions:
- Passing around the authn context seems great
- Passing around sloggers (and using them more!) seems great
- The question of "what should be handled within the
OpContextobject" vs "what should be handled by the slog KV pairs + slog drains" seems much fuzzier to me. Some thoughts left as comments below.
|
|
||
| created_instant: Instant, | ||
| created_walltime: SystemTime, | ||
| metadata: BTreeMap<String, String>, |
There was a problem hiding this comment.
Why not just use the KV list that exists within the logger? Why make this one distinct?
A perk of using the KV from slog is that it minimizes coupling across API boundaries. As a concrete example...
- If we want to pass a "log" object to
async-bb8-diesel, it only needs to add a dependency onslog. It can add whatever KV pairs it wants, and will implicitly log the KV pairs we have previously attached (such as request ID!). - If we want to pass a "OpContext" object (or just the
metadatafield) toasync-bb8-diesel, it will need to add a dependency on Nexus - or we'll need to factorOpContextinto a new crate. Additionally, it'll need to do all logging through theOpContextinterface to capture these variables - and frankly, why bother, whenslogalready exists?
There was a problem hiding this comment.
I think what you're getting at is that the common functionality between most of these things is the key-value metadata that lives in the slog::Logger. That's true -- in an ideal world, I would have a more general purpose crate that gives you hierarchical key-value contexts. Then I'd have slog use it. As it is today, slog takes ownership of that metadata. It's only available to you via a Drain, and then only when someone has logged a message. It feels very strange to me to phrase operations like bumping a counter, firing a DTrace probe, or recording timers as "send a log message". A bunch of things related to log messages don't apply (like the "level" and the "message" itself), and we'd need to build some programmatic logic with side effects atop the log messages. All of this feels very counter-intuitive to me as a user.
The problem you mention about wanting to use this in dependencies that don't know about Nexus -- that's a real problem. If this proves useful, we could separate out the Nexus-specific parts and pass a more general object down to those dependencies. There are also other ways to solve the problem, though. If we want to instrument database queries, for example, we can just create a common point in Nexus that does so.
| let created_instant = Instant::now(); | ||
| let created_walltime = SystemTime::now(); | ||
| let apictx = rqctx.context(); | ||
| let log = apictx.log.new(o!()); |
There was a problem hiding this comment.
Related to the comment above - seems a little silly to make this with an empty K/V pair, (o!()) and then immediately fill the metadata structure with string pairs. Seems like it might be handy to just have these all exist within [https://docs.rs/slog/2.7.0/slog/struct.OwnedKV.html
I'm curious about @bnaecker 's input here - https://github.com/oxidecomputer/slog-dtrace/blob/main/src/lib.rs seemed like an example where the drain itself can make more complex decisions about "how to use these K/V pairs", whereas the front-end part of the logger can just be concerned with dumping whatever context seems relevant.
There was a problem hiding this comment.
Yeah, this behavior is essentially trying to duplicate the slog behavior, but in a way that allows us to control what happens with the data. I do like the idea of using OwnedKV rather than BTreeMap here (which I know isn't exactly what you don't like about this, but is a good point about reusing a more complete interface here).
| created_instant: Instant, | ||
| created_walltime: SystemTime, | ||
| metadata: BTreeMap<String, String>, | ||
| kind: OpKind, |
There was a problem hiding this comment.
Similar question here. I think there's a general tension between "structure vs extensibility" here:
- On the "fully structured" side, we would have all the possible values that could be inserted into OpContext fully enumerated and strongly typed. Want to do an internal operation? It must be added it to the
OpKindenum. - On the "fully extensible" side, we'd just keep track of all these K/V pairs as strings. Anyone can add them, and
OpContextwould need no modifications for new strings to be added + dumped out through the logging/tracing backends.
"Where do we fit in on this scale" is what I'm grappling with here - why bother making the "Op Kind" a strongly typed field instead of an arbitrary K/V pair attached to the slogger? What would fit the bill for one side or the other?
In my opinion, we should be using strongly-typed data when we want to perform meaningful transformations of the data before sending it to a logging backend. As an example: If we want the logs to be filled with "relative time" operations, I think keeping tracked of the "created_time" values makes total sense - but I would still emit them to the logger as strings.
All that being said - slog drains can totally parse the K/V pairs to make meaningful decisions, which may be nice for keeping this work centralized in the drain itself. You mentioned a use case like "counters", "firing dtrace probes" , etc - certain values of keys can instruct drains to parse or ignore records in a particular way. By using drains here, we also would get the benefit that this information can be modulated entirely by log-levels.
There was a problem hiding this comment.
Yeah, this tension is real, and I feel I need more experience programming with a prototype to propose useful guidelines. Certainly, using types allows us to enforce some invariants like: the field must be present, it must conform to some known structure, it must include some other piece of data, etc. As you said, it also allows us to use it programmatically in a robust way, as we might expect to do for the create timestamps. You could make an argument that we should not support arbitrary key-value pairs at all, that we should instead have a (potentially huge) enum of all the specific contexts we might find ourselves in and the associated metadata for each one. I don't think that's as off-the-rails as it sounds. But it would be rather a pain to work with and I'm not sure it's worth it. Right now, I think I draw the line at fields that we know we're not programmatically consuming in any way -- they're just being serialized to DTrace or a log file. In that case, the benefits of easy changeability far outweigh the benefits of structure.
| created_instant: Instant, | ||
| created_walltime: SystemTime, |
There was a problem hiding this comment.
https://github.com/slog-rs/perf is a thing that exists - although I actually don't really love the implementation, I think they got one part right: tracking the "duration of an operation", and logging how long things happened at the end, is something that could be refactored to a new structure and embedded in this structure.
(I think this is, perhaps, functionally similar to what you're trying to do here, but the key part is that it could look like a "separate structure embedded within the Context", to help reduce coupling further)
|
So it’ll take me some time to look at this and formulate a more nuanced response. But I wanted to point out something that @smklein alluded to, about slog’s KV pairs that I learned when building slog-dtrace.
They are 100% _not_ a general purpose map-like structure. I was expecting for example, a method to iterate over the keys or values. That does not exist. There is, as far as I could tell, nothing useful you can do with the KV pairs except pass them indirectly to child loggers, and to serialize them, and even then not using serde, but slog’s own parallel implementation.
I just want to make clear that if we need pairs of data that we need to inspect or do anything meaningful with really, slog will not provide that.
… On 21 Oct 2021, at 17:51, Sean Klein ***@***.***> wrote:
@smklein commented on this pull request.
In nexus/src/context.rs:
> + created_instant: Instant,
+ created_walltime: SystemTime,
https://github.com/slog-rs/perf is a thing that exists - although I actually don't really love the implementation, I think they got one part right: tracking the "duration of an operation", and logging how long things happened at the end, is something that could be refactored to a new structure and embedded in this structure.
(I think this is, perhaps, functionally similar to what you're trying to do here, but the key part is that it could look like a "separate structure embedded within the Context", to help reduce coupling further)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
Thanks for that. I had come to the same conclusion in chat as well. |
We followed-up on this a bit in the control-plane chat, but https://tokio.rs/blog/2019-08-tracing has some examples of "trace-providers" - akin to slog drains- from the tokio tracing crate, that do appear to support improved parsing functionality of supplied K/V pairs. |
Yeah, we can use drains or subscribers. I think @bnaecker was responding to my comment about using |
|
Yeah Dave is right, my comment was directed at the question, why our own string map if we already have slog’s KV pairs? They’re not equivalent, which it sounds like y’all have discussed. The only way to get data back out is to serialize it, such as to a JSON object.
… On 21 Oct 2021, at 19:12, David Pacheco ***@***.***> wrote:
So it’ll take me some time to look at this and formulate a more nuanced response. But I wanted to point out something that @smklein alluded to, about slog’s KV pairs that I learned when building slog-dtrace. They are 100% not a general purpose map-like structure. I was expecting for example, a method to iterate over the keys or values. That does not exist. There is, as far as I could tell, nothing useful you can do with the KV pairs except pass them indirectly to child loggers, and to serialize them, and even then not using serde, but slog’s own parallel implementation. I just want to make clear that if we need pairs of data that we need to inspect or do anything meaningful with really, slog will not provide that.
Thanks for that. I had come to the same conclusion in chat as well.
We followed-up on this a bit in the control-plane chat, but https://tokio.rs/blog/2019-08-tracing has some examples of "trace-providers" - akin to slog drains- from the tokio tracing crate, that do appear to support improved parsing functionality of supplied K/V pairs.
Yeah, we can use drains or subscribers. I think @bnaecker was responding to my comment about using OwnedKV to store our metadata. I don't think we can do that if our goal is to get it back out in order to pass it to DTrace or the like. (We can still pursue the drain-based approach for all this.)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
|
This PR is experimental. I'd like feedback about how this looks. If it's promising, I can go make the rest of the changes (which will be extensive) to plumb this through in more places.
Motivation
There are a bunch of things that I think we might want to do in the middle of virtually any code path in Nexus:
load_asynccould log a message that includes the current HTTP request id. Of course, it doesn't know anything about HTTP requests today, so this needs to be baked into whatever logger it gets. Slog already has a good pattern for this. This whole thing is kind of a generalization of that pattern.async-bb8-diesel. Like with logging, we might want that to use key-value pairs related to things async-bb8-diesel doesn't know anything about.Proposal
This PR prototypes a unified
OpContextthat bundles this stuff together. I've seen and used similar patterns a bunch in other places but this is the first time it's been unified and the first time I'm trying to apply it to Rust.The basic idea is that you create an
OpContextwhen you start an operation:The OpContext includes:
slog::Loggerso that you can always log a message, and it will include whatever key-value pairs are specific to this operation (e.g., request id)authn::Contextso that you can always do an authz check (not yet implemented -- this is kind of a first step)Instant(which we can use to measure latency of the overall operation or to any point in the request)kind, which describes what operation we're doing (currently either handling an external API request or running a background operation)The idea is that we pass the
OpContextaround basically everywhere. Think of passing this to every Datastore function so that it could log messages and record metrics. Maybe even down toasync-bb8-diesel'sload_async()and having it log the query in a message that includes the currentrequest_id. This particular example is tricky though -- see below.Lots of context
We already have a bunch of contexts in Nexus:
dropshot::RequestContextgives you information related to the current HTTP requestomicron_nexus::context::Contextgives you access to server-wide state (like a reference toNexusitself)steno::ActionContextgives you information related to the current saga action and some control interfacesauthn::Contexttells you about the actor that's currently authenticated. Eventually this will provide a handle for doing authorization checks.I think this pattern basically makes sense -- there's just a lot of different state, it's organized by its scope, and different scopes might have different combinations of context. I could also see the naming being confusing though.
Things I'm punting on
I want to defer all of this because I think we need experience with a prototype to make informed decisions here. But if folks think this stuff is important to nail down first (maybe because this feels like a bad direction), please say so.
OpContext. That way, you'd only need one object to reach all the others that were relevant for what you were doing. The ugly thing about this: eitherOpContexttakes type parameter(s) describing the contexts inside it -- in which case all the general-purpose functions that accept anOpContextneed to be parameterized, which feels annoying -- or else the other contexts are only available sometimes (e.g., ifkind == ExternalApiRequest, you get a dropshot::RequestContext), in which case it feels like a lot of code is going to be doingunwrap()because it knows its handling, say, an external API request, but that's not reflected in the type signature.LatencyTrackerin globalContext(like we do today) and use key-value pairs from the metadata.reqwestso that it can log and record metrics about client requests. But obviously it won't know about authn. So do we create some kind of general-purpose thing that can also store special-purpose data? This problem is very theoretical at this point because most general-purpose crates don't even accept a slog Logger today, let alone something else for metrics, etc. So I want to punt on it.Trying to solve all of these simultaneously seems really hard. I'm just looking to take a step in a useful direction here.