Log external requests#44286
Conversation
feab632 to
75a16a0
Compare
Bundle size report 📦
Look at the Statoscope report for a full comparison between the commits b232e8a and 3010217 or learn more. Open explanation
|
34ac077 to
36d8aea
Compare
|
Codenotify: Notifying subscribers in CODENOTIFY files for diff 3010217...b232e8a.
|
mrnugget
left a comment
There was a problem hiding this comment.
I did a first pass and left a bunch of comments with suggestions/questions. I didn't look at the frontend code yet, because I assume that my comments on GraphQL schema etc. might change frontend.
One comment on the screenshot though: can we make this really dense? 😄 We send a lot of requests and I think we should maybe have a single line per request and then uncollapse on click or something.
I'm very excited for this! :)
There was a problem hiding this comment.
It'd be consistent with the rest of our API if you name this after. See: https://sourcegraph.sourcegraph.com/search?q=context:global+content:%22after:+String%22+file:schema%5C.graphql+repo:sourcegraph%24&patternType=standard&sm=0&groupBy=repo
There was a problem hiding this comment.
Just like the comment on lastKey: I think OutBoundRequestConnection would be more idiomatic to return. Especially since you can then return the pageInfo with the next cursor to use.
There was a problem hiding this comment.
Hmm, at first I thought this was just a rename, but if you mean that I should change the structure to WebhookLogConnection's then my decision not to do that was intentional.
We always want to return all items (optionally filtered by "after"), we have no pagination and the total count is pretty redundant as well because we're always getting the items. So while I realized that was the most common solution, in this case I saw no value added by matching the format of WebhookLogConnection.
On the other hand, I found examples like pendingInvitations and outOfBandMigrations which followed the pattern I used, so this also seemed idiomatic.
WDYT?
There was a problem hiding this comment.
Fair enough! I assumed lastKey is cursor-based pagination, since it effectively works like endCursor.
There was a problem hiding this comment.
We always want to return all items (optionally filtered by "after")
Is this because the backend caps the responses automatically? What if we ever change the backend and decide to keep a longer history of these requests?
There was a problem hiding this comment.
Doesn't have to be in this PR, but just FYI: we also have this type https://github.com/sourcegraph/sourcegraph/blob/7d234ffe2f73177f6bcc8418499397fc4c71f858/cmd/frontend/graphqlbackend/schema.graphql#L8112-L8125
We could probably replace WebhookLogHeader with HTTPHeader in a follow-up PR.
There was a problem hiding this comment.
I don't think we should use pointers for a context.Context here. It's an identity-less object that can be wrapped/unwrapped and no caller/callee should care about a specific instance.
If you want to make it optional I suggest defining a different method.
There was a problem hiding this comment.
That's exactly the conclusion I came to last night after going to bed 😄
Yes, the original intention was to make it optional, but the solution didn't feel right.
Also, I'm not big on deciding pointers vs no pointers in Go yet. This is probably the area where I feel a lack of knowledge of best practices the most often in Go.
There was a problem hiding this comment.
Short docstring please :)
Also: would IsRiskyHeader be a better name? Key is awfully generic
There was a problem hiding this comment.
See above, I think you should just pass in context.Background() instead of making ctx optional
There was a problem hiding this comment.
I would assume that if logging-to-redis was turned off that this middleware would be a noop. What it does instead is to save the request and then delete it immediately, right?
There was a problem hiding this comment.
Ah, true, thank you! I added an if limit==0 check to the getter, but not the setter.
How much fun this function could've had saving and deleting all those items. 😿
I've added the check now.
|
@mrnugget I’ve responded to all your comments so far, and pushed all my commits. I’ll now work on the UI, remembering your ask “can we make this really dense? 😄” which also makes a lot of sense to me. |
There was a problem hiding this comment.
Reviewed the bits pertaining to the redis cache, as per my mention in the PR :)
On 2022-11-16, @bobheadxi added a new Redis function called ListKeys() here. This causes some duplication in functionality with my former solution GetAll(). I've decided to de-duplicate it, trying to keep the good parts of both. I'll reach out to Robert to make sure that my changes don't alter his solution too much. If he disagrees with my decision, it'll be easy to revert.
There was a problem hiding this comment.
| // ListKeysWithPrefix lists all keys associated with this cache, filtered by a prefix. | |
| // Use with care if you have long TTLs or no TTL configured. | |
| func (r *Cache) ListKeysWithPrefix(ctx context.Context, prefix string) (results []string, err error) { | |
| return r.listKeys(ctx, prefix) | |
| } | |
| // ListKeysWithPrefix lists all keys associated with this cache, filtered by a prefix in addition to the prefix already associated with this Cache instance. | |
| // Use with care if you have long TTLs or no TTL configured. | |
| func (r *Cache) ListKeysWithPrefix(ctx context.Context, prefix string) (results []string, err error) { | |
| return r.listKeys(ctx, prefix) | |
| } |
There was a problem hiding this comment.
Also see https://github.com/sourcegraph/sourcegraph/pull/44286#discussion_r1031756025 , I don't think we need these prefix variants at all - a redis cache instance in this case is already designed to represent a prefixed namespace within the cache. The cache you re-use is supposed to be for conditional requests, I would just create a new one for this new type of cache data
There was a problem hiding this comment.
The redisCache instance in package httpcli is used for caching results of APIs that support conditional requests: https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph@a49c8c8b7a374deafb6841723f0afa535561c0b6/-/blob/internal/httpcli/client.go?L86-92
If you want to store a different type of data, in this case outbound requests, you should create a new outboundRequestsRedisCache and set a site-configurable TTL for it. Then you don't need to implement the various prefix variants of functions on the redis cache - each cache instance is already designed to represent a prefixed namespace within the cache, and creating new ones is not expensive or anything
There was a problem hiding this comment.
I absolutely agree, did these changes, and the code is of course a lot cleaner this way.
The only thing is that I don't see much value in making the TTL site-configurable. TTL for redisCache is also hard-coded. If customers need to configure it later, it's very easy to add a setting for this, so I'll just hard code this for now.
There was a problem hiding this comment.
For reference, here's how I made a live-configurable TTL: https://sourcegraph.com/github.com/sourcegraph/sourcegraph@81082be686593f286efe4febe54d3ec06df807cb/-/blob/enterprise/internal/authz/syncjobs/records_store.go?L44:1-63:2
We could potentially bundle this type of behaviour into redisCache itself as well. Configuration is useful in case e.g. the load of this is higher than expected, or you want a larger window for debugging data - up to you :)
There was a problem hiding this comment.
By default, only the last fifty requests are stored, and the intention is that they won't set this to a number higher than 1,000. Considering this, and that we are logging a request for each ping, the 7-day limit seems to be impossible to reach. So I think it's fine to leave it hard-coded.
|
Wow, awesome tips, @bobheadxi, thank you! I'll fix them tomorrow! |
There was a problem hiding this comment.
Should this be gql.DateTime? In practice we want users to provide a timestamp since the redis keys are all timestamps, that they are compared as strings and/or that the timestamp is the key is an implementation detail (so we should remove mentions of the key from the API documentation too)
There was a problem hiding this comment.
I think it's more an opaque after cursor, as I argued above. I think exposing it as DateTime might be a bit leaky, since it's irrelevant for the client. The client wants to say "give me the requests after this one if there are new ones" and shouldn't have to worry about timestamps etc., IMHO
There was a problem hiding this comment.
I don't have a strong stance on this, but I never intended to provide a "give me the items after date X" feature.
I even considered just an incremental int64 key; I went with dates because I found them better suited for the purpose, but the intention is still only to have "some incremental value", not a date.
There was a problem hiding this comment.
Ah, in that case can we change the docstring here to match other instances in the GraphQL API where we have this type of after cursor? e.g.
"""
An opaque cursor that is used for pagination.
"""
after: String
1f7e330 to
84d49af
Compare
There was a problem hiding this comment.
Do we (can we) filter them so that httpcli doesn't show up? Like here: https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/database/dbconn/dynamic_metadata.go?L106-117
There was a problem hiding this comment.
Sure, I definitely can, I just wasn't sure which one is more helpful info to show:
a.) the caller of NewExternalClientFactory (which, in some cases, is internal/httpcli/client.go:96)
b.) the caller of the internal/httpcli package.
I went with a.), but I can easily go with b.) – I guess your stance is that b.) is more useful?
There was a problem hiding this comment.
Hmmmm. I think I want to know where the request comes from, not where the client was instantiated, you know? But now that you mention it, both are useful.
There was a problem hiding this comment.
I agree that knowing where the call comes from is more useful. I just thought it's also worth logging which client it uses.
There was a problem hiding this comment.
Just in case this might be useful, but you could take a look at this code https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/internal/database/dbconn/dynamic_metadata.go?L59 which shows how to exclude certain package from the stacktrace.
There was a problem hiding this comment.
Thanks, @jhchabran! Thorsten pointed me to Eric's PR that added this, so I'm familiar with what this code does.
I think what we have now is already useful. I prefer to release the feature and collect feedback on it before further improving it. Then we'll see what improvements we see as useful.
philipp-spiess
left a comment
There was a problem hiding this comment.
Sorry I saw you mentioned the connection/pagination stuff so I got curious 😅 I’m mostly asking out of curiosity because I don't fully understand what's going on yet.
There was a problem hiding this comment.
This part needs to be inside a useEffect otherwise you risk refetching often when something outside of this component re-renders it for unknown reasons.
Also I’m not quite sure what's going on here:
- We make a network request and get back some data and some items.
- In case there are no items, we continue to poll the same request
- However if there are items, we update the network request with a new
aftercursor so that we only get newer items.
Is this correct?
What I miss is where do we append items to the existing array? Do they just disappear? 🤔
There was a problem hiding this comment.
@philipp-spiess Thanks, I've wrapped it into a useEffect!
You are absolutely right that this is weird! @mrnugget also didn't get this at first, so I added this comment to avoid further confusion. Basically, the refetch is only there to apply the change to the variables, that's all. We're throwing away the result for the refetch (which is, by the way, most likely empty because we've just loaded the data milliseconds before it—but it doesn't actually matter whether it's empty or not).
We're fetching the data with useQuery here.
There was a problem hiding this comment.
Thanks! Polling in Apollo is such a mess 🤕
There was a problem hiding this comment.
The number of items is somehow capped, right? Otherwise this might get really expensive.
There was a problem hiding this comment.
I’m not quite sure why we use a
FilteredConnectionhere if we don't have a pagination of some form (maybe I misread the comments)?
Just to provide the dropdown filter for success/failure, and the free-text search pretty much for free, without doing a custom implementation of a data display component with filters etc.
The number of items is somehow capped, right? Otherwise this might get really expensive.
Yes, it's capped at a limit set by the admin, 50 being the suggested default.
There was a problem hiding this comment.
We always want to return all items (optionally filtered by "after")
Is this because the backend caps the responses automatically? What if we ever change the backend and decide to keep a longer history of these requests?
Yes, the backend only stores a number of items (N=50 by default), and returns all of them for the first call (after being null), then only the ones with IDs larger than after. Even if the site admin sets it to something like 500, the data is transmitted only once, and each piece is fairly small.
There was a problem hiding this comment.
Just to provide the dropdown filter for success/failure, and the free-text search pretty much for free, without doing a custom implementation of a data display component with filters etc.
Makes sense! Yeah we really need the new filter components to be usable in a standalone way 😅
There was a problem hiding this comment.
Haha, I just proved that FilteredConnection can be abused to do client-side filtering 😆
The only annoying part is that ugly delay(200) here.
There was a problem hiding this comment.
We always want to return all items (optionally filtered by "after")
Is this because the backend caps the responses automatically? What if we ever change the backend and decide to keep a longer history of these requests?
There was a problem hiding this comment.
| """ | |
| Filters the requests by key, only returning keys greater than the given one. | |
| """ | |
| after: String | |
| """ | |
| An opaque cursor that is used for pagination. | |
| """ | |
| after: String |
There was a problem hiding this comment.
nit: GetMulti uses varargs, so we might want to keep consistency here with varargs as well:
| func (r *Cache) DeleteMulti(keys []string) { | |
| func (r *Cache) DeleteMulti(keys ...string) { |
There was a problem hiding this comment.
log.Error does not create a log entry, it creates a log field, so you need a logger here to actually log this log field - see https://docs.sourcegraph.com/dev/how-to/add_logging#writing-log-messages
There was a problem hiding this comment.
Logging in httpcli is complicated because clients are often instantiated as a package-level variable, before loggers are available
There was a problem hiding this comment.
It's a bit complicated, but you could set the error in request context and have it be picked up by the logging middleware:
There was a problem hiding this comment.
Same here, log.Error does not create a log entry, only a log field
- Also added RedisLogItem with a non-final format - At this point, it successfully logs to Redis - It seems to be causing some errors, though (I see weird stuff in the console)
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
| styles[node.method.toLowerCase() as keyof typeof styles] | ||
| )} | ||
| > | ||
| ● |
There was a problem hiding this comment.
Since this is only used for visual guidance, it should probably by wrapped with <VisuallyHidden/>
I've incorporated Robert's changes (https://github.com/sourcegraph/sourcegraph/pull/44861) in my final version of this PR.
eseliger
left a comment
There was a problem hiding this comment.
sorry for being late to the party! I had a few comments pending that I didn't get around finishing yet - nothing critical at all, just some comments/advice around GraphQL
| pageInfo: PageInfo! | ||
| } | ||
| """ | ||
| A single logged webhook delivery. |
| """ | ||
| type OutboundRequestConnection { | ||
| """ | ||
| A list of webhook logs. |
| nodes: [OutboundRequest!]! | ||
|
|
||
| """ | ||
| The total number of webhook logs in the connection. |
| """ | ||
| type OutboundRequest { | ||
| """ | ||
| The request log item ID. Looks like "2021-01-01T00_00_00.000000" – Same as the Redis keys. |
There was a problem hiding this comment.
No need to explain the format here, ids are meant to be opaque. Also, should this type implement Node? Otherwise it can't be queried by ID.
There was a problem hiding this comment.
Oh, true. It actually already implements Node, just I missed the implements Node here.
| """ | ||
| The total time the request took to complete. | ||
| """ | ||
| duration: Float! |
There was a problem hiding this comment.
what's the unit of this float? ms?
More so, does it have to be float or is ms resolution good enough for us?
There was a problem hiding this comment.
@eseliger Thanks for pointing it out! I also hate when an interface doesn't specify the unit of measurement; it's embarrassing that I feel into this trap now. It's seconds. Current precision is nanoseconds, but a millisecond resolution would probably be totally enough.
I'm seeing these solutions:
a.) Just clarify this in the comment. That is, add "The total time the request took to complete, in seconds."
b.) Change the name in the GraphQL to durationInSeconds
c.) Like b.), but also on the back end (it starts getting problematic here because to do this properly, I'd need to migrate the old format in Redis
d.) Change the graphQL and front-end to milliseconds and int, and change the name to durationMs.
e.) like d.), but back end also (also needs migrating old data)
I'd go with a.). b.) and d.) are also fine with me if you believe it helps much. I probably wouldn't put more effort into it than that.
I'll go with a.) until you respond!
There was a problem hiding this comment.
Yeah d feels good to me as it would also get around potential floating point precision issues, but if it's a ton of work don't bother with it.
| """ | ||
| An opaque cursor that is used for pagination. | ||
| """ | ||
| after: ID |
There was a problem hiding this comment.
after should be of type String usually as per relay, also would be good to stay consistent and also allow the first: Int property here - that'll help with the pagination effort
There was a problem hiding this comment.
after
Hmm, it was a string at first, but when I implemented the Node interface I had to change it to ID because I needed to unmarshal the base64 encoding. I can look into how to make this a string if you believe this is a mistake.
first
Context: In earlier stages, this endpoint only returned an array, I intentionally didn't want to match the GraphQL Cursor Connection Spec because it seemed like introducing too much dead code we won't use. But reviewers started requesting it to be more like our other endpoints, so by now it is almost like a normal endpoint. At this point, I agree that it'd make sense to add first, too. I'll add it!
There was a problem hiding this comment.
Hmm, it was a string at first, but when I implemented the Node interface I had to change it to ID because I needed to unmarshal the base64 encoding. I can look into how to make this a string if you believe this is a mistake.
I think it should be as simple as doing graphql.ID(arg.ID) when passing to the unmarshaller, the reason is simply that it should not be a client concern nor should the client be encouraged to pass an ID to the after field, so we can simply change what the cursor contains later, should we have a need for that. Doing that case is pretty standard in our code base fwiw, so I don't feel bad about it :)
Context: In earlier stages, this endpoint only returned an array, I intentionally didn't want to match the GraphQL Cursor Connection Spec because it seemed like introducing too much dead code we won't use. But reviewers started requesting it to be more like our other endpoints, so by now it is almost like a normal endpoint. At this point, I agree that it'd make sense to add first, too. I'll add it!
Perfect, thanks!
| "WebhookLog": func(ctx context.Context, id graphql.ID) (Node, error) { | ||
| return webhookLogByID(ctx, db, id) | ||
| }, | ||
| "OutboundRequest": func(ctx context.Context, id graphql.ID) (Node, error) { |
There was a problem hiding this comment.
This only works once the type also implements the Node interface in the graphql schema
|
@eseliger Thanks for the review! I'll fix these in a follow-up PR! |
|
@eseliger addressed everything in https://github.com/sourcegraph/sourcegraph/pull/44937 |
|
@vdavid looking at the code here, is there a reason you don't use redis list operations? It looks like you are basically simulating that, but relying on |
|
@jhchabran ☝🏼 |
|
@keegancsmith @vdavid yes, I wrote the |
|
@vdavid I might just port this code to |
|
@keegancsmith I agree. It was not the optimal choice and I considered switching to a LIST a few times during the feature development, but as the list is always short (usually 50 items, but there's a hard cap at 500) and it's an in-memory store, the gains didn't seem enough to justify the refactoring at the time. Thanks a lot if you're doing this, and let me know if I should jump in and help, happy to clean this up if it's a blocker. |
|
Just to make it clearer, the gains are in fact worth it since ListKeys will look at every key in the entire redis store, not just the keys related to slow requests. These aren't marshalled over the wire luckily, since match runs server side on redis. But if I am not mistaken since redis is just a big hash table, it doesn't have an effecient range query / prefix matching on keys. |


This PR starts logging all outgoing requests from Sourcegraph, and makes them discoverable on the web UI.
The feature will be off by default, except for development. Can be enabled in site config.
Notes
Back end
outboundRequestLogLimitsetting to a non-zero value in the site config. The suggested default is 50.dev-privateafteris optional):Front end
FilteredComponentfor the list.useQuery()withpollIntervaland a small hack to go around Variable recalculation for each polling request apollographql/apollo-client#3053.FilteredComponent, and a weird race condition started occurring insideFilterComponent, so after an hour of debugging, I found the problem and added an uglydelay(200)to the query function. Of course, it still feels quite fast, but... 🤦♂️Test plan
Tested these aspects:
Also created this 5-min Loom where I went through the above point. It's not great, not terrible, and probably isn't worth a watch 😄