hubble-relay: implement flows reordering#11397
Conversation
gandro
left a comment
There was a problem hiding this comment.
The implementation looks good to me, one change requested (and one nit).
While I agree with the commit description that we want to test this in CI, it seems to me that this is also something that should be very easily testable by unit tests as well. This probably needs a bit of boiler-plate first (i.e. we want to be able to mock the different peers and the gRPC client), so I'm fine with merging this feature without unit tests, but I do believe that we should add unit tests for this.
| } | ||
| } | ||
| pq.Push(flow) | ||
| case <-time.After(1 * time.Second): // make sure to drain the queue when no new flow responses are received |
There was a problem hiding this comment.
nit: Since we are calling this for each flow and it is presumably creating a channel each time, I wonder if creating a time.NewTimer and calling Reset at the end of the loop might be slightly more efficient. But I actually don't know, just something that stood out to me.
There was a problem hiding this comment.
I could add a benchmark in the tests and actually measure this.
I'll mark this PR as draft while I implement unit tests. We're not in a hurry to merge it so I can dedicate time for that. |
99caa70 to
fda1329
Compare
|
@gandro I started to give a go at implementing unit tests but it would require a lot of refactoring as the existing code is not easily mockable. I'll create an issue and implement tests for the whole |
| // Copyright 2019 Authors of Hubble | ||
| // Copyright 2020 Authors of Cilium |
gandro
left a comment
There was a problem hiding this comment.
The cancellation logic looks good to me now, I don't see any point which can block forever.
There is one case where the ctx.Done() check will be delayed (see comment) however.
fda1329 to
893ffe5
Compare
|
@glibsm I made these values configurable (see |
|
test-me-please |
I disagree with that. We can provide sane defaults and have documentation for what these values do, but fine tuning is a per-environment thing. There is no way we can provide a default that works for everybody |
glibsm
left a comment
There was a problem hiding this comment.
Exposing through CLI is not blocking this particular PR though
|
test-me-please |
This priority queue implementation is not yet used and only relevant for hubble-relay. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Hubble-relay needs a priority queue for GetFlowsResponse objects, not Events. Note that making this priority queue more generic by using an interface instead is not possible as one would have to perform nil checks on an interface type (which would be asking for trouble). Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This field is pretty useless if not set. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This field is pretty useless if not set. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
This commits implements flows reordering for hubble-relay. The goal is to reorder flows based on their timestamp in a "best-effort" basis as waiting for all flows and buffering would have serious drawbacks such as extreme memory usage in some scenarios (and it would not work in follow-mode anyway as the number of flows is virtually infinite in such a case). This implementation uses a priority queue where Pop() returns the oldest element in the queue. In order to ensure that elements are popped out from the queue even when the queue is not full (typically for requests in follow-mode that don't result in a large number of flows), an element is tentatively popped out from the queue at least every second. I have tested this implementation on a 3 nodes cluster and both in follow-mode and non follow-mode. Unit tests need to be implemented but this is true for the relay package in general which is not easily mockable with the current implementation. Signed-off-by: Robin Hahling <robin.hahling@gw-computing.net>
893ffe5 to
faca702
Compare
|
test-me-please |
|
test-gke |
|
retest-4.9 |
|
retest-runtime |
|
retest-net-next |
|
In case of runtime tests, there is something else going on: |
|
And in case of |
|
No runtime tests are failing with this: And |
@errordeveloper I think this one is #11512 The other one seems to be a transient failure (alpine pkg database could not be reached thus building the docker image failed). |
|
retest-netnext |
|
retest-net-next |
|
retest-runtime |
Please, see individual commits for details (this PR also includes 2 fixes for hubble observer).
Closes #11223.