[Agent] Support Node and Service autodiscovery in k8s provider#26801
[Agent] Support Node and Service autodiscovery in k8s provider#26801ChrsMark merged 25 commits intoelastic:masterfrom
Conversation
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
MichaelKatsoulis
left a comment
There was a problem hiding this comment.
LGTM. It is nice splitting different resource watchers in their respectful files.
Signed-off-by: chrismark <chrismarkou92@gmail.com>
|
Pinging @elastic/integrations (Team:Integrations) |
|
Opening this for review. I tested it manually with the scenarios mentioned in this PR's description. I plan to add unit tests for the provider too but the implementation should be ready for review now. |
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
|
Note: Latest commit (43f5136) tries to improve data emission handling in |
|
@ChrsMark thanks for working on this. Before going deep into implementation details, I would like to raise a concern, do we have plans to support discovery of multiple resources? With current configuration it doesn't seem to be possible. There is no way to declare multiple providers of the same kind as in Beats, and
I would see two options for that, one would be to make Of course, with current implementation there is also the option of running multiple agents, one for each kind of resource, but this may be a bit overkill and cumbersome for users. |
|
That's a fair point @jsoriano , thanks for bringing this up! Especially now that same Agent will run collectors for logs+metrics+uptime at the same time, using the same config, we should be able to provide such flexibility to our users. I think that splitting into different providers per k8s resource would make sense but fields should be reported under the same namespace like ${kubernetes_service.service.name} == 'kube-controller-manager', which I'm not if is something we would like.
Other way to tackle this is to enable support for defining k8s provider multiple times but this most probably should happen on controller provider's layer. @exekias any thoughts on this? |
| // Check if resource is service. If yes then default the scope to "cluster". | ||
| if c.Resources.Service != nil { | ||
| if c.Scope == "node" { | ||
| logp.L().Warnf("can not set scope to `node` when using resource `Service`. resetting scope to `cluster`") |
There was a problem hiding this comment.
Should this logger be namespaced? I also see logger widely used in agent, not sure if there is any preference here
| type ResourceConfig struct { | ||
| KubeConfig string `config:"kube_config"` | ||
| Namespace string `config:"namespace"` | ||
| SyncPeriod time.Duration `config:"sync_period"` | ||
| CleanupTimeout time.Duration `config:"cleanup_timeout" validate:"positive"` | ||
|
|
||
| // Needed when resource is a Pod or Node | ||
| Node string `config:"node"` |
There was a problem hiding this comment.
I'm wondering about use cases for resource specific settings, do you have anyone in mind?
There was a problem hiding this comment.
Now that I'm thinking of it again, maybe it is over-engineering to provide this option at the moment since the base config shared for all the resources should cover the cases. Flexibility for different accesses per resource or variant settings options would be nice to think of but we can and see if users actually need them. So, I will change it and move to single config for all of the resources.
| func (c *Config) Validate() error { | ||
| // Check if resource is service. If yes then default the scope to "cluster". | ||
| if c.Resources.Service != nil { | ||
| if c.Scope == "node" { |
There was a problem hiding this comment.
It's interesting that you can override almost all settings per resource, but not scope
| Node: c.Node, | ||
| } | ||
| if c.Resources.Pod == nil { | ||
| c.Resources.Pod = baseCfg |
There was a problem hiding this comment.
what if the user only overrides resources.pod.namespace? Does that mean that the rest of settings will be empty?
| func (n *node) emitRunning(node *kubernetes.Node) { | ||
| data := generateNodeData(node) | ||
| data.mapping["scope"] = n.scope | ||
| if data == nil { |
There was a problem hiding this comment.
Can this happen taking the previous line into account?
| n.emitStopped(node) | ||
| n.emitRunning(node) |
There was a problem hiding this comment.
In theory emitRunning should be enough here, right? This will AddOrUpdate
| return false | ||
| } | ||
|
|
||
| func getAddress(node *kubernetes.Node) string { |
There was a problem hiding this comment.
A explanatory comment here would help
| // Pass annotations to all events so that it can be used in templating and by annotation builders. | ||
| annotations := common.MapStr{} | ||
| for k, v := range node.GetObjectMeta().GetAnnotations() { | ||
| safemapstr.Put(annotations, k, v) |
There was a problem hiding this comment.
Should we be dedotting these?
There was a problem hiding this comment.
I was thinking about adding this when we will deal with metadata in general, but it's ok to add it now. Adding.
| "node": map[string]interface{}{ | ||
| "uid": string(node.GetUID()), | ||
| "name": node.GetName(), | ||
| "labels": node.GetLabels(), |
There was a problem hiding this comment.
Same for labels, should we dedot?
|
|
||
| providerDataChan := make(chan providerData) | ||
| done := make(chan bool, 1) | ||
| go generateContainerData(pod, containers, containerstatuses, providerDataChan, done) |
There was a problem hiding this comment.
Why using a channel for this? What would you think about emitting directly from the function?
There was a problem hiding this comment.
Channel usage helps to isolate the generator function so as to be possible to be tested with unit tests, following a yield-like approach.
There was a problem hiding this comment.
understood, you could also build a mocked emitter passed as a parameter to retrieve the results in the tests, right?
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
|
Comments addressed and tested locally following the updated scenarios listed in the PR's description. |
Signed-off-by: chrismark <chrismarkou92@gmail.com>
jsoriano
left a comment
There was a problem hiding this comment.
This is looking good. Added a question about the future of dedotting and the possibility of using flattened type instead.
Also I think we still need to polish some use cases that we recently polished in Beats, as discovery of short-living pods, crashing containers, ephemeral containers and so on, but this can be done as follow ups. elastic/e2e-testing#1090 will help to validate these use cases 😇
| // Scope of the provider (cluster or node) | ||
| Scope string `config:"scope"` | ||
| LabelsDedot bool `config:"labels.dedot"` | ||
| AnnotationsDedot bool `config:"annotations.dedot"` |
There was a problem hiding this comment.
@exekias @ChrsMark wdyt about start using the flattened type in Agent instead of dedotting? There are still some trade-offs but they should be eventually addressed. Hopefully flattened is the future for this kind of fields.
More context here: https://github.com/elastic/obs-dc-team/issues/461
There was a problem hiding this comment.
I'm not completely aware of pros/cons but with a quick view flattened type looks better than dedoting to me, and sounds like a good idea regarding timing to do this change now.
There was a problem hiding this comment.
taking this into account I'm inclined to leave dedotting out of this PR and investigate the experience when using flattened for these fields, any thoughts? Also, in case we end up introducing it, I would like to be opinionated here, and avoid adding any config parameter for it.
I'm particularly concerned about doing things like grouping some metric by a label, which is a valid use. I'm less concerned about annotations...
There was a problem hiding this comment.
I'm +1 in leaving this out for now and open a follow-up issue to work on for dedotting/flattening
| // InitDefaults initializes the default values for the config. | ||
| func (c *Config) InitDefaults() { | ||
| c.SyncPeriod = 10 * time.Minute | ||
| c.CleanupTimeout = 60 * time.Second |
There was a problem hiding this comment.
We may be hitting this issue #20543, not sure how we can do something now depending on the kind of data we are collecting.
There was a problem hiding this comment.
Yeap, the provider is not really aware of the inputs right now, but maybe it could be handled better in the future if we introduce a "smart" controller which enables providers according to the inputs.
| p.emitContainers(pod, pod.Spec.Containers, pod.Status.ContainerStatuses) | ||
|
|
||
| // TODO deal with init containers stopping after initialization | ||
| p.emitContainers(pod, pod.Spec.InitContainers, pod.Status.InitContainerStatuses) |
There was a problem hiding this comment.
TODO: add support for ephemeral containers.
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Thanks! Testing should be improved for sure and in a more complete/e2e approach. We have in our backlog elastic/e2e-testing#1090 which can cover this need I think, and I'm thinking that this could be better to be implemented after we have a more complete codebase including metadata handling too. |
Signed-off-by: chrismark <chrismarkou92@gmail.com>
jsoriano
left a comment
There was a problem hiding this comment.
I think this can be merged, and we can iterate on details and specific use cases in future PRs.
blakerouse
left a comment
There was a problem hiding this comment.
Looks good, thanks for all the fixes and changes!
(cherry picked from commit 6635acb)
What does this PR do?
This PR adds more resources in
kubernetesdynamic provider.Why is it important?
So as to support Node and Service discovery via
kubernetesdynamic provider of Agent.Checklist
- [ ] I have made corresponding changes to the documentationCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.How to test this PR locally
Test with
noderesource`:inspectcommand to check what is the compiled configuration (use proper path for the elastic-agent.yml config file :./elastic-agent -c /Users/ubuntu/Desktop/elastic-agent.yml inspect output -o defaultTest with
serviceresource.Test with
podTest with
serviceresource andnodescope.clusterscope (can not set scope to node when using resource Service. resetting scope to cluster), and thatkubernetes.scopefield is populated withclustervalue.Test with
podat node scope and define nodeRelated issues