Call fleet-server audit/unenroll endpoint on uninstall#5302
Call fleet-server audit/unenroll endpoint on uninstall#5302michel-laterman merged 23 commits intoelastic:mainfrom
Conversation
|
This pull request does not have a backport label. Could you fix it @michel-laterman? 🙏
NOTE: |
Uninstalling a fleet-managed elastic-agent instance will now do a best-effort attempt to notify fleet-server of the agent removal so the agent may not appear as offiline.
34c721e to
bb4d789
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
|
@michel-laterman now that elastic/fleet-server#3818 has been merged, are you able to make progress on this PR here? |
|
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
| func (e *AuditUnenrollRequest) Validate() error { | ||
| if e.Timestamp.IsZero() { | ||
| return &ReqError{fmt.Errorf("request timestamp not set")} | ||
| } | ||
| switch e.Reason { | ||
| case ReasonUninstall: | ||
| default: | ||
| return &ReqError{fmt.Errorf("unsupported reason: %s", e.Reason)} | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Shouldn't this validation be performed on the server side ? What happens if different version of agent and fleet have different validations?
There was a problem hiding this comment.
There is also request validation done server side; currently the agent's is more restrictive (only allowing one reason)
| mux := http.NewServeMux() | ||
| path := fmt.Sprintf(auditUnenrollPath, agentInfo.AgentID()) | ||
| mux.HandleFunc(path, authHandler(func(w http.ResponseWriter, r *http.Request) { | ||
| w.WriteHeader(http.StatusOK) |
There was a problem hiding this comment.
Why do we write a 200 OK header before validating the request body with the requires ?
maybe this handler can verify the request with asserts and return a 200 if everything is ok and something else (maybe a 400) if something is not what we expect ?
There was a problem hiding this comment.
this is a dummy handler; i don't expect that that the fleetapi level we care about the response (as the uninstall function controls retries), i'll move the write header to the last step
| response, err := info.ESClient.Get(".fleet-agents", agentID, info.ESClient.Get.WithContext(ctx)) | ||
| require.NoError(t, err) | ||
| defer response.Body.Close() | ||
| p, err := io.ReadAll(response.Body) | ||
| require.NoError(t, err) | ||
| require.Equalf(t, http.StatusOK, response.StatusCode, "ES status code expected 200, body: %s", p) | ||
| var res struct { | ||
| Source struct { | ||
| AuditUnenrolledReason string `json:"audit_unenrolled_reason"` | ||
| } `json:"_source"` | ||
| } |
There was a problem hiding this comment.
Isn't there a fleet endpoint to check the audit reason for the agent? I would prefer not to query directly a fleet index from agent...
There was a problem hiding this comment.
In an earlier attempt I tried using info.KibanaClient.SendWithContext(ctx, http.MethodGet, "/api/fleet/agents/"+agentID, nil, nil, nil) to get the agent, however the fleet api's return information does not include the audit_unenrolled_reason attribute
There was a problem hiding this comment.
I agree with @pchila that we should avoid directly querying ES indices from Agent. I've created elastic/kibana#194884 for GET /api/fleet/agents/:id to return audit_unenrolled_reason so we can update this test in the future to use that API when it's ready.
There was a problem hiding this comment.
Also created #5694 to add a TODO comment here to make the switch when the Fleet API is ready.
| case http.StatusOK: | ||
| pt.Describe("notify fleet-server success") | ||
| return | ||
| case http.StatusBadRequest, http.StatusUnauthorized, http.StatusConflict: |
There was a problem hiding this comment.
Why not retry on StatusConflict?
Added comment on why each of these are not retryable.
Co-authored-by: Paolo Chilà <paolo.chila@elastic.co> Co-authored-by: Blake Rouse <blake.rouse@elastic.co>
|
|
|
|
| } | ||
|
|
||
| if cfg != nil && !configuration.IsStandalone(cfg.Fleet) { | ||
| ai, err = info.NewAgentInfo(ctx, false) |
There was a problem hiding this comment.
The context here is set as ctx := context.Background(), the context should probably be an input into the Uninstall and tied to the lifetime of the uninstall command.
If you are 30s into a retry loop for the uninstall endpoint does CTRL-C correctly cause it to terminate?
There was a problem hiding this comment.
👍 , now passing cmd.Context() to Uninstall
|
blakerouse
left a comment
There was a problem hiding this comment.
Looks like all the review comments have been resolved, you have added the cmd.Context() in the places required to allow the ctrl-c to work.
Looks good!
* Call fleet-server audit/unenroll endpoint on uninstall Uninstalling a fleet-managed elastic-agent instance will now do a best-effort attempt to notify fleet-server of the agent removal so the agent may not appear as offiline. --------- Co-authored-by: Paolo Chilà <paolo.chila@elastic.co> Co-authored-by: Blake Rouse <blake.rouse@elastic.co> (cherry picked from commit 07c2a92)
* Call fleet-server audit/unenroll endpoint on uninstall Uninstalling a fleet-managed elastic-agent instance will now do a best-effort attempt to notify fleet-server of the agent removal so the agent may not appear as offiline. --------- Co-authored-by: Paolo Chilà <paolo.chila@elastic.co> Co-authored-by: Blake Rouse <blake.rouse@elastic.co> (cherry picked from commit 07c2a92) Co-authored-by: Michel Laterman <82832767+michel-laterman@users.noreply.github.com>




What does this PR do?
Uninstalling a fleet-managed elastic-agent instance will now do a best-effort attempt to notify fleet-server of the agent removal so the agent may not appear as offiline in the UI.
Requires fleet-server PR elastic/fleet-server#3818 to be merged in order for integration tests to succeed.
Why is it important?
Uninstalling the agent leaves inactive (offline) entries in the UI that clutter up the list.
This is an attempt to treat these agents similarly to agents that are unenrolled.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration files./changelog/fragmentsusing the changelog toolDisruptive User Impact
No disruptive impact, the change is a best-effort api call.
How to test this PR locally
TODO
Related issues