Skip to content
This repository was archived by the owner on Sep 21, 2023. It is now read-only.

Add Elasticsearch output status reporter#239

Merged
fearful-symmetry merged 7 commits intoelastic:mainfrom
fearful-symmetry:es-output-report
Feb 14, 2023
Merged

Add Elasticsearch output status reporter#239
fearful-symmetry merged 7 commits intoelastic:mainfrom
fearful-symmetry:es-output-report

Conversation

@fearful-symmetry
Copy link
Copy Markdown
Contributor

@fearful-symmetry fearful-symmetry commented Feb 9, 2023

What does this PR do?

closes #174

This adds a reporting utility to the elasticsearch output, which accepts a callback that updates the unit state if the output fails for a given number of seconds.

I've tested this, but between #240 and elastic/beats#34319 it's a tad hard to test.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.md or CHANGELOG-developer.md.

@fearful-symmetry fearful-symmetry added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Feb 9, 2023
@fearful-symmetry fearful-symmetry requested a review from a team as a code owner February 9, 2023 22:49
@fearful-symmetry fearful-symmetry self-assigned this Feb 9, 2023
@fearful-symmetry fearful-symmetry requested review from leehinman and rdner and removed request for a team February 9, 2023 22:49
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Feb 9, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Copy Markdown
Contributor

elasticmachine commented Feb 9, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-02-14T19:22:36.617+0000

  • Duration: 16 min 39 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

es.healthWatcher.Fail(err.Error())
es.logger.Errorf("couldn't add to bulk index request: %v", err)
// This event couldn't be attempted, so mark it as finished.
batch.Done(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the Done call here? IIRC it was needed for bookkeeping or else the batch won't be fully acknowledged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, probably removed that by accident...

time.Sleep(time.Millisecond * 100)
watcher.Fail("simulated failure")
// should fail
time.Sleep(time.Millisecond * 600)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleeps in these tests are some heavy costs to impose on every unit test run. Can we instead make the time callback a parameter of the watcher? Typical use could pass in time.Now (or maybe that could be the default) while unit tests could pass in a mocked timer so it can be tested deterministically without any delay.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I tinkered with this for a bit, but after I started running into race issues between the tests and the main watch loop, I decided it would be easier to just shift things around to make the sleeps faster, which is hopefully good enough.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm I see what you mean but it makes me sad. Probably the Right thing to do here is still to pass in a mockable helper to generate timer channels or something so we can make it deterministic, but that would require a lot of revisions for something that already mostly works. Maybe we can do a more robust generic helper when we have more components with health to report; I'll approve this one for now :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, agreed, I'm not a fan of it either, and I experimented with adding a few helpers/callbacks, but it just felt like the added code complexity just wasn't worth making the tests fully deterministic, particularly after shortening the sleeps in the test.

@fearful-symmetry
Copy link
Copy Markdown
Contributor Author

/test

Copy link
Copy Markdown
Contributor

@faec faec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending some minor tweaks

select {
case <-ctx.Done():
return
default:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replace default with <-time.After(hw.waitInterval) and then remove the sleep at the end of the loop? Similar effect but then sleeps still respect context cancellation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good idea!

time.Sleep(time.Millisecond * 100)
watcher.Fail("simulated failure")
// should fail
time.Sleep(time.Millisecond * 600)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm I see what you mean but it makes me sad. Probably the Right thing to do here is still to pass in a mockable helper to generate timer channels or something so we can make it deterministic, but that would require a lot of revisions for something that already mostly works. Maybe we can do a more robust generic helper when we have more components with health to report; I'll approve this one for now :-)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request Team:Elastic-Agent Label for the Agent team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The shipper should report its output unit as DEGRADED when it cannot publish events to Elasticsearch

3 participants