agent.flush() waits for inflight spans and errors by trentm · Pull Request #2372 · elastic/apm-agent-nodejs

trentm · 2021-10-14T17:23:02Z

fix: improve Agent#flush() to wait for inflight ended spans and errors

With this change an apm.flush() will wait for "inflight" spans and errors.
These are spans that have ended, but have not yet been encoded and sent
to the transport; and errors that are still capturing. Both of these are
necessarily asynchronous, mainly for stacktrace collection. This allows one
to synchronously call flush like this and have it work as expected:

span.end()
apm.captureError(err)
apm.flush()

This is helpful for flushing APM data for a finished Lambda function. This
change also includes some related (and unrelated) Lambda instrumentation
improvements:

Add trace logging for start/end/flush of lambda invocations.
captureError and set transaction outcome on lambda function errors
Ensure captured errors are associated with the transaction for this invoke.
Catch a top-level thrown exception in a non-async-function lambda handler.

Fixes: #2294

Checklist

Implement code
Remove debugging XXXs
Add tests
Add CHANGELOG.asciidoc entry

- missing some tests - doesn't yet handle inflight *errors* Will eventually fix #2294

…cleaner

ghost · 2021-10-14T17:45:39Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-10-21T15:52:37.003+0000
Duration: 25 min 19 sec
Commit: 295f798

Test stats 🧪

Test	Results
Failed	0
Passed	20
Skipped	0
Total	20

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
run module tests for <modules> : Run TAV tests for one or more modules, where <modules> can be either a comma separated list of modules (e.g. memcached,redis) or the string literal ALL to test all modules
run benchmark tests : Run the benchmark test only.

…) sequencing; they both use the same write buffer/queue in the http client

… get transaction.id association; do NOT pass callback to captureError because it results in double-flush (first in agent.captureError, second from lambda wrapper handling). With inflight event handling in agent.flush we no longer need to wait for the error to process before calling agent.flush() when the lambda completes.

…transport.sendError callback -- it led to the Agent's default handleUncaughtException behaviour of process.exit(1)

…ed before agent.flush() handled this

astorm

Overall looks like a reasonable best effort solution -- if I'm reading the code correctly it looks like the flush will be passed on to the client as soon as the last item is deleted from the inflight set OR the timeout (currently one second) for the inflight set is reached.

Also, I gave this a spin with the last known working lambda extension and data flowed into APM Server correctly. I've got a few questions below for thoroughness, but this looks good as is. Approving.

astorm · 2021-10-19T23:24:41Z

lib/agent.js

-        cb(new Error('cannot capture error before agent is started'), id)
+        agent._transport.sendError(apmError)
+        inflightEvents.delete(id)
+        if (!handled || cb) {


I don't quite follow the reason for this change here -- it looks like this code has been removed from the sendError callback function, which (may?) chance the timing here subtly. Can you talk a bit about why this change was necessary?

Good question.

First, I removed the use of the sendError callback, because both the sent apmError and the flush token added to the http client Client by agent.flush() use the same queue: the Client stream (it inherites from stream.Writable). So there is no need to sequence apm.flush() asynchronously after sendError.

Second, waiting for the callback from sendError can actually delay the start of the intake request:

Client.prototype.sendError = function (error, cb) { if (this._isUnsafeToWrite() || this._shouldDropEvent()) { return } this._maybeCork() return this.write({ error }, Client.encoding.ERROR, cb) }

That _maybeCork will delay the start of an intake request, if there isn't one already by bufferWindowTime, which defaults to 20ms. I restored the "use the sendError callback" and added some logging and re-ran this test in test/agent.test.js:

test.only('#captureError()', function (t) { ... // Passing a callback to `captureError` means agent.flush() will be called. t.test('with callback', function (t) {

The logging shows a ~20ms delay from the "Sending error to Elastic APM" message to "intake request start":

[2021-10-20T22:11:14.313Z] INFO (elastic-apm-node): Sending error to Elastic APM: {"id":"52b6dfb48f99b95c923b3c186ca1df3e"} XXX corking XXX sync after sendError XXX uncorking1 (after bufferWindowTime) [2021-10-20T22:11:14.338Z] TRACE (elastic-apm-node): intake request start event.module: "apmclient" reqId: "69557af9d5c862a96a45eb4119397f5d" [2021-10-20T22:11:14.340Z] TRACE (elastic-apm-node): _write: encode object event.module: "apmclient" fullTimeMs: 5.267449 numEvents: 1 numBytes: 6831 [2021-10-20T22:11:14.341Z] TRACE (elastic-apm-node): intakeReq "socket": unref it event.module: "apmclient" reqId: "69557af9d5c862a96a45eb4119397f5d" XXX in callback from sendError [2021-10-20T22:11:14.341Z] TRACE (elastic-apm-node): _writeFlush event.module: "apmclient" active: true

The apm.flush() call comes after that in once sendError calls back.

With the currently proposed code the intake request is starts right away (a few ms of processing time):

[2021-10-20T22:04:42.229Z] INFO (elastic-apm-node): Sending error to Elastic APM: {"id":"96ef84fc56d978ab34e3201b392c1de0"} [2021-10-20T22:04:42.234Z] TRACE (elastic-apm-node): intake request start event.module: "apmclient" reqId: "33fd40c31c5ad99af7dd9de8ab8b0e40" [2021-10-20T22:04:42.235Z] TRACE (elastic-apm-node): _writeBatch event.module: "apmclient" encodeTimeMs: 0.690205 fullTimeMs: 5.48049 numEvents: 1 numBytes: 6871

because the apm.flush() called right after the sendError:

[2021-10-20T22:23:39.171Z] INFO (elastic-apm-node): Sending error to Elastic APM: {"id":"affa9a3c0dc1c8958515255657c3ea95"} XXX corking XXX sync after sendError XXX uncorking2 [2021-10-20T22:23:39.177Z] TRACE (elastic-apm-node): intake request start event.module: "apmclient" reqId: "6e2b4c7c250e96a3f5f9ea6e056a4e9b" [2021-10-20T22:23:39.178Z] TRACE (elastic-apm-node): _writeBatch

Because the Client.flush uncorks immediately:

Client.prototype.flush = function (cb) { this._maybeUncork()

Last, I'll admit to not being 100% sure there isn't some interaction difference if there an ongoing intake request and enough data on the Client buffer such that the just sent apmError will need to wait for one or more separate intake requests to get fully flushed. I haven't followed what happens with the this._write(error, cb) callback when it is passed to the stream chopper.

astorm · 2021-10-19T23:35:26Z

lib/agent.js

+//      inflightEvents.delete(id)
 Agent.prototype.flush = function (cb) {
-  if (this._transport) {
-    // TODO: Only bind the callback if the transport can't use AsyncResource from async hooks


It looks like we lost this comment in the new version. Is this something that's still relevant?

This was intentional. That TODO was added a long while back and there is no details that I found that explains why AsyncResource would be better and how it would be used. I think _instrumentation.bindFunction(cb) is the right thing to do to ensure that cb executes in the correct run context. If at some point our Run Context Manager uses AsyncResource to handle tracking, then great, but code like this should just use the Instrumentation class API.

trentm added 4 commits October 13, 2021 10:10

first pass at waiting for inflight events in agent.flush()

2f23501

- missing some tests - doesn't yet handle inflight *errors* Will eventually fix #2294

add support for inflight error events; move API to Agent where it is …

f43abb9

…cleaner

comment tweaks

f74a44b

Merge branch 'master' into trentm/lambda

9f91ea1

trentm self-assigned this Oct 14, 2021

github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Oct 14, 2021

trentm added 14 commits October 14, 2021 12:15

fix 'make check'

3080a3a

fix tests for new required 'id' arg

a287305

fix AgentMock used in lambda tests to have an agent.logger

a9b28c9

do not need to use the sendError callback for subsequent agent.flush(…

b67e291

…) sequencing; they both use the same write buffer/queue in the http client

fix tests failures due to the change in captureError to not wait for …

236623f

…transport.sendError callback -- it led to the Agent's default handleUncaughtException behaviour of process.exit(1)

fix captureError arg order; improve mocks for lambda testing

ea2e962

catch a top-level exception in non-async-function handler

d4fa6a7

add unit tests for InflightEventSet class

ea0d80b

cleaning up logging, removing XXXs

ad4621f

ensure do not wrap lambda functions when non-active

acd2312

remove setTimeout 'wait for span to process' hacks in test suite need…

571aea9

…ed before agent.flush() handled this

fix lambda tests' agent mock for lambda.js usage

339484a

changelog entry

8bef489

trentm requested a review from astorm October 18, 2021 22:51

trentm marked this pull request as ready for review October 18, 2021 22:51

astorm previously approved these changes Oct 20, 2021

View reviewed changes

trentm added 2 commits October 20, 2021 15:56

Merge branch 'master' into trentm/lambda

0bed578

reduce some logging

295f798

trentm dismissed astorm’s stale review via 295f798 October 21, 2021 15:52

trentm merged commit e08ae4e into master Oct 21, 2021

trentm deleted the trentm/lambda branch October 21, 2021 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent.flush() waits for inflight spans and errors#2372

agent.flush() waits for inflight spans and errors#2372
trentm merged 20 commits intomasterfrom
trentm/lambda

trentm commented Oct 14, 2021 •

edited

Loading

Uh oh!

ghost commented Oct 14, 2021 •

edited by ghost

Loading

Build stats

Test stats 🧪

Uh oh!

astorm left a comment

Uh oh!

astorm Oct 19, 2021

Uh oh!

trentm Oct 20, 2021 •

edited

Loading

Uh oh!

astorm Oct 19, 2021

Uh oh!

trentm Oct 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

trentm commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

ghost commented Oct 14, 2021 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

🤖 GitHub comments

Uh oh!

astorm left a comment

Choose a reason for hiding this comment

Uh oh!

astorm Oct 19, 2021

Choose a reason for hiding this comment

Uh oh!

trentm Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astorm Oct 19, 2021

Choose a reason for hiding this comment

Uh oh!

trentm Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trentm commented Oct 14, 2021 •

edited

Loading

ghost commented Oct 14, 2021 •

edited by ghost

Loading

trentm Oct 20, 2021 •

edited

Loading