Skip to content

[8.17] (backport #17135) TBS: Log pubsub errors at error or warn level#17177

Merged
mergify[bot] merged 4 commits into8.17from
mergify/bp/8.17/pr-17135
Jun 10, 2025
Merged

[8.17] (backport #17135) TBS: Log pubsub errors at error or warn level#17177
mergify[bot] merged 4 commits into8.17from
mergify/bp/8.17/pr-17135

Conversation

@mergify
Copy link
Copy Markdown
Contributor

@mergify mergify bot commented Jun 10, 2025

Motivation/summary

  • Surface pubsub subscriber errors by logging at Warn level for 429 and Error level for others, instead of Debug level. Do not log context canceled. This may introduce noise to logs, e.g. when ES returns 429, but at worst it will only log one line every search interval (default = 1m/2 = 30s) which is acceptable.
  • Also change publisher logs from debug level to error level.

Checklist

- [ ] Update CHANGELOG.asciidoc to be done on release

  • Documentation has been updated

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Covered by unit tests

Related issues

Fixes #17117


This is an automatic backport of pull request #17135 done by [Mergify](https://mergify.com).

* Surface pubsub subscriber errors by logging at Warn level for 429 and Error level for others, instead of Debug level. Do not log context canceled. This may introduce noise to logs, e.g. when ES returns 429, but at worst it will only log one line every search interval (default = 1m/2 = 30s) which is acceptable.
* Also change publisher logs from debug level to error level.

(cherry picked from commit 6d41432)

# Conflicts:
#	x-pack/apm-server/sampling/pubsub/checkpoints.go
#	x-pack/apm-server/sampling/pubsub/pubsub.go
#	x-pack/apm-server/sampling/pubsub/pubsub_test.go
@mergify mergify bot added the backport label Jun 10, 2025
@mergify mergify bot requested a review from a team as a code owner June 10, 2025 20:40
@mergify mergify bot added the conflicts There is a conflict in the backported pull request label Jun 10, 2025
@mergify
Copy link
Copy Markdown
Contributor Author

mergify bot commented Jun 10, 2025

Cherry-pick of 6d41432 has failed:

On branch mergify/bp/8.17/pr-17135
Your branch is up to date with 'origin/8.17'.

You are currently cherry-picking commit 6d414320.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	both modified:   x-pack/apm-server/sampling/pubsub/checkpoints.go
	both modified:   x-pack/apm-server/sampling/pubsub/pubsub.go
	both modified:   x-pack/apm-server/sampling/pubsub/pubsub_test.go

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@mergify mergify bot merged commit 47c0018 into 8.17 Jun 10, 2025
12 of 14 checks passed
@mergify mergify bot deleted the mergify/bp/8.17/pr-17135 branch June 10, 2025 21:23
@ericywl ericywl mentioned this pull request Jun 17, 2025
1 task
@ericywl ericywl self-assigned this Jun 20, 2025
@ericywl
Copy link
Copy Markdown
Contributor

ericywl commented Jun 20, 2025

Results

Tested for v8.17.8 that some of the logs are showing as expected.

Search trace IDs error logged as WARN if TOO_MANY_REQUESTS:

{"log.level":"warn","@timestamp":"2025-06-20T16:14:47.404+0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling/pubsub.(*Pubsub).SubscribeSampledTraceIDs","file.name":"pubsub/pubsub.go","file.line":166},"message":"error searching for trace IDs","service.name":"apm-server","error":{"message":"index refresh request failed with status code 429: stop spamming!\n"},"position":{},"ecs.version":"1.6.0"}

Search trace IDs error logged as ERROR otherwise:

{"log.level":"error","@timestamp":"2025-06-20T16:18:32.393+0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling/pubsub.(*Pubsub).SubscribeSampledTraceIDs","file.name":"pubsub/pubsub.go","file.line":168},"message":"error searching for trace IDs","service.name":"apm-server","error":{"message":"index refresh request failed with status code 500: serious error!!!\n"},"position":{},"ecs.version":"1.6.0"}

The other logs look to be covered by the unit tests.

Testing Method

First, switch to 8.17 branch.

  1. Modify apm-perf moxy to return different errors on /<index_name>/_refresh
  2. Run it using go run ./cmd/moxy
  3. Build APM server using make apm-server
  4. Modify apm-server.yml with sampling enabled:
     sampling.tail:
        enabled: true
        interval: 1m
        policies:
          - sample_rate: 1
    
  5. Run ./build/apm-server-<os>-<arch> -c apm-server.yml -e -d "*"
  6. Run cd systemtest && go run ./cmd/sendotlp to send data to APM server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport conflicts There is a conflict in the backported pull request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants