pkg/util/log: introduce new metrics to the logging package#111126
pkg/util/log: introduce new metrics to the logging package#111126craig[bot] merged 8 commits intocockroachdb:masterfrom
Conversation
dhartunian
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 5 of 5 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @abarganier)
-- commits line 85 at r5:
nit: sink
pkg/util/log/buffered_sink.go line 492 at r5 (raw file):
b.messages = b.messages[1:] b.sizeBytes -= uint64(firstMsg.Len()) logging.metrics.IncrementCounter(BufferedSinkMessagesDropped, 1)
any chance we can get a unit test for this?
pkg/util/log/clog.go line 271 at r6 (raw file):
// can be exercised via pkg/util/log's *own* init() function, which is called first. // Therefore, we need to protect against this case. if logging.metrics != nil {
do we need to worry about this conditional being added to a hot path?
pkg/util/log/clog.go line 272 at r6 (raw file):
// Therefore, we need to protect against this case. if logging.metrics != nil { logging.metrics.IncrementCounter(LogMessageCount, 1)
can we test that the counter gets incremented?
2a68e7d to
0190f15
Compare
abarganier
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
Previously, dhartunian (David Hartunian) wrote…
nit: sink
Done.
pkg/util/log/buffered_sink.go line 492 at r5 (raw file):
Previously, dhartunian (David Hartunian) wrote…
any chance we can get a unit test for this?
Similar to the thread where we increment LogMessageCount, is the goal to test that this is called? Or is it to test the functionality of IncrementCounter?
pkg/util/log/clog.go line 271 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
do we need to worry about this conditional being added to a hot path?
Good question - I just threw together a little toy benchmark:
func doThing(l *LogMetricsRegistry) {
work := 0
for i := 0; i < 100; i++ {
if l != nil {
l.IncrementCounter(log.LogMessageCount, 1)
}
work++
}
}
func doThingWithoutConditional(l *LogMetricsRegistry) {
work := 0
for i := 0; i < 100; i++ {
l.IncrementCounter(log.LogMessageCount, 1)
work++
}
}
func doThingWithoutConditionalAndCounter() {
work := 0
for i := 0; i < 100; i++ {
work++
}
}
func BenchmarkCounters(b *testing.B) {
l := newLogMetricsRegistry()
b.Run("with conditional and counter increment", func(b *testing.B) {
for i := 0; i < b.N; i++ {
for i := 0; i < 10; i++ {
go doThing(l)
}
}
})
b.Run("without conditional, but with counter increment", func(b *testing.B) {
for i := 0; i < b.N; i++ {
for i := 0; i < 10; i++ {
go doThingWithoutConditional(l)
}
}
})
b.Run("without conditional and no counter increment", func(b *testing.B) {
for i := 0; i < b.N; i++ {
for i := 0; i < 10; i++ {
go doThingWithoutConditionalAndCounter()
}
}
})
}
Here are the results:
BenchmarkCounters
BenchmarkCounters/with_conditional_and_counter_increment
BenchmarkCounters/with_conditional_and_counter_increment-10 19623 159938 ns/op
BenchmarkCounters/without_conditional,_but_with_counter_increment
BenchmarkCounters/without_conditional,_but_with_counter_increment-10 10000 104556 ns/op
BenchmarkCounters/without_conditional_and_no_counter_increment
BenchmarkCounters/without_conditional_and_no_counter_increment-10 498702 2315 ns/op
...That's quite a big performance hit 😬.
This makes me thing we should just drop this metric from the PR... I don't think the performance hit would be as drastic in the actual outputLogEntry, since it's doing a lot more than just incrementing a counter, but I don't think the value added here justifies a slower hot path for logging.
What do you think?
pkg/util/log/clog.go line 272 at r6 (raw file):
Previously, dhartunian (David Hartunian) wrote…
can we test that the counter gets incremented?
To clarify, is the test just to confirm that IncrementCounter is called here? Or do you wish to see the functionality within IncrementCounter tested?
If the latter, we have unit tests for this here: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/logmetrics/metrics_test.go#L22
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
pkg/util/log/clog.go line 271 at r6 (raw file):
Previously, abarganier (Alex Barganier) wrote…
Good question - I just threw together a little toy benchmark:
func doThing(l *LogMetricsRegistry) { work := 0 for i := 0; i < 100; i++ { if l != nil { l.IncrementCounter(log.LogMessageCount, 1) } work++ } } func doThingWithoutConditional(l *LogMetricsRegistry) { work := 0 for i := 0; i < 100; i++ { l.IncrementCounter(log.LogMessageCount, 1) work++ } } func doThingWithoutConditionalAndCounter() { work := 0 for i := 0; i < 100; i++ { work++ } } func BenchmarkCounters(b *testing.B) { l := newLogMetricsRegistry() b.Run("with conditional and counter increment", func(b *testing.B) { for i := 0; i < b.N; i++ { for i := 0; i < 10; i++ { go doThing(l) } } }) b.Run("without conditional, but with counter increment", func(b *testing.B) { for i := 0; i < b.N; i++ { for i := 0; i < 10; i++ { go doThingWithoutConditional(l) } } }) b.Run("without conditional and no counter increment", func(b *testing.B) { for i := 0; i < b.N; i++ { for i := 0; i < 10; i++ { go doThingWithoutConditionalAndCounter() } } }) }Here are the results:
BenchmarkCounters BenchmarkCounters/with_conditional_and_counter_increment BenchmarkCounters/with_conditional_and_counter_increment-10 19623 159938 ns/op BenchmarkCounters/without_conditional,_but_with_counter_increment BenchmarkCounters/without_conditional,_but_with_counter_increment-10 10000 104556 ns/op BenchmarkCounters/without_conditional_and_no_counter_increment BenchmarkCounters/without_conditional_and_no_counter_increment-10 498702 2315 ns/op...That's quite a big performance hit 😬.
This makes me thing we should just drop this metric from the PR... I don't think the performance hit would be as drastic in the actual
outputLogEntry, since it's doing a lot more than just incrementing a counter, but I don't think the value added here justifies a slower hot path for logging.What do you think?
Your benchmark is flawed.
- That
gostatement in there is what is killing your performance. - don't do a
forloop in yourdoXXXfunctions. Let the benchmark loops do it. In fact, avoid the 2 nested loops in the benchmarks and stick to the iteration over1..b.N. The bench framework will adjust N automatically.
once you have simplified the benchmark, let's see how the results compare. I'm expecting something more reasonable.
|
Thanks for the advice @knz! This is my first time using benchmark - it's pretty cool. Here are results from the revised benchmark: This shows an ~11% performance hit. Again, I imagine the actual performance hit will be less significant as the real function's runtime involves a lot more than just incrementing an integer. @dhartunian @knz thoughts on this? These results feel acceptable to me. Also, for the curious, here's the code for the new benchmark: |
|
Not quite yet -- as written, there's no guarantee that the same number of goroutines will be running by the time wg.Wait runs. That's because Try this: func BenchmarkCountersWithoutConditionalAndCounter(b *testing.B) {
doWork := make(chan struct{}, 10 /* at least 1 buffer entry per goroutine */)
done := make(chan struct{})
wg := &sync.WaitGroup{}
doThingWithoutConditionalAndCounter := func() {
defer wg.Done()
for {
select {
case <-doWork:
i := 0
i++
case <-done:
return
}
}
}
for i := 0; i < 10; i++ {
wg.Add(1)
go doThingWithoutConditionalAndCounter()
}
for i := 0; i < b.N; i++ {
doWork <- struct{}{}
}
close(done)
wg.Wait()
}(also no need for a sub-test) |
|
Thanks for the tweaks @knz - 3rd time's a charm? 😄 Here are the results given your suggestions: This demonstrates a 15% hit to performance. Again, I imagine the actual performance hit will be less significant as the real function's runtime involves a lot more than just incrementing an integer. So to summarize, if we keep this metric, we get easy observability into log volume (perhaps for the first time ever?). The cost is a bit of a performance hit. If we're not okay paying this cost, we can reduce the contention around the counters a bit by giving each log channel its own metric. Or, we can ditch the idea altogether. @knz @dhartunian what do y'all think? |
More specifically, it corresponds to an extra 0.1µs per log call. Given the overall density of log calls (less than 100 per second on healthy clusters), this means the overhead is really 100 x 1µs / 1s = 0.00005%. We can afford that. |
knz
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 2 of 5 files at r5, 4 of 4 files at r7, 3 of 4 files at r8, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)
docs/generated/metrics/metrics.html line 1470 at r2 (raw file):
<tr><td>SERVER</td><td>build.timestamp</td><td>Build information</td><td>Build Time</td><td>GAUGE</td><td>TIMESTAMP_SEC</td><td>AVG</td><td>NONE</td></tr> <tr><td>SERVER</td><td>go.scheduler_latency</td><td>Go scheduling latency</td><td>Nanoseconds</td><td>HISTOGRAM</td><td>NANOSECONDS</td><td>AVG</td><td>NONE</td></tr> <tr><td>SERVER</td><td>log.fluent.sink.conn.errors</td><td>Number of connection errors experienced by fluent-server logging sinks</td><td>fluent-server log sink connection errors</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>
You might want to double check that you're regenerating this file properly in both r2 and r3. It seems to me that an update here really belongs to the previous commit.
Alternatively, you can also squash them together.
pkg/util/log/clog.go line 269 at r8 (raw file):
// NB: Generally, we don't need to make this nil check. However, logging.metrics // is injected from an external package's init() function, but this code path // can be exercised via pkg/util/log's *own* init() function, which is called first.
how so? do you have a stack trace? I'd be more interested in makign this non-nil in all cases to reduce complexity.
pkg/util/log/logmetrics/metrics.go line 114 at r8 (raw file):
l.mu.counters[log.MetricName(FluentSinkConnErrors.Name)] = l.mu.metricsStruct.FluentSinkConnErrors l.mu.counters[log.MetricName(BufferedSinkMessagesDropped.Name)] = l.mu.metricsStruct.BufferedSinkMessagesDropped l.mu.counters[log.MetricName(LogMessageCount.Name)] = l.mu.metricsStruct.LogMessageCount
If you fully care about performance, an alternative to the overhead of a by-string map lookup is to:
- make this a slice
- create an enum that mirrors the counters
- use the enum in the increment API
- look up the metric by a simple slice index operation
0190f15 to
6342cbd
Compare
|
Thanks for the feedback. I'll address soon. Introducing |
6342cbd to
74bc839
Compare
abarganier
left a comment
There was a problem hiding this comment.
Thanks for the feedback - RFAL
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)
docs/generated/metrics/metrics.html line 1470 at r2 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
You might want to double check that you're regenerating this file properly in both r2 and r3. It seems to me that an update here really belongs to the previous commit.
Alternatively, you can also squash them together.
Thanks for catching - I went ahead and regenerated properly for each commit.
pkg/util/log/clog.go line 271 at r6 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Your benchmark is flawed.
- That
gostatement in there is what is killing your performance.- don't do a
forloop in yourdoXXXfunctions. Let the benchmark loops do it. In fact, avoid the 2 nested loops in the benchmarks and stick to the iteration over1..b.N. The bench framework will adjust N automatically.once you have simplified the benchmark, let's see how the results compare. I'm expecting something more reasonable.
Done.
pkg/util/log/clog.go line 269 at r8 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
how so? do you have a stack trace? I'd be more interested in makign this non-nil in all cases to reduce complexity.
I'm not aware of how we can possibly do this, given order of initialization. LMK if I'm overlooking something. I'll explain my thinking below.
The LogMetricsRegistry is injected into pkg/util/log via pkg/util/log/logmetrics. pkg/util/log is the one that's imported, so it has to be initialized first. As long as any logging call can be made within one of pkg/util/log's own init() functions before pkg/util/log/logmetrics injects LogMetricsRegistry, this nil possibility exists.
For the other metrics (buffered sink msg dropped, fluent conn errors), a way to exercise those code paths during init() doesn't exist, since these structures are only initialized after the customer's log config file is applied. Of course, if we some day changed our barebones default log config to include one of these, we could end up exercising these code paths within an init() logging call and hit this nil case.
I'm thinking we should define a package-level function within pkg/util/log with the same API that gets used, instead of using logging.metrics directly. We can make the nil check there, and then callers never have to think about it. Thoughts?
pkg/util/log/logmetrics/metrics.go line 114 at r8 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
If you fully care about performance, an alternative to the overhead of a by-string map lookup is to:
- make this a slice
- create an enum that mirrors the counters
- use the enum in the increment API
- look up the metric by a simple slice index operation
Good idea - added a commit to the end that changes LogMetricsRegistry to work this way.
knz
left a comment
There was a problem hiding this comment.
Reviewed 9 of 9 files at r12, 2 of 2 files at r13, 1 of 1 files at r14, 2 of 2 files at r15, 5 of 5 files at r16, 4 of 4 files at r17, 5 of 5 files at r18, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)
pkg/util/log/clog.go line 269 at r8 (raw file):
As long as any logging call can be made within one of
pkg/util/log's owninit()functions
yes I understand the cause effect relationship. I am just wondering why are there logging calls made in init code? I thought we got rid of that?
Have you observed this code failing due to the nil ptr? When it failed, what was the stack trace?
pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):
l.mu.Lock() defer l.mu.Unlock() if metric < 0 || len(l.mu.counters) <= int(metric) {
the go runtime already does this check for you. no need to add extra code.
74bc839 to
d7df1f2
Compare
abarganier
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)
pkg/util/log/clog.go line 269 at r8 (raw file):
yes I understand the cause effect relationship.
Ack, sorry if that came across as patronizing at all!
I thought we got rid of that?
I think we did (assuming you're referring to #109901), but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via init(). This nil check was meant to be defensive against such a case. I don't have a stack trace, I just added this check based on what we experienced in #109900.
I can see why panicking if logging.metrics is nil might be better. It would help protect against folks introducing more logging calls into init().
Would you prefer an explicit panic for such a thing? e.g. if logging.metrics == nil { panic() }? Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?
pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
the go runtime already does this check for you. no need to add extra code.
Oops - removed the < 0 check.
knz
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)
pkg/util/log/clog.go line 269 at r8 (raw file):
Ack, sorry if that came across as patronizing at all!
it did not! I just wanted to clarify where I was coming from.
but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via
init().
This would most definitely be a bug/regression. I wish for these regressions to be caught during CI, before they make it to a release.
Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?
Since the go runtime already does the panic call for you, adding another one looks like more work.
pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):
Previously, abarganier (Alex Barganier) wrote…
Oops - removed the
< 0check.
ditto metric >= len(l.mu.counters) - go will panic for you if the value doesn't work.
We recently introduced metrics into the logging package, but unfortunately we did not prefix them properly. All metrics in the logging package should share the same `log.*` prefix, to clearly group them together. Luckily, only 1 log metric exists so far. This patch updates the metric name to have the `log.*` prefix. Release note (ops change): This patch renames the metric `fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`. The addition of the `log.` prefix was to better group together logging-related metrics. The behavior and purpose of the metric remains unchanged.
The Measurement metadata for this metric was incorrect. This patch fixes it to better represent what's being measured. Release note: none
The log.fluent.sink.conn.errors metric's metadata was missing the MetricType. This patch adds it. Release note (ops change): This patch sets the Metric Type on the metric `log.fluent.sink.conn.errors`. Previously, the Metric Type was incorrectly left unset. Note that this is simply an update to the metric's metadata. The behavior and purpose of the metric remains unchanged.
Previously, a LogMetrics implementations was not provided to the logging package in tests. This could lead to tests that exercise code paths involving LogMetrics to experience problems like nil pointer errors. This patch assigns a dummy test implementation in the testing log scope setup, to avoid this case. Release note: none
Previously, the MetricsStruct used by the logmetrics package was protected by the same mutex that protects the map of metric name to counter. However, the MetricsStruct is never written to after initialization. It's only read within `NewRegistry()` to dump the underlying counters into a new registry for in-process tenants. Since concurrent writes are not possible with this MetricsStruct (it's only read from), protection by this mutex is unnecessary. In fact, the unnecessary mutex protection can cause a deadlock if a metric is incremented in the hot path for logging (e.g. something like once per-log message as it passes through `outputLogEntry`). For example: 1. NewRegistry is called 2. NewRegistry acquires and holds mutex 3. NewRegistry initializes a new registry, which eventually [makes a logging call](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/metric/registry.go#L87) 4. Logging call makes its way through the logging code and attempts to increment a logmetrics counter. 5. `IncrementCounter` is called. 6. `IncrementCounter` attempts to acquire the mutex. 7. The mutex is already being held via step 2. 8. Deadlock! By removing the unnecessary protection of the mutex for the MetricsStruct, we eliminate this possibility. Release note: none
Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. If a writer attempts to append a log message to the buffer that would exceed this `max-buffer-size`, then the buffered log sink logic drops older messages to make room for the new. Previously, these dropped messages were not tracked in any way. A TODO was left to add a metric tracking them. This patch introduces a metric to do so: `log.buffered.messages.dropped` It's shared across all buffered log sinks and counts the number of messages dropped from the buffer. Release note (ops change): This patch introduces a new metric, `log.buffered.messages.dropped`. Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sink, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. `log.buffered.messages.dropped` counts the number of messages dropped from the buffer. Note that the count is shared across all buffered logging sinks.
d7df1f2 to
ac60323
Compare
abarganier
left a comment
There was a problem hiding this comment.
Thanks @knz - ready for another look 👍
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)
pkg/util/log/clog.go line 269 at r8 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Ack, sorry if that came across as patronizing at all!
it did not! I just wanted to clarify where I was coming from.
but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via
init().This would most definitely be a bug/regression. I wish for these regressions to be caught during CI, before they make it to a release.
Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?
Since the go runtime already does the panic call for you, adding another one looks like more work.
Sounds good! Removed the conditional here.
pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
ditto
metric >= len(l.mu.counters)- go will panic for you if the value doesn't work.
Ack - removed the conditional altogether
knz
left a comment
There was a problem hiding this comment.
Nice
Reviewed 1 of 9 files at r20, 2 of 5 files at r25, 2 of 4 files at r26, 5 of 5 files at r27, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
ac60323 to
94e211d
Compare
|
As discussed on Slack w/ @knz - a lot of tests that don't use the test server don't import Reintroduced the nil check as a result. |
This patch introduces the `log.messages.count` metric. The metric counts the number of messages logged, recording at the point of `outputLogEntry`, which all logging calls (e.g. `Info`, `Error`, etc.) commonly pass through. This metric will be helpful to better understand log volume and rates. Note that this does not capture the fanout of a single log message to multiple logging sinks. Release note (ops change): This patch introduces the metric, `log.messages.count`. This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes.
This is a small optimization made to the logmetrics package. The log package previously provided a metric name string when incrementing a metric, which would prompt the logmetrics package to perform a map lookup. By using enum values instead, we can do direct index lookups instead. These log metrics are in the critical logging path, so these types of optimizations are worthwhile, especially when the effort is low (like here). Release note: none
94e211d to
bdbb04d
Compare
|
TFTR! bors r=knz |
|
Build succeeded: |
See individual commit messages to review.
This PR introduces two new metrics to the logging package:
log.messages.countlog.buffered.messages.droppedlog.messages.count: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes.
log.buffered.messages.dropped: Buffered network logging sinks have a
max-buffer-sizeattribute, which determines, in bytes, how many log messages can be buffered. Anyfluent-serverorhttp-serverlog sink that makes use of abufferingattribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed thismax-buffer-size. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new.This PR also renames the metric
fluent.sink.conn.errorstolog.fluent.sink.conn.errors, for consistency.Fixes: #72453
Release note (ops change): This patch sets the Metric Type
on the metric
log.fluent.sink.conn.errors. Previously, theMetric Type was incorrectly left unset.
Note that this is simply an update to the metric's metadata.
The behavior and purpose of the metric remains unchanged.
Release note (ops change): This patch introduces the metric,
log.messages.count.This metric measures the count of messages logged on the
node since startup. Note that this does not measure the
fan-out of single log messages to the various configured
logging sinks.
This metric can be helpful in understanding log rates and
volumes.
Release note (ops change): This patch introduces a new metric,
log.buffered.messages.dropped.Buffered network logging sinks have a
max-buffer-sizeattribute,which determines, in bytes, how many log messages can be buffered.
Any
fluent-serverorhttp-serverlog sink that makes use ofa
bufferingattribute in its configuration (enabled by default)qualifies as a buffered network logging sink.
If this buffer becomes full, and an additional log message is sent
to the buffered log sing, the buffer would exceed this
max-buffer-size. Therefore, the buffered log sink drops oldermessages in the buffer to handle, in order to make room for the new.
log.buffered.messages.droppedcounts the number of messagesdropped from the buffer. Note that the count is shared across all
buffered logging sinks.