pkg/util/log: introduce new metrics to the logging package by abarganier · Pull Request #111126 · cockroachdb/cockroach

abarganier · 2023-09-22T18:04:21Z

See individual commit messages to review.

This PR introduces two new metrics to the logging package:

log.messages.count
log.buffered.messages.dropped

log.messages.count: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes.

log.buffered.messages.dropped: Buffered network logging sinks have a max-buffer-size attribute, which determines, in bytes, how many log messages can be buffered. Any fluent-server or http-server log sink that makes use of a buffering attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this max-buffer-size. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new.

This PR also renames the metric fluent.sink.conn.errors to log.fluent.sink.conn.errors, for consistency.

Fixes: #72453

Release note (ops change): This patch sets the Metric Type
on the metric log.fluent.sink.conn.errors. Previously, the
Metric Type was incorrectly left unset.

Note that this is simply an update to the metric's metadata.
The behavior and purpose of the metric remains unchanged.

Release note (ops change): This patch introduces the metric,
log.messages.count.

This metric measures the count of messages logged on the
node since startup. Note that this does not measure the
fan-out of single log messages to the various configured
logging sinks.

This metric can be helpful in understanding log rates and
volumes.

Release note (ops change): This patch introduces a new metric,
log.buffered.messages.dropped.

Buffered network logging sinks have a max-buffer-size attribute,
which determines, in bytes, how many log messages can be buffered.
Any fluent-server or http-server log sink that makes use of
a buffering attribute in its configuration (enabled by default)
qualifies as a buffered network logging sink.

If this buffer becomes full, and an additional log message is sent
to the buffered log sing, the buffer would exceed this
max-buffer-size. Therefore, the buffered log sink drops older
messages in the buffer to handle, in order to make room for the new.

log.buffered.messages.dropped counts the number of messages
dropped from the buffer. Note that the count is shared across all
buffered logging sinks.

cockroach-teamcity · 2023-09-22T18:04:37Z

This change is

dhartunian

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 5 of 5 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier)

-- commits line 85 at r5:
nit: sink

pkg/util/log/buffered_sink.go line 492 at r5 (raw file):

	b.messages = b.messages[1:]
	b.sizeBytes -= uint64(firstMsg.Len())
	logging.metrics.IncrementCounter(BufferedSinkMessagesDropped, 1)

any chance we can get a unit test for this?

pkg/util/log/clog.go line 271 at r6 (raw file):

	// can be exercised via pkg/util/log's *own* init() function, which is called first.
	// Therefore, we need to protect against this case.
	if logging.metrics != nil {

do we need to worry about this conditional being added to a hot path?

pkg/util/log/clog.go line 272 at r6 (raw file):

	// Therefore, we need to protect against this case.
	if logging.metrics != nil {
		logging.metrics.IncrementCounter(LogMessageCount, 1)

can we test that the counter gets incremented?

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

-- commits line 85 at r5:

Previously, dhartunian (David Hartunian) wrote…

nit: sink

Done.

pkg/util/log/buffered_sink.go line 492 at r5 (raw file):

Previously, dhartunian (David Hartunian) wrote…

any chance we can get a unit test for this?

Similar to the thread where we increment LogMessageCount, is the goal to test that this is called? Or is it to test the functionality of IncrementCounter?

pkg/util/log/clog.go line 271 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

do we need to worry about this conditional being added to a hot path?

Good question - I just threw together a little toy benchmark:

func doThing(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		if l != nil {
			l.IncrementCounter(log.LogMessageCount, 1)
		}
		work++
	}
}

func doThingWithoutConditional(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		l.IncrementCounter(log.LogMessageCount, 1)
		work++
	}
}

func doThingWithoutConditionalAndCounter() {
	work := 0
	for i := 0; i < 100; i++ {
		work++
	}
}

func BenchmarkCounters(b *testing.B) {
	l := newLogMetricsRegistry()
	b.Run("with conditional and counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThing(l)
			}
		}
	})
	b.Run("without conditional, but with counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditional(l)
			}
		}
	})
	b.Run("without conditional and no counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditionalAndCounter()
			}
		}
	})
}

Here are the results:

BenchmarkCounters
BenchmarkCounters/with_conditional_and_counter_increment
BenchmarkCounters/with_conditional_and_counter_increment-10         	   19623	    159938 ns/op
BenchmarkCounters/without_conditional,_but_with_counter_increment
BenchmarkCounters/without_conditional,_but_with_counter_increment-10         	   10000	    104556 ns/op
BenchmarkCounters/without_conditional_and_no_counter_increment
BenchmarkCounters/without_conditional_and_no_counter_increment-10            	  498702	      2315 ns/op

...That's quite a big performance hit 😬.

This makes me thing we should just drop this metric from the PR... I don't think the performance hit would be as drastic in the actual outputLogEntry, since it's doing a lot more than just incrementing a counter, but I don't think the value added here justifies a slower hot path for logging.

What do you think?

pkg/util/log/clog.go line 272 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can we test that the counter gets incremented?

To clarify, is the test just to confirm that IncrementCounter is called here? Or do you wish to see the functionality within IncrementCounter tested?

If the latter, we have unit tests for this here: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/logmetrics/metrics_test.go#L22

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

pkg/util/log/clog.go line 271 at r6 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Good question - I just threw together a little toy benchmark:

func doThing(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		if l != nil {
			l.IncrementCounter(log.LogMessageCount, 1)
		}
		work++
	}
}

func doThingWithoutConditional(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		l.IncrementCounter(log.LogMessageCount, 1)
		work++
	}
}

func doThingWithoutConditionalAndCounter() {
	work := 0
	for i := 0; i < 100; i++ {
		work++
	}
}

func BenchmarkCounters(b *testing.B) {
	l := newLogMetricsRegistry()
	b.Run("with conditional and counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThing(l)
			}
		}
	})
	b.Run("without conditional, but with counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditional(l)
			}
		}
	})
	b.Run("without conditional and no counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditionalAndCounter()
			}
		}
	})
}

Here are the results:

BenchmarkCounters
BenchmarkCounters/with_conditional_and_counter_increment
BenchmarkCounters/with_conditional_and_counter_increment-10         	   19623	    159938 ns/op
BenchmarkCounters/without_conditional,_but_with_counter_increment
BenchmarkCounters/without_conditional,_but_with_counter_increment-10         	   10000	    104556 ns/op
BenchmarkCounters/without_conditional_and_no_counter_increment
BenchmarkCounters/without_conditional_and_no_counter_increment-10            	  498702	      2315 ns/op

...That's quite a big performance hit 😬.

This makes me thing we should just drop this metric from the PR... I don't think the performance hit would be as drastic in the actual outputLogEntry, since it's doing a lot more than just incrementing a counter, but I don't think the value added here justifies a slower hot path for logging.

What do you think?

Your benchmark is flawed.

That go statement in there is what is killing your performance.
don't do a for loop in your doXXX functions. Let the benchmark loops do it. In fact, avoid the 2 nested loops in the benchmarks and stick to the iteration over 1..b.N. The bench framework will adjust N automatically.

once you have simplified the benchmark, let's see how the results compare. I'm expecting something more reasonable.

abarganier · 2023-09-26T15:16:43Z

Thanks for the advice @knz! This is my first time using benchmark - it's pretty cool.

Here are results from the revised benchmark:

BenchmarkWithoutConditionalAndCounter
BenchmarkWithoutConditionalAndCounter/without_conditional_and_no_counter_increment
BenchmarkWithoutConditionalAndCounter/without_conditional_and_no_counter_increment-10         	 3339765	       360.6 ns/op

BenchmarkCountersWithoutConditional
BenchmarkCountersWithoutConditional/without_conditional,_but_with_counter_increment
BenchmarkCountersWithoutConditional/without_conditional,_but_with_counter_increment-10         	 3196917	       396.3 ns/op

BenchmarkCounters
BenchmarkCounters/with_conditional_and_counter_increment
BenchmarkCounters/with_conditional_and_counter_increment-10         	 2956760	       401.6 ns/op

This shows an ~11% performance hit. Again, I imagine the actual performance hit will be less significant as the real function's runtime involves a lot more than just incrementing an integer.

@dhartunian @knz thoughts on this? These results feel acceptable to me.

Also, for the curious, here's the code for the new benchmark:

func BenchmarkCountersWithoutConditionalAndCounter(b *testing.B) {
	doWork := make(chan struct{})
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThingWithoutConditionalAndCounter := func() {
		for {
			select {
			case <-doWork:
				wg.Add(1)
				i := 0
				i++
				wg.Done()
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		go doThingWithoutConditionalAndCounter()
	}
	b.Run("without conditional and no counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
		wg.Wait()
	})

	b.Cleanup(func() {
		close(done)
	})
}

func BenchmarkCountersWithoutConditional(b *testing.B) {
	l := newLogMetricsRegistry()
	doWork := make(chan struct{})
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThingWithoutConditional := func() {
		for {
			select {
			case <-doWork:
				wg.Add(1)
				i := 0
				l.IncrementCounter(log.LogMessageCount, 1)
				i++
				wg.Done()
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		go doThingWithoutConditional()
	}

	b.Run("without conditional, but with counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
		wg.Wait()
	})
	b.Cleanup(func() {
		close(done)
	})
}

func BenchmarkCounters(b *testing.B) {
	l := newLogMetricsRegistry()
	doWork := make(chan struct{})
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThing := func() {
		for {
			select {
			case <-doWork:
				wg.Add(1)
				i := 0
				if l != nil {
					l.IncrementCounter(log.LogMessageCount, 1)
				}
				i++
				wg.Done()
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		go doThing()
	}

	b.Run("with conditional and counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
		wg.Wait()
	})
	b.Cleanup(func() {
		close(done)
	})
}

knz · 2023-09-26T15:34:23Z

Not quite yet -- as written, there's no guarantee that the same number of goroutines will be running by the time wg.Wait runs. That's because wg.Add should not be called from within the goroutine code -- there's no guarantee it runs at all before wg.Wait starts.

Try this:

func BenchmarkCountersWithoutConditionalAndCounter(b *testing.B) {
	doWork := make(chan struct{}, 10 /* at least 1 buffer entry per goroutine */)
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThingWithoutConditionalAndCounter := func() {
        defer wg.Done()
		for {
			select {
			case <-doWork:
				i := 0
				i++
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go doThingWithoutConditionalAndCounter()
	}

		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
        close(done)
		wg.Wait()
}

(also no need for a sub-test)

abarganier · 2023-09-26T16:54:21Z

Thanks for the tweaks @knz - 3rd time's a charm? 😄

Here are the results given your suggestions:

BenchmarkCounters
BenchmarkCounters-10    	 2972420	       406.7 ns/op
BenchmarkCountersWithoutConditional
BenchmarkCountersWithoutConditional-10    	 3013413	       395.9 ns/op
BenchmarkWithoutConditionalAndCounter
BenchmarkWithoutConditionalAndCounter-10    	 3408507	       350.8 ns/op

This demonstrates a 15% hit to performance. Again, I imagine the actual performance hit will be less significant as the real function's runtime involves a lot more than just incrementing an integer.

So to summarize, if we keep this metric, we get easy observability into log volume (perhaps for the first time ever?). The cost is a bit of a performance hit.

If we're not okay paying this cost, we can reduce the contention around the counters a bit by giving each log channel its own metric. Or, we can ditch the idea altogether.

@knz @dhartunian what do y'all think?

knz · 2023-09-26T17:38:59Z

This demonstrates a 15% hit to performance.

More specifically, it corresponds to an extra 0.1µs per log call.

Given the overall density of log calls (less than 100 per second on healthy clusters), this means the overhead is really 100 x 1µs / 1s = 0.00005%.

We can afford that.

knz

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 2 of 5 files at r5, 4 of 4 files at r7, 3 of 4 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)

docs/generated/metrics/metrics.html line 1470 at r2 (raw file):

<tr><td>SERVER</td><td>build.timestamp</td><td>Build information</td><td>Build Time</td><td>GAUGE</td><td>TIMESTAMP_SEC</td><td>AVG</td><td>NONE</td></tr>
<tr><td>SERVER</td><td>go.scheduler_latency</td><td>Go scheduling latency</td><td>Nanoseconds</td><td>HISTOGRAM</td><td>NANOSECONDS</td><td>AVG</td><td>NONE</td></tr>
<tr><td>SERVER</td><td>log.fluent.sink.conn.errors</td><td>Number of connection errors experienced by fluent-server logging sinks</td><td>fluent-server log sink connection errors</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>

You might want to double check that you're regenerating this file properly in both r2 and r3. It seems to me that an update here really belongs to the previous commit.

Alternatively, you can also squash them together.

pkg/util/log/clog.go line 269 at r8 (raw file):

	// NB: Generally, we don't need to make this nil check. However, logging.metrics
	// is injected from an external package's init() function, but this code path
	// can be exercised via pkg/util/log's *own* init() function, which is called first.

how so? do you have a stack trace? I'd be more interested in makign this non-nil in all cases to reduce complexity.

pkg/util/log/logmetrics/metrics.go line 114 at r8 (raw file):

	l.mu.counters[log.MetricName(FluentSinkConnErrors.Name)] = l.mu.metricsStruct.FluentSinkConnErrors
	l.mu.counters[log.MetricName(BufferedSinkMessagesDropped.Name)] = l.mu.metricsStruct.BufferedSinkMessagesDropped
	l.mu.counters[log.MetricName(LogMessageCount.Name)] = l.mu.metricsStruct.LogMessageCount

If you fully care about performance, an alternative to the overhead of a by-string map lookup is to:

make this a slice
create an enum that mirrors the counters
use the enum in the increment API
look up the metric by a simple slice index operation

abarganier · 2023-09-26T20:58:00Z

Thanks for the feedback. I'll address soon.

Introducing log.messages.count created a deadlock with the LogMetrics mutex, which I spent time tracking down & fixing. With that fixed, I can address the rest of the feedback.

abarganier

Thanks for the feedback - RFAL

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

docs/generated/metrics/metrics.html line 1470 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

You might want to double check that you're regenerating this file properly in both r2 and r3. It seems to me that an update here really belongs to the previous commit.

Alternatively, you can also squash them together.

Thanks for catching - I went ahead and regenerated properly for each commit.

pkg/util/log/clog.go line 271 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Your benchmark is flawed.

That go statement in there is what is killing your performance.

don't do a for loop in your doXXX functions. Let the benchmark loops do it. In fact, avoid the 2 nested loops in the benchmarks and stick to the iteration over 1..b.N. The bench framework will adjust N automatically.

once you have simplified the benchmark, let's see how the results compare. I'm expecting something more reasonable.

Done.

pkg/util/log/clog.go line 269 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

how so? do you have a stack trace? I'd be more interested in makign this non-nil in all cases to reduce complexity.

I'm not aware of how we can possibly do this, given order of initialization. LMK if I'm overlooking something. I'll explain my thinking below.

The LogMetricsRegistry is injected into pkg/util/log via pkg/util/log/logmetrics. pkg/util/log is the one that's imported, so it has to be initialized first. As long as any logging call can be made within one of pkg/util/log's own init() functions before pkg/util/log/logmetrics injects LogMetricsRegistry, this nil possibility exists.

For the other metrics (buffered sink msg dropped, fluent conn errors), a way to exercise those code paths during init() doesn't exist, since these structures are only initialized after the customer's log config file is applied. Of course, if we some day changed our barebones default log config to include one of these, we could end up exercising these code paths within an init() logging call and hit this nil case.

I'm thinking we should define a package-level function within pkg/util/log with the same API that gets used, instead of using logging.metrics directly. We can make the nil check there, and then callers never have to think about it. Thoughts?

pkg/util/log/logmetrics/metrics.go line 114 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

If you fully care about performance, an alternative to the overhead of a by-string map lookup is to:

make this a slice

create an enum that mirrors the counters

use the enum in the increment API

look up the metric by a simple slice index operation

Good idea - added a commit to the end that changes LogMetricsRegistry to work this way.

knz

Reviewed 9 of 9 files at r12, 2 of 2 files at r13, 1 of 1 files at r14, 2 of 2 files at r15, 5 of 5 files at r16, 4 of 4 files at r17, 5 of 5 files at r18, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)

pkg/util/log/clog.go line 269 at r8 (raw file):

As long as any logging call can be made within one of pkg/util/log's own init() functions

yes I understand the cause effect relationship. I am just wondering why are there logging calls made in init code? I thought we got rid of that?
Have you observed this code failing due to the nil ptr? When it failed, what was the stack trace?

pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

	l.mu.Lock()
	defer l.mu.Unlock()
	if metric < 0 || len(l.mu.counters) <= int(metric) {

the go runtime already does this check for you. no need to add extra code.

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/util/log/clog.go line 269 at r8 (raw file):

yes I understand the cause effect relationship.

Ack, sorry if that came across as patronizing at all!

I thought we got rid of that?

I think we did (assuming you're referring to #109901), but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via init(). This nil check was meant to be defensive against such a case. I don't have a stack trace, I just added this check based on what we experienced in #109900.

I can see why panicking if logging.metrics is nil might be better. It would help protect against folks introducing more logging calls into init().

Would you prefer an explicit panic for such a thing? e.g. if logging.metrics == nil { panic() }? Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?

pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

the go runtime already does this check for you. no need to add extra code.

Oops - removed the < 0 check.

knz

Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)

pkg/util/log/clog.go line 269 at r8 (raw file):

Ack, sorry if that came across as patronizing at all!

it did not! I just wanted to clarify where I was coming from.

but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via init().

This would most definitely be a bug/regression. I wish for these regressions to be caught during CI, before they make it to a release.

Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?

Since the go runtime already does the panic call for you, adding another one looks like more work.

pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Oops - removed the < 0 check.

ditto metric >= len(l.mu.counters) - go will panic for you if the value doesn't work.

We recently introduced metrics into the logging package, but unfortunately we did not prefix them properly. All metrics in the logging package should share the same `log.*` prefix, to clearly group them together. Luckily, only 1 log metric exists so far. This patch updates the metric name to have the `log.*` prefix. Release note (ops change): This patch renames the metric `fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`. The addition of the `log.` prefix was to better group together logging-related metrics. The behavior and purpose of the metric remains unchanged.

The Measurement metadata for this metric was incorrect. This patch fixes it to better represent what's being measured. Release note: none

The log.fluent.sink.conn.errors metric's metadata was missing the MetricType. This patch adds it. Release note (ops change): This patch sets the Metric Type on the metric `log.fluent.sink.conn.errors`. Previously, the Metric Type was incorrectly left unset. Note that this is simply an update to the metric's metadata. The behavior and purpose of the metric remains unchanged.

Previously, a LogMetrics implementations was not provided to the logging package in tests. This could lead to tests that exercise code paths involving LogMetrics to experience problems like nil pointer errors. This patch assigns a dummy test implementation in the testing log scope setup, to avoid this case. Release note: none

Previously, the MetricsStruct used by the logmetrics package was protected by the same mutex that protects the map of metric name to counter. However, the MetricsStruct is never written to after initialization. It's only read within `NewRegistry()` to dump the underlying counters into a new registry for in-process tenants. Since concurrent writes are not possible with this MetricsStruct (it's only read from), protection by this mutex is unnecessary. In fact, the unnecessary mutex protection can cause a deadlock if a metric is incremented in the hot path for logging (e.g. something like once per-log message as it passes through `outputLogEntry`). For example: 1. NewRegistry is called 2. NewRegistry acquires and holds mutex 3. NewRegistry initializes a new registry, which eventually [makes a logging call](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/metric/registry.go#L87) 4. Logging call makes its way through the logging code and attempts to increment a logmetrics counter. 5. `IncrementCounter` is called. 6. `IncrementCounter` attempts to acquire the mutex. 7. The mutex is already being held via step 2. 8. Deadlock! By removing the unnecessary protection of the mutex for the MetricsStruct, we eliminate this possibility. Release note: none

Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. If a writer attempts to append a log message to the buffer that would exceed this `max-buffer-size`, then the buffered log sink logic drops older messages to make room for the new. Previously, these dropped messages were not tracked in any way. A TODO was left to add a metric tracking them. This patch introduces a metric to do so: `log.buffered.messages.dropped` It's shared across all buffered log sinks and counts the number of messages dropped from the buffer. Release note (ops change): This patch introduces a new metric, `log.buffered.messages.dropped`. Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sink, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. `log.buffered.messages.dropped` counts the number of messages dropped from the buffer. Note that the count is shared across all buffered logging sinks.

abarganier

Thanks @knz - ready for another look 👍

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/util/log/clog.go line 269 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Ack, sorry if that came across as patronizing at all!

it did not! I just wanted to clarify where I was coming from.

but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via init().

This would most definitely be a bug/regression. I wish for these regressions to be caught during CI, before they make it to a release.

Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?

Since the go runtime already does the panic call for you, adding another one looks like more work.

Sounds good! Removed the conditional here.

pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ditto metric >= len(l.mu.counters) - go will panic for you if the value doesn't work.

Ack - removed the conditional altogether

knz

Nice

Reviewed 1 of 9 files at r20, 2 of 5 files at r25, 2 of 4 files at r26, 5 of 5 files at r27, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

abarganier · 2023-10-02T18:08:07Z

As discussed on Slack w/ @knz - a lot of tests that don't use the test server don't import pkg/util/log/logmetrics, so removing the nil check in clog.go caused hundreds of tests to start failing.

Reintroduced the nil check as a result.

This patch introduces the `log.messages.count` metric. The metric counts the number of messages logged, recording at the point of `outputLogEntry`, which all logging calls (e.g. `Info`, `Error`, etc.) commonly pass through. This metric will be helpful to better understand log volume and rates. Note that this does not capture the fanout of a single log message to multiple logging sinks. Release note (ops change): This patch introduces the metric, `log.messages.count`. This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes.

This is a small optimization made to the logmetrics package. The log package previously provided a metric name string when incrementing a metric, which would prompt the logmetrics package to perform a map lookup. By using enum values instead, we can do direct index lookups instead. These log metrics are in the critical logging path, so these types of optimizations are worthwhile, especially when the effort is low (like here). Release note: none

abarganier · 2023-10-02T20:16:35Z

TFTR!

bors r=knz

craig · 2023-10-02T21:33:11Z

Build succeeded:

Bazel Essential CI (Cockroach)

abarganier requested review from a team and dhartunian and removed request for a team September 22, 2023 18:04

dhartunian reviewed Sep 25, 2023

View reviewed changes

abarganier force-pushed the log-metrics branch from 2a68e7d to 0190f15 Compare September 25, 2023 19:00

abarganier commented Sep 25, 2023

View reviewed changes

abarganier requested a review from dhartunian September 25, 2023 19:00

knz reviewed Sep 26, 2023

View reviewed changes

abarganier requested a review from knz September 26, 2023 15:17

knz reviewed Sep 26, 2023

View reviewed changes

abarganier force-pushed the log-metrics branch from 0190f15 to 6342cbd Compare September 26, 2023 20:55

abarganier force-pushed the log-metrics branch from 6342cbd to 74bc839 Compare September 27, 2023 21:13

abarganier commented Sep 27, 2023

View reviewed changes

abarganier requested a review from knz September 27, 2023 21:23

knz reviewed Sep 28, 2023

View reviewed changes

abarganier force-pushed the log-metrics branch from 74bc839 to d7df1f2 Compare September 28, 2023 13:36

abarganier commented Sep 28, 2023

View reviewed changes

abarganier requested a review from knz September 28, 2023 13:37

knz reviewed Oct 2, 2023

View reviewed changes

abarganier added 4 commits October 2, 2023 12:12

logmetrics: fix measurement metadata for fluent conn errs

7d48565

The Measurement metadata for this metric was incorrect. This patch fixes it to better represent what's being measured. Release note: none

abarganier added 2 commits October 2, 2023 12:12

abarganier force-pushed the log-metrics branch from d7df1f2 to ac60323 Compare October 2, 2023 16:22

abarganier commented Oct 2, 2023

View reviewed changes

knz approved these changes Oct 2, 2023

View reviewed changes

abarganier force-pushed the log-metrics branch from ac60323 to 94e211d Compare October 2, 2023 18:06

abarganier added 2 commits October 2, 2023 14:18

abarganier force-pushed the log-metrics branch from 94e211d to bdbb04d Compare October 2, 2023 18:18

craig bot merged commit 422b70f into cockroachdb:master Oct 2, 2023

Conversation

abarganier commented Sep 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Sep 22, 2023

Uh oh!

dhartunian left a comment

Choose a reason for hiding this comment

Uh oh!

abarganier left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

abarganier commented Sep 26, 2023

Uh oh!

knz commented Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abarganier commented Sep 26, 2023

Uh oh!

knz commented Sep 26, 2023

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

abarganier commented Sep 26, 2023

Uh oh!

abarganier left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

abarganier left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

abarganier left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

abarganier commented Oct 2, 2023

Uh oh!

abarganier commented Oct 2, 2023

Uh oh!

craig bot commented Oct 2, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abarganier commented Sep 22, 2023 •

edited

Loading

knz commented Sep 26, 2023 •

edited

Loading