Skip to content

pkg/util/log: introduce new metrics to the logging package#111126

Merged
craig[bot] merged 8 commits intocockroachdb:masterfrom
abarganier:log-metrics
Oct 2, 2023
Merged

pkg/util/log: introduce new metrics to the logging package#111126
craig[bot] merged 8 commits intocockroachdb:masterfrom
abarganier:log-metrics

Conversation

@abarganier
Copy link
Copy Markdown
Contributor

@abarganier abarganier commented Sep 22, 2023

See individual commit messages to review.

This PR introduces two new metrics to the logging package:

  • log.messages.count
  • log.buffered.messages.dropped

log.messages.count: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes.

log.buffered.messages.dropped: Buffered network logging sinks have a max-buffer-size attribute, which determines, in bytes, how many log messages can be buffered. Any fluent-server or http-server log sink that makes use of a buffering attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this max-buffer-size. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new.

This PR also renames the metric fluent.sink.conn.errors to log.fluent.sink.conn.errors, for consistency.

Fixes: #72453


Release note (ops change): This patch sets the Metric Type
on the metric log.fluent.sink.conn.errors. Previously, the
Metric Type was incorrectly left unset.

Note that this is simply an update to the metric's metadata.
The behavior and purpose of the metric remains unchanged.


Release note (ops change): This patch introduces the metric,
log.messages.count.

This metric measures the count of messages logged on the
node since startup. Note that this does not measure the
fan-out of single log messages to the various configured
logging sinks.

This metric can be helpful in understanding log rates and
volumes.


Release note (ops change): This patch introduces a new metric,
log.buffered.messages.dropped.

Buffered network logging sinks have a max-buffer-size attribute,
which determines, in bytes, how many log messages can be buffered.
Any fluent-server or http-server log sink that makes use of
a buffering attribute in its configuration (enabled by default)
qualifies as a buffered network logging sink.

If this buffer becomes full, and an additional log message is sent
to the buffered log sing, the buffer would exceed this
max-buffer-size. Therefore, the buffered log sink drops older
messages in the buffer to handle, in order to make room for the new.

log.buffered.messages.dropped counts the number of messages
dropped from the buffer. Note that the count is shared across all
buffered logging sinks.

@abarganier abarganier requested review from a team and dhartunian and removed request for a team September 22, 2023 18:04
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 5 of 5 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier)


-- commits line 85 at r5:
nit: sink


pkg/util/log/buffered_sink.go line 492 at r5 (raw file):

	b.messages = b.messages[1:]
	b.sizeBytes -= uint64(firstMsg.Len())
	logging.metrics.IncrementCounter(BufferedSinkMessagesDropped, 1)

any chance we can get a unit test for this?


pkg/util/log/clog.go line 271 at r6 (raw file):

	// can be exercised via pkg/util/log's *own* init() function, which is called first.
	// Therefore, we need to protect against this case.
	if logging.metrics != nil {

do we need to worry about this conditional being added to a hot path?


pkg/util/log/clog.go line 272 at r6 (raw file):

	// Therefore, we need to protect against this case.
	if logging.metrics != nil {
		logging.metrics.IncrementCounter(LogMessageCount, 1)

can we test that the counter gets incremented?

Copy link
Copy Markdown
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


-- commits line 85 at r5:

Previously, dhartunian (David Hartunian) wrote…

nit: sink

Done.


pkg/util/log/buffered_sink.go line 492 at r5 (raw file):

Previously, dhartunian (David Hartunian) wrote…

any chance we can get a unit test for this?

Similar to the thread where we increment LogMessageCount, is the goal to test that this is called? Or is it to test the functionality of IncrementCounter?


pkg/util/log/clog.go line 271 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

do we need to worry about this conditional being added to a hot path?

Good question - I just threw together a little toy benchmark:

func doThing(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		if l != nil {
			l.IncrementCounter(log.LogMessageCount, 1)
		}
		work++
	}
}

func doThingWithoutConditional(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		l.IncrementCounter(log.LogMessageCount, 1)
		work++
	}
}

func doThingWithoutConditionalAndCounter() {
	work := 0
	for i := 0; i < 100; i++ {
		work++
	}
}

func BenchmarkCounters(b *testing.B) {
	l := newLogMetricsRegistry()
	b.Run("with conditional and counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThing(l)
			}
		}
	})
	b.Run("without conditional, but with counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditional(l)
			}
		}
	})
	b.Run("without conditional and no counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditionalAndCounter()
			}
		}
	})
}

Here are the results:

BenchmarkCounters
BenchmarkCounters/with_conditional_and_counter_increment
BenchmarkCounters/with_conditional_and_counter_increment-10         	   19623	    159938 ns/op
BenchmarkCounters/without_conditional,_but_with_counter_increment
BenchmarkCounters/without_conditional,_but_with_counter_increment-10         	   10000	    104556 ns/op
BenchmarkCounters/without_conditional_and_no_counter_increment
BenchmarkCounters/without_conditional_and_no_counter_increment-10            	  498702	      2315 ns/op

...That's quite a big performance hit 😬.

This makes me thing we should just drop this metric from the PR... I don't think the performance hit would be as drastic in the actual outputLogEntry, since it's doing a lot more than just incrementing a counter, but I don't think the value added here justifies a slower hot path for logging.

What do you think?


pkg/util/log/clog.go line 272 at r6 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can we test that the counter gets incremented?

To clarify, is the test just to confirm that IncrementCounter is called here? Or do you wish to see the functionality within IncrementCounter tested?

If the latter, we have unit tests for this here: https://github.com/cockroachdb/cockroach/blob/master/pkg/util/log/logmetrics/metrics_test.go#L22

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


pkg/util/log/clog.go line 271 at r6 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Good question - I just threw together a little toy benchmark:

func doThing(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		if l != nil {
			l.IncrementCounter(log.LogMessageCount, 1)
		}
		work++
	}
}

func doThingWithoutConditional(l *LogMetricsRegistry) {
	work := 0
	for i := 0; i < 100; i++ {
		l.IncrementCounter(log.LogMessageCount, 1)
		work++
	}
}

func doThingWithoutConditionalAndCounter() {
	work := 0
	for i := 0; i < 100; i++ {
		work++
	}
}

func BenchmarkCounters(b *testing.B) {
	l := newLogMetricsRegistry()
	b.Run("with conditional and counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThing(l)
			}
		}
	})
	b.Run("without conditional, but with counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditional(l)
			}
		}
	})
	b.Run("without conditional and no counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			for i := 0; i < 10; i++ {
				go doThingWithoutConditionalAndCounter()
			}
		}
	})
}

Here are the results:

BenchmarkCounters
BenchmarkCounters/with_conditional_and_counter_increment
BenchmarkCounters/with_conditional_and_counter_increment-10         	   19623	    159938 ns/op
BenchmarkCounters/without_conditional,_but_with_counter_increment
BenchmarkCounters/without_conditional,_but_with_counter_increment-10         	   10000	    104556 ns/op
BenchmarkCounters/without_conditional_and_no_counter_increment
BenchmarkCounters/without_conditional_and_no_counter_increment-10            	  498702	      2315 ns/op

...That's quite a big performance hit 😬.

This makes me thing we should just drop this metric from the PR... I don't think the performance hit would be as drastic in the actual outputLogEntry, since it's doing a lot more than just incrementing a counter, but I don't think the value added here justifies a slower hot path for logging.

What do you think?

Your benchmark is flawed.

  1. That go statement in there is what is killing your performance.
  2. don't do a for loop in your doXXX functions. Let the benchmark loops do it. In fact, avoid the 2 nested loops in the benchmarks and stick to the iteration over 1..b.N. The bench framework will adjust N automatically.

once you have simplified the benchmark, let's see how the results compare. I'm expecting something more reasonable.

@abarganier
Copy link
Copy Markdown
Contributor Author

Thanks for the advice @knz! This is my first time using benchmark - it's pretty cool.

Here are results from the revised benchmark:

BenchmarkWithoutConditionalAndCounter
BenchmarkWithoutConditionalAndCounter/without_conditional_and_no_counter_increment
BenchmarkWithoutConditionalAndCounter/without_conditional_and_no_counter_increment-10         	 3339765	       360.6 ns/op

BenchmarkCountersWithoutConditional
BenchmarkCountersWithoutConditional/without_conditional,_but_with_counter_increment
BenchmarkCountersWithoutConditional/without_conditional,_but_with_counter_increment-10         	 3196917	       396.3 ns/op

BenchmarkCounters
BenchmarkCounters/with_conditional_and_counter_increment
BenchmarkCounters/with_conditional_and_counter_increment-10         	 2956760	       401.6 ns/op

This shows an ~11% performance hit. Again, I imagine the actual performance hit will be less significant as the real function's runtime involves a lot more than just incrementing an integer.

@dhartunian @knz thoughts on this? These results feel acceptable to me.

Also, for the curious, here's the code for the new benchmark:

func BenchmarkCountersWithoutConditionalAndCounter(b *testing.B) {
	doWork := make(chan struct{})
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThingWithoutConditionalAndCounter := func() {
		for {
			select {
			case <-doWork:
				wg.Add(1)
				i := 0
				i++
				wg.Done()
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		go doThingWithoutConditionalAndCounter()
	}
	b.Run("without conditional and no counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
		wg.Wait()
	})

	b.Cleanup(func() {
		close(done)
	})
}

func BenchmarkCountersWithoutConditional(b *testing.B) {
	l := newLogMetricsRegistry()
	doWork := make(chan struct{})
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThingWithoutConditional := func() {
		for {
			select {
			case <-doWork:
				wg.Add(1)
				i := 0
				l.IncrementCounter(log.LogMessageCount, 1)
				i++
				wg.Done()
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		go doThingWithoutConditional()
	}

	b.Run("without conditional, but with counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
		wg.Wait()
	})
	b.Cleanup(func() {
		close(done)
	})
}

func BenchmarkCounters(b *testing.B) {
	l := newLogMetricsRegistry()
	doWork := make(chan struct{})
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThing := func() {
		for {
			select {
			case <-doWork:
				wg.Add(1)
				i := 0
				if l != nil {
					l.IncrementCounter(log.LogMessageCount, 1)
				}
				i++
				wg.Done()
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		go doThing()
	}

	b.Run("with conditional and counter increment", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
		wg.Wait()
	})
	b.Cleanup(func() {
		close(done)
	})
}

@abarganier abarganier requested a review from knz September 26, 2023 15:17
@knz
Copy link
Copy Markdown
Contributor

knz commented Sep 26, 2023

Not quite yet -- as written, there's no guarantee that the same number of goroutines will be running by the time wg.Wait runs. That's because wg.Add should not be called from within the goroutine code -- there's no guarantee it runs at all before wg.Wait starts.

Try this:

func BenchmarkCountersWithoutConditionalAndCounter(b *testing.B) {
	doWork := make(chan struct{}, 10 /* at least 1 buffer entry per goroutine */)
	done := make(chan struct{})
	wg := &sync.WaitGroup{}

	doThingWithoutConditionalAndCounter := func() {
        defer wg.Done()
		for {
			select {
			case <-doWork:
				i := 0
				i++
			case <-done:
				return
			}
		}
	}

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go doThingWithoutConditionalAndCounter()
	}

		for i := 0; i < b.N; i++ {
			doWork <- struct{}{}
		}
        close(done)
		wg.Wait()
}

(also no need for a sub-test)

@abarganier
Copy link
Copy Markdown
Contributor Author

Thanks for the tweaks @knz - 3rd time's a charm? 😄

Here are the results given your suggestions:

BenchmarkCounters
BenchmarkCounters-10    	 2972420	       406.7 ns/op
BenchmarkCountersWithoutConditional
BenchmarkCountersWithoutConditional-10    	 3013413	       395.9 ns/op
BenchmarkWithoutConditionalAndCounter
BenchmarkWithoutConditionalAndCounter-10    	 3408507	       350.8 ns/op

This demonstrates a 15% hit to performance. Again, I imagine the actual performance hit will be less significant as the real function's runtime involves a lot more than just incrementing an integer.

So to summarize, if we keep this metric, we get easy observability into log volume (perhaps for the first time ever?). The cost is a bit of a performance hit.

If we're not okay paying this cost, we can reduce the contention around the counters a bit by giving each log channel its own metric. Or, we can ditch the idea altogether.

@knz @dhartunian what do y'all think?

@knz
Copy link
Copy Markdown
Contributor

knz commented Sep 26, 2023

This demonstrates a 15% hit to performance.

More specifically, it corresponds to an extra 0.1µs per log call.

Given the overall density of log calls (less than 100 per second on healthy clusters), this means the overhead is really 100 x 1µs / 1s = 0.00005%.

We can afford that.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 2 of 5 files at r5, 4 of 4 files at r7, 3 of 4 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)


docs/generated/metrics/metrics.html line 1470 at r2 (raw file):

<tr><td>SERVER</td><td>build.timestamp</td><td>Build information</td><td>Build Time</td><td>GAUGE</td><td>TIMESTAMP_SEC</td><td>AVG</td><td>NONE</td></tr>
<tr><td>SERVER</td><td>go.scheduler_latency</td><td>Go scheduling latency</td><td>Nanoseconds</td><td>HISTOGRAM</td><td>NANOSECONDS</td><td>AVG</td><td>NONE</td></tr>
<tr><td>SERVER</td><td>log.fluent.sink.conn.errors</td><td>Number of connection errors experienced by fluent-server logging sinks</td><td>fluent-server log sink connection errors</td><td>COUNTER</td><td>COUNT</td><td>AVG</td><td>NON_NEGATIVE_DERIVATIVE</td></tr>

You might want to double check that you're regenerating this file properly in both r2 and r3. It seems to me that an update here really belongs to the previous commit.

Alternatively, you can also squash them together.


pkg/util/log/clog.go line 269 at r8 (raw file):

	// NB: Generally, we don't need to make this nil check. However, logging.metrics
	// is injected from an external package's init() function, but this code path
	// can be exercised via pkg/util/log's *own* init() function, which is called first.

how so? do you have a stack trace? I'd be more interested in makign this non-nil in all cases to reduce complexity.


pkg/util/log/logmetrics/metrics.go line 114 at r8 (raw file):

	l.mu.counters[log.MetricName(FluentSinkConnErrors.Name)] = l.mu.metricsStruct.FluentSinkConnErrors
	l.mu.counters[log.MetricName(BufferedSinkMessagesDropped.Name)] = l.mu.metricsStruct.BufferedSinkMessagesDropped
	l.mu.counters[log.MetricName(LogMessageCount.Name)] = l.mu.metricsStruct.LogMessageCount

If you fully care about performance, an alternative to the overhead of a by-string map lookup is to:

  1. make this a slice
  2. create an enum that mirrors the counters
  3. use the enum in the increment API
  4. look up the metric by a simple slice index operation

@abarganier
Copy link
Copy Markdown
Contributor Author

Thanks for the feedback. I'll address soon.

Introducing log.messages.count created a deadlock with the LogMetrics mutex, which I spent time tracking down & fixing. With that fixed, I can address the rest of the feedback.

Copy link
Copy Markdown
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback - RFAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


docs/generated/metrics/metrics.html line 1470 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

You might want to double check that you're regenerating this file properly in both r2 and r3. It seems to me that an update here really belongs to the previous commit.

Alternatively, you can also squash them together.

Thanks for catching - I went ahead and regenerated properly for each commit.


pkg/util/log/clog.go line 271 at r6 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Your benchmark is flawed.

  1. That go statement in there is what is killing your performance.
  2. don't do a for loop in your doXXX functions. Let the benchmark loops do it. In fact, avoid the 2 nested loops in the benchmarks and stick to the iteration over 1..b.N. The bench framework will adjust N automatically.

once you have simplified the benchmark, let's see how the results compare. I'm expecting something more reasonable.

Done.


pkg/util/log/clog.go line 269 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

how so? do you have a stack trace? I'd be more interested in makign this non-nil in all cases to reduce complexity.

I'm not aware of how we can possibly do this, given order of initialization. LMK if I'm overlooking something. I'll explain my thinking below.

The LogMetricsRegistry is injected into pkg/util/log via pkg/util/log/logmetrics. pkg/util/log is the one that's imported, so it has to be initialized first. As long as any logging call can be made within one of pkg/util/log's own init() functions before pkg/util/log/logmetrics injects LogMetricsRegistry, this nil possibility exists.

For the other metrics (buffered sink msg dropped, fluent conn errors), a way to exercise those code paths during init() doesn't exist, since these structures are only initialized after the customer's log config file is applied. Of course, if we some day changed our barebones default log config to include one of these, we could end up exercising these code paths within an init() logging call and hit this nil case.

I'm thinking we should define a package-level function within pkg/util/log with the same API that gets used, instead of using logging.metrics directly. We can make the nil check there, and then callers never have to think about it. Thoughts?


pkg/util/log/logmetrics/metrics.go line 114 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

If you fully care about performance, an alternative to the overhead of a by-string map lookup is to:

  1. make this a slice
  2. create an enum that mirrors the counters
  3. use the enum in the increment API
  4. look up the metric by a simple slice index operation

Good idea - added a commit to the end that changes LogMetricsRegistry to work this way.

@abarganier abarganier requested a review from knz September 27, 2023 21:23
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r12, 2 of 2 files at r13, 1 of 1 files at r14, 2 of 2 files at r15, 5 of 5 files at r16, 4 of 4 files at r17, 5 of 5 files at r18, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)


pkg/util/log/clog.go line 269 at r8 (raw file):

As long as any logging call can be made within one of pkg/util/log's own init() functions

yes I understand the cause effect relationship. I am just wondering why are there logging calls made in init code? I thought we got rid of that?
Have you observed this code failing due to the nil ptr? When it failed, what was the stack trace?


pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

	l.mu.Lock()
	defer l.mu.Unlock()
	if metric < 0 || len(l.mu.counters) <= int(metric) {

the go runtime already does this check for you. no need to add extra code.

Copy link
Copy Markdown
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/util/log/clog.go line 269 at r8 (raw file):

yes I understand the cause effect relationship.

Ack, sorry if that came across as patronizing at all!

I thought we got rid of that?

I think we did (assuming you're referring to #109901), but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via init(). This nil check was meant to be defensive against such a case. I don't have a stack trace, I just added this check based on what we experienced in #109900.

I can see why panicking if logging.metrics is nil might be better. It would help protect against folks introducing more logging calls into init().

Would you prefer an explicit panic for such a thing? e.g. if logging.metrics == nil { panic() }? Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?


pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

the go runtime already does this check for you. no need to add extra code.

Oops - removed the < 0 check.

@abarganier abarganier requested a review from knz September 28, 2023 13:37
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r19, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)


pkg/util/log/clog.go line 269 at r8 (raw file):

Ack, sorry if that came across as patronizing at all!

it did not! I just wanted to clarify where I was coming from.

but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via init().

This would most definitely be a bug/regression. I wish for these regressions to be caught during CI, before they make it to a release.

Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?

Since the go runtime already does the panic call for you, adding another one looks like more work.


pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

Previously, abarganier (Alex Barganier) wrote…

Oops - removed the < 0 check.

ditto metric >= len(l.mu.counters) - go will panic for you if the value doesn't work.

We recently introduced metrics into the logging package,
but unfortunately we did not prefix them properly.

All metrics in the logging package should share the same
`log.*` prefix, to clearly group them together.

Luckily, only 1 log metric exists so far. This patch
updates the metric name to have the `log.*` prefix.

Release note (ops change): This patch renames the metric
`fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`.

The addition of the `log.` prefix was to better group
together logging-related metrics. The behavior and purpose
of the metric remains unchanged.
The Measurement metadata for this metric was incorrect.

This patch fixes it to better represent what's being
measured.

Release note: none
The log.fluent.sink.conn.errors metric's metadata
was missing the MetricType. This patch adds it.

Release note (ops change): This patch sets the Metric Type
on the metric `log.fluent.sink.conn.errors`. Previously, the
Metric Type was incorrectly left unset.

Note that this is simply an update to the metric's metadata.
The behavior and purpose of the metric remains unchanged.
Previously, a LogMetrics implementations was not provided
to the logging package in tests. This could lead to tests
that exercise code paths involving LogMetrics to experience
problems like nil pointer errors.

This patch assigns a dummy test implementation in the
testing log scope setup, to avoid this case.

Release note: none
Previously, the MetricsStruct used by the logmetrics
package was protected by the same mutex that protects
the map of metric name to counter.

However, the MetricsStruct is never written to after
initialization. It's only read within `NewRegistry()`
to dump the underlying counters into a new registry
for in-process tenants. Since concurrent writes are
not possible with this MetricsStruct (it's only read
from), protection by this mutex is unnecessary.

In fact, the unnecessary mutex protection can cause
a deadlock if a metric is incremented in the hot path
for logging (e.g. something like once per-log message
as it passes through `outputLogEntry`).

For example:
1. NewRegistry is called
2. NewRegistry acquires and holds mutex
3. NewRegistry initializes a new registry, which
eventually [makes a logging call](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/metric/registry.go#L87)
4. Logging call makes its way through the logging
code and attempts to increment a logmetrics counter.
5. `IncrementCounter` is called.
6. `IncrementCounter` attempts to acquire the mutex.
7. The mutex is already being held via step 2.
8. Deadlock!

By removing the unnecessary protection of the mutex
for the MetricsStruct, we eliminate this possibility.

Release note: none
Buffered network logging sinks have a `max-buffer-size` attribute,
which determines, in bytes, how many log messages can be buffered.

If a writer attempts to append a log message to the buffer that
would exceed this `max-buffer-size`, then the buffered log sink
logic drops older messages to make room for the new.

Previously, these dropped messages were not tracked in any way.
A TODO was left to add a metric tracking them.

This patch introduces a metric to do so:
`log.buffered.messages.dropped`

It's shared across all buffered log sinks and counts the number
of messages dropped from the buffer.

Release note (ops change): This patch introduces a new metric,
`log.buffered.messages.dropped`.

Buffered network logging sinks have a `max-buffer-size` attribute,
which determines, in bytes, how many log messages can be buffered.
Any `fluent-server` or `http-server` log sink that makes use of
a `buffering` attribute in its configuration (enabled by default)
qualifies as a buffered network logging sink.

If this buffer becomes full, and an additional log message is sent
to the buffered log sink, the buffer would exceed this
`max-buffer-size`. Therefore, the buffered log sink drops older
messages in the buffer to handle, in order to make room for the new.

`log.buffered.messages.dropped` counts the number of messages
dropped from the buffer. Note that the count is shared across all
buffered logging sinks.
Copy link
Copy Markdown
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @knz - ready for another look 👍

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/util/log/clog.go line 269 at r8 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Ack, sorry if that came across as patronizing at all!

it did not! I just wanted to clarify where I was coming from.

but that doesn't mean some dev in the future can't reintroduce a stray logging call called somehow via init().

This would most definitely be a bug/regression. I wish for these regressions to be caught during CI, before they make it to a release.

Or should we just ditch the nil check and let the nil pointer dereference trigger the panic?

Since the go runtime already does the panic call for you, adding another one looks like more work.

Sounds good! Removed the conditional here.


pkg/util/log/logmetrics/metrics.go line 143 at r18 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ditto metric >= len(l.mu.counters) - go will panic for you if the value doesn't work.

Ack - removed the conditional altogether

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Reviewed 1 of 9 files at r20, 2 of 5 files at r25, 2 of 4 files at r26, 5 of 5 files at r27, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

@abarganier
Copy link
Copy Markdown
Contributor Author

As discussed on Slack w/ @knz - a lot of tests that don't use the test server don't import pkg/util/log/logmetrics, so removing the nil check in clog.go caused hundreds of tests to start failing.

Reintroduced the nil check as a result.

This patch introduces the `log.messages.count` metric.

The metric counts the number of messages logged, recording
at the point of `outputLogEntry`, which all logging calls
(e.g. `Info`, `Error`, etc.) commonly pass through.

This metric will be helpful to better understand log
volume and rates.

Note that this does not capture the fanout of a single
log message to multiple logging sinks.

Release note (ops change): This patch introduces the metric,
`log.messages.count`.

This metric measures the count of messages logged on the
node since startup. Note that this does not measure the
fan-out of single log messages to the various configured
logging sinks.

This metric can be helpful in understanding log rates and
volumes.
This is a small optimization made to the logmetrics package.
The log package previously provided a metric name string when
incrementing a metric, which would prompt the logmetrics package
to perform a map lookup. By using enum values instead, we can
do direct index lookups instead.

These log metrics are in the critical logging path, so these
types of optimizations are worthwhile, especially when the
effort is low (like here).

Release note: none
@abarganier
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=knz

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 2, 2023

Build succeeded:

@craig craig bot merged commit 422b70f into cockroachdb:master Oct 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

log: introduce more metrics into log pkg for network logging

4 participants