Skip to content

obsservice: support embedding in CRDB nodes#95703

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:obsservice.embed
Feb 22, 2023
Merged

obsservice: support embedding in CRDB nodes#95703
craig[bot] merged 1 commit intocockroachdb:masterfrom
andreimatei:obsservice.embed

Conversation

@andreimatei
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei commented Jan 23, 2023

This patch adds a special value to the --obsservice-addr CRDB flag:
--obsservice-addr=embed means that an the respective node will run the
Obs Service RPC service internally and the events exporter will be
configured to connect to this internal service.

Release note: None
Fixes #88194
Epic: CRDB-16791

@andreimatei andreimatei requested review from a team as code owners January 23, 2023 22:59
@andreimatei andreimatei requested review from a team January 23, 2023 22:59
@andreimatei andreimatei requested a review from a team as a code owner January 23, 2023 23:00
@andreimatei andreimatei requested review from herkolategan, rharding6373 and srosenberg and removed request for a team January 23, 2023 23:00
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@andreimatei
Copy link
Copy Markdown
Contributor Author

Everything but the last commit is #95702 and #95632.

There's some things to discuss still, like for example the name of the database created for the embedded obs service (or more generally the location of the needed schema). But I think it's ready for a review.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the 1st commit: you've decided to create 1 different loopback listener per tenant. Is that what you wanted? Why not mirror what is done for the TCP service and use just 1 listener, with tenant routing instead?

(I'll review the other commits in later review rounds)

Reviewed 11 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)


-- commits line 18 at r1:
💖 💖 💖


pkg/server/server.go line 1904 at r1 (raw file):

// internal loopback interface.
func (s *Server) AcceptInternalClients(ctx context.Context) error {
	return s.stopper.RunAsyncTaskEx(ctx,

I recommend use netutil.MakeTCPServer and ServeWith here like in startServeSQL().

You might want to rename this to MakeStreamServer (since it's not good for just TCP and works for any stream socket) but the logic therein is appropriate as-is for the loopback listener too.


pkg/server/tenant.go line 763 at r1 (raw file):

	}

	return s.stopper.RunAsyncTaskEx(ctx,

ditto conn manager.


pkg/sql/pgwire/hba_conf.go line 303 at r1 (raw file):

host  all all  all cert-password # built-in CockroachDB default
local all all      password      # built-in CockroachDB default
loopback all all all trust       # built-in CockroachDB default

nit: align these columns.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd commit LGTM

3rd commit - no architectural concerns, only a couple of minor points.

Reviewed 1 of 11 files at r1, 2 of 2 files at r2, 23 of 23 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)


-- commits line 48 at r3:
This is incorrect. The epic is CRDB-16791.
You mean this commit fixes (jira) issue CRDB-23641. You can write "Fixes: CRDB-23641" and "Epic: CRDB-16791".


pkg/cli/flags.go line 545 at r3 (raw file):

	for _, cmd := range telemetryEnabledCmds {
		f := cmd.Flags()
		cliflagcfg.StringFlag(f, &serverCfg.ObsServiceAddr, cliflags.ObsServiceAddr)

maybe MarkHidden here until we know where we're going.


pkg/obs/event_exporter.go line 173 at r3 (raw file):

	// Note that Dial is non-blocking.
	conn, err := grpc.Dial(targetAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))

Please add a TODO here and refer to a (new) issue to implement proper configuration of credentials.


pkg/obs/event_exporter.go line 228 at r3 (raw file):

	}))
	ctx, sp := s.tr.StartSpanCtx(ctx, "obsservice flusher", tracing.WithSterile())
	go func() {

Why no stopper.RunAsyncTask here?

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last commit: I'd like to revisit the configuration ergonomics, see my comments below.

And then I'd like to raise a higher-level point. This code as-is uses an obs database in the system tenant. Is that the best we can do? I'm thinking here this would be a very good use case for using a separate secondary tenant to host just the observability data. WDYT?

Reviewed 9 of 9 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)


-- commits line 54 at r4:
Let's make it a separate boolean flag. You can add a check that the two flags are mutually exclusive for now. This will ensure we're not precluded from having both an embedded service and an exporter running side-by-side in the future.


pkg/base/constants.go line 58 at r4 (raw file):

	// ObsServiceEmbedFlagValue is the special value of the --obsservice-addr flag
	// configuring the CRDB node to run the Obs Service internally.
	ObsServiceEmbedFlagValue = "embed"

See above.


pkg/cli/cliflags/flags.go line 1709 at r4 (raw file):

		Description: `Address of an OpenTelemetry OTLP sink such as the
Observability Service or the OpenTelemetry Collector. If set, telemetry
events are exported to this address. The special value "embed" causes

See above.


pkg/obs/event_exporter.go line 130 at r4 (raw file):

// missing port with the default.
func ValidateOTLPTargetAddr(targetAddr string) (string, error) {
	otlpHost, otlpPort, err := net.SplitHostPort(targetAddr)

Here and in previous commits: use addrutil.SplitHostPort. The go version is broken for ipv6.


pkg/server/initial_sql.go line 40 at r4 (raw file):

	ctx context.Context, startSingleNode bool, adminUser, adminPassword string,
) error {
	if s.cfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue {

See comment at top.


pkg/server/server.go line 874 at r4 (raw file):

	var eventsExporter obs.EventsExporterInterface
	if cfg.ObsServiceAddr != "" {
		if cfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue {

ditto


pkg/server/server.go line 1831 at r4 (raw file):

		BinaryVersion: build.BinaryVersion(),
	})
	if s.cfg.ObsServiceAddr != base.ObsServiceEmbedFlagValue {

ditto


pkg/server/tenant.go line 1020 at r4 (raw file):

	var eventsExporter obs.EventsExporterInterface
	if baseCfg.ObsServiceAddr != "" {
		if baseCfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue {

ditto

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything but the last commit is #95702 and #95632.

Let's please continue on these PRs.

And then I'd like to raise a higher-level point. This code as-is uses an obs database in the system tenant. Is that the best we can do? I'm thinking here this would be a very good use case for using a separate secondary tenant to host just the observability data. WDYT?

Seems complicated to deal with creating a tenant, particularly in a cluster that otherwise doesn't have tenants. What would be the benefit?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)


-- commits line 48 at r3:

Previously, knz (Raphael 'kena' Poss) wrote…

This is incorrect. The epic is CRDB-16791.
You mean this commit fixes (jira) issue CRDB-23641. You can write "Fixes: CRDB-23641" and "Epic: CRDB-16791".

done in #95632


-- commits line 54 at r4:

Previously, knz (Raphael 'kena' Poss) wrote…

Let's make it a separate boolean flag. You can add a check that the two flags are mutually exclusive for now. This will ensure we're not precluded from having both an embedded service and an exporter running side-by-side in the future.

I'd rather keep a single flag. If we'll want to support both an embedded one and an external one (or, at that point, any number of targets), we can make the flag accept a list of addresses.
No?


pkg/cli/flags.go line 545 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

maybe MarkHidden here until we know where we're going.

done in #95632


pkg/obs/event_exporter.go line 173 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Please add a TODO here and refer to a (new) issue to implement proper configuration of credentials.

done in #95632


pkg/obs/event_exporter.go line 228 at r3 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Why no stopper.RunAsyncTask here?

because I want this to run as long as possible (or at least as long as can be done easily), to flush shutdown events too. That's why I hook it up as a closer, above.


pkg/obs/event_exporter.go line 130 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Here and in previous commits: use addrutil.SplitHostPort. The go version is broken for ipv6.

Done.


pkg/server/initial_sql.go line 40 at r4 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

See comment at top.

if there's a suggestion which would cause 5 other things to necessarily change, I don't think you need to comment on all of them :P


pkg/server/server.go line 1904 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

I recommend use netutil.MakeTCPServer and ServeWith here like in startServeSQL().

You might want to rename this to MakeStreamServer (since it's not good for just TCP and works for any stream socket) but the logic therein is appropriate as-is for the loopback listener too.

I had looked at those, but they have extraneous code that I don't want. MakeTCPServer integrates with the stopper and does something on quiescence which I don't want (*). ServeWith deals with some temporary errors from Accept that I don't understand (that code was introduced at the same time with cmux so maybe it has something to do with that?), but doesn't apply here and I'd rather not have it.
So I don't see what those provide that I want.

(*) In fact I'm not all that sure that what it does is kosher - it closes the active connections when the stopper quiesces, but that close races with the goroutine that deals with that connection. I don't see what makes racing closers OK.


pkg/sql/pgwire/hba_conf.go line 303 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: align these columns.

let's take this to #95702 - but I've been meaning to ask you if you think that adding this rule here and then also validating that such a rule exists is the right thing to do, as opposed to not surfacing rules that talk about loopback to users, and adding the rule by fiat to whatever config the user provides (or otherwise exempt loopback conns from hba through dedicated code).

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please continue on these PRs.

That works 👍

And then I'd like to raise a higher-level point. This code as-is uses an obs database in the system tenant. Is that the best we can do? I'm thinking here this would be a very good use case for using a separate secondary tenant to host just the observability data. WDYT?

Seems complicated to deal with creating a tenant, particularly in a cluster that otherwise doesn't have tenants. What would be the benefit?

The benefit is twofold.

  • to not pollute the database namespace.
    1. the name you've chosen is generic and likely to conflict with other legit db names.
    2. users will ask questions. With a separate tenant, they would not see it at all unless they learn to care about it.
  • using a different tenant will also enable the obs service to evolve its system schema separately from the main customer's application (each tenant runs its version upgrades independently from others)

Reviewed 4 of 23 files at r7, 2 of 9 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)


-- commits line 54 at r4:

Previously, andreimatei (Andrei Matei) wrote…

I'd rather keep a single flag. If we'll want to support both an embedded one and an external one (or, at that point, any number of targets), we can make the flag accept a list of addresses.
No?

"embed" is a valid address. So would pretty much any string. Please, don't don't confuse types with values in the UX.


pkg/obs/event_exporter.go line 228 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

because I want this to run as long as possible (or at least as long as can be done easily), to flush shutdown events too. That's why I hook it up as a closer, above.

Discussion moved to the other PR.


pkg/server/initial_sql.go line 40 at r4 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

if there's a suggestion which would cause 5 other things to necessarily change, I don't think you need to comment on all of them :P

I do this as a reviewer so I can quickly seek back to my comments later to see how they have been addressed. Feel free to ignore the duplicates on your side.


pkg/server/server.go line 1904 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I had looked at those, but they have extraneous code that I don't want. MakeTCPServer integrates with the stopper and does something on quiescence which I don't want (*). ServeWith deals with some temporary errors from Accept that I don't understand (that code was introduced at the same time with cmux so maybe it has something to do with that?), but doesn't apply here and I'd rather not have it.
So I don't see what those provide that I want.

(*) In fact I'm not all that sure that what it does is kosher - it closes the active connections when the stopper quiesces, but that close races with the goroutine that deals with that connection. I don't see what makes racing closers OK.

Discussion moved to the other PR.


pkg/sql/pgwire/hba_conf.go line 303 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

let's take this to #95702 - but I've been meaning to ask you if you think that adding this rule here and then also validating that such a rule exists is the right thing to do, as opposed to not surfacing rules that talk about loopback to users, and adding the rule by fiat to whatever config the user provides (or otherwise exempt loopback conns from hba through dedicated code).

I am friendly to the idea. But discussion moved to the other PR.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to not pollute the database namespace.
users will ask questions. With a separate tenant, they would not see it at all unless they learn to care about it.

I think it's the other way around - a tenant is more invasive and more presumably visible in the cluster than one more database - particularly when going from 1 tenant to 2. As far as questions are concerned, I think a new tenant move would probably generate a lot more.

the name you've chosen is generic and likely to conflict with other legit db names.

Yeah, sorry, I didn't actually mean for obs to stay as the name. Replaced by crdb_observability.
I've also considered making it a schema in the system database, but I was dissuaded from that.

using a different tenant will also enable the obs service to evolve its system schema separately from the main customer's application (each tenant runs its version upgrades independently from others)

When the obs service is embedded in crdb, we want to evolve its schema on crdb upgrades. So the schema will evolve independently of the application, tenant or no tenant. So I don't see this point very clearly; perhaps you can say more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)


-- commits line 54 at r4:

Previously, knz (Raphael 'kena' Poss) wrote…

"embed" is a valid address. So would pretty much any string. Please, don't don't confuse types with values in the UX.

If someone really has a machine called embed, they can reference it with the port number, as embed:4318.
I think corner cases like this are worth it in order to keep the number of flags down.

Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong feeling about the tenancy question. I do think if the customer can see and manipulate the crdb_observability database, then we should perhaps be resilient against its truncation/dropping.

Reviewed 4 of 12 files at r12, 35 of 35 files at r13.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @knz)


pkg/server/initial_sql.go line 42 at r13 (raw file):

) error {
	if s.cfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue {
		if err := s.startEmbeddedObsService(ctx); err != nil {

why is this included in RunInitialSQL and in server.Start? can we just put it in one place?


pkg/server/initial_sql.go line 77 at r13 (raw file):

// it doesn't exist already), starts the internal RPC service for event
// ingestion and hooks up the event exporter to talk to the local service.
func (s *Server) startEmbeddedObsService(ctx context.Context) error {

regardless of comment above, can we keep obsservice stuff in its own files? I know this has to stay in server package, but I think mixing with internal CRDB stuff like this gets confusing especially since it has lines like migrations.RunDBMigrations which could confuse ppl.

Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think if the customer can see and manipulate the crdb_observability database, then we should perhaps be resilient against its truncation/dropping.

If someone messes with this database, event exporting might not work any more, but otherwise I think the effects are pretty contained - creating new events won't crash or anyhing.
If the database is dropped, restarting a node will recreate it. Beyond that, we could perhaps provide a command of some sort to recreate it at runtime; I'll ruminate but I think it can come later.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)


pkg/server/initial_sql.go line 42 at r13 (raw file):

Previously, dhartunian (David Hartunian) wrote…

why is this included in RunInitialSQL and in server.Start? can we just put it in one place?

yeah...
It's because server.Start only runs for TestServer; the CLI doesn't run it.
I got rid of this Start cause it was confusing, and inlined it in TestServer, and have that call RunInitialSQL.


pkg/server/initial_sql.go line 77 at r13 (raw file):

Previously, dhartunian (David Hartunian) wrote…

regardless of comment above, can we keep obsservice stuff in its own files? I know this has to stay in server package, but I think mixing with internal CRDB stuff like this gets confusing especially since it has lines like migrations.RunDBMigrations which could confuse ppl.

moved to dedicated file

Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the database is dropped, restarting a node will recreate it. Beyond that, we could perhaps provide a command of some sort to recreate it at runtime; I'll ruminate but I think it can come later.

agreed.

:lgtm:

Reviewed 1 of 12 files at r12, 4 of 5 files at r14, 1 of 1 files at r15, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @knz)


pkg/server/initial_sql.go line 42 at r13 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

yeah...
It's because server.Start only runs for TestServer; the CLI doesn't run it.
I got rid of this Start cause it was confusing, and inlined it in TestServer, and have that call RunInitialSQL.

Nice.

@andreimatei
Copy link
Copy Markdown
Contributor Author

Rafa, I'm merging to have a foot in the door, but we can continue our discussions at length in the snow.
TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 9, 2023

Merge conflict.

This patch adds a special value to the --obsservice-addr CRDB flag:
--obsservice-addr=embed means that an the respective node will run the
Obs Service RPC service internally and the events exporter will be
configured to connect to this internal service.

Release note: None
Fixes: CRDB-19741
Epic: CRDB-16791
Copy link
Copy Markdown
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 22, 2023

Build succeeded:

@craig craig bot merged commit 2431b4b into cockroachdb:master Feb 22, 2023
@andreimatei andreimatei deleted the obsservice.embed branch February 22, 2023 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

obsservice: embed in CRDB

4 participants