obsservice: support embedding in CRDB nodes#95703
obsservice: support embedding in CRDB nodes#95703craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
knz
left a comment
There was a problem hiding this comment.
Regarding the 1st commit: you've decided to create 1 different loopback listener per tenant. Is that what you wanted? Why not mirror what is done for the TCP service and use just 1 listener, with tenant routing instead?
(I'll review the other commits in later review rounds)
Reviewed 11 of 11 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)
-- commits line 18 at r1:
💖 💖 💖
pkg/server/server.go line 1904 at r1 (raw file):
// internal loopback interface. func (s *Server) AcceptInternalClients(ctx context.Context) error { return s.stopper.RunAsyncTaskEx(ctx,
I recommend use netutil.MakeTCPServer and ServeWith here like in startServeSQL().
You might want to rename this to MakeStreamServer (since it's not good for just TCP and works for any stream socket) but the logic therein is appropriate as-is for the loopback listener too.
pkg/server/tenant.go line 763 at r1 (raw file):
} return s.stopper.RunAsyncTaskEx(ctx,
ditto conn manager.
pkg/sql/pgwire/hba_conf.go line 303 at r1 (raw file):
host all all all cert-password # built-in CockroachDB default local all all password # built-in CockroachDB default loopback all all all trust # built-in CockroachDB default
nit: align these columns.
knz
left a comment
There was a problem hiding this comment.
2nd commit LGTM
3rd commit - no architectural concerns, only a couple of minor points.
Reviewed 1 of 11 files at r1, 2 of 2 files at r2, 23 of 23 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)
-- commits line 48 at r3:
This is incorrect. The epic is CRDB-16791.
You mean this commit fixes (jira) issue CRDB-23641. You can write "Fixes: CRDB-23641" and "Epic: CRDB-16791".
pkg/cli/flags.go line 545 at r3 (raw file):
for _, cmd := range telemetryEnabledCmds { f := cmd.Flags() cliflagcfg.StringFlag(f, &serverCfg.ObsServiceAddr, cliflags.ObsServiceAddr)
maybe MarkHidden here until we know where we're going.
pkg/obs/event_exporter.go line 173 at r3 (raw file):
// Note that Dial is non-blocking. conn, err := grpc.Dial(targetAddr, grpc.WithTransportCredentials(insecure.NewCredentials()))
Please add a TODO here and refer to a (new) issue to implement proper configuration of credentials.
pkg/obs/event_exporter.go line 228 at r3 (raw file):
})) ctx, sp := s.tr.StartSpanCtx(ctx, "obsservice flusher", tracing.WithSterile()) go func() {
Why no stopper.RunAsyncTask here?
knz
left a comment
There was a problem hiding this comment.
last commit: I'd like to revisit the configuration ergonomics, see my comments below.
And then I'd like to raise a higher-level point. This code as-is uses an obs database in the system tenant. Is that the best we can do? I'm thinking here this would be a very good use case for using a separate secondary tenant to host just the observability data. WDYT?
Reviewed 9 of 9 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)
-- commits line 54 at r4:
Let's make it a separate boolean flag. You can add a check that the two flags are mutually exclusive for now. This will ensure we're not precluded from having both an embedded service and an exporter running side-by-side in the future.
pkg/base/constants.go line 58 at r4 (raw file):
// ObsServiceEmbedFlagValue is the special value of the --obsservice-addr flag // configuring the CRDB node to run the Obs Service internally. ObsServiceEmbedFlagValue = "embed"
See above.
pkg/cli/cliflags/flags.go line 1709 at r4 (raw file):
Description: `Address of an OpenTelemetry OTLP sink such as the Observability Service or the OpenTelemetry Collector. If set, telemetry events are exported to this address. The special value "embed" causes
See above.
pkg/obs/event_exporter.go line 130 at r4 (raw file):
// missing port with the default. func ValidateOTLPTargetAddr(targetAddr string) (string, error) { otlpHost, otlpPort, err := net.SplitHostPort(targetAddr)
Here and in previous commits: use addrutil.SplitHostPort. The go version is broken for ipv6.
pkg/server/initial_sql.go line 40 at r4 (raw file):
ctx context.Context, startSingleNode bool, adminUser, adminPassword string, ) error { if s.cfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue {
See comment at top.
pkg/server/server.go line 874 at r4 (raw file):
var eventsExporter obs.EventsExporterInterface if cfg.ObsServiceAddr != "" { if cfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue {
ditto
pkg/server/server.go line 1831 at r4 (raw file):
BinaryVersion: build.BinaryVersion(), }) if s.cfg.ObsServiceAddr != base.ObsServiceEmbedFlagValue {
ditto
pkg/server/tenant.go line 1020 at r4 (raw file):
var eventsExporter obs.EventsExporterInterface if baseCfg.ObsServiceAddr != "" { if baseCfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue {
ditto
304c0f8 to
3104a00
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Let's please continue on these PRs.
And then I'd like to raise a higher-level point. This code as-is uses an obs database in the system tenant. Is that the best we can do? I'm thinking here this would be a very good use case for using a separate secondary tenant to host just the observability data. WDYT?
Seems complicated to deal with creating a tenant, particularly in a cluster that otherwise doesn't have tenants. What would be the benefit?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)
Previously, knz (Raphael 'kena' Poss) wrote…
This is incorrect. The epic is CRDB-16791.
You mean this commit fixes (jira) issue CRDB-23641. You can write "Fixes: CRDB-23641" and "Epic: CRDB-16791".
done in #95632
Previously, knz (Raphael 'kena' Poss) wrote…
Let's make it a separate boolean flag. You can add a check that the two flags are mutually exclusive for now. This will ensure we're not precluded from having both an embedded service and an exporter running side-by-side in the future.
I'd rather keep a single flag. If we'll want to support both an embedded one and an external one (or, at that point, any number of targets), we can make the flag accept a list of addresses.
No?
pkg/cli/flags.go line 545 at r3 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
maybe
MarkHiddenhere until we know where we're going.
done in #95632
pkg/obs/event_exporter.go line 173 at r3 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Please add a TODO here and refer to a (new) issue to implement proper configuration of credentials.
done in #95632
pkg/obs/event_exporter.go line 228 at r3 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Why no stopper.RunAsyncTask here?
because I want this to run as long as possible (or at least as long as can be done easily), to flush shutdown events too. That's why I hook it up as a closer, above.
pkg/obs/event_exporter.go line 130 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Here and in previous commits: use
addrutil.SplitHostPort. The go version is broken for ipv6.
Done.
pkg/server/initial_sql.go line 40 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
See comment at top.
if there's a suggestion which would cause 5 other things to necessarily change, I don't think you need to comment on all of them :P
pkg/server/server.go line 1904 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
I recommend use
netutil.MakeTCPServerandServeWithhere like instartServeSQL().You might want to rename this to
MakeStreamServer(since it's not good for just TCP and works for any stream socket) but the logic therein is appropriate as-is for the loopback listener too.
I had looked at those, but they have extraneous code that I don't want. MakeTCPServer integrates with the stopper and does something on quiescence which I don't want (*). ServeWith deals with some temporary errors from Accept that I don't understand (that code was introduced at the same time with cmux so maybe it has something to do with that?), but doesn't apply here and I'd rather not have it.
So I don't see what those provide that I want.
(*) In fact I'm not all that sure that what it does is kosher - it closes the active connections when the stopper quiesces, but that close races with the goroutine that deals with that connection. I don't see what makes racing closers OK.
pkg/sql/pgwire/hba_conf.go line 303 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: align these columns.
let's take this to #95702 - but I've been meaning to ask you if you think that adding this rule here and then also validating that such a rule exists is the right thing to do, as opposed to not surfacing rules that talk about loopback to users, and adding the rule by fiat to whatever config the user provides (or otherwise exempt loopback conns from hba through dedicated code).
knz
left a comment
There was a problem hiding this comment.
Let's please continue on these PRs.
That works 👍
And then I'd like to raise a higher-level point. This code as-is uses an obs database in the system tenant. Is that the best we can do? I'm thinking here this would be a very good use case for using a separate secondary tenant to host just the observability data. WDYT?
Seems complicated to deal with creating a tenant, particularly in a cluster that otherwise doesn't have tenants. What would be the benefit?
The benefit is twofold.
- to not pollute the database namespace.
- the name you've chosen is generic and likely to conflict with other legit db names.
- users will ask questions. With a separate tenant, they would not see it at all unless they learn to care about it.
- using a different tenant will also enable the obs service to evolve its system schema separately from the main customer's application (each tenant runs its version upgrades independently from others)
Reviewed 4 of 23 files at r7, 2 of 9 files at r8, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @dhartunian)
Previously, andreimatei (Andrei Matei) wrote…
I'd rather keep a single flag. If we'll want to support both an embedded one and an external one (or, at that point, any number of targets), we can make the flag accept a list of addresses.
No?
"embed" is a valid address. So would pretty much any string. Please, don't don't confuse types with values in the UX.
pkg/obs/event_exporter.go line 228 at r3 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
because I want this to run as long as possible (or at least as long as can be done easily), to flush shutdown events too. That's why I hook it up as a closer, above.
Discussion moved to the other PR.
pkg/server/initial_sql.go line 40 at r4 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
if there's a suggestion which would cause 5 other things to necessarily change, I don't think you need to comment on all of them :P
I do this as a reviewer so I can quickly seek back to my comments later to see how they have been addressed. Feel free to ignore the duplicates on your side.
pkg/server/server.go line 1904 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I had looked at those, but they have extraneous code that I don't want.
MakeTCPServerintegrates with the stopper and does something on quiescence which I don't want (*).ServeWithdeals with some temporary errors fromAcceptthat I don't understand (that code was introduced at the same time withcmuxso maybe it has something to do with that?), but doesn't apply here and I'd rather not have it.
So I don't see what those provide that I want.(*) In fact I'm not all that sure that what it does is kosher - it closes the active connections when the stopper quiesces, but that close races with the goroutine that deals with that connection. I don't see what makes racing closers OK.
Discussion moved to the other PR.
pkg/sql/pgwire/hba_conf.go line 303 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
let's take this to #95702 - but I've been meaning to ask you if you think that adding this rule here and then also validating that such a rule exists is the right thing to do, as opposed to not surfacing rules that talk about
loopbackto users, and adding the rule by fiat to whatever config the user provides (or otherwise exemptloopbackconns from hba through dedicated code).
I am friendly to the idea. But discussion moved to the other PR.
3104a00 to
81fa335
Compare
andreimatei
left a comment
There was a problem hiding this comment.
to not pollute the database namespace.
users will ask questions. With a separate tenant, they would not see it at all unless they learn to care about it.
I think it's the other way around - a tenant is more invasive and more presumably visible in the cluster than one more database - particularly when going from 1 tenant to 2. As far as questions are concerned, I think a new tenant move would probably generate a lot more.
the name you've chosen is generic and likely to conflict with other legit db names.
Yeah, sorry, I didn't actually mean for obs to stay as the name. Replaced by crdb_observability.
I've also considered making it a schema in the system database, but I was dissuaded from that.
using a different tenant will also enable the obs service to evolve its system schema separately from the main customer's application (each tenant runs its version upgrades independently from others)
When the obs service is embedded in crdb, we want to evolve its schema on crdb upgrades. So the schema will evolve independently of the application, tenant or no tenant. So I don't see this point very clearly; perhaps you can say more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)
Previously, knz (Raphael 'kena' Poss) wrote…
"embed" is a valid address. So would pretty much any string. Please, don't don't confuse types with values in the UX.
If someone really has a machine called embed, they can reference it with the port number, as embed:4318.
I think corner cases like this are worth it in order to keep the number of flags down.
81fa335 to
07471b0
Compare
dhartunian
left a comment
There was a problem hiding this comment.
I don't have a strong feeling about the tenancy question. I do think if the customer can see and manipulate the crdb_observability database, then we should perhaps be resilient against its truncation/dropping.
Reviewed 4 of 12 files at r12, 35 of 35 files at r13.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @knz)
pkg/server/initial_sql.go line 42 at r13 (raw file):
) error { if s.cfg.ObsServiceAddr == base.ObsServiceEmbedFlagValue { if err := s.startEmbeddedObsService(ctx); err != nil {
why is this included in RunInitialSQL and in server.Start? can we just put it in one place?
pkg/server/initial_sql.go line 77 at r13 (raw file):
// it doesn't exist already), starts the internal RPC service for event // ingestion and hooks up the event exporter to talk to the local service. func (s *Server) startEmbeddedObsService(ctx context.Context) error {
regardless of comment above, can we keep obsservice stuff in its own files? I know this has to stay in server package, but I think mixing with internal CRDB stuff like this gets confusing especially since it has lines like migrations.RunDBMigrations which could confuse ppl.
07471b0 to
9491e60
Compare
andreimatei
left a comment
There was a problem hiding this comment.
I do think if the customer can see and manipulate the crdb_observability database, then we should perhaps be resilient against its truncation/dropping.
If someone messes with this database, event exporting might not work any more, but otherwise I think the effects are pretty contained - creating new events won't crash or anyhing.
If the database is dropped, restarting a node will recreate it. Beyond that, we could perhaps provide a command of some sort to recreate it at runtime; I'll ruminate but I think it can come later.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)
pkg/server/initial_sql.go line 42 at r13 (raw file):
Previously, dhartunian (David Hartunian) wrote…
why is this included in
RunInitialSQLand inserver.Start? can we just put it in one place?
yeah...
It's because server.Start only runs for TestServer; the CLI doesn't run it.
I got rid of this Start cause it was confusing, and inlined it in TestServer, and have that call RunInitialSQL.
pkg/server/initial_sql.go line 77 at r13 (raw file):
Previously, dhartunian (David Hartunian) wrote…
regardless of comment above, can we keep obsservice stuff in its own files? I know this has to stay in
serverpackage, but I think mixing with internal CRDB stuff like this gets confusing especially since it has lines likemigrations.RunDBMigrationswhich could confuse ppl.
moved to dedicated file
9491e60 to
a370e02
Compare
dhartunian
left a comment
There was a problem hiding this comment.
If the database is dropped, restarting a node will recreate it. Beyond that, we could perhaps provide a command of some sort to recreate it at runtime; I'll ruminate but I think it can come later.
agreed.
Reviewed 1 of 12 files at r12, 4 of 5 files at r14, 1 of 1 files at r15, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @abarganier, @andreimatei, and @knz)
pkg/server/initial_sql.go line 42 at r13 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
yeah...
It's becauseserver.Startonly runs forTestServer; the CLI doesn't run it.
I got rid of thisStartcause it was confusing, and inlined it inTestServer, and have that callRunInitialSQL.
Nice.
|
Rafa, I'm merging to have a foot in the door, but we can continue our discussions at length in the snow. bors r+ |
|
Merge conflict. |
a370e02 to
ffae982
Compare
This patch adds a special value to the --obsservice-addr CRDB flag: --obsservice-addr=embed means that an the respective node will run the Obs Service RPC service internally and the events exporter will be configured to connect to this internal service. Release note: None Fixes: CRDB-19741 Epic: CRDB-16791
ffae982 to
13acc58
Compare
andreimatei
left a comment
There was a problem hiding this comment.
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aadityasondhi, @abarganier, @dhartunian, and @knz)
|
Build succeeded: |
This patch adds a special value to the --obsservice-addr CRDB flag:
--obsservice-addr=embed means that an the respective node will run the
Obs Service RPC service internally and the events exporter will be
configured to connect to this internal service.
Release note: None
Fixes #88194
Epic: CRDB-16791