Skip to content

pgwire: remove readTimeoutConn in favor of a channel#124373

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
rafiss:remove-readTimeoutConn
May 20, 2024
Merged

pgwire: remove readTimeoutConn in favor of a channel#124373
craig[bot] merged 2 commits intocockroachdb:masterfrom
rafiss:remove-readTimeoutConn

Conversation

@rafiss
Copy link
Copy Markdown
Collaborator

@rafiss rafiss commented May 17, 2024

Rather than using a connection that polls for the context being done every second, we now spin up an additional goroutine that blocks until the connection context is done, or the drain signal was received.

I wrote a simple benchmark of the idle connection loop to generate CPU profiles.

With the old readTimeoutConn:
image

With the new goroutine approach:
image

There's definitely less noise and overhead.

fixes #25585
Release note: None

@rafiss rafiss requested a review from yuzefovich May 17, 2024 23:01
@rafiss rafiss requested review from a team as code owners May 17, 2024 23:01
@rafiss rafiss requested a review from a team May 17, 2024 23:01
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@rafiss rafiss force-pushed the remove-readTimeoutConn branch from ce2e425 to 17e388a Compare May 18, 2024 08:10
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! The change overall seems good to me.

I'm not too familiar with this code, so I want to double check my understanding. IIUC currently for each connection we have two goroutines:

  • the "reader" goroutine that continuously loops in serveImpl and see what the client is requesting us to do
  • and the "processor" goroutine for the connExecutor that executes commands added into stmtBuf by the "reader" goroutine.

This PR introduces a third "watchdog" goroutine that monitors the state for cancellation and draining. In other words, we make a trade-off between 1) periodically forcing net.Conn to wake up in the "reader" due to a read deadline to see whether the "reader" should quit and 2) a separate goroutine that only monitors things with no read deadline set for the "reader". In practice, this "watchdog" goroutine should be idle most of the time, so it should remove some system calls. Does this sound right?

@@ -945,6 +942,35 @@ func (s *Server) serveImpl(
authOpt authOptions,
sessionID clusterunique.ID,
) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the last part of the comment for serveImpl needs an update.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tcpConn, _ = underConn.(*net.TCPConn)
}
if tcpConn == nil {
_ = c.conn.Close()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't close the connection in both directions before this patch. Based on the comment to serveImpl, it seems that its the connExecutor's job to do so. Why add this here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm good point. it seems safer to only close the read side.

this matches the previous behavior, where if the context was cancelled, then the readTimeoutConn would start returning errors when Read is called:

typ, n, err := c.readBuf.ReadTypedMsg(&c.rd)

that should cause the main loop of the reader goroutine to exit here:

// If we can't read data because of any one of the following conditions,
// then we should break:
// 1. the connection was closed.
// 2. the context was canceled (e.g. during authentication).
// 3. we reached an arbitrary threshold of repeated errors.
if netutil.IsClosedConnection(err) ||
errors.Is(err, context.Canceled) ||
repeatedErrorCount > maxRepeatedErrorCount {
break
}

Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although, thinking about it more, the risk is that if the type switch no longer works as expected (like for example, there's a new implementation of net.Conn to deal with later), then not having a default case would make the connection hang forever.

maybe it's safer to keep the Close as a fallback? WDYT?

// DrainRequest. This will make the processor quit whenever it finds a
// good time.
_ /* err */ = c.stmtBuf.Push(ctx, sql.DrainRequest{})
<-ctx.Done()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a comment (that eventually the drain request will be processed by the server and all connections will be closed, which is achieved by cancelling the context) would be helpful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a third "watchdog" goroutine that monitors the state for cancellation and draining. In other words, we make a trade-off between 1) periodically forcing net.Conn to wake up in the "reader" due to a read deadline to see whether the "reader" should quit and 2) a separate goroutine that only monitors things with no read deadline set for the "reader". In practice, this "watchdog" goroutine should be idle most of the time, so it should remove some system calls. Does this sound right?

yes sounds right. it's worth noting that a goroutine that does nothing apart from wait on a channel is very efficient in go. the scheduler should never try to run it.

@@ -945,6 +942,35 @@ func (s *Server) serveImpl(
authOpt authOptions,
sessionID clusterunique.ID,
) {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// DrainRequest. This will make the processor quit whenever it finds a
// good time.
_ /* err */ = c.stmtBuf.Push(ctx, sql.DrainRequest{})
<-ctx.Done()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

tcpConn, _ = underConn.(*net.TCPConn)
}
if tcpConn == nil {
_ = c.conn.Close()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm good point. it seems safer to only close the read side.

this matches the previous behavior, where if the context was cancelled, then the readTimeoutConn would start returning errors when Read is called:

typ, n, err := c.readBuf.ReadTypedMsg(&c.rd)

that should cause the main loop of the reader goroutine to exit here:

// If we can't read data because of any one of the following conditions,
// then we should break:
// 1. the connection was closed.
// 2. the context was canceled (e.g. during authentication).
// 3. we reached an arbitrary threshold of repeated errors.
if netutil.IsClosedConnection(err) ||
errors.Is(err, context.Canceled) ||
repeatedErrorCount > maxRepeatedErrorCount {
break
}

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/pgwire/server.go line 968 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

although, thinking about it more, the risk is that if the type switch no longer works as expected (like for example, there's a new implementation of net.Conn to deal with later), then not having a default case would make the connection hang forever.

maybe it's safer to keep the Close as a fallback? WDYT?

Sounds reasonable to me.

I traced where Close is called currently, and I think it's a few layers up in the defer in in the async task in netutil.TCPServer.ServeWith, so the comment (that "serveImpl always closes the network connection before returning") is misleading before your patch but will actually become closer to reality after the patch. It might be worth adjusting the comment.

@rafiss rafiss requested a review from yuzefovich May 20, 2024 18:14
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the updates haven't been pushed yet?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @rafiss)

Copy link
Copy Markdown
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/pgwire/server.go line 968 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Sounds reasonable to me.

I traced where Close is called currently, and I think it's a few layers up in the defer in in the async task in netutil.TCPServer.ServeWith, so the comment (that "serveImpl always closes the network connection before returning") is misleading before your patch but will actually become closer to reality after the patch. It might be worth adjusting the comment.

I was about to push the change, then realized that another fallback option is to call conn.SetReadDeadline(now()), which might be less disruptive. I'll do that instead.

@rafiss rafiss force-pushed the remove-readTimeoutConn branch from 17e388a to d7150f9 Compare May 20, 2024 18:29
@rafiss rafiss requested a review from yuzefovich May 20, 2024 18:30
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)


pkg/sql/pgwire/server.go line 968 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

I was about to push the change, then realized that another fallback option is to call conn.SetReadDeadline(now()), which might be less disruptive. I'll do that instead.

Nice, I like it.

rafiss added 2 commits May 20, 2024 14:54
Rather than using a connection that polls for the context being done
every second, we now spin up an additional goroutine that blocks until
the connection context is done, or the drain signal was received.

Release note: None
This benchmark was only useful for capturing a CPU profile, it doesn't
measure anything else.

Release note: None
@rafiss rafiss force-pushed the remove-readTimeoutConn branch from d7150f9 to 1351385 Compare May 20, 2024 18:54
@rafiss
Copy link
Copy Markdown
Collaborator Author

rafiss commented May 20, 2024

tftr!

bors r=yuzefovich

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 20, 2024

Build failed:

@rickystewart
Copy link
Copy Markdown
Collaborator

bors r=yuzefovich

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 20, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pgwire: replace deadline and connection polling with separate goroutine

4 participants