streamingccl: allow stream ingestion processors to keep running on `GenerationEvent` by annezhu98 · Pull Request #68195 · cockroachdb/cockroach

annezhu98 · 2021-07-28T19:07:18Z

Previously, a stream ingestion processor would shut down if it ever loses connection with its stream client. With generation support, the processor should not immediately move to draining state, instead, it should be in StateRunning to poll for cutover signal sent by the coordinator. Generation support will be implemented by the following PR: #67189

The first commit adds GenerationEvent as a valid event type that can be emitted over a cluster stream.
The second commit implements the mechanism that keeps processors running when losing connection with the client.

Add `GenerationEvent` as a possible event type to be emitted over a cluster stream. When a `GenerationEvent` is emitted, we should be able to get its topology as well as the start time of the new generation. Release note: None

cockroach-teamcity · 2021-07-28T19:07:26Z

This change is

pbardea · 2021-07-29T14:40:27Z

pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go

+			case driver.ErrBadConn:
+				select {
+				case <-eventCh:
+					eventCh <- streamingccl.MakeGenerationEvent()


I wonder if it makes sense for the ingestion processor to be aware of the concept of generations at all. Would it be simpler if we just swallowed ErrBadConn in the driver in the client which would hang the processor until its context was cancelled.

Also does errors.Is(err, driver.ErrBadConn) work here? I think that's the more standard way of checking for errors (see https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20190318_error_handling.md for more info).

The processor would go to a different state If we swallow ErrBadConn on the client side, since there are no more values to be read from eventCh, consequently we'd stop reading from cutoverCh. I'm not sure if there's a better way of differentiating the error returned by the client.

@pbardea If I'm understanding correctly, the issue with swallowing the error is that we return from the sinkless client. The return then triggers the eventCh for the partition to be closed, which in turn causes the merged channel in the processor to be closed. Once the merged channel is closed, the proc drains and we're in a bit of a soup.

Thinking ahead to multi-partition where we probably want to do more than just wait for a cutover (ingest up until some time), a generation event would give us a place to trigger this custom logic? There might be a better way though

Gotcha, my original thinking was that we wouldn't close the eventCh and the processors would be unaware that the client had disconnected, it would just hang waiting for its context to be canceled. (When the client sees the error it just reads from <-ctx.Done().)

The tl;dr of below is that given that the coordinator doesn't yet have a way to send processors messages, this approach should be okay for now, but I think it's worth keeping an eye on potentially moving this to the coordinator down the line.

A bit more consideration to why it may be good to move to a scheme where the coordinator is the only one aware of "cutovers times" and "generation events":

Such a scheme would allow each ingestion processor to not need to worry about global state like generations and cutover time, but only to ingest data from the stream, and to watch for a "ingest until at least " signal from the coordinator. (We don't have a way for the coordinator to provide a signal like that today).

The underlying concern is that on large clusters, we'll develop these many to 1 relationships that could put unnecessary pressure on some nodes. (We have this today with cutover time polling since all 100 nodes would poll the job record on avg 3 times a second and that would only increase as the polling interval is reduced).

100% agree, filed #68475.

adityamaru · 2021-07-30T18:58:44Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go

 				return sip.flush()
+			case streamingccl.GenerationEvent:
+				log.Info(sip.Ctx, "client disconnected")
+				waitingForCutover = true


I'd prefer if instead of this bool we just:

<-sip.cutoverCh sip.internalDrained = true return nil, nil

from below. This is also most likely going to change in the future when in multi-partition streaming we want to continue reading events up until some time in the future (as indicated by the GenerationEvent)?

We'd probably also want to wait on ctx.Done() in addition to cutover channel here.

I updated the logic here, hope it makes sense now. When a generation event is received, we wait for sip.cutoverCh and sip.Ctx.Done(), whichever comes first. If we received value from sip.cutoverCh, the function returns. If the context gets cancelled before a cutover signal is sent, the function returns as well.

adityamaru · 2021-07-30T19:00:20Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go


 				return sip.flush()
+			case streamingccl.GenerationEvent:
+				log.Info(sip.Ctx, "client disconnected")


I like the idea of logging a generation event. This will be more useful when GE actually contains a generation time or something for multi parititon. For the time being lets change the verbiage to "GenerationEvent received" since client disconnected is specific to the sinkless client.

adityamaru · 2021-07-30T19:00:45Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go

 				return sip.flush()
 			}

+			// If we lost connection with the client, wait for cutover signal instead of


I don't think we need any of this once we address the comment below?

adityamaru · 2021-07-30T19:07:30Z

pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go

 			}
 		}
-		if err := rows.Err(); err != nil {
+		if err := rows.Err(); errors.Is(err, driver.ErrBadConn) {


mmm this seems a little wrong, it should probably be:

if err := rows.Err(); err != nil { // Maybe add a comment about why we are special casing a client disconnect. if errors.Is(err, driver.ErrBadConn) { select { case eventCh <- streamingccl.MakeGenerationEvent(): case <-ctx.Done(): errCh <- ctx.Err() } } else { errCh <- err } }

Let's chat offline about how select blocks and channels work, just to make sure we're on the same page 🙂

adityamaru · 2021-07-30T19:18:20Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go

 		testutils.IsError(meta.Err, "this client always returns an error")
 	})
+
+	t.Run("stream ingestion processor hangs on losing client connection", func(t *testing.T) {


I haven't reviewed the test yet, and I'm going to think if there is any other way we can structure this. I don't particularly like sleep in tests 😋 time-dependent tests often flake and get skipped

Perhaps instead of testing if the client hangs, we could test if the processor returns an error or not when the client disconnects?

pbardea

Generally LGTM -- it would be nice if we could simplify the processor to not use with waitingForCutover like Aditya suggested

adityamaru · 2021-08-05T22:17:51Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go

+		// Send a cutover signal to shut down the processor
+		sip.cutoverCh <- struct{}{}
+
+		// The processor should have been moved to draining state with a nil error


We need to move the wg.Wait() above this so we are guaranteed that sip.Run()` has returned.

can we then use out returned by getStreamIngestionProcessor to check:

if !out.ProducerClosed() { t.Fatalf("output RowReceiver not closed") } for { row := out.NextNoMeta(t) if row == nil { break } // We don't expect any rows so... t.Fatal(...) }

adityamaru · 2021-08-05T22:17:59Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go

+
+		<-interceptCh
+
+		// Send a cutover signal to shut down the processor


Update comment to:
// The sip processor has received a gen event and is thus waiting for a cutover signal, so let's send one!

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go

adityamaru · 2021-08-05T22:19:17Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go

 	}
 }
+
+func markGenerationEventAsReceived(


can we pass a func() as a parameter here instead of a channel? This will give us flexibility in the future to add stuff to that func.

Maybe also rename the function to makeGenerationEventReceived

adityamaru · 2021-08-05T22:23:20Z

@annezhu98 I'm happy with the PR once the above comments are addressed and you've make stressrace'd the tests 🙂 Nice work on this!

adityamaru · 2021-08-06T16:52:28Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go

+		}
+
+		// The processor should have been moved to draining state with a nil error
+		_, meta := sip.Next()


I don't think we need this any longer. out is a row buffer that is filled by sip.Next(). The out.NextNoMeta checks that none of the output was a meta.

pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go

adityamaru · 2021-08-09T00:31:42Z

@annezhu98 i don't think we should check that the eventCh is closed. Instead can you try applying this patch and seeing if it works. We can change the mock client to not send a closed, fixed size channel but instead return a channel that needs to be read from for more events to be sent on it. This way, when the interceptor is hit we are guaranteed the sip has read from it:

diff --git a/pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go b/pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go
index e294c9b383..23bfb5c36c 100644
--- a/pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go
+++ b/pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go
@@ -134,6 +134,7 @@ func (m *sinklessReplicationClient) ConsumePartition(
 			} else {
 				errCh <- err
 			}
+			return
 		}
 	}()
 
diff --git a/pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go b/pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go
index 757a98def6..1ce4bd067d 100644
--- a/pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go
+++ b/pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go
@@ -44,7 +44,7 @@ import (
 	"github.com/stretchr/testify/require"
 )
 
-// mockStreamClient will return the slice of events associated to the stream
+// mockStreamClient will return a channel of events associated to the stream
 // partition being consumed. Stream partitions are identified by unique
 // partition addresses.
 type mockStreamClient struct {
@@ -73,7 +73,7 @@ func (m *mockStreamClient) GetTopology(
 
 // ConsumePartition implements the Client interface.
 func (m *mockStreamClient) ConsumePartition(
-	_ context.Context, address streamingccl.PartitionAddress, _ hlc.Timestamp,
+	ctx context.Context, address streamingccl.PartitionAddress, _ hlc.Timestamp,
 ) (chan streamingccl.Event, chan error, error) {
 	var events []streamingccl.Event
 	var ok bool
@@ -81,25 +81,32 @@ func (m *mockStreamClient) ConsumePartition(
 		return nil, nil, errors.Newf("no events found for paritition %s", address)
 	}
 
-	eventCh := make(chan streamingccl.Event, len(events))
+	errCh := make(chan error)
+	eventCh := make(chan streamingccl.Event)
 
-	for _, event := range events {
-		eventCh <- event
+	go func() {
+		defer close(eventCh)
+		for _, event := range events {
+			select {
+			case eventCh <- event:
+			case <-ctx.Done():
+				errCh <- ctx.Err()
+			}
 
-		func() {
-			m.mu.Lock()
-			defer m.mu.Unlock()
+			func() {
+				m.mu.Lock()
+				defer m.mu.Unlock()
 
-			if len(m.mu.interceptors) > 0 {
-				for _, interceptor := range m.mu.interceptors {
-					if interceptor != nil {
-						interceptor(event, address)
+				if len(m.mu.interceptors) > 0 {
+					for _, interceptor := range m.mu.interceptors {
+						if interceptor != nil {
+							interceptor(event, address)
+						}
 					}
 				}
-			}
-		}()
-	}
-	close(eventCh)
+			}()
+		}
+	}()
 
 	return eventCh, nil, nil
 }

adityamaru · 2021-08-09T22:09:12Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go

 		err := sip.checkForCutoverSignal(ctx, sip.closePoller)
 		if err != nil {
-			sip.pollingErr = errors.Wrap(err, "error while polling job for cutover signal")
+			sip.mu.pollingErr = errors.Wrap(err, "error while polling job for cutover signal")


we need to sip.mu.Lock() and sip.mu.Unlock() the mutex around this.

adityamaru · 2021-08-09T22:11:59Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go


-	if sip.pollingErr != nil {
-		sip.MoveToDraining(sip.pollingErr)
+	if sip.mu.pollingErr != nil {


ditto comment as above

adityamaru · 2021-08-09T22:13:05Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go

+	// mu is used to provide thread-safe read-write operations to ingestionErr
+	// and pollingErr.
+	mu struct {
+		sync.Mutex


we usually use syncutil.Mutex

adityamaru · 2021-08-10T13:34:44Z

pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go

-	}
-	close(eventCh)
+	go func() {
+		defer close(eventCh)


@annezhu98 one nit, though you can pick it up in your next PR to prevent another push since this is only testing code. We should close(errCh) too.

…am client Previously, if the sinkless client loses connection, the processor would receive an error and move to draining. With the concept of generation, the sinkless client should send over a `GenerationEvent` once it has lost connection. On receiving a `GenerationEvent`, the processor should wait for a cutover signal to be sent (the mechanism for issuing cutover signals on new generation will be implemented in a following PR). Release note: None

annezhu98 · 2021-08-10T16:39:51Z

bors r+

craig · 2021-08-10T19:54:42Z

Build succeeded:

GitHub CI (Cockroach)

streamingccl: add GenerationEvent type

ff40d73

Add `GenerationEvent` as a possible event type to be emitted over a cluster stream. When a `GenerationEvent` is emitted, we should be able to get its topology as well as the start time of the new generation. Release note: None

annezhu98 requested review from a team, adityamaru and pbardea July 28, 2021 19:07

annezhu98 changed the title ~~Hang processors on losing client~~ streamingccl: allow stream ingestion processors to keep running on GenerationEvent Jul 28, 2021

annezhu98 force-pushed the hang-processors-on-losing-client branch 2 times, most recently from 8950deb to d6adbe7 Compare July 28, 2021 20:54

pbardea reviewed Jul 29, 2021

View reviewed changes

annezhu98 force-pushed the hang-processors-on-losing-client branch from d6adbe7 to 729a504 Compare July 30, 2021 00:45

adityamaru reviewed Jul 30, 2021

View reviewed changes

pbardea self-requested a review August 3, 2021 12:37

adityamaru self-requested a review August 3, 2021 13:59

annezhu98 force-pushed the hang-processors-on-losing-client branch 2 times, most recently from e1455d8 to 2b63a09 Compare August 3, 2021 20:24

pbardea reviewed Aug 4, 2021

View reviewed changes

annezhu98 force-pushed the hang-processors-on-losing-client branch 4 times, most recently from 990ec84 to f8b0365 Compare August 4, 2021 20:20

adityamaru mentioned this pull request Aug 5, 2021

streamingccl: shift Cutover and GenerationEvent handling from processors to coordinator #68475

Closed

annezhu98 force-pushed the hang-processors-on-losing-client branch 2 times, most recently from eba6dfc to d645add Compare August 5, 2021 22:00

adityamaru reviewed Aug 5, 2021

View reviewed changes

annezhu98 force-pushed the hang-processors-on-losing-client branch from d645add to 271de0a Compare August 6, 2021 16:33

adityamaru reviewed Aug 6, 2021

View reviewed changes

adityamaru self-requested a review August 6, 2021 17:13

adityamaru reviewed Aug 6, 2021

View reviewed changes

pkg/ccl/streamingccl/streamclient/cockroach_sinkless_replication_client.go Show resolved Hide resolved

annezhu98 force-pushed the hang-processors-on-losing-client branch from 271de0a to 25cbeb7 Compare August 6, 2021 18:10

annezhu98 force-pushed the hang-processors-on-losing-client branch from 25cbeb7 to f57ba74 Compare August 6, 2021 20:28

annezhu98 force-pushed the hang-processors-on-losing-client branch 2 times, most recently from 6805b11 to 7999186 Compare August 9, 2021 22:05

adityamaru reviewed Aug 9, 2021

View reviewed changes

annezhu98 force-pushed the hang-processors-on-losing-client branch from 7999186 to ff7b0eb Compare August 9, 2021 22:20

adityamaru reviewed Aug 10, 2021

View reviewed changes

annezhu98 force-pushed the hang-processors-on-losing-client branch from ff7b0eb to 11e9a38 Compare August 10, 2021 14:35

annezhu98 force-pushed the hang-processors-on-losing-client branch from 11e9a38 to f5244f4 Compare August 10, 2021 14:41

craig bot merged commit f28f98f into cockroachdb:master Aug 10, 2021

annezhu98 deleted the hang-processors-on-losing-client branch August 10, 2021 20:05

annezhu98 restored the hang-processors-on-losing-client branch August 17, 2021 18:23

annezhu98 deleted the hang-processors-on-losing-client branch August 17, 2021 19:19


		<-interceptCh

		// Send a cutover signal to shut down the processor

Conversation

annezhu98 commented Jul 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 28, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

annezhu98 Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adityamaru Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbardea left a comment

Choose a reason for hiding this comment

Uh oh!

adityamaru Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adityamaru commented Aug 5, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adityamaru commented Aug 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

annezhu98 commented Aug 10, 2021

Uh oh!

craig bot commented Aug 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

annezhu98 commented Jul 28, 2021 •

edited

Loading

annezhu98 Aug 4, 2021 •

edited

Loading

adityamaru Jul 30, 2021 •

edited

Loading

adityamaru Aug 5, 2021 •

edited

Loading

adityamaru commented Aug 9, 2021 •

edited

Loading