changefeedccl: add schema change events to nemeses#41842
changefeedccl: add schema change events to nemeses#41842craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
c98c944 to
fbc668e
Compare
fbc668e to
2f2497a
Compare
danhhz
left a comment
There was a problem hiding this comment.
high level first pass
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @danhhz)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):
eventMix: map[fsm.Event]int{ // eventOpenTxn opens an UPSERT or DELETE transaction. eventOpenTxn{}: 10,
why split this up? (not objecting, i just don't see offhand why it was necessary)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):
// eventFeedMessage reads a message from the feed, or if the state machine // thinks there will be no message available, it falls back to // eventTransact.
nit: eventTransact is gone
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):
// eventResume RESUMEs the changefeed. eventResume{}: 1,
ditto here, why split this up?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):
} } else { if err := abort(fsm.Args{Ctx: ctx, Extended: ns}); err != nil {
abort is not quite the opposite of commit, we should add a rollback event too
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
if ns.currentTestColumnCount >= ns.maxTestColumnCount { // Do nothing if the table already has the maximum allowed columns.
bummer, i wonder if we can avoid generating this event by adding more state. IIRC there's some crazy pattern matching stuff in the fsm library that helps with the transition explosion, but I don't remember if it would be useful here. if that doesn't work out then i agree it's probably not worth the extra state
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):
// Adding a column should trigger a full table scan. ns.db.QueryRow(`SELECT count(*) FROM foo`).Scan(&rows) ns.availableRows += rows
it should be twice this, right? the schema change itself does a kv backfill (which generates uninteresting rows at the changefeed level that we don't have a good way to filter) and then the changefeed runs a backfill after the schema change finishes
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 452 at r2 (raw file):
} if ns.currentTestColumnCount == 0 { // Do nothing if the table has no test columns.
ditto unfortunate
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 471 at r2 (raw file):
if ns.availableRows <= 0 { // No-op if there are no available rows to read.
if we're seeing this, it's because the availableRows accounting is wrong. we should probably look into that
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):
ns := a.Extended.(*nemeses) defer func() { ns.txn = nil }() return ns.txn.Rollback()
hmm, i don't think we want to do this change? it was intentionally aborting a txn without its knowledge before
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
// table and then apply the row update. We do it this way to ensure that // the set of non-null columns in the scratch table is equal to the set of // columns in the original table.
I don't quite follow, do you have an example?
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
// Ignore the fingerprint mismatch if there was an in-progress schema change job // on the table. // TODO(aayush): This is pretty hacky, how do we improve this?
it'd be nice to address this TODO before merge
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @danhhz)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 291 at r2 (raw file):
func (eventFinished) Event() {} var compatibleStateTransitions = fsm.Pattern{
nit: just call this stateTransitions?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 300 at r2 (raw file):
stateRunning{Paused: fsm.False, TxnOpen: fsm.False}: { eventPause{}: { Next: stateRunning{Paused: fsm.True, TxnOpen: fsm.False},
This comment is more for my own edification than asking for a change. I feel like we lack principle about which transitions we're making explicitly possible in the FSM vs checking inside the implementation. For example, we're keeping track of row counts and column counts inside the extended state and then changing behavior inside of actions based on that extended state.
Moving more things into the state machine itself is heavy-weight due to the transition explosion but not moving it into the state machine makes the logic more implicit and harder to understand.
I wonder if it would be good to have some sort of mechanism to create predicates over the extended state and use them as pre-conditions for transitions.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 356 at r2 (raw file):
} var txnStateTransitions = fsm.Compile(compatibleStateTransitions)
nit: I know this is the old name but I don't love it. What about compiledStateTransitions?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):
if deleteID == noDeleteSentinel { if err := txn.QueryRow( `UPSERT INTO foo VALUES ((random() * $1)::int, cluster_logical_timestamp()::string) RETURNING id, ts`,
This is interesting to me. Doesn't it mean that all updates are unpushable? Is there a reason that the timestamp needs to be cluster_logical_timestamp()? Could it just be now()?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 514 at r2 (raw file):
// Don't error out if we got pushed, but don't increment availableRows no // matter what error was hit. if strings.Contains(err.Error(), `restart transaction`) {
Just out of curiosity I wonder if there's a more principled way to check if this is a serializable restart. Maybe by unwrapping the error into something which could hand you an error code? I have no idea. This comment is mostly just my own curiosity.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
it'd be nice to address this TODO before merge
It's honestly surprising to me (in a good way) that this works.
danhhz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):
Previously, ajwerner wrote…
This is interesting to me. Doesn't it mean that all updates are unpushable? Is there a reason that the timestamp needs to be
cluster_logical_timestamp()? Could it just benow()?
it appears i neglected to write this down. @tbg @nvanbenschoten do you remember why we did this? I feel like an obviously unnecessary use of cluster_logical_timestamp() would have come up in code review
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, ajwerner wrote…
It's honestly surprising to me (in a good way) that this works.
yeah, ditto. the workaround is incredibly clever
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
why split this up? (not objecting, i just don't see offhand why it was necessary)
It seemed easier to reason about if we split transact into an event that simply opens a transaction and then events that committed or rolled-back the transaction. I'm not super tied to this idea but I'd argue its a bit cleaner this way.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
ditto here, why split this up?
This is just so I can avoid some code duplication. In nextEvent I'm trying to use the state transitions map to get all the "compatible" state transitions and then randomly selecting one among them by using weights from this eventMix map. Adding eventResume here allows me to do the same with pause/resume without separate logic.
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I don't quite follow, do you have an example?
Imagine that we have 4 columns in foo and 4 (non-null) columns in fprint. Now there is a removeColumn event that drops one of the columns in foo. Now, the row updates that are streamed because of this will only contain 3 columns, but we can't simply UPSERT them into fprint because that won't nullify the 4th column. So, we instead do a transactional DELETE, followed by an UPSERT to apply every row update.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
yeah, ditto. the workaround is incredibly clever
We could go one step further here and make sure that the table in question is indeed foo, but that would be a bit redundant since this is the only table in this test cluster that undergoes schema changes.
I explored ways to perhaps look at foo's table descriptor somehow and check if it has pending mutations, but that seems even worse. The challenging part is that any approach we pick here must disambiguate false negatives and "real" failures.
Are there any other ideas I can explore?
|
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
I thought those "uninteresting" rows were being filtered out in |
danhhz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
It seemed easier to reason about if we split
transactinto an event that simply opens a transaction and then events that committed or rolled-back the transaction. I'm not super tied to this idea but I'd argue its a bit cleaner this way.
wfm. I had this originally and for some reason switched it to this, but I can't remember why. As long as this doesn't cause any other compromised I prefer it to
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
This is just so I can avoid some code duplication. In
nextEventI'm trying to use the state transitions map to get all the "compatible" state transitions and then randomly selecting one among them by using weights from thiseventMixmap. AddingeventResumehere allows me to do the same with pause/resume without separate logic.
👍
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I thought those "uninteresting" rows were being filtered out in
cloudFeed.Next()(assuming they're not being emitted at a new timestamp), is this incorrect?
Oh! Yes you're right. That's really a bug in testFeed, so sounds like a good reminder/TODO to add to this comment.
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Imagine that we have 4 columns in
fooand 4 (non-null) columns infprint. Now there is aremoveColumnevent that drops one of the columns infoo. Now, the row updates that are streamed because of this will only contain 3 columns, but we can't simplyUPSERTthem intofprintbecause that won't nullify the 4th column. So, we instead do a transactionalDELETE, followed by anUPSERTto apply every row update.
we can't UPSERT (id, ts, col1, col2, col3) VALUES (...)?
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
We could go one step further here and make sure that the table in question is indeed
foo, but that would be a bit redundant since this is the only table in this test cluster that undergoes schema changes.I explored ways to perhaps look at
foo's table descriptor somehow and check if it has pending mutations, but that seems even worse. The challenging part is that any approach we pick here must disambiguate false negatives and "real" failures.Are there any other ideas I can explore?
I'm missing some background. Why do the fingerprints not match?
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
we can't
UPSERT (id, ts, col1, col2, col3) VALUES (...)?
So if fprint contains col1, col2, col3, col4 and we UPSERT (id, ts, col1, col2, col3) VALUES (...) this only performs an update, and doesn't nullify col4. This would cause the validation to fail, even though it shouldn't.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I'm missing some background. Why do the fingerprints not match?
Essentially, because of the fact that schema changes with backfill increment the table descriptor version 3 times (the last one happening after the backfill is "complete"). Lets say we're at table descriptor version 1, a schema change (with backfill) begins and it increments it to version 2, then to version 3 (but not to version 4, which it would do once the backfill is complete). If there is a resolved timestamp when the original table is at version 3, it would not contain the dropped (or added) column, but since the backfill has not finished yet, the fprint table still contains that about-to-be-dropped (or doesn't contain the newly added) column. So when we validate at this aforementioned resolved timestamp, we get a mismatch.
The "hack" here ignores the validation failure if it happened at a time where there was a pending schema change on the original table.
|
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
Could you explain why it's a bug? |
|
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
I see. Would we say that the resultant state has |
2f2497a to
d56d02d
Compare
|
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
So there is a Let's say we add a boolean attribute to @ajwerner any ideas? |
d56d02d to
ae651a0
Compare
danhhz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
So there is a
fsm.Var("some_variable_name")that we can use for something like this. SeeeventPauseandeventResume. The problem is that the fsm library only supports bools.Let's say we add a boolean attribute to
stateRunningto keep track of whether test columns exist, call itTestColumnsExist. TheeventRemoveColumn, wouldn't always flip theTestColumnsExisttofalse. So the result state becomes non-deterministic.@ajwerner any ideas?
can we use fsm.Any somehow?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Could you explain why it's a bug?
the testFeeds are intended to remove (key,ts) duplicates, but I was lazy and removed (key, value) duplicates. see the numerous // TODO(dan): Track duplicates more precisely in sinklessFeed/tableFeed. comments in TestChangefeedSchemaChangeAllowBackfill
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
So if
fprintcontainscol1, col2, col3, col4and weUPSERT (id, ts, col1, col2, col3) VALUES (...)this only performs an update, and doesn't nullifycol4. This would cause the validation to fail, even though it shouldn't.
oh because fprint is a fixed number of columns, right. Then how about filling out all of col4...col10 with NULLs in the UPSERT?
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Essentially, because of the fact that schema changes with backfill increment the table descriptor version 3 times (the last one happening after the backfill is "complete"). Lets say we're at table descriptor version 1, a schema change (with backfill) begins and it increments it to version 2, then to version 3 (but not to version 4, which it would do once the backfill is complete). If there is a resolved timestamp when the original table is at version 3, it would not contain the dropped (or added) column, but since the backfill has not finished yet, the
fprinttable still contains that about-to-be-dropped (or doesn't contain the newly added) column. So when we validate at this aforementioned resolved timestamp, we get a mismatch.The "hack" here ignores the validation failure if it happened at a time where there was a pending schema change on the original table.
I'm worried about the hack because we lose coverage in a really tricky spot (during schema change backfill) so it'd be nice to address it if we can.
I'm still trying to wrap my head around why these don't match. If the original table is at version 3 and hasn't had the column dropped, shouldn't the fprint table also still have the column, and thus a matching fingerprint.
Side node: I'm incredibly surprised that the fingerprint code is actually resilient to the two tables having a different number of columns (as long as the one with extra columns has them filled with NULLs?). I did not intend that when writing it
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
can we use fsm.Any somehow?
We can make this work by providing variables in the event. For example we can include in eventAddColumn and eventRemoveColumn whether they will change the state.
For example:
type stateRunning struct {
...
CanRemoveColumn fsm.Bool
CanAddColumn fsm.Bool
}
type eventAddColumn struct {
CanAddColumnAfter fsm.Bool
}
type eventRemoveColumn struct {
CanRemoveColumnAfter fsm.Bool
}
Then in the transitions:
stateRunning{Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.Var("can_remove"), CanAddColumn: fsm.True}: {
eventAddColumn{CanAddColumnAfter: fsm.Var("can_add_after")}: {
Next: stateRunning{
Paused: fsm.Var("paused"),
TxnOpen: fsm.False,
CanRemoveColumn: fsm.True,
CanAddColumn: fsm.Var("can_add_after"),
},
Action: logEvent(addColumn),
}
}
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, ajwerner wrote…
We can make this work by providing variables in the event. For example we can include in
eventAddColumnandeventRemoveColumnwhether they will change the state.For example:
type stateRunning struct { ... CanRemoveColumn fsm.Bool CanAddColumn fsm.Bool } type eventAddColumn struct { CanAddColumnAfter fsm.Bool } type eventRemoveColumn struct { CanRemoveColumnAfter fsm.Bool }Then in the transitions:
stateRunning{Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.Var("can_remove"), CanAddColumn: fsm.True}: { eventAddColumn{CanAddColumnAfter: fsm.Var("can_add_after")}: { Next: stateRunning{ Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.True, CanAddColumn: fsm.Var("can_add_after"), }, Action: logEvent(addColumn), } }
This works because we can know when we create the event whether we'll reach the threshold
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1, 1 of 2 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: eventTransact is gone
nit: eventOpenTxn or eventCommit
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
abort is not quite the opposite of commit, we should add a rollback event too
+1
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
it appears i neglected to write this down. @tbg @nvanbenschoten do you remember why we did this? I feel like an obviously unnecessary use of cluster_logical_timestamp() would have come up in code review
I don't have any memory of this, but it does seem likely that this is to prevent the transaction from being auto-retried when being pushed.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, ajwerner wrote…
This works because we can know when we create the event whether we'll reach the threshold
Yeah, that would work. It still might be too heavyweight to justify though.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I see. Would we say that the resultant state has
TxnOpen = falsestill?
You mean TxnOpen = true? I think we would leave the transaction open.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 76 at r3 (raw file):
eventResume{}: 1, // eventCommit commits the outstanding transaction.
nit: do you mind grouping all of the transaction-related events together?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 142 at r3 (raw file):
return nil, err } // Now push everything to make sure the initial scan can complete, otherwise
Why were we able to get rid of this?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 143 at r3 (raw file):
for i := 0; i < ns.maxTestColumnCount; i++ { testCol := fmt.Sprintf(", test%d STRING DEFAULT NULL", i) createFprintStmtBuf.WriteString(testCol)
fmt.Fprintf
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 231 at r3 (raw file):
} // Filter out `eventFinished` and restore at the end of the method.
Woah, this is super gross. If eventFinished has a weight of 0, do we even need this?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 261 at r3 (raw file):
} // If there are no available rows, openTxn or commit outstanding txn instead of
nit: move this into the if _, ok := event.(eventFeedMessage); ok { block and panic("unreachable") below the for loop.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 264 at r3 (raw file):
// reading. if ns.availableRows < 1 { if ns.txn == nil {
Can't we get this from the state? Seems cleaner.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 273 at r3 (raw file):
type stateRunning struct { Paused fsm.Bool
FeedPaused?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 321 at r3 (raw file):
Action: logEvent(noteFeedMessage), }, eventSplit{}: {
I'm surprised we don't want to allow some of these while a transaction is ongoing.
pkg/ccl/changefeedccl/cdctest/validator.go, line 133 at r3 (raw file):
failures []string numColumns int
nit: we seem to keep failures []string separate, so much this up to the first group of fields.
pkg/sql/lease.go, line 989 at r2 (raw file):
// because of its use of `singleflight.Group`. See issue #41780 for how this has // happened. newCtx, cancel := m.stopper.WithCancelOnQuiesce(logtags.WithTags(context.Background(), logtags.FromContext(ctx)))
WithCancelOnQuiesce is a pretty big footgun and I'm not sure that we even need it. Was the old context being canceled on quiesce?
pkg/util/fsm/fsm.go, line 145 at r3 (raw file):
} // GetExpandedStateTransitions returns the underlying expanded pattern map.
Doesn't this belong as a method on Transitions?
ae651a0 to
2a6c897
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 231 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Woah, this is super gross. If
eventFinishedhas a weight of 0, do we even need this?
Ah I didn't think of that. Done.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
If the original table is at version 3 and hasn't had the column dropped
At version 3, the original table will have had its last column dropped, but fprint would still have non-null values in that column. Those values for fprint would get nullified once the backfill is complete. Hence the mismatch.
2a6c897 to
2297b9a
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
eventOpenTxn or eventCommit
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
+1
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, that would work. It still might be too heavyweight to justify though.
I added this, I personally think this is pretty clean.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 76 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: do you mind grouping all of the transaction-related events together?
Done.
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 245 at r4 (raw file):
} log.Infof(context.TODO(), "curstate: %#v", state)
remove this debug line?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 269 at r4 (raw file):
// If there are no available rows, openTxn or commit outstanding txn instead // of reading. if ns.availableRows < 1 {
I don't think that you should do this as it's pretty heavy but we could avoid all of this special casing if we made the state machine aware of whether or not there were available rows. Unfortunately to do that we'd need to track even more things like probably turn openTransaction into explicit delete and upsert events and only delete if you have rows and well ... you get the picture.
pkg/sql/lease.go, line 989 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
WithCancelOnQuiesceis a pretty big footgun and I'm not sure that we even need it. Was the old context being canceled on quiesce?
The old context was the clients context. That was the bug. Dan and I didn't have any suggestions on what a good timeout would be but if these calls hang (due to say the cluster being unavailable) we'll hang on shutdown. How do you see with cancel on quiesce as being problematic / a footgun here?
f025137 to
dc8c39c
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
You mean
TxnOpen = true? I think we would leave the transaction open.
Done.
pkg/util/fsm/fsm.go, line 145 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Doesn't this belong as a method on
Transitions?
I wanted to avoid reCompileing the state transitions map. I restructured this bit so its a little bit cleaner. Let me know if this still seems lacking.
fc03b87 to
ec538d3
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 479 at r5 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
unnecessary now right?
also, why can't we add/remove columns while a txn is open? seems like a nice case to have tested if we can
I believe a schema change will block forever if a transaction is open.
3964be4 to
2f69548
Compare
|
This is RFAL! |
2f69548 to
9cb3a13
Compare
danhhz
left a comment
There was a problem hiding this comment.
modulo the one substantive comment about
fingerprint(row.updated.Prev())
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 479 at r5 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I believe a schema change will block forever if a transaction is open.
gotcha
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r6 (raw file):
eventMix: map[fsm.Event]int{ // We don't want `eventFinished` to ever be returned by `nextEvent` so we set // it's weight to 0.
nit: its
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 182 at r6 (raw file):
} var txnOpenBeforeInitialScan bool
nit: slightly more obvious to put txnOpenBeforeInitialScan := false here
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
}, stateRunning{ FeedPaused: fsm.False,
any reason this one and the next can't work while feed is paused?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 438 at r6 (raw file):
Action: logEvent(abort), }, eventRollback{}: {
group this with eventCommit
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 505 at r6 (raw file):
} func push(a fsm.Args) error {
move this down right above/below abort
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 573 at r6 (raw file):
return err } ns.availableRows += 2 * rows
comment plz, something out each row once for kv backfill once for changefeed backfill
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 594 at r6 (raw file):
return err } ns.availableRows += 2 * rows
ditto comment here plz
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 639 at r6 (raw file):
} func commit(a fsm.Args) error {
group this and rollback up under openTxn
pkg/ccl/changefeedccl/cdctest/validator.go, line 142 at r6 (raw file):
// exist before calling this constructor. func NewFingerprintValidator( sqlDB *gosql.DB, origTable, fprintTable string, partitions []string, testColumns int,
testColumns is pretty vague. maybe something about variable table width? even if we don't end up with a better name, this deserves a mention in the godoc
pkg/ccl/changefeedccl/cdctest/validator.go, line 356 at r6 (raw file):
} // If we have processed all the row updates belonging to a certain timestamp,
"If any updates have exactly equal timestamps, we have to apply them all before fingerprinting."
pkg/ccl/changefeedccl/cdctest/validator.go, line 360 at r6 (raw file):
if len(v.buffer) == 0 || v.buffer[0].updated != row.updated { lastFingerprintedAt = row.updated if err := v.fingerprint(row.updated); err != nil {
I think you lost the fingerprint check for row.updated.Prev(). Without that, if we missed a row update we wouldn't catch it, right? Say k1 was written at t1, t2, t3. If we only output t1 and t3 would anything catch that right now?
8f05236 to
3a50e29
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: its
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 182 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: slightly more obvious to put
txnOpenBeforeInitialScan := falsehere
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
any reason this one and the next can't work while feed is paused?
I'd disabled it because it made each individual run of the nemeses last substantially longer (up to like 2mins in some cases, as opposed to ~5s before), and there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?
I suspect the reason it takes this takes much longer is because it now stays in FeedPaused=true much longer (because eventAddColumn and eventRemoveColumn now contend with eventResume in that state). We could tune the event weights to fix this but I'm not sure this is even a problem as of now. What do you think?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 438 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
group this with eventCommit
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 505 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
move this down right above/below abort
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 573 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
comment plz, something out each row once for kv backfill once for changefeed backfill
Done
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 594 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
ditto comment here plz
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 639 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
group this and rollback up under openTxn
Done.
pkg/ccl/changefeedccl/cdctest/validator.go, line 142 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
testColumns is pretty vague. maybe something about variable table width? even if we don't end up with a better name, this deserves a mention in the godoc
Changed the name and added some explanation around what that parameter is. Let me know if it seems lacking in any way.
pkg/ccl/changefeedccl/cdctest/validator.go, line 356 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
"If any updates have exactly equal timestamps, we have to apply them all before fingerprinting."
Done
pkg/ccl/changefeedccl/cdctest/validator.go, line 360 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I think you lost the fingerprint check for row.updated.Prev(). Without that, if we missed a row update we wouldn't catch it, right? Say k1 was written at t1, t2, t3. If we only output t1 and t3 would anything catch that right now?
Done
|
This is ready for a final look. Please let me know if the way I've handled the |
danhhz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
hmm yeah, i really do like it taking ~5s. 2m is far too long. what happens if you just give eventResume a really high weight? something like 100x what add/remove have? also totally fine just leaving this as a TODO
there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?
all the bugs you've already found around schema changes were things we thought worked. my takeaway is that anything we can reasonably include in nemesis without diluting it, we should
pkg/ccl/changefeedccl/cdctest/validator.go, line 362 at r7 (raw file):
// If we've processed all row updates belonging to the previous row's timestamp, // we fingerprint at `updated.Prev()` since want to catch cases where one or more
nit: since we want
1576f7a to
2ce3894
Compare
Currently, the changefeed nemeses system does not support schema changes. This has previously allowed some fairly critical bugs to creep past our testing. This PR adds support for schema change events to the changefeed nemeses. Release note: None.
2ce3894 to
c82d570
Compare
aayushshah15
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
hmm yeah, i really do like it taking ~5s. 2m is far too long. what happens if you just give eventResume a really high weight? something like 100x what add/remove have? also totally fine just leaving this as a TODO
there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?
all the bugs you've already found around schema changes were things we thought worked. my takeaway is that anything we can reasonably include in nemesis without diluting it, we should
It seems to be taking less time after increasing the weight for eventResume, which makes sense.
pkg/ccl/changefeedccl/cdctest/validator.go, line 362 at r7 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: since we want
Done.
pkg/sql/lease.go, line 989 at r2 (raw file):
Previously, ajwerner wrote…
In that case we should either use context.Background or hang a context off of the lease manager so there's only one that it tied to the stopper.
Going to send out a new PR to address this.
ajwerner
left a comment
There was a problem hiding this comment.
Happy to see this going!
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
|
TFTRs! bors r+ |
41842: changefeedccl: add schema change events to nemeses r=aayushshah15 a=aayushshah15 Currently, the changefeed nemeses system does not support schema changes. This has previously allowed some fairly critical bugs to creep past our testing. This PR adds support for schema change events to the changefeed nemeses. Release note: None. Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Build succeeded |
|
\o/ |
Currently, the changefeed nemeses system does not support schema
changes. This has previously allowed some fairly critical bugs to creep past our testing.
This PR adds support for schema change events to the changefeed nemeses.
Release note: None.