Skip to content

Fix propagation of entries_read by calling streamPropagateGroupID unconditionally#12898

Merged
oranagra merged 5 commits into
redis:unstablefrom
enjoy-binbin:fix_entries_read
Feb 29, 2024
Merged

Fix propagation of entries_read by calling streamPropagateGroupID unconditionally#12898
oranagra merged 5 commits into
redis:unstablefrom
enjoy-binbin:fix_entries_read

Conversation

@enjoy-binbin

@enjoy-binbin enjoy-binbin commented Dec 28, 2023

Copy link
Copy Markdown
Contributor

In XREADGROUP ACK, because streamPropagateXCLAIM does not propagate
entries-read, entries-read will be inconsistent between master and replicas.
I.e. if no entries were claimed, it would have propagated correctly, but if some
were claimed, then the entries-read field would be inconsistent on the replica.

The fix was suggested by guybe7, call streamPropagateGroupID unconditionally,
so that we will normalize entries_read on the replicas. In the past, we would
only set propagate_last_id when NOACK was specified. And in #9127, XCLAIM did
not propagate entries_read in ACK, which would cause entries_read to be
inconsistent between master and replicas.

Another approach is add another arg to XCLAIM and let it propagate entries_read,
but we decided not to use it. Because we want minimal damage in case there's an
old target and new source (in the worst case scenario, the new source doesn't
recognize XGROUP SETID ... ENTRIES READ and the lag is lost. If we change XCLAIM,
the damage is much more severe).

In this patch, now if the user uses XREADGROUP .. COUNT 1 there will be an additional
overhead of MULTI, EXEC and XGROUPSETID. We assume the extra command in case of
COUNT 1 (4x factor, changing from one XCLAIM to MULTI+XCLAIM+XSETID+EXEC), is probably
ok since reading just one entry is in any case very inefficient (a client round trip
per record), so we're hoping it's not a common case.

Issue was introduced in #9127.

…onditionally

In XREADGROUP ACK, because streamPropagateXCLAIM does not propagate
entries-read, entries-read will be inconsistent between master and replicas.

The fix was suggested by guybe7, call streamPropagateGroupID unconditionally,
so that we will normalize entries_read on the replicas.

Issue was introduced in redis#9127.
Comment thread src/commands/xgroup-create.json
Comment thread tests/unit/type/stream-cgroups.tcl Outdated
@guybe7

guybe7 commented Dec 28, 2023

Copy link
Copy Markdown
Collaborator

revisiting the decision not to add another arg to XCLAIM: why is that a problem? it's only a problem if the master is new and replica is old (major-version-wise), which anyway isn't supported

@oranagra

@oranagra

Copy link
Copy Markdown
Member

revisiting the decision not to add another arg to XCLAIM: why is that a problem? it's only a problem if the master is new and replica is old (major-version-wise), which anyway isn't supported

@oranagra

@guybe7 you're right, but there are also other cases (like RL's replica-of, or loading an AOF), which can replicate a new version to and old one.
what bothers me is also the fact that even if we ignore the command failure, it means that the target skips the entire XCLAIM, not just the unrecognized option.

so if the current approach is viable, maybe it's a better one...
WDYT?

@enjoy-binbin

Copy link
Copy Markdown
Contributor Author

@oranagra @guybe7 avoid delaying for too long and losing the context. Do we have a final decision?

@guybe7

guybe7 commented Feb 20, 2024

Copy link
Copy Markdown
Collaborator

@oranagra ok, so what happens with the current code when the user uses NOACK? the server propagates the new form of XGROUP SETID (with ENTRIESREAD) and the receiving server (assuming it's an older version) will fail to execute it...

I mean, we probably have the same problem (old server fails to execute commands from a new server) in many other places, so I see no reason to avoid adding a new arg to XCLAIM

@oranagra

Copy link
Copy Markdown
Member

i already lost context 😞
trying to move forward without regaining it..

the difference could be if the command / combination that breaks it is new, or rarely used, in which case the replication won't break unless the user uses a certain feature.
does that help? or are both approaches similar in that respect?

@enjoy-binbin enjoy-binbin requested review from guybe7 and removed request for guybe7 February 29, 2024 02:17
@guybe7

guybe7 commented Feb 29, 2024

Copy link
Copy Markdown
Collaborator

talked about this with @oranagra and we decided to keep the current approach because we want minimal damage in case there's an old target and new source (in the worst case scenario, the new source doesn't recognize XGROUP SETID ... ENTRIES READ and the lag is lost. if we change XCLAIM, the damage is much more severe)

we also understand that if the user uses XREADGROUP .. COUNT 1 the will be an additional overhead of MULTI, EXEC and XGROUPSETID

@oranagra

Copy link
Copy Markdown
Member

to elaborate on that, we assume the extra command in case of COUNT 1 (4x factor, changing from one XCLAIM to MULTI+XCLAIM+XSETID+EXEC), is probably ok since reading just one entry is in any case very inefficient (a client round trip per record), so we're hoping it's not a common case.

@oranagra

Copy link
Copy Markdown
Member

@enjoy-binbin please ping me that it's ready for merge from your perspective

@enjoy-binbin

Copy link
Copy Markdown
Contributor Author

@oranagra I browsed through it again, it is ready to merge.

@oranagra oranagra merged commit f17381a into redis:unstable Feb 29, 2024
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Feb 29, 2024
@enjoy-binbin enjoy-binbin deleted the fix_entries_read branch February 29, 2024 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: No status
Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants