Skip to content

Conversation

@ny0312
Copy link
Contributor

@ny0312 ny0312 commented Jun 10, 2021

Fix flaky test case reported here: #8474

The root cause is that one test (5 keys in, 5 keys out) is leaking a volatile key that can expire while another later test(All TTL in commands are propagated as absolute timestamp in replication stream) is running. Such leaked expiration injects an unexpected DEL command into the replication command during the later test, causing it to fail.

The fixes are two fold:

  1. Plug the leak in the first test.
  2. Add FLUSHALL to the later test, to avoid future interference from other tests.

madolson
madolson previously approved these changes Jun 10, 2021
@madolson
Copy link
Contributor

@ny0312 I think it would be okay to merge it now so we can start watching the nightly runs, It's Oran's weekend now.

eliblight
eliblight previously approved these changes Jun 11, 2021
Copy link
Contributor

@eliblight eliblight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!!!
isn't it sufficient to just add a FLUSHALL to that test before attaching to the replication stream.
Ie. Instead of this massive LOC change?

@ny0312
Copy link
Contributor Author

ny0312 commented Jun 11, 2021

Good catch!!!
isn't it sufficient to just add a FLUSHALL to that test before attaching to the replication stream.
Ie. Instead of this massive LOC change?

That would also be sufficient. It depends on which convention we want to establish when it comes to isolating a test case from all side effects of other test cases:

  1. Use a dedicated server, or
  2. Do a "clean up" on the same server. Such "clean up" may include FLUSHALL and other things, for example, reverting/resetting configurations that have been set in earlier tests.

Personally I would prefer (1). It is slower. But it is explicit, simple and clean. It frees an author of a new test from having to understand the side effects of all preceding tests in order to properly "clean up" their side effects. I would encourage all contributors to follow this convention.

@oranagra
Copy link
Member

Spinning up a dedicated server for each test would slow down the test suite a lot.
We obviously use that on tests that cause crashes, or tests that are really very fragile, where any slight change can break them.
On other tests it is sufficient to do FLUSHALL.
There's also another concern which is the configuration, here we usually prefer that tests restore the original configuration of configs they changed.

@ny0312
Copy link
Contributor Author

ny0312 commented Jun 11, 2021

Spinning up a dedicated server for each test would slow down the test suite a lot.
We obviously use that on tests that cause crashes, or tests that are really very fragile, where any slight change can break them.
On other tests it is sufficient to do FLUSHALL.
There's also another concern which is the configuration, here we usually prefer that tests restore the original configuration of configs they changed.

So your preferred convention is "re-use servers across test cases unless it is impossible or too hard to make work".

I agree that this approach would give us the fastest tests therefore shortest feedback cycle. It would more friendly to contributors of Redis, which is very important.

I will change to FLUSHALL if no other objections.

@ny0312 ny0312 dismissed stale reviews from eliblight and madolson via efe45cf June 13, 2021 03:34
@ny0312 ny0312 force-pushed the fix-absolute-ttl-test branch from e751c0f to efe45cf Compare June 13, 2021 03:34
@ny0312
Copy link
Contributor Author

ny0312 commented Jun 13, 2021

Updated. Please review. @madolson @oranagra

@oranagra oranagra merged commit fb140a1 into redis:unstable Jun 13, 2021
@ny0312 ny0312 deleted the fix-absolute-ttl-test branch July 2, 2021 17:28
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Sep 8, 2021
The root cause is that one test (`5 keys in, 5 keys out`) is leaking a volatile key
that can expire while another later test(`All TTL in commands are propagated
as absolute timestamp in replication stream`) is running.
Such leaked expiration injects an unexpected `DEL` command into the
replication command during the later test, causing it to fail.

The fixes are two fold:
1. Plug the leak in the first test.
2. Add FLUSHALL to the later test, to avoid future interference from other tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants