Skip to content

Intensify "xxx_one_in"'s default value in crash test#12127

Closed
hx235 wants to merge 1 commit intofacebook:mainfrom
hx235:aggresive_stress_test_value
Closed

Intensify "xxx_one_in"'s default value in crash test#12127
hx235 wants to merge 1 commit intofacebook:mainfrom
hx235:aggresive_stress_test_value

Conversation

@hx235
Copy link
Copy Markdown
Contributor

@hx235 hx235 commented Dec 7, 2023

Context/Summary:
My experimental stress runs with more frequent "xxx_one_in" surfaced a couple interesting bugs/issues with RocksDB or crash test framework in the past. We now consider changing the default value so they are run more frequently in production testing environment.

Increase frequency by 2 orders of magnitude for most parameters, except for error-prone features e.g, manual compaction and file ingestion (increased by 3 orders) and expensive features e.g, checksum verification (increased by 1 order)

Test:
Monitor CI to see if it did surface more interesting bugs/issues. If not, we may consider intensify even more.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I do generally believe this biases the bug finding toward places where bugs are most common, I believe it also widens some areas that could go under-stressed. In particular

  • The high occurrence of IO-heavy or DB mutex-heavy operations reduces stress on the fast write and read paths. With HCC, both write path and read path have lock-free algorithms for which stress test is our best defense against regression bugs.
  • Consistently high flush rates could mask issues that only show up with long-lived memtables or large SST files, etc.

Perhaps one way you could think about it is that there could be a higher-level random decision about where each run should be on a spectrum between "let the read and write paths flow as freely as possible" and "throw as many wrenches and curveballs into smooth DB operation as possible" and you could derive these other parameters from that one.

In other other words, there is risk in being too consistent with our randomness. So adding some higher-level randomness to the behavioral parameters could reproduce a larger suite of stress conditions. IMHO.

Copy link
Copy Markdown
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still overall a better config than before IMHO

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Dec 7, 2023

In other other words, there is risk in being too consistent with our randomness. So adding some higher-level randomness to the behavioral parameters could reproduce a larger suite of stress conditions. IMHO.

Yes - good idea. Let me do that as a follow up. An immediate follow-up could be lambda: random.choice() between the original value and intensified value as @akankshamahajan15 suggested.

@akankshamahajan15
Copy link
Copy Markdown
Contributor

akankshamahajan15 commented Dec 7, 2023

Yes - good idea. Let me do that as a follow up. An immediate follow-up could be lambda: random.choice() between 0 and intensified value as @akankshamahajan15 suggested.

I meant lambda: random.choice([current_value, intensified_value]) That way it only selects either the current value or the intensified value.

@hx235 hx235 force-pushed the aggresive_stress_test_value branch from b65af52 to c20a9cb Compare December 7, 2023 21:56
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@hx235
Copy link
Copy Markdown
Contributor Author

hx235 commented Dec 7, 2023

Yes - good idea. Let me do that as a follow up. An immediate follow-up could be lambda: random.choice() between 0 and intensified value as @akankshamahajan15 suggested.

I meant lambda: random.choice([current_value, intensified_value]) That way it only selects either the current value or the intensified value.

Yes - Sorry - I mistyped it and meant to say original value. Edited it

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pdillinger
Copy link
Copy Markdown
Contributor

I meant lambda: random.choice([current_value, intensified_value]) That way it only selects either the current value or the intensified value.

A good suggestion to reduce loss of coverage on the "let the read and write paths flow as freely as possible" side of the spectrum!

My speculation that there's value in covering even further on that side is... just speculation.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@hx235 merged this pull request in 179d2c7.

facebook-github-bot pushed a commit that referenced this pull request Dec 17, 2023
Summary:
**Context/Summary:**

Continued from #12127, we can randomly reduce the # max key to coerce more operations on the same key. My experimental run shows it surfaced more issue than just #12127.

I also randomly reduce the related parameters, write buffer size and target file base, to adapt to randomly lower number of # max key.  This creates 4 situations of testing, 3 of which are new:

1. **high** # max key with **high** write buffer size and target file base (existing)
2. **high** # max key with **low** write buffer size and target file base (new, will go through some rehearsal testing to ensure we don't run out of space with many files)
3. **low** # max key with **high** write buffer size and target file base (new, keys will stay in memory longer)
4. **low** # max key with **low** write buffer size and target file base (new, experimental runs show it surfaced even more issues)

Pull Request resolved: #12148

Test Plan:
- [Ongoing] Rehearsal stress test
- Monitor production stress test

Reviewed By: jaykorean

Differential Revision: D52174980

Pulled By: hx235

fbshipit-source-id: bd5e11280826819ca9314c69bbbf05d481c6d105
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants