Skip to content

roachprod: RAID0 only network disks if both local and network are present#119906

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nameisbhaskar:user/bhaskar/persistent_disk_issue_gce
Mar 13, 2024
Merged

roachprod: RAID0 only network disks if both local and network are present#119906
craig[bot] merged 1 commit intocockroachdb:masterfrom
nameisbhaskar:user/bhaskar/persistent_disk_issue_gce

Conversation

@nameisbhaskar
Copy link
Copy Markdown
Contributor

@nameisbhaskar nameisbhaskar commented Mar 5, 2024

Today, if a machine has both local and network disks, both the disks are selected for RAID'ing.

But, RAID'ing different types of disks causes performance differences.

To address this, local disks are ignored for RAID'ing only if network disks are present.

Fixes: #98783
Epic: none

@nameisbhaskar nameisbhaskar requested a review from a team as a code owner March 5, 2024 07:42
@nameisbhaskar nameisbhaskar requested review from renatolabs and srosenberg and removed request for a team March 5, 2024 07:42
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nameisbhaskar nameisbhaskar changed the title roachprod: don't RAID0 local SSDs when PDs are present roachprod: RAID0 only network disks if both local and network are present Mar 6, 2024
@srosenberg
Copy link
Copy Markdown
Member

srosenberg commented Mar 10, 2024

We should resolve this TODO [1] and do a full (test) run in both GCE and AWS. Have you thought how to test this definitively? Note that the startup script is captured by the journalctl log (see FetchJournalctl). However, we don't save these unless a test failed (see collectArtifacts). Maybe we change that behavior, at least for the test runs? This way we can post-analyze all the startup scripts to ensure there were no "funny" RAID`ing. What do you think?

[1]

// Work around an issue that RAID0s local NVMe and GP3 storage together:
// https://github.com/cockroachdb/cockroach/issues/98783.
//
// TODO(srosenberg): Remove this workaround when 98783 is addressed.
// TODO(miral): This now returns an error instead of panicking, so even though
// we haven't panicked here before, we should handle the error. Moot if this is
// removed as per TODO above.
s.AWS.MachineType, _, _ = spec.SelectAWSMachineType(s.CPUs, s.Mem, false /* shouldSupportLocalSSD */, vm.ArchAMD64)

@nameisbhaskar
Copy link
Copy Markdown
Contributor Author

We should resolve this TODO [1] and do a full (test) run in both GCE and AWS. Have you thought how to test this definitively? Note that the startup script is captured by the journalctl log (see FetchJournalctl). However, we don't save these unless a test failed (see collectArtifacts). Maybe we change that behavior, at least for the test runs? This way we can post-analyze all the startup scripts to ensure there were no "funny" RAID`ing. What do you think?

[1]

// Work around an issue that RAID0s local NVMe and GP3 storage together:
// https://github.com/cockroachdb/cockroach/issues/98783.
//
// TODO(srosenberg): Remove this workaround when 98783 is addressed.
// TODO(miral): This now returns an error instead of panicking, so even though
// we haven't panicked here before, we should handle the error. Moot if this is
// removed as per TODO above.
s.AWS.MachineType, _, _ = spec.SelectAWSMachineType(s.CPUs, s.Mem, false /* shouldSupportLocalSSD */, vm.ArchAMD64)

Regarding testing, I have tested roachprod deployment with different disk configuration. Here are the scenarios and the expected behaviour as per my understanding:

  1. (local SSD = 0, Network Disk - 1) - no RAID'ing
  2. (local SSD = 1, Network Disk - 0) - no RAID'ing
  3. (local SSD >= 1, Network Disk = 1) - no RAID'ing
  4. (local SSD > 1, Network Disk = 0) - local SSDs selected for RAID'ing
  5. (local SSD >= 0, Network Disk > 1) - network disks selected for RAID'ing
    Please let me know if this is the correct assumption. Also, if there are a combination of SSDs and 1 network disk, all the disks will be mounted. Is that also the expected behaviour?

@nameisbhaskar
Copy link
Copy Markdown
Contributor Author

made some changes to only use network disk if present:

(local SSD = 0, Network Disk - 1) - no RAID'ing and mount network disk

(local SSD = 1, Network Disk - 0) - no RAID'ing and mount local SSD

(local SSD >= 1, Network Disk = 1) - no RAID'ing and mount network disk

(local SSD > 1, Network Disk = 0) - local SSDs selected for RAID'ing

(local SSD >= 0, Network Disk > 1) - network disks selected for RAID'ing

…sent

Today, if a machine has both local and network disks, both the disks are selected for RAID'ing.

But, RAID'ing different types of disks causes performance differences.

To address this, local disks are ignored for RAID'ing only if network disks are present.

Fixes: #98783
Epic: none
@srosenberg srosenberg added backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only labels Mar 12, 2024
Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@nameisbhaskar
Copy link
Copy Markdown
Contributor Author

Thanks for reviewing and approving the change @srosenberg !

@nameisbhaskar
Copy link
Copy Markdown
Contributor Author

bors r=srosenberg

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 13, 2024

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 13, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 6fd7ff8 to blathers/backport-release-23.1-119906: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@nameisbhaskar nameisbhaskar deleted the user/bhaskar/persistent_disk_issue_gce branch March 13, 2024 06:50
@renatolabs
Copy link
Copy Markdown

Nice work!

Sorry if I missed something, but should we do something for Azure as well?

@nameisbhaskar
Copy link
Copy Markdown
Contributor Author

Nice work!

Sorry if I missed something, but should we do something for Azure as well?

Azure does not have the RAID'ing implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachprod,roachtest: don't RAID0 local SSD and PD/EBS

4 participants