Skip to content

Allow VMM reservoir to be configured by sled-agent#3200

Merged
jordanhendricks merged 1 commit into
oxidecomputer:mainfrom
jordanhendricks:vmm-rsrvr
May 25, 2023
Merged

Allow VMM reservoir to be configured by sled-agent#3200
jordanhendricks merged 1 commit into
oxidecomputer:mainfrom
jordanhendricks:vmm-rsrvr

Conversation

@jordanhendricks

@jordanhendricks jordanhendricks commented May 23, 2023

Copy link
Copy Markdown
Contributor

Builds on top of #2684, with some slight changes, and with the addition of allowing the reservoir size to be configured as a percentage of sled DRAM, via sled-agent's config.toml.

Summary:

  • adds support in sled-agent to configure the bhyve VMM reservoir
  • the reservoir size is configured as a percentage of DRAM, rounded down to the nearest 2MiB (this is to make support of large pages easier later)
  • the percentage is an optional value in sled-agent's config.toml (/opt/oxide/sled-agent/package/config.toml)
  • sled-agent reports the last set reservoir size back to Nexus after startup
  • currently the default percentage for all config.toml files shipped in the repo (gimlet, standalone gimlet, non-gimlet) is 0, so with this change, the reservoir is by default not used without manual intervention

Allowing the size to be configured in this way is intentional; we know that we can't use the reservoir without swap devices configured, and there are potentially other issues to sort out there too. A PR for having sled-agent set up a swap device is in progress. But having the reservoir bits in main is valuable for testing; we can configure swap devices manually and test use of the reservoir without sled-agent knowing how to set those up yet.

To change the reservoir configured by sled-agent, modify this field in /opt/oxide/sled-agent/pkg/config.toml:

vmm_reservoir_percentage = 0

Then restart sled-agent: $ svcadm restart sled-agent
Note that sled-agent currently tears down existing zones on startup; restarting sled-agent is thus destructive to existing instances on the sled.

An illegal value (< 0 or > 100) will cause sled-agent to panic.

Note that the config.toml is in the ramdisk, so changes here are not persisted across reboot.

@jordanhendricks jordanhendricks force-pushed the vmm-rsrvr branch 3 times, most recently from 74345d8 to 51a8db6 Compare May 24, 2023 18:53
@jordanhendricks jordanhendricks marked this pull request as ready for review May 24, 2023 20:23
@jordanhendricks jordanhendricks removed the request for review from luqmana May 25, 2023 02:47

@jmpesp jmpesp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deployed this to the Canada region and it looks good (incidentally, it's also how I learned that two of my machines report 130986 for prtconf -m, and two report 130984!). I was able to launch instances on all four machines.

A few questions outside of the comments I left:

  • should setting a reservoir value impact usable_physical_ram at all?
  • should it impact the accounting during instance allocation? there's currently a // TODO: We should also validate the reservoir space, when it exists. message in DataStore::sled_reservation_create.

Comment thread sled-agent/src/instance_manager.rs
Comment thread sled-agent/src/instance_manager.rs

@pfmooney pfmooney left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bhyve/propolis-adjacent bits look fine to me. I'm no useful judge of the more omicron focused stuff, though.

Co-authored-by: Sean Klein <sean@oxide.computer>
@jordanhendricks

Copy link
Copy Markdown
Contributor Author

Thanks @jmpesp, for doing some additional testing of provisioning instances.

Some testing notes around the configuration parameter validation. I did these tests by modifying the config.toml and restarting sled-agent.

Valid parameters:

No value specified in config

16:18:02.351Z WARN SledAgent: Not using VMM reservoir
    sled_id = 295fa623-8759-4a84-8c05-d61639b0659b

0%

vmm_reservoir_percentage = 0

16:16:57.094Z WARN SledAgent: Not using VMM reservoir (size 0 bytes requested)

Arbitrary value: 20%

vmm_reservoir_percentage = 20

16:18:58.572Z INFO SledAgent (InstanceManager): Setting reservoir size to 26186 MiB bytes (20% of 137297096704 total = 27459419340 bytes requested)

rsrvrctl reports the same:

jordan@dunkin ~/omicron $ pfexec /usr/lib/rsrvrctl -q
Free KiB:       26814464
Allocated KiB:  0
Transient Allocated KiB:        0
Size limit KiB: 127845635

Invalid parameters:

-1%
vmm_reservoir_percentage = -1

sled-agent: Failed to parse config from /opt/oxide/sled-agent/pkg/config.toml: TOML parse error at line 37, column 28
   |
37 | vmm_reservoir_percentage = -1
   |                            ^^
invalid value: integer `-1`, expected u8

101%
vmm_reservoir_percentage = 101

thread 'main' panicked at 'invalid requested VMM reservoir percentage: 101', /home/jordan/omicron/sled-agent/src/sled_agent.rs:266:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[ May 25 16:20:31 Stopping because all processes in service exited. ]
[ May 25 16:20:31 Executing stop method (:kill). ]
[ May 25 16:20:31 Restarting too quickly, changing state to maintenance. ]

@jordanhendricks

Copy link
Copy Markdown
Contributor Author
  • should setting a reservoir value impact usable_physical_ram at all?
  • should it impact the accounting during instance allocation? there's currently a // TODO: We should also validate the reservoir space, when it exists. message in DataStore::sled_reservation_create.

Discussed offline -- we should look at both of these things before enabling the reservoir by default. Will file issues today.

@jordanhendricks

Copy link
Copy Markdown
Contributor Author

Filed #3228 and #3223 as follow ups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants