Fix Intel CPUID leaf 4 cache topology for SMT#1002
Conversation
|
oh nice, thanks! totally an oversight when I was in there. how visible is this under so, if this is legible in the guest, could you adjust that test to check (.. I also see that in retrospect that test assumes SMT siblings are adjacent in APIC ID, which is definitely wrong in general.) |
|
This is what I get before the patch which indeed shows each vCPU as its own private L1 and L2 cache. After the patch I get Let me add a test for this |
|
Also I just noticed the propolis/phd-tests/tests/src/cpuid.rs Line 326 in dacb53d shouldn't sibling_idx be idx/2 instead of idx/4?The thread_siblings documentation in linux is vague but I think it is a hex bitmask where bit N is set if CPU N is a sibling. With SMT, CPUs 0-1 set bits 0-1 which gives '3', CPUs 2-3 set bits 2-3 which gives 'c' and so on. Since each hex digit covers 4 CPUs and we are iterating over pairs of siblings, we go through 2 pairs before moving to the next hex digit, so idx/2 makes sense instead of idx/4? The current code would only advance sibling_idx every 8 CPUs.I just ran the test it fails for me locally. |
When SMT is enabled, L1/L2 caches should report being shared by 2 logical processors (the SMT siblings). Previously EAX[25:14] was always being set to 0, indicating no sharing which contradicts the SMT topology reported in leaf 0xB. As per [1] EAX[25:14] indicates maximum number of addressable IDs for logical processors sharing this cache. This mismatch causes linux guest to print "BUG: arch topology borken / the SMT domain not a subset of the CLS domain" during boot. Linux derives L2 cache sharing groups from leaf 4 and expects SMT siblings to share L2 but it was being informed that each vCPU has private L1/L2. This brings the SMT handling logic in CPUID inline with what being done for AMD in fix_amd_cache_topo() which sets the sharing count to 2 when has_smt is true. This fixes oxidecomputer#1001. [1]: Table 1-15. Reference for CPUID Leaf 04H https://cdrdv2-public.intel.com/775917/intel-64-architecture-processor-topology-enumeration.pdf Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
The existing test assertion would fail on hosts with SMT enabled due to incorrect index calculations. Also add has_smt() helper to skip thread_siblings checks on non-SMT hosts and remove the unused itertools import. Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
Verify that Linux guest observes correct cache sharing topology from /sys/devices/system/cpu/cpu0/cache/. With SMT enabled, L1 and L2 caches should report sharing by SMT siblings while L3 should be shared across all vCPUs. Signed-off-by: Amey Narkhede <ameynarkhede03@gmail.com>
|
I fixed the |
|
Also noticed this late
Do you mean index3 😅/L3 which should be shared across all cores? Index 0 and 1 is split among L1i and L1d. I added test for L1 and L2 to be shared among SMT siblings and L3 to be among all CPUs in Add cache topology verification to guest_cpu_topo_test |
iximeow
left a comment
There was a problem hiding this comment.
hello! apologies again for taking forever to get back to this :)
I have a few notes here for references (for myself and others, things I wished I'd thought to cite earlier), so I'll push another commit that adjusts some of the docs a bit, squash it all, and merge shortly. thanks again for the fix and adjusting the test to suit 🙏
separately I was taking a look at this locally with propolis-standalone and pretty confused by:
root@ubuntu:~# cat /sys/devices/system/cpu/cpu*/cache/index*/shared_cpu_list
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
which... comes down to propolis-standalone taking a different path to CPU profiles specifically if there is no CPUID profile set. totally different issue! in that case we take the bhyve default literally and bhyve indeed does not indicate that L1/L2 caches are shared. on a quick skim that all pleads ignorance of SMT... except that CPUID_HTT is set in leaf 1 so at least I'm left wondering what all this looks like on a HT-capable processor with hyperthreading disabled. it's probably not terribly important over in bhyve the default CPUID behavior should probably represent threads as siblings in the topology here too; that'd be an issue over at illumos.org.
ps, to to myself from a few months ago:
(.. I also see that in retrospect that test assumes SMT siblings are adjacent in APIC ID, which is definitely wrong in general.)
a more careful reading of CACHE TOPOLOGY ENUMERATION USING CPUID LEAF 04H from the topology enumeration doc you linked, and this is said similarly in the AMD APM as well, actually does require cache-sharing cores to be colocated in APIC ID. quote Intel:
The CACHE_IDs for each cache level can be extracted from the x2APIC ID for processors that report 32-bit x2APIC ID, or from the initial APIC ID for processors that do not report x2APIC ID. ... The list of CACEH_Masks[n] of all types and levels when a bitwise AND is performed against its own x2APIC ID/APIC ID will provide the CACHE_ID[n] for that level and type that can be used to match other processors sharing the same cache ...
so cores sharing L1 or L2 cache must have APIC IDs that are the same outside bits corresponding to bits 25-14 in cpuid 4.eax. in a topology where three cores shared an L1 cache on a six-core processor, that yields APIC IDs 0, 1, 2, 4, 5, 6.
AMD describes APIC IDs having similar relatedness in Volume 3, appendix E "Obtaining Processor Information Via the CPUID Instruction", in the section about Function 8000_001Dh:
ShareId = LocalApicId >> log2(NumSharingCache+1)
so same as Intel, you can slice up the APIC ID into bitmasks of who shares caches.
|
Thanks for getting back to this! Yes I was testing this on machine with HT capable CPU but HT explicitly disabled. The propolis-standalone observation is interesting. Just checking in, is there anything else you need from me on this? |
no - I'd wanted to see how the APIC IDs work out if the processor is hyperthreading-capable-but-disabled, but I'm clearly not going to get back to that promptly.. this is clearly more accurate in the typical case so I'll just hit the button. thanks again for the patch 🙏 |
Update Crucible from `7103cd3a` to `bd9a0e2a`, picking up the following PRs: - Use an explicit rev for oxidecomputer git deps (oxidecomputer/crucible#1936) - Add Clone and Deserialize to VolumeInfo et al (oxidecomputer/crucible#1935) - Update omicron/oximeter (oxidecomputer/crucible#1933) - [meta] update to drift 0.1.4 (oxidecomputer/crucible#1932) - Don't log if there is nothing to log (oxidecomputer/crucible#1930) - Add VolumeInfo (oxidecomputer/crucible#1928) - Remove bonus Volume layer (oxidecomputer/crucible#1927) - Add session and client id to panic messages (oxidecomputer/crucible#1926) - [crucible-agent-types] migrate to RFD 619 pattern (oxidecomputer/crucible#1899) - Background read-only region creation (oxidecomputer/crucible#1919) - [crucible-downstairs-repair] switch to RFD 619 pattern (oxidecomputer/crucible#1901) - [crucible-pantry] switch to RFD 619 pattern (oxidecomputer/crucible#1900) - Use separate in-memory types (oxidecomputer/crucible#1913) - Remove old field from dtrace action script (oxidecomputer/crucible#1917) - Retry data writes that return an IO error (oxidecomputer/crucible#1915) - Bump dropshot to 0.17.0 (oxidecomputer/crucible#1909) - Reject snapshot requests when read-only (oxidecomputer/crucible#1914) - update ringbuf method, fix clippy lint (oxidecomputer/crucible#1904) - bump vergen-v9 version too (oxidecomputer/crucible#1903) - update dropshot to 0.16.7, dropshot-api-manager to 0.5.2 (oxidecomputer/crucible#1851) - perf-vol.d updates (oxidecomputer/crucible#1898) - upgrade progenitor to 0.13, reqwest to 0.13 (oxidecomputer/crucible#1854) - Remove cargo nextest from github workflow, out of space (oxidecomputer/crucible#1846) - Add a test for VCR serialize/deserialize (oxidecomputer/crucible#1843) Update Propolis from `bc489ddf` to `58ab73bd`, picking up the following PRs: - Bump crucible to latest, update Omicron, use explicit revs (oxidecomputer/propolis#1141) - Add project and silo ids to VM attestation (oxidecomputer/propolis#1114) - Update escargot (oxidecomputer/propolis#1139) - Prefix shebang and mark D scripts as executable (oxidecomputer/propolis#1140) - Fix error in propolis-server README (oxidecomputer/propolis#1138) - [meta] update to drift 0.1.4 (oxidecomputer/propolis#1137) - Fix Intel CPUID leaf 4 cache topology for SMT (oxidecomputer/propolis#1002) - support NVMe Deallocate (oxidecomputer/propolis#1105) - viona: do not lose used/avail indices (oxidecomputer/propolis#1135) - viona: multiqueue device should stay multiqueue across migration (oxidecomputer/propolis#1121) - Bump crucible rev to latest (oxidecomputer/propolis#1132) - expand zerocopy IntoBytes/FromByes use in guest memory accesses (oxidecomputer/propolis#1130) - dropshot-api-manager 0.7.1 (oxidecomputer/propolis#1129) - improve slog component setting (oxidecomputer/propolis#1124) - wait for viona Poller to run before declaring device running (oxidecomputer/propolis#1118) - virtio: tolerate importing queues with adjusted size (oxidecomputer/propolis#1117) - Run viona unit tests in CI (oxidecomputer/propolis#1120) - feature gate Crucible-specific boot digest code (oxidecomputer/propolis#1119) Also: - ran `cargo update -p vergen` - removed the `reqwest012` dependency - removed `reqwest012_client` from Nexus - ran `cargo hakari generate` and `cargo hakari manage-deps` - replace use of `ProgenitorOperationRetry` with `retry_operation_while_indefinitely` - during the region replacement drive saga, consume the new `VolumeInfo` from Propolis and use that to determine when to consider a replacement done
When SMT is enabled, L1/L2 caches should report being shared by 2 logical processors (the SMT siblings). Previously EAX[25:14] was always being set to 0, indicating no sharing which contradicts the SMT topology reported in leaf
0xB. As per [1] EAX[25:14] indicates maximum number of addressable IDs for logical processors sharing this cache.This mismatch causes linux guest to print "BUG: arch topology borken / the SMT domain not a subset of the CLS domain" during boot. Linux derives L2 cache sharing groups from leaf 4 and expects SMT siblings to share L2 but it was being informed that each vCPU has private L1/L2.
This brings the SMT handling logic in CPUID inline with whats being done for AMD in
fix_amd_cache_topo()which sets the sharing count to 2 when has_smt is true. This fixes #1001.[1]: Table 1-15. Reference for CPUID Leaf 04H
https://cdrdv2-public.intel.com/775917/intel-64-architecture-processor-topology-enumeration.pdf