-
Notifications
You must be signed in to change notification settings - Fork 119
Rewrite Apple PAL using native APIs #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@amari thanks for looking at this.
This was primarily introduced to stop people introducing Windows bugs on other platforms. For memory not to be part of the Working Set (Resident set), we need to Decommit the reserved memory. Any access to decommitted memory will cause a fault. Adding the For any memory access, the most recent call of We do not require linearity, we can repeatedly call I believe the code could be made to restrict this if required for the Apple PAL. But it would require some additional work to validate all the paths, and add cross platform testing.
Yes. If someone asks snmalloc for more than 4GiB of memory, then we ask for more than 4GiB.
Awesome @davidchisnall, @ryancinsight, @devnexen do you have any thoughts on moving away from the POSIX view of Apple to the native APIs? This seems a good reason for the PAL, but do you have any compatibility concerns? |
|
@mjp41. There was a formatting error. I don't think that the contributors were pinged. |
|
Just tried your branch on Mac M1 and just does not build. Fails on
|
|
I m not against removing getentropy support here, seems justifiable to avoid the app store issue, however I m less keen about the vm api revamp. but we ll see I m not against neither. |
@devnexen. Are you compiling using Rosetta? Otherwise If not, |
|
Did not use rosetta. By the way I retried also in my older mac intel same issue. Is there any prerequisite ? |
|
I found this. I don't think externs play nice with
Also, If we use |
|
More important |
|
Not sure for Intel about |
@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 |
|
Might be still needed for non apple tough ? Just curious is your gear the ARM64 flavor ? |
Or maybe PAGE_MAX_SIZE is not really what it appears. May means more arch agnostic max value rather than arch specific, to be fair most of the time apple specifics in general are a bit of opaque :-) |
I'm using Intel. Between 10.15 and 11, Apple changed the definition of PAGE_MAX_SHIFT from 12 to 14 for Intel builds. I think that the assertion should be skipped when |
That s right, also same value for ARM thus my earlier assumption. |
I can't say I know enough about apple to forsee issues here, I don't currently use one either. That said, I would have thought arc4random_uniform/_buf would have been simpler but this route is pretty clever and I've see it pursued in swift. My only question is whether is an upperbound (only saw lower if there is one) to generated numbers or if there should be like there is in other implementations: https://opensource.apple.com/source/OpenSSH/OpenSSH-166/openssh/openbsd-compat/bsd-arc4random.c |
src/pal/pal_apple.h
Outdated
| mach_vm_offset_t mask = page_size - 1; | ||
|
|
||
| int flags = | ||
| VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE | VM_MAKE_TAG(PALAnonID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would putting back VM_MAKE_TAG into a default flag work for you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather leave it as an expression. If in the future Apple refactors VM_MAKE_TAG into a function or any other non constexpr it should still compile.
|
Also what's the benefit of this over CCrandomGeneratebytes? https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h |
What does Apple say?
I followed the documentation rabbit-hole, and Apple's own CommonCrypto docs direct developers to use SecRandomCopyBytes. Cryptographic interfaces#Common Crypto library -> Cryptographic Services Guide#GeneratingRandomNumbers -> Randomization Services Guide -> Call TracingDid some digging on opensource.apple.com, I traced the calls: Wishful thinkingThe most "correct" way to implement it would be to use corecrypto directly, but afaik it's an internal library: #include <corecrypto/ccrng.h> // ccrng, ccrng_state, ccrng_generate
static uint64_t get_entropy64() {
uint64_t result;
int status;
struct ccrng_state *rng = ccrng(&status);
if (status != 0 || ccrng_generate(rng, sizeof(result), reinterpret_cast<void*>(&result)) != CCERR_OK) {
error("Failed to get system randomness");
}
return result;
}RNG specsHere the specs of the RNG as of macOS 10.13: |
|
@amari thanks for explanation! these docs are definitely a rabbit hole! Based on trace results I personally still prefer to be as direct as possible especially if it's skipping rng: #elif defined(APPLE) And somehow use CCRandomGenerateBytes(buf, buf_len) == kCCSuccess; However, your example would indeed be better though very similar: https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h.auto.html |
I can't find any information about whether |
|
They shouldn't give you a hard time based on this: https://www.niap-ccevs.org/Documents_and_Guidance/view_td.cfm?TD=0510 |
The performance regression(s)I've been benchmarking and digging through the XNU kernel on sourcegraph to find the performance regression(s). Here are my findings:
mimalloc-benchI ran mimalloc-bench with leanN and redis. Performance is on-par with master.
Where things standI refactored ctest is a little slower on x86_64 (~5 to 10 seconds). This is due to replacing |
Nice rework
Not sure about this statement, with madvise and MADV_FREE/MADV_FREE_REUSABLE it happened to me, rarely for sure tough.
Really 100x ? Not fighting it really glad finally mmap calls are back :-)
Too bad microquill is down for now :-) I filled an issue on the project. |
mjp41
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for such an in depth analysis, and rounds of changes. I am happy subject to a couple of minor things.
|
I like the new changes minus would like to see back the EAGAIN loops. |
I'm working on it. What OS version gave you EAGAIN? I going to look through sourcegraph to find any more undocumented errnos to be safe. |
Well macOs Big Sur (never saw it anywhere else at least until now), it s pretty much contextual, depends of the current overall nature of the workload. since january I have my M1 maybe it occured 4/5 times. |
I tried to find a root cause. I couldn't find one. Oh well. The |
|
The overall looks good I got nothing major else to say. |
mjp41
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@davidchisnall I am planning to merge this if you don't complain today ;-) |
src/pal/pal_apple.h
Outdated
| /** | ||
| * Source of Entropy | ||
| * | ||
| * Apple platforms do not have a public getentropy implementation, so use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is misleading. It's a system call with a man page. It's completely public, Apple just has App Store policies preventing it from being used via some distribution channels. Do these apply to arc4random? This is documented as being periodically seeded from getentropy and should be fine for our purposes. I'm really nervous about a malloc implementation depending on a library that expects malloc to be working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arc4random_buf can silently fail. CCGenerateRandomBytes returns an error code.
https://opensource.apple.com/source/Libc/Libc-1439.40.11/gen/FreeBSD/arc4random.c.auto.html
void
arc4random_buf(void *buf, size_t buf_size)
{
arc4_init();
// the error is ignored
ccrng_generate(rng, buf_size, buf);
}CCRNGStatus CCRandomGenerateBytes(void *bytes, size_t count)
{
int err;
struct ccrng_state *rng;
if (0 == count) {
return kCCSuccess;
}
if (NULL == bytes) {
return kCCParamError;
}
rng = ccDRBGGetRngState();
// the error is captured
err = ccrng_generate(rng, count, bytes);
if (err == CCERR_OK) {
return kCCSuccess;
}
return kCCRNGFailure;
}I don't know why Apple didn't reimplement getentropy with CoreCrypto.
Apple's own libc depends on CoreCrypto, and CCGenerateRandomBytes does not depend on malloc.
- libc -> CoreCrypto
- CommonCrypto -> CoreCrypto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll correct the comment so that it is more clear and concise.
Also, arc4random is not seeded with genentropy:
https://opensource.apple.com/source/Libc/Libc-1439.40.11/gen/FreeBSD/arc4random.c.auto.html
static struct ccrng_state *rng;
static void
arc4_init(void)
{
int err;
if (rng != NULL) return;
rng = ccrng(&err);
if (rng == NULL) {
#if OS_CRASH_ENABLE_EXPERIMENTAL_LIBTRACE
os_crash("arc4random: unable to get ccrng() handle (%d)", err);
#else
os_crash("arc4random: unable to get ccrng() handle");
#endif
}
}
uint32_t
arc4random(void)
{
uint32_t rand;
arc4random_buf(&rand, sizeof(rand));
return rand;
}
void
arc4random_buf(void *buf, size_t buf_size)
{
arc4_init();
ccrng_generate(rng, buf_size, buf);
}
uint32_t
arc4random_uniform(uint32_t upper_bound)
{
uint64_t rand;
arc4_init();
ccrng_uniform(rng, upper_bound, &rand);
return (uint32_t)rand;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we depending on CommonCrypto or CoreCrypto? Is there an API contract that says that no future version of CCGenerateRandomBytes may ever call malloc or related APIs (including implicitly, as a result of depending on thread-local storage or as a result of running library constructors that must be run before we invoke the exported function)?
If there is an error, we are killing the process. Given that we use the entropy only for probabilistic mitigations, I'm not convinced that this is a better outcome. Under what situations can ccrng_generate fail? I presume we cannot call ccrng_generate directly because it's an internal function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependency is CommonCrypto. CoreCrypto is private. Yes, ccrng_generate is a private function.
CommonCrypto (/usr/lib/system/libcommonCrypto.dylib) is implicitly linked by dyld (macOS linker.) We have no say in the matter as library developers. If there's any "library initialization" that is dependent on malloc in CommonCrypto, I'd argue that the behavior is undefined and is intrinsic to macOS in production.
I'd argue that Apple makes no guarantees about future behavior of standard system APIs, as we saw with the macOS 11.2 release (a minor release!) that broke mprotect (W^X) and many projects experienced issues.
The descriptions of ccrng and ccrng_generate:
https://github.com/apple/darwin-xnu/blob/main/EXTERNAL_HEADERS/corecrypto/ccrng.h
/*!
@function ccrng
@abstract Initializes an AES-CTR mode cryptographic random number generator and returns the statically-allocated rng object.
Getting a pointer to a ccrng has never been simpler!
Call this function, get an rng object and then pass the object to ccrng_generate() to generate randoms.
ccrng() may be called more than once. It returns pointer to the same object on all calls.
@result a cryptographically secure random number generator or NULL if fails
@discussion
- It is significantly faster than using the system /dev/random
- FIPS Compliant: NIST SP800-90A + FIPS 140-2
- Seeded from the system entropy.
- Provides at least 128bit security if the system provide 2bit of entropy / byte.
- Entropy accumulation
- Backtracing resistance
- Prediction break with frequent (asynchronous) reseed
*//*!
@function ccrng_generate
@abstract Generate `outlen` bytes of output, stored in `out`, using ccrng_state `rng`.
@param rng `struct ccrng_state` representing the state of the RNG.
@param outlen Amount of random bytes to generate.
@param out Pointer to memory where random bytes are stored, of size at least `outlen`.
@result 0 on success and nonzero on failure.
*/ccrng "Initializes an AES-CTR mode cryptographic random number generator and returns the statically-allocated rng object [emphasis mine]."
Standards-compliant entities/organizations deploy to macOS and iOS using this API. See, https://www.niap-ccevs.org/Documents_and_Guidance/view_td.cfm?TD=0510
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ccrng is closed source. ccrng_generate calls the function pointer in ccrng_state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error, we are killing the process.
I would much sooner kill the process. We build different versions of the library with and without the mitigations. If it fails with mitigations, then the user can try the version without mitigations. Silent failure seems like a problem for security auditing.
src/pal/pal.h
Outdated
| // define `PAGE_SIZE` as an extern making it unsuitable for this assertion. | ||
| #if !defined(__APPLE__) && defined(PAGE_SIZE) | ||
| static_assert( | ||
| PAGE_SIZE == OS_PAGE_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| PAGE_SIZE == OS_PAGE_SIZE, | |
| #if __has_builtin(__builtin_constant_p) | |
| !__builtin_constant_p(PAGE_SIZE) || (PAGE_SIZE == OS_PAGE_SIZE), | |
| #else | |
| true, | |
| #endif |
We can then get rid of the platform-specific ifdefs. At some point, the builtin can be replaced with something more portable (with C++17 there are template tricks to check for compile-time constants, with C++20 and consteval I think there are simpler ones).
| * spotted in various projects; might be a temporary fix). | ||
| */ | ||
| template<bool page_aligned = false> | ||
| static void zero(void* p, size_t size) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to have lost this function. @mjp41, is it no longer needed? If so, we should update the porting document to remove it and remove it from all of the other PALs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is coming from the PALBSD one, but I think we should keep this version. Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the !Arm case has been fixed in the logic in the YesZero case in notify_using?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@devnexen, who opened #278, tested this branch on both Intel and M1 machines and found no regression.
About a week ago, I crawled XNU on sourcegraph in an attempt to diagnose root cause(s) for the M1 bugs, as well as reconcile the difference between the x86_64 and ARM codepaths of PALApple:
- The Apple silicon/ARM bugs were due to PALPOSIX::page_size being defined as Aal::smallest_page_size, which is 0x1000 for ARM. It needs to be 16KiB for Apple user-space on ARM. There's quite a bit of "creative" engineering going on in XNU as of macOS 11.3 . On ARM, the kernel uses 4KiB pages internally, but requires userspace to use 16KiB pages, unless the process is "exotic" (i.e. rosetta).
- MADV_FREE does not behave as expected. I'm not going to write the entire call tree, but the pages are never marked as "reusable" by the kernel and this can be observed through profiling. I tested MADV_FREE and MADV_FREE_REUSABLE. It is significant. At least ~75% to ~90% less dirty memory is used by func-malloc-16 (observed on x86_64). Apple's own malloc implementation as well as many ports for Apple Operating Systems use MADV_FREE_REUS{E, ABLE} instead of MADV_FREE. See https://bugs.chromium.org/p/chromium/issues/detail?id=713892 https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html
The reason why notify_using and zero look like duplicate functionality is b/c of optimization. Also related to another one of @davidchisnall comments below, when PALPOSIX::zero calls mmap, it can fail and call bzero on the same pages that mmap failed to overwrite.
In pseudocode, for brevity.
master:
PALPOSIX::zero<page_aligned>:
if constexpr (zero_mem == YesZero)
if page_aligned || is_aligned_block<OS::page_size>(p, size)
mmap
if r != MAP_FAILED
return;
bzero
PALPOSIX::notify_using:
mprotect(PROT_READ|PROT_WRITE)
zero<true>
pal-apple:
PALApple::notify_using:
if constexpr (zero_mem == YesZero)
mmap
if r != MAP_FAILED
return;
mprotect(PROT_READ|PROT_WRITE)
madvise(MADV_FREE_REUSE)
bzero
We are free to move the mmap before mprotect, thereby eliminating the other 2 syscalls. Otherwise it would be mprotect, mmap, which is wasteful.
Please note that I didn't want to deviate from the failure conditions of PALPOSIX::notify_using .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes a lot of sense. It would be great to have a comment explaining this.
| * Notify platform that we will be using these pages. | ||
| */ | ||
| template<ZeroMem zero_mem> | ||
| static void notify_using(void* p, size_t size) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function now looks almost identical to the POSIX PAL version. Why do we need a separate one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MADV_FREE != MADV_FREE_REUSABLE on XNU.
Please see the above comment. #336 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We already have some template parameters for the generic PALs that allow tweaking things like the default mmap flags, I was wondering why this couldn't be done to make MADV_FREE something tunable, but it looks as if the order of operations is sufficiently different that this isn't possible (or, rather, would add a lot more complexity than having a custom implementation here).
| * spotted in various projects; might be a temporary fix). | ||
| */ | ||
| template<bool page_aligned = false> | ||
| static void zero(void* p, size_t size) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully the !Arm case has been fixed in the logic in the YesZero case in notify_using?
| is_aligned_block<PALBSD::page_size>(p, size) || (zero_mem == NoZero)); | ||
| is_aligned_block<page_size>(p, size) || (zero_mem == NoZero)); | ||
|
|
||
| if constexpr (zero_mem == YesZero) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put this logic in zero()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see #336 (comment)
| if constexpr (zero_mem == YesZero) | ||
| zero<true>(p, size); | ||
| { | ||
| ::bzero(p, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only reach this if the mmap call above failed. I don't think that mmap should be able to fail, so the error condition there should probably be PAL::error (which kills the process): if we can't mmap over some address space that we've previously allocated, then something is badly wrong and we're probably trying to mmap over some address space that we shouldn't be trying to mmap over so bzeroing it will make things even worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PALPOSIX::zero behaves in the same fashion. PALWindows calls error if VirtualAlloc2FromApp fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to deviate from PALPOSIX::notify_using's current behavior. PALPOSIX::notify_using will call PALPOSIX::zero. Currently if mmap fails, bzero is called on the pages that mmap failed to overwrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that this is more of an issue of PALPOSIX::notify_using's behavior. I too would like PALPOSIX and PALApple to abort the current process, like PALWindows. But I feel that belongs in a different pull request; one that will also update PORTING.md to include the invariant "failed mmap calls will cause the process to abort" or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable to move that to a different PR.
src/pal/pal_apple.h
Outdated
| /** | ||
| * Source of Entropy | ||
| * | ||
| * Apple platforms do not have a public getentropy implementation, so use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we depending on CommonCrypto or CoreCrypto? Is there an API contract that says that no future version of CCGenerateRandomBytes may ever call malloc or related APIs (including implicitly, as a result of depending on thread-local storage or as a result of running library constructors that must be run before we invoke the exported function)?
If there is an error, we are killing the process. Given that we use the entropy only for probabilistic mitigations, I'm not convinced that this is a better outcome. Under what situations can ccrng_generate fail? I presume we cannot call ccrng_generate directly because it's an internal function?
e51fdde to
8235fa7
Compare
| * spotted in various projects; might be a temporary fix). | ||
| */ | ||
| template<bool page_aligned = false> | ||
| static void zero(void* p, size_t size) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that makes a lot of sense. It would be great to have a comment explaining this.
| * Notify platform that we will be using these pages. | ||
| */ | ||
| template<ZeroMem zero_mem> | ||
| static void notify_using(void* p, size_t size) noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. We already have some template parameters for the generic PALs that allow tweaking things like the default mmap flags, I was wondering why this couldn't be done to make MADV_FREE something tunable, but it looks as if the order of operations is sufficiently different that this isn't possible (or, rather, would add a lot more complexity than having a custom implementation here).
|
@mjp41 The So it looks like by default macOS is intolerant of processes that eat dirty memory and swap memory in a short time-frame. (The current test in There are sysctl knobs that govern this behavior, but they are undocumented so I still need to experiment. |
Perfect thanks. We are happy to alter the test to trigger it fewer times on Apply platforms if that helps. |
This is a rewrite of the Apple PAL that implements the
AlignedAllocation,Entropy, andLazyCommitfeatures through native Apple APIs.It adds a dependency on
Security.frameworkviaSecRandomCopyBytes. Apple actively discourages use ofgetentropyand the symbol is not allowed on the App Store.TODO / Discussion
USE_POSIX_COMMIT_CHECKSI removed support for
USE_POSIX_COMMIT_CHECKSbecause the POSIX memory management APIs are no longer used but I'd like to avoid a regression. I need some context about what thememsetandmprotectcalls inpal_bsd.handpal_posix.hare supposed to accomplish. Is it supposed to emulate Windows' decommit behavior? If so, why do the pages need to be decommitted? What are the invariants?I hope to use the mach APIs to implement the behavior if possible and prevent any M1-related bugs:
Can the
sizeparameter ofreserve_alignedbe greater than4GiB?There are Apple equivalents of
ReclaimVirtualMemoryandOfferVirtualMemory, but they only work on regions that are4GiBand smaller. (These APIs have existed since macOS 10.6 and are present in the iOS SDK.)The
4GiBlimit is a hard limit in the XNU kernel even on 64-bit platforms.LowMemoryNotificationI have a working implementation behind the featureSNMALLOC_APPLE_LOWMEMORYNOTIFICATION. I'll push the change after we resolveUSE_POSIX_COMMIT_CHECKS.