Rewrite Apple PAL using native APIs #336

amari · 2021-06-12T02:53:24Z

This is a rewrite of the Apple PAL that implements the AlignedAllocation, Entropy, and LazyCommit features through native Apple APIs.

It adds a dependency on Security.framework via SecRandomCopyBytes. Apple actively discourages use of getentropy and the symbol is not allowed on the App Store.

TODO / Discussion

`USE_POSIX_COMMIT_CHECKS`

I removed support for USE_POSIX_COMMIT_CHECKS because the POSIX memory management APIs are no longer used but I'd like to avoid a regression. I need some context about what the memset and mprotect calls in pal_bsd.h and pal_posix.h are supposed to accomplish. Is it supposed to emulate Windows' decommit behavior? If so, why do the pages need to be decommitted? What are the invariants?

I hope to use the mach APIs to implement the behavior if possible and prevent any M1-related bugs:

Can the `size` parameter of `reserve_aligned` be greater than `4GiB`?

There are Apple equivalents of ReclaimVirtualMemory and OfferVirtualMemory, but they only work on regions that are 4GiB and smaller. (These APIs have existed since macOS 10.6 and are present in the iOS SDK.)

The 4GiB limit is a hard limit in the XNU kernel even on 64-bit platforms.

`LowMemoryNotification`

~~I have a working implementation behind the feature SNMALLOC_APPLE_LOWMEMORYNOTIFICATION. I'll push the change after we resolve USE_POSIX_COMMIT_CHECKS.~~

ghost · 2021-06-12T02:53:37Z

All CLA requirements met.

mjp41 · 2021-06-14T08:46:41Z

@amari thanks for looking at this.

USE_POSIX_COMMIT_CHECKS

This was primarily introduced to stop people introducing Windows bugs on other platforms. For memory not to be part of the Working Set (Resident set), we need to Decommit the reserved memory. Any access to decommitted memory will cause a fault. Adding the mprotect was done to ensure any memory that is accessed has notified the PAL is will be using it.

For any memory access, the most recent call of reserve, pal_notify_using and pal_notify_not_using that covers that region of memory was either pal_notify_using or reserve with the committed flag set.

We do not require linearity, we can repeatedly call pal_notify_using or pal_notify_not_using and that is not considered a problem, i.e. the following is not considered an error:

  pal_notify_using(p, size);
  pal_notify_using(p, size);

I believe the code could be made to restrict this if required for the Apple PAL. But it would require some additional work to validate all the paths, and add cross platform testing.

Can the size parameter of reserve_aligned be greater than 4GiB?

Yes. If someone asks snmalloc for more than 4GiB of memory, then we ask for more than 4GiB.

LowMemoryNotification

Awesome

@davidchisnall, @ryancinsight, @devnexen do you have any thoughts on moving away from the POSIX view of Apple to the native APIs? This seems a good reason for the PAL, but do you have any compatibility concerns?

amari · 2021-06-14T09:14:29Z

@mjp41. There was a formatting error. I don't think that the contributors were pinged.

@davidchisnall, @ryancinsight, @devnexen

devnexen · 2021-06-14T09:45:44Z

Just tried your branch on Mac M1 and just does not build. Fails on

pal.h: 125:5:static_assert expression PAGE_SIZE == OS_PAGE_SIZE ... #define PAGE_SIZE vm_page_size

devnexen · 2021-06-14T09:47:14Z

I m not against removing getentropy support here, seems justifiable to avoid the app store issue, however I m less keen about the vm api revamp. but we ll see I m not against neither.

amari · 2021-06-14T10:02:01Z

Just tried your branch on Mac M1 and just does not build. Fails on

pal.h: 125:5:static_assert expression PAGE_SIZE == OS_PAGE_SIZE ... #define PAGE_SIZE vm_page_size

@devnexen. Are you compiling using Rosetta? Otherwise vm_page_size should be 16KiB. See these WWDC notes Determining CPU page size.

If not, PAGE_SIZE should be replaced by PAGE_MAX_SIZE. But, vm_page_size shouldn't be 4KiB for native arm apps. (I don't have an M1 to test this on, I'm just following the documentation.)

devnexen · 2021-06-14T10:15:10Z

Did not use rosetta. By the way I retried also in my older mac intel same issue. Is there any prerequisite ?

amari · 2021-06-14T10:45:11Z

I found this. I don't think externs play nice with static_assert. Before, it was comparing vm_page_size with itself and therefore the comparison optimized away.

Declaration of vm_page_size in /usr/include/mach/vm_page_size.h

extern  vm_size_t       vm_page_size;

Before macOS 11

/usr/include/mach/i386/vm_param.h

#define PAGE_SIZE  I386_PGBYTES // 4KiB

macOS 11 and Later

/usr/include/mach/i386/vm_param.h

#define PAGE_SIZE  vm_page_size // extern

/usr/include/mach/arm/vm_param.h

#define PAGE_SIZE  vm_page_size // extern

Also, If we use PAGE_SIZE we lose any benefit of page_size being a constexpr. The replacement is PAGE_MAX_SIZE.

src/pal/pal.h

devnexen · 2021-06-14T10:49:23Z

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

devnexen · 2021-06-14T11:09:07Z

Not sure for Intel about PAGE_MAX_SIZE tough, the build fails, my Mac M1 I can t use it now but might work.

amari · 2021-06-14T11:09:43Z

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue. The solution is to eliminate the static_assert. There's no way to get the page size at compile time though a macro.

devnexen · 2021-06-14T11:13:40Z

Might be still needed for non apple tough ? Just curious is your gear the ARM64 flavor ?

devnexen · 2021-06-14T11:15:50Z

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue.

Or maybe PAGE_MAX_SIZE is not really what it appears. May means more arch agnostic max value rather than arch specific, to be fair most of the time apple specifics in general are a bit of opaque :-)

amari · 2021-06-14T11:25:55Z

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue.

Or maybe PAGE_MAX_SIZE is not really what it appears. May means more arch agnostic max value rather than arch specific, to be fair most of the time apple specifics in general are a bit of opaque :-)

I'm using Intel. Between 10.15 and 11, Apple changed the definition of PAGE_MAX_SHIFT from 12 to 14 for Intel builds.

I think that the assertion should be skipped when __APPLE__ is defined b/c there is only one kernel build, unlike Linux and the BSDs where the page size can be build dependent.

devnexen · 2021-06-14T12:21:08Z

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue.

Or maybe PAGE_MAX_SIZE is not really what it appears. May means more arch agnostic max value rather than arch specific, to be fair most of the time apple specifics in general are a bit of opaque :-)

I'm using Intel. Between 10.15 and 11, Apple changed the definition of PAGE_MAX_SHIFT from 12 to 14 for Intel builds.

That s right, also same value for ARM thus my earlier assumption.
Last commit builds, and benchmarks outputs for master and this branch seems more or less on part.

ryancinsight · 2021-06-14T12:46:32Z

@amari thanks for looking at this.

USE_POSIX_COMMIT_CHECKS

This was primarily introduced to stop people introducing Windows bugs on other platforms. For memory not to be part of the Working Set (Resident set), we need to Decommit the reserved memory. Any access to decommitted memory will cause a fault. Adding the mprotect was done to ensure any memory that is accessed has notified the PAL is will be using it.

For any memory access, the most recent call of reserve, pal_notify_using and pal_notify_not_using that covers that region of memory was either pal_notify_using or reserve with the committed flag set.

We do not require linearity, we can repeatedly call pal_notify_using or pal_notify_not_using and that is not considered a problem, i.e. the following is not considered an error:
  pal_notify_using(p, size);
  pal_notify_using(p, size);
I believe the code could be made to restrict this if required for the Apple PAL. But it would require some additional work to validate all the paths, and add cross platform testing.

Can the size parameter of reserve_aligned be greater than 4GiB?

Yes. If someone asks snmalloc for more than 4GiB of memory, then we ask for more than 4GiB.

LowMemoryNotification

Awesome

@davidchisnall, @ryancinsight, @devnexen do you have any thoughts on moving away from the POSIX view of Apple to the native APIs? This seems a good reason for the PAL, but do you have any compatibility concerns?

I can't say I know enough about apple to forsee issues here, I don't currently use one either. That said, I would have thought arc4random_uniform/_buf would have been simpler but this route is pretty clever and I've see it pursued in swift. My only question is whether is an upperbound (only saw lower if there is one) to generated numbers or if there should be like there is in other implementations: https://opensource.apple.com/source/OpenSSH/OpenSSH-166/openssh/openbsd-compat/bsd-arc4random.c

devnexen · 2021-06-14T13:54:59Z

src/pal/pal_apple.h

+        mach_vm_offset_t mask = page_size - 1;
+
+        int flags =
+          VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE | VM_MAKE_TAG(PALAnonID);


would putting back VM_MAKE_TAG into a default flag work for you ?

I'd rather leave it as an expression. If in the future Apple refactors VM_MAKE_TAG into a function or any other non constexpr it should still compile.

ryancinsight · 2021-06-14T17:58:31Z

Also what's the benefit of this over CCrandomGeneratebytes? https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h

amari · 2021-06-14T19:47:19Z

@ryancinsight

Also what's the benefit of this over CCrandomGeneratebytes? https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h

What does Apple say?

SecRandomCopyBytes is the only one of these functions that has documentation on Apple's developer site.

I followed the documentation rabbit-hole, and Apple's own CommonCrypto docs direct developers to use SecRandomCopyBytes.

Cryptographic interfaces#Common Crypto library -> Cryptographic Services Guide#GeneratingRandomNumbers -> Randomization Services Guide -> SecRandomCopyBytes

Call Tracing

Did some digging on opensource.apple.com, I traced the calls:
SecRandomCopyBytes(rnd, count, bytes) -> CCRandomCopyBytes(kCCRandomDefault, bytes, count) -> CCRandomGenerateBytes(bytes, count).
It looks like Apple ignores the RNG selection and uses the default.

Wishful thinking

The most "correct" way to implement it would be to use corecrypto directly, but afaik it's an internal library:

#include <corecrypto/ccrng.h> // ccrng, ccrng_state, ccrng_generate
 
static uint64_t get_entropy64() {
  uint64_t result;
  int status;

  struct ccrng_state *rng = ccrng(&status);

  if (status != 0 || ccrng_generate(rng, sizeof(result), reinterpret_cast<void*>(&result)) != CCERR_OK) {
   error("Failed to get system randomness"); 
  }
  
  return result;
}

RNG specs

Here the specs of the RNG as of macOS 10.13:
https://github.com/apple/darwin-xnu/blob/main/EXTERNAL_HEADERS/corecrypto/ccrng.h#L28-L45

ryancinsight · 2021-06-14T20:11:22Z

@amari thanks for explanation! these docs are definitely a rabbit hole! Based on trace results I personally still prefer to be as direct as possible especially if it's skipping rng:

#elif defined(APPLE)
#include <CommonCrypto/CommonRandom.h>

And somehow use CCRandomGenerateBytes(buf, buf_len) == kCCSuccess;
Which should be in the header above afaik

However, your example would indeed be better though very similar: https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h.auto.html

amari · 2021-06-14T22:08:52Z

@ryancinsight

@amari thanks for explanation! these docs are definitely a rabbit hole! Based on trace results I personally still prefer to be as direct as possible especially if it's skipping rng:

#elif defined(APPLE)
#include <CommonCrypto/CommonRandom.h>

And somehow use CCRandomGenerateBytes(buf, buf_len) == kCCSuccess;
Which should be in the header above afaik

However, your example would indeed be better though very similar: https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h.auto.html

I can't find any information about whether CCRandomGenerateBytes is allowed on the App Store. (As the saying goes, no news is good news.) If there's an issue in the future, SNMALLOC_APPLE_APPSTORE_COMPAT should be defined and use SecCopyRandomBytes.

ryancinsight · 2021-06-15T02:31:17Z

They shouldn't give you a hard time based on this: https://www.niap-ccevs.org/Documents_and_Guidance/view_td.cfm?TD=0510

amari · 2021-06-16T10:06:43Z

@devnexen

The performance regression(s)

I've been benchmarking and digging through the XNU kernel on sourcegraph to find the performance regression(s). Here are my findings:

The Apple silicon/ARM bugs were due to PALPOSIX::page_size being defined as Aal::smallest_page_size, which is 0x1000 for ARM. It needs to be 16KiB for Apple user-space on ARM.
MADV_FREE does not behave as expected. I'm not going to write the entire call tree, but the pages are never marked as "reusable" by the kernel and this can be observed through profiling. I tested MADV_FREE and MADV_FREE_REUSABLE. It is significant. At least ~75% to ~90% less dirty memory is used by func-malloc-16 (observed on x86_64).
Unlike Linux, madvise and mprotect will never return EAGAIN.
mach_vm_protect and mach_vm_behavior_set are slower (100x) than mprotect and madvise. Apparently these functions pack their arguments into message buffers and use RPC to communicate with the kernel. (I had thought that Apple used registers for this.) This was slowing down func-malloc-16 significantly.

mimalloc-bench

I ran mimalloc-bench with leanN and redis. Performance is on-par with master.

master

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.81 466080 119.84 5.29 17 118441
      
redis sn 7.258 27696 3.47 0.17 0 7247

pal-apple

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.90 480648 119.34 5.18 20 122216
      
redis sn 7.109 27620 3.41 0.16 0 7226

Where things stand

I refactored PALApple to inherit from PALBSD. However, the only shared code is PALBSD::zero and PALPOSIX::error. notify_using, notify_not_using, and print_stack_trace have Apple-specific definitions (x86_64 now calls MADV_FREE_REUSABLE due to reasons above). I also removed all mach_* calls except for the single mach_vm_map in reserve_aligned. I also removed the address randomization from reserve_aligned, which makes virtual memory profiles easier to compare.

ctest is a little slower on x86_64 (~5 to 10 seconds). This is due to replacing MADV_FREE with MADV_FREE_REUSABLE.

devnexen · 2021-06-16T10:20:18Z

@devnexen

The performance regression(s)

I've been benchmarking and digging through the XNU kernel on sourcegraph to find the performance regression(s). Here are my findings:

Nice rework

The Apple silicon/ARM bugs were due to PALPOSIX::page_size being defined as Aal::smallest_page_size, which is 0x1000 for ARM. It needs to be 16KiB for Apple user-space on ARM.

MADV_FREE does not behave as expected. I'm not going to write the entire call tree, but the pages are never marked as "reusable" by the kernel and this can be observed through profiling. I tested MADV_FREE and MADV_FREE_REUSABLE. It is significant. At least ~75% to ~90% less dirty memory is used by func-malloc-16 (observed on x86_64).

Unlike Linux, madvise and mprotect will never return EAGAIN.

Not sure about this statement, with madvise and MADV_FREE/MADV_FREE_REUSABLE it happened to me, rarely for sure tough.

mach_vm_protect and mach_vm_behavior_set are slower (100x) than mprotect and madvise. Apparently these functions pack their arguments into message buffers and use RPC to communicate with the kernel. (I had thought that Apple used registers for this.) This was slowing down func-malloc-16 significantly.

Really 100x ? Not fighting it really glad finally mmap calls are back :-)

mimalloc-bench

I ran mimalloc-bench with leanN and redis. Performance is on-par with master.

master
# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.81 466080 119.84 5.29 17 118441
      
redis sn 7.258 27696 3.47 0.17 0 7247
pal-apple
# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.90 480648 119.34 5.18 20 122216
      
redis sn 7.109 27620 3.41 0.16 0 7226
Where things stand

I refactored PALApple to inherit from PALBSD. However, the only shared code is PALBSD::zero and PALPOSIX::error. notify_using, notify_not_using, and print_stack_trace have Apple-specific definitions (x86_64 now calls MADV_FREE_REUSABLE due to reasons above). I also removed all mach_* calls except for the single mach_vm_map in reserve_aligned. I also removed the address randomization from reserve_aligned, which makes virtual memory profiles easier to compare.

ctest is a little slower on x86_64 (~5 to 10 seconds). This is due to replacing MADV_FREE with MADV_FREE_REUSABLE.

Too bad microquill is down for now :-) I filled an issue on the project.

src/pal/pal_apple.h

mjp41

Thanks for such an in depth analysis, and rounds of changes. I am happy subject to a couple of minor things.

src/pal/pal_apple.h

devnexen · 2021-06-17T08:11:14Z

I like the new changes minus would like to see back the EAGAIN loops.

amari · 2021-06-17T08:15:30Z

@devnexen

I like the new changes minus would like to see back the EAGAIN loops.

I'm working on it. What OS version gave you EAGAIN? I going to look through sourcegraph to find any more undocumented errnos to be safe.

devnexen · 2021-06-17T08:18:26Z

@devnexen

I like the new changes minus would like to see back the EAGAIN loops.

I'm working on it. What OS version gave you EAGAIN? I going to look through sourcegraph to find any more undocumented errnos to be safe.

Well macOs Big Sur (never saw it anywhere else at least until now), it s pretty much contextual, depends of the current overall nature of the workload. since january I have my M1 maybe it occured 4/5 times.

amari · 2021-06-17T08:53:00Z

@devnexen

I like the new changes minus would like to see back the EAGAIN loops.

I'm working on it. What OS version gave you EAGAIN? I going to look through sourcegraph to find any more undocumented errnos to be safe.

Well macOs Big Sur (never saw it anywhere else at least until now), it s pretty much contextual, depends of the current overall nature of the workload. since january I have my M1 maybe it occured 4/5 times.

I tried to find a root cause. I couldn't find one. Oh well. The EAGAIN loops are back! ;)

devnexen · 2021-06-17T08:55:44Z

The overall looks good I got nothing major else to say.

mjp41

LGTM

mjp41 · 2021-06-21T08:44:06Z

@davidchisnall I am planning to merge this if you don't complain today ;-)

davidchisnall · 2021-06-21T08:56:05Z

src/pal/pal_apple.h

+    /**
+     * Source of Entropy
+     *
+     * Apple platforms do not have a public getentropy implementation, so use


This comment is misleading. It's a system call with a man page. It's completely public, Apple just has App Store policies preventing it from being used via some distribution channels. Do these apply to arc4random? This is documented as being periodically seeded from getentropy and should be fine for our purposes. I'm really nervous about a malloc implementation depending on a library that expects malloc to be working.

arc4random_buf can silently fail. CCGenerateRandomBytes returns an error code.

https://opensource.apple.com/source/Libc/Libc-1439.40.11/gen/FreeBSD/arc4random.c.auto.html

void arc4random_buf(void *buf, size_t buf_size) { arc4_init(); // the error is ignored ccrng_generate(rng, buf_size, buf); }

https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60178.40.2/lib/CommonRandom.c.auto.html

CCRNGStatus CCRandomGenerateBytes(void *bytes, size_t count) { int err; struct ccrng_state *rng; if (0 == count) { return kCCSuccess; } if (NULL == bytes) { return kCCParamError; } rng = ccDRBGGetRngState(); // the error is captured err = ccrng_generate(rng, count, bytes); if (err == CCERR_OK) { return kCCSuccess; } return kCCRNGFailure; }

I don't know why Apple didn't reimplement getentropy with CoreCrypto.

Apple's own libc depends on CoreCrypto, and CCGenerateRandomBytes does not depend on malloc.

libc -> CoreCrypto

CommonCrypto -> CoreCrypto

I'll correct the comment so that it is more clear and concise.

Also, arc4random is not seeded with genentropy:

https://opensource.apple.com/source/Libc/Libc-1439.40.11/gen/FreeBSD/arc4random.c.auto.html

static struct ccrng_state *rng; static void arc4_init(void) { int err; if (rng != NULL) return; rng = ccrng(&err); if (rng == NULL) { #if OS_CRASH_ENABLE_EXPERIMENTAL_LIBTRACE os_crash("arc4random: unable to get ccrng() handle (%d)", err); #else os_crash("arc4random: unable to get ccrng() handle"); #endif } } uint32_t arc4random(void) { uint32_t rand; arc4random_buf(&rand, sizeof(rand)); return rand; } void arc4random_buf(void *buf, size_t buf_size) { arc4_init(); ccrng_generate(rng, buf_size, buf); } uint32_t arc4random_uniform(uint32_t upper_bound) { uint64_t rand; arc4_init(); ccrng_uniform(rng, upper_bound, &rand); return (uint32_t)rand; }

Are we depending on CommonCrypto or CoreCrypto? Is there an API contract that says that no future version of CCGenerateRandomBytes may ever call malloc or related APIs (including implicitly, as a result of depending on thread-local storage or as a result of running library constructors that must be run before we invoke the exported function)?

If there is an error, we are killing the process. Given that we use the entropy only for probabilistic mitigations, I'm not convinced that this is a better outcome. Under what situations can ccrng_generate fail? I presume we cannot call ccrng_generate directly because it's an internal function?

The dependency is CommonCrypto. CoreCrypto is private. Yes, ccrng_generate is a private function.

CommonCrypto (/usr/lib/system/libcommonCrypto.dylib) is implicitly linked by dyld (macOS linker.) We have no say in the matter as library developers. If there's any "library initialization" that is dependent on malloc in CommonCrypto, I'd argue that the behavior is undefined and is intrinsic to macOS in production.

I'd argue that Apple makes no guarantees about future behavior of standard system APIs, as we saw with the macOS 11.2 release (a minor release!) that broke mprotect (W^X) and many projects experienced issues.

The descriptions of ccrng and ccrng_generate:

https://github.com/apple/darwin-xnu/blob/main/EXTERNAL_HEADERS/corecrypto/ccrng.h

/*! @function ccrng @abstract Initializes an AES-CTR mode cryptographic random number generator and returns the statically-allocated rng object. Getting a pointer to a ccrng has never been simpler! Call this function, get an rng object and then pass the object to ccrng_generate() to generate randoms. ccrng() may be called more than once. It returns pointer to the same object on all calls. @result a cryptographically secure random number generator or NULL if fails @discussion - It is significantly faster than using the system /dev/random - FIPS Compliant: NIST SP800-90A + FIPS 140-2 - Seeded from the system entropy. - Provides at least 128bit security if the system provide 2bit of entropy / byte. - Entropy accumulation - Backtracing resistance - Prediction break with frequent (asynchronous) reseed */

/*! @function ccrng_generate @abstract Generate `outlen` bytes of output, stored in `out`, using ccrng_state `rng`. @param rng `struct ccrng_state` representing the state of the RNG. @param outlen Amount of random bytes to generate. @param out Pointer to memory where random bytes are stored, of size at least `outlen`. @result 0 on success and nonzero on failure. */

ccrng "Initializes an AES-CTR mode cryptographic random number generator and returns the statically-allocated rng object [emphasis mine]."

Standards-compliant entities/organizations deploy to macOS and iOS using this API. See, https://www.niap-ccevs.org/Documents_and_Guidance/view_td.cfm?TD=0510

ccrng is closed source. ccrng_generate calls the function pointer in ccrng_state.

If there is an error, we are killing the process.

I would much sooner kill the process. We build different versions of the library with and without the mitigations. If it fails with mitigations, then the user can try the version without mitigations. Silent failure seems like a problem for security auditing.

davidchisnall · 2021-06-21T09:02:48Z

src/pal/pal.h

+  // define `PAGE_SIZE` as an extern making it unsuitable for this assertion.
+#if !defined(__APPLE__) && defined(PAGE_SIZE)
  static_assert(
    PAGE_SIZE == OS_PAGE_SIZE,


Suggested change

PAGE_SIZE == OS_PAGE_SIZE,

#if __has_builtin(__builtin_constant_p)

!__builtin_constant_p(PAGE_SIZE) || (PAGE_SIZE == OS_PAGE_SIZE),

#else

true,

#endif

We can then get rid of the platform-specific ifdefs. At some point, the builtin can be replaced with something more portable (with C++17 there are template tricks to check for compile-time constants, with C++20 and consteval I think there are simpler ones).

davidchisnall · 2021-06-21T10:46:02Z

src/pal/pal_apple.h

-     * spotted in various projects; might be a temporary fix).
-     */
-    template<bool page_aligned = false>
-    static void zero(void* p, size_t size) noexcept


We seem to have lost this function. @mjp41, is it no longer needed? If so, we should update the porting document to remove it and remove it from all of the other PALs.

It is coming from the PALBSD one, but I think we should keep this version. Good catch

Hopefully the !Arm case has been fixed in the logic in the YesZero case in notify_using?

@devnexen, who opened #278, tested this branch on both Intel and M1 machines and found no regression.

About a week ago, I crawled XNU on sourcegraph in an attempt to diagnose root cause(s) for the M1 bugs, as well as reconcile the difference between the x86_64 and ARM codepaths of PALApple:

The Apple silicon/ARM bugs were due to PALPOSIX::page_size being defined as Aal::smallest_page_size, which is 0x1000 for ARM. It needs to be 16KiB for Apple user-space on ARM. There's quite a bit of "creative" engineering going on in XNU as of macOS 11.3 . On ARM, the kernel uses 4KiB pages internally, but requires userspace to use 16KiB pages, unless the process is "exotic" (i.e. rosetta).

MADV_FREE does not behave as expected. I'm not going to write the entire call tree, but the pages are never marked as "reusable" by the kernel and this can be observed through profiling. I tested MADV_FREE and MADV_FREE_REUSABLE. It is significant. At least ~75% to ~90% less dirty memory is used by func-malloc-16 (observed on x86_64). Apple's own malloc implementation as well as many ports for Apple Operating Systems use MADV_FREE_REUS{E, ABLE} instead of MADV_FREE. See https://bugs.chromium.org/p/chromium/issues/detail?id=713892 https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html

The reason why notify_using and zero look like duplicate functionality is b/c of optimization. Also related to another one of @davidchisnall comments below, when PALPOSIX::zero calls mmap, it can fail and call bzero on the same pages that mmap failed to overwrite.

In pseudocode, for brevity.

master:

PALPOSIX::zero<page_aligned>: if constexpr (zero_mem == YesZero) if page_aligned || is_aligned_block<OS::page_size>(p, size) mmap if r != MAP_FAILED return; bzero PALPOSIX::notify_using: mprotect(PROT_READ|PROT_WRITE) zero<true>

pal-apple:

PALApple::notify_using: if constexpr (zero_mem == YesZero) mmap if r != MAP_FAILED return; mprotect(PROT_READ|PROT_WRITE) madvise(MADV_FREE_REUSE) bzero

We are free to move the mmap before mprotect, thereby eliminating the other 2 syscalls. Otherwise it would be mprotect, mmap, which is wasteful.

Please note that I didn't want to deviate from the failure conditions of PALPOSIX::notify_using .

Thanks, that makes a lot of sense. It would be great to have a comment explaining this.

davidchisnall · 2021-06-21T10:47:29Z

src/pal/pal_apple.h

+     * Notify platform that we will be using these pages.
     */
    template<ZeroMem zero_mem>
    static void notify_using(void* p, size_t size) noexcept


This function now looks almost identical to the POSIX PAL version. Why do we need a separate one?

MADV_FREE != MADV_FREE_REUSABLE on XNU.
Please see the above comment. #336 (comment)

Thanks. We already have some template parameters for the generic PALs that allow tweaking things like the default mmap flags, I was wondering why this couldn't be done to make MADV_FREE something tunable, but it looks as if the order of operations is sufficiently different that this isn't possible (or, rather, would add a lot more complexity than having a custom implementation here).

src/pal/pal_apple.h

davidchisnall · 2021-06-21T16:05:25Z

src/pal/pal_apple.h

-     * spotted in various projects; might be a temporary fix).
-     */
-    template<bool page_aligned = false>
-    static void zero(void* p, size_t size) noexcept


Hopefully the !Arm case has been fixed in the logic in the YesZero case in notify_using?

davidchisnall · 2021-06-21T16:05:50Z

src/pal/pal_apple.h

-        is_aligned_block<PALBSD::page_size>(p, size) || (zero_mem == NoZero));
+        is_aligned_block<page_size>(p, size) || (zero_mem == NoZero));
+
+      if constexpr (zero_mem == YesZero)


Please put this logic in zero()

Please see #336 (comment)

davidchisnall · 2021-06-21T16:08:30Z

src/pal/pal_apple.h

      if constexpr (zero_mem == YesZero)
-        zero<true>(p, size);
+      {
+        ::bzero(p, size);


We can only reach this if the mmap call above failed. I don't think that mmap should be able to fail, so the error condition there should probably be PAL::error (which kills the process): if we can't mmap over some address space that we've previously allocated, then something is badly wrong and we're probably trying to mmap over some address space that we shouldn't be trying to mmap over so bzeroing it will make things even worse.

PALPOSIX::zero behaves in the same fashion. PALWindows calls error if VirtualAlloc2FromApp fails.

I didn't want to deviate from PALPOSIX::notify_using's current behavior. PALPOSIX::notify_using will call PALPOSIX::zero. Currently if mmap fails, bzero is called on the pages that mmap failed to overwrite.

I feel that this is more of an issue of PALPOSIX::notify_using's behavior. I too would like PALPOSIX and PALApple to abort the current process, like PALWindows. But I feel that belongs in a different pull request; one that will also update PORTING.md to include the invariant "failed mmap calls will cause the process to abort" or something similar.

That seems reasonable to move that to a different PR.

davidchisnall · 2021-06-21T16:14:31Z

src/pal/pal_apple.h

+    /**
+     * Source of Entropy
+     *
+     * Apple platforms do not have a public getentropy implementation, so use


Are we depending on CommonCrypto or CoreCrypto? Is there an API contract that says that no future version of CCGenerateRandomBytes may ever call malloc or related APIs (including implicitly, as a result of depending on thread-local storage or as a result of running library constructors that must be run before we invoke the exported function)?

If there is an error, we are killing the process. Given that we use the entropy only for probabilistic mitigations, I'm not convinced that this is a better outcome. Under what situations can ccrng_generate fail? I presume we cannot call ccrng_generate directly because it's an internal function?

davidchisnall · 2021-06-22T08:18:52Z

src/pal/pal_apple.h

-     * spotted in various projects; might be a temporary fix).
-     */
-    template<bool page_aligned = false>
-    static void zero(void* p, size_t size) noexcept


Thanks, that makes a lot of sense. It would be great to have a comment explaining this.

davidchisnall · 2021-06-22T08:21:19Z

src/pal/pal_apple.h

+     * Notify platform that we will be using these pages.
     */
    template<ZeroMem zero_mem>
    static void notify_using(void* p, size_t size) noexcept


Thanks. We already have some template parameters for the generic PALs that allow tweaking things like the default mmap flags, I was wondering why this couldn't be done to make MADV_FREE something tunable, but it looks as if the order of operations is sufficiently different that this isn't possible (or, rather, would add a lot more complexity than having a custom implementation here).

amari · 2021-06-23T07:18:03Z

@mjp41 LowMemoryNotification will be a different PR.

The perf-low_memory-* tests pass, but will sometimes get SIGKILL'd due to memory usage even though I can see the reclamation happen on the dirty and swap graphs of the profiler.
macOS has 3 memory pressure states: Normal, Warning, and Critical. Not only is a process is picked to be killed based on memory usage, but also the number of times it has triggered the Warning and/or Critical states.

So it looks like by default macOS is intolerant of processes that eat dirty memory and swap memory in a short time-frame. (The current test in low_memory.cc calls reach_pressure then reduce_pressure 10 times in a loop.)

There are sysctl knobs that govern this behavior, but they are undocumented so I still need to experiment.

mjp41 · 2021-06-23T07:26:44Z

@mjp41 LowMemoryNotification will be a different PR.

The perf-low_memory-* tests pass, but will sometimes get SIGKILL'd due to memory usage even though I can see the reclamation happen on the dirty and swap graphs of the profiler.
macOS has 3 memory pressure states: Normal, Warning, and Critical. Not only is a process is picked to be killed based on memory usage, but also the number of times it has triggered the Warning and/or Critical states. The current test in low_memory.cc calls reach_pressure then reduce_pressure 10 times in a loop. There are sysctl knobs that govern this behavior, but they are undocumented so I still need to experiment.

Perfect thanks. We are happy to alter the test to trigger it fewer times on Apply platforms if that helps.

Rewrite Apple PAL using native APIs

bb9af79

devnexen reviewed Jun 14, 2021

View reviewed changes

src/pal/pal.h Outdated Show resolved Hide resolved

Fix builds on macOS 11, PAGE_SIZE is not a constant

757d522

amari force-pushed the pal-apple branch from 65e482d to 757d522 Compare June 14, 2021 11:00

Fix builds on macOS 11, trust Apple PAL page_size

e6684f6

devnexen reviewed Jun 14, 2021

View reviewed changes

Replace SecRandomCopyBytes with CCRandomGenerateBytes

0924ac1

amari force-pushed the pal-apple branch from da1b489 to 0924ac1 Compare June 14, 2021 23:23

Implement USE_POSIX_COMMIT_CHECKS for Apple PAL

77d4c20

devnexen reviewed Jun 16, 2021

View reviewed changes

src/pal/pal_apple.h Show resolved Hide resolved

mjp41 requested changes Jun 16, 2021

View reviewed changes

src/pal/pal_apple.h Outdated Show resolved Hide resolved

src/pal/pal_apple.h Show resolved Hide resolved

src/pal/pal_apple.h Outdated Show resolved Hide resolved

src/pal/pal_apple.h Outdated Show resolved Hide resolved

devnexen reviewed Jun 16, 2021

View reviewed changes

src/pal/pal_apple.h Outdated Show resolved Hide resolved

amari added 4 commits June 17, 2021 02:32

Deduplicate pal_apple.h and pal_posix.h

ffe11e5

Document Apple PAL

b25217e

Remove possible undefined behavior from Apple PAL

142ed96

Remove address randomization from Apple PAL

d957d91

Bring back the EAGAIN loops to Apple PAL

e08266f

mjp41 approved these changes Jun 17, 2021

View reviewed changes

davidchisnall reviewed Jun 21, 2021

View reviewed changes

Document PALApple::get_entropy64 and PALApple::reserve_aligned

59ab670

amari force-pushed the pal-apple branch 2 times, most recently from e51fdde to 8235fa7 Compare June 22, 2021 01:59

Refactor the platform-specific PAGE_SIZE check

12f560a

amari force-pushed the pal-apple branch from 8235fa7 to 12f560a Compare June 22, 2021 03:40

davidchisnall approved these changes Jun 22, 2021

View reviewed changes

Document peculiarities of Apple PAL

e1fff7a

mjp41 approved these changes Jun 23, 2021

View reviewed changes

mjp41 merged commit 85843e9 into microsoft:master Jun 23, 2021

-    PAGE_SIZE == OS_PAGE_SIZE,
+#if __has_builtin(__builtin_constant_p)
+    !__builtin_constant_p(PAGE_SIZE) || (PAGE_SIZE == OS_PAGE_SIZE),
+#else
+    true,
+#endif

Rewrite Apple PAL using native APIs #336

Rewrite Apple PAL using native APIs #336

Uh oh!

Conversation

amari commented Jun 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO / Discussion

USE_POSIX_COMMIT_CHECKS

Can the size parameter of reserve_aligned be greater than 4GiB?

LowMemoryNotification

Uh oh!

ghost commented Jun 12, 2021 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjp41 commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amari commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devnexen commented Jun 14, 2021

Uh oh!

devnexen commented Jun 14, 2021

Uh oh!

amari commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devnexen commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amari commented Jun 14, 2021

Uh oh!

Uh oh!

devnexen commented Jun 14, 2021

Uh oh!

devnexen commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amari commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devnexen commented Jun 14, 2021

Uh oh!

devnexen commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amari commented Jun 14, 2021

Uh oh!

devnexen commented Jun 14, 2021

Uh oh!

ryancinsight commented Jun 14, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryancinsight commented Jun 14, 2021

Uh oh!

amari commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does Apple say?

Call Tracing

Wishful thinking

RNG specs

Uh oh!

ryancinsight commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amari commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryancinsight commented Jun 15, 2021

Uh oh!

amari commented Jun 16, 2021

The performance regression(s)

mimalloc-bench

Where things stand

Uh oh!

devnexen commented Jun 16, 2021

The performance regression(s)

amari commented Jun 12, 2021 •

edited

Loading

`USE_POSIX_COMMIT_CHECKS`

Can the `size` parameter of `reserve_aligned` be greater than `4GiB`?

`LowMemoryNotification`

ghost commented Jun 12, 2021 •

edited by ghost

Loading

mjp41 commented Jun 14, 2021 •

edited

Loading

amari commented Jun 14, 2021 •

edited

Loading

amari commented Jun 14, 2021 •

edited

Loading

devnexen commented Jun 14, 2021 •

edited

Loading

devnexen commented Jun 14, 2021 •

edited

Loading

amari commented Jun 14, 2021 •

edited

Loading

devnexen commented Jun 14, 2021 •

edited

Loading

amari commented Jun 14, 2021 •

edited

Loading

ryancinsight commented Jun 14, 2021 •

edited

Loading

amari commented Jun 14, 2021 •

edited

Loading

devnexen commented Jun 17, 2021 •

edited

Loading

amari Jun 21, 2021 •

edited

Loading

amari Jun 21, 2021 •

edited

Loading