Skip to content

Conversation

@amari
Copy link
Contributor

@amari amari commented Jun 12, 2021

This is a rewrite of the Apple PAL that implements the AlignedAllocation, Entropy, and LazyCommit features through native Apple APIs.

It adds a dependency on Security.framework via SecRandomCopyBytes. Apple actively discourages use of getentropy and the symbol is not allowed on the App Store.

TODO / Discussion

USE_POSIX_COMMIT_CHECKS

I removed support for USE_POSIX_COMMIT_CHECKS because the POSIX memory management APIs are no longer used but I'd like to avoid a regression. I need some context about what the memset and mprotect calls in pal_bsd.h and pal_posix.h are supposed to accomplish. Is it supposed to emulate Windows' decommit behavior? If so, why do the pages need to be decommitted? What are the invariants?

I hope to use the mach APIs to implement the behavior if possible and prevent any M1-related bugs:

Can the size parameter of reserve_aligned be greater than 4GiB?

There are Apple equivalents of ReclaimVirtualMemory and OfferVirtualMemory, but they only work on regions that are 4GiB and smaller. (These APIs have existed since macOS 10.6 and are present in the iOS SDK.)

The 4GiB limit is a hard limit in the XNU kernel even on 64-bit platforms.

LowMemoryNotification

I have a working implementation behind the feature SNMALLOC_APPLE_LOWMEMORYNOTIFICATION. I'll push the change after we resolve USE_POSIX_COMMIT_CHECKS.

@ghost
Copy link

ghost commented Jun 12, 2021

CLA assistant check
All CLA requirements met.

@mjp41
Copy link
Member

mjp41 commented Jun 14, 2021

@amari thanks for looking at this.

USE_POSIX_COMMIT_CHECKS

This was primarily introduced to stop people introducing Windows bugs on other platforms. For memory not to be part of the Working Set (Resident set), we need to Decommit the reserved memory. Any access to decommitted memory will cause a fault. Adding the mprotect was done to ensure any memory that is accessed has notified the PAL is will be using it.

For any memory access, the most recent call of reserve, pal_notify_using and pal_notify_not_using that covers that region of memory was either pal_notify_using or reserve with the committed flag set.

We do not require linearity, we can repeatedly call pal_notify_using or pal_notify_not_using and that is not considered a problem, i.e. the following is not considered an error:

  pal_notify_using(p, size);
  pal_notify_using(p, size);

I believe the code could be made to restrict this if required for the Apple PAL. But it would require some additional work to validate all the paths, and add cross platform testing.

Can the size parameter of reserve_aligned be greater than 4GiB?

Yes. If someone asks snmalloc for more than 4GiB of memory, then we ask for more than 4GiB.

LowMemoryNotification

Awesome

@davidchisnall, @ryancinsight, @devnexen do you have any thoughts on moving away from the POSIX view of Apple to the native APIs? This seems a good reason for the PAL, but do you have any compatibility concerns?

@amari
Copy link
Contributor Author

amari commented Jun 14, 2021

@mjp41. There was a formatting error. I don't think that the contributors were pinged.

@davidchisnall, @ryancinsight, @devnexen

@devnexen
Copy link
Collaborator

Just tried your branch on Mac M1 and just does not build. Fails on

pal.h: 125:5:static_assert expression PAGE_SIZE == OS_PAGE_SIZE ... #define PAGE_SIZE vm_page_size

@devnexen
Copy link
Collaborator

I m not against removing getentropy support here, seems justifiable to avoid the app store issue, however I m less keen about the vm api revamp. but we ll see I m not against neither.

@amari
Copy link
Contributor Author

amari commented Jun 14, 2021

Just tried your branch on Mac M1 and just does not build. Fails on

pal.h: 125:5:static_assert expression PAGE_SIZE == OS_PAGE_SIZE ... #define PAGE_SIZE vm_page_size

@devnexen. Are you compiling using Rosetta? Otherwise vm_page_size should be 16KiB. See these WWDC notes Determining CPU page size.

If not, PAGE_SIZE should be replaced by PAGE_MAX_SIZE. But, vm_page_size shouldn't be 4KiB for native arm apps. (I don't have an M1 to test this on, I'm just following the documentation.)

@devnexen
Copy link
Collaborator

devnexen commented Jun 14, 2021

Did not use rosetta. By the way I retried also in my older mac intel same issue. Is there any prerequisite ?

@amari
Copy link
Contributor Author

amari commented Jun 14, 2021

I found this. I don't think externs play nice with static_assert. Before, it was comparing vm_page_size with itself and therefore the comparison optimized away.

Declaration of vm_page_size in /usr/include/mach/vm_page_size.h

extern  vm_size_t       vm_page_size;

Before macOS 11

/usr/include/mach/i386/vm_param.h

#define PAGE_SIZE  I386_PGBYTES // 4KiB

macOS 11 and Later

/usr/include/mach/i386/vm_param.h

#define PAGE_SIZE  vm_page_size // extern

/usr/include/mach/arm/vm_param.h

#define PAGE_SIZE  vm_page_size // extern

Also, If we use PAGE_SIZE we lose any benefit of page_size being a constexpr. The replacement is PAGE_MAX_SIZE.

@devnexen
Copy link
Collaborator

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen
Copy link
Collaborator

devnexen commented Jun 14, 2021

Not sure for Intel about PAGE_MAX_SIZE tough, the build fails, my Mac M1 I can t use it now but might work.

@amari
Copy link
Contributor Author

amari commented Jun 14, 2021

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue. The solution is to eliminate the static_assert. There's no way to get the page size at compile time though a macro.

@devnexen
Copy link
Collaborator

Might be still needed for non apple tough ? Just curious is your gear the ARM64 flavor ?

@devnexen
Copy link
Collaborator

devnexen commented Jun 14, 2021

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue.

Or maybe PAGE_MAX_SIZE is not really what it appears. May means more arch agnostic max value rather than arch specific, to be fair most of the time apple specifics in general are a bit of opaque :-)

@amari
Copy link
Contributor Author

amari commented Jun 14, 2021

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue.

Or maybe PAGE_MAX_SIZE is not really what it appears. May means more arch agnostic max value rather than arch specific, to be fair most of the time apple specifics in general are a bit of opaque :-)

I'm using Intel. Between 10.15 and 11, Apple changed the definition of PAGE_MAX_SHIFT from 12 to 14 for Intel builds.

I think that the assertion should be skipped when __APPLE__ is defined b/c there is only one kernel build, unlike Linux and the BSDs where the page size can be build dependent.

@devnexen
Copy link
Collaborator

More important
pal/pal.h:124:3: error: static_assert failed due to requirement '(1 << 14) == OS_PAGE_SIZE' "Page size from system header does not match snmalloc config page size."

@devnexen Okay. Apple misrepesents the page size on macOS 11+. On x86_64 PAGE_MAX_SIZE is defined as 1 << 14, which is simply untrue.

Or maybe PAGE_MAX_SIZE is not really what it appears. May means more arch agnostic max value rather than arch specific, to be fair most of the time apple specifics in general are a bit of opaque :-)

I'm using Intel. Between 10.15 and 11, Apple changed the definition of PAGE_MAX_SHIFT from 12 to 14 for Intel builds.

That s right, also same value for ARM thus my earlier assumption.
Last commit builds, and benchmarks outputs for master and this branch seems more or less on part.

@ryancinsight
Copy link
Contributor

@amari thanks for looking at this.

USE_POSIX_COMMIT_CHECKS

This was primarily introduced to stop people introducing Windows bugs on other platforms. For memory not to be part of the Working Set (Resident set), we need to Decommit the reserved memory. Any access to decommitted memory will cause a fault. Adding the mprotect was done to ensure any memory that is accessed has notified the PAL is will be using it.

For any memory access, the most recent call of reserve, pal_notify_using and pal_notify_not_using that covers that region of memory was either pal_notify_using or reserve with the committed flag set.

We do not require linearity, we can repeatedly call pal_notify_using or pal_notify_not_using and that is not considered a problem, i.e. the following is not considered an error:

  pal_notify_using(p, size);
  pal_notify_using(p, size);

I believe the code could be made to restrict this if required for the Apple PAL. But it would require some additional work to validate all the paths, and add cross platform testing.

Can the size parameter of reserve_aligned be greater than 4GiB?

Yes. If someone asks snmalloc for more than 4GiB of memory, then we ask for more than 4GiB.

LowMemoryNotification

Awesome

@davidchisnall, @ryancinsight, @devnexen do you have any thoughts on moving away from the POSIX view of Apple to the native APIs? This seems a good reason for the PAL, but do you have any compatibility concerns?

I can't say I know enough about apple to forsee issues here, I don't currently use one either. That said, I would have thought arc4random_uniform/_buf would have been simpler but this route is pretty clever and I've see it pursued in swift. My only question is whether is an upperbound (only saw lower if there is one) to generated numbers or if there should be like there is in other implementations: https://opensource.apple.com/source/OpenSSH/OpenSSH-166/openssh/openbsd-compat/bsd-arc4random.c

mach_vm_offset_t mask = page_size - 1;

int flags =
VM_FLAGS_FIXED | VM_FLAGS_OVERWRITE | VM_MAKE_TAG(PALAnonID);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would putting back VM_MAKE_TAG into a default flag work for you ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather leave it as an expression. If in the future Apple refactors VM_MAKE_TAG into a function or any other non constexpr it should still compile.

@ryancinsight
Copy link
Contributor

Also what's the benefit of this over CCrandomGeneratebytes? https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h

@amari
Copy link
Contributor Author

amari commented Jun 14, 2021

@ryancinsight

Also what's the benefit of this over CCrandomGeneratebytes? https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h

What does Apple say?

SecRandomCopyBytes is the only one of these functions that has documentation on Apple's developer site.

I followed the documentation rabbit-hole, and Apple's own CommonCrypto docs direct developers to use SecRandomCopyBytes.

Cryptographic interfaces#Common Crypto library -> Cryptographic Services Guide#GeneratingRandomNumbers -> Randomization Services Guide -> SecRandomCopyBytes

Call Tracing

Did some digging on opensource.apple.com, I traced the calls:
SecRandomCopyBytes(rnd, count, bytes) -> CCRandomCopyBytes(kCCRandomDefault, bytes, count) -> CCRandomGenerateBytes(bytes, count).
It looks like Apple ignores the RNG selection and uses the default.

Wishful thinking

The most "correct" way to implement it would be to use corecrypto directly, but afaik it's an internal library:

#include <corecrypto/ccrng.h> // ccrng, ccrng_state, ccrng_generate
 
static uint64_t get_entropy64() {
  uint64_t result;
  int status;

  struct ccrng_state *rng = ccrng(&status);

  if (status != 0 || ccrng_generate(rng, sizeof(result), reinterpret_cast<void*>(&result)) != CCERR_OK) {
   error("Failed to get system randomness"); 
  }
  
  return result;
}

RNG specs

Here the specs of the RNG as of macOS 10.13:
https://github.com/apple/darwin-xnu/blob/main/EXTERNAL_HEADERS/corecrypto/ccrng.h#L28-L45

@ryancinsight
Copy link
Contributor

ryancinsight commented Jun 14, 2021

@amari thanks for explanation! these docs are definitely a rabbit hole! Based on trace results I personally still prefer to be as direct as possible especially if it's skipping rng:

#elif defined(APPLE)
#include <CommonCrypto/CommonRandom.h>

And somehow use CCRandomGenerateBytes(buf, buf_len) == kCCSuccess;
Which should be in the header above afaik

However, your example would indeed be better though very similar: https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h.auto.html

@amari
Copy link
Contributor Author

amari commented Jun 14, 2021

@ryancinsight

@amari thanks for explanation! these docs are definitely a rabbit hole! Based on trace results I personally still prefer to be as direct as possible especially if it's skipping rng:

#elif defined(APPLE)
#include <CommonCrypto/CommonRandom.h>

And somehow use CCRandomGenerateBytes(buf, buf_len) == kCCSuccess;
Which should be in the header above afaik

However, your example would indeed be better though very similar: https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60061/include/CommonRandom.h.auto.html

I can't find any information about whether CCRandomGenerateBytes is allowed on the App Store. (As the saying goes, no news is good news.) If there's an issue in the future, SNMALLOC_APPLE_APPSTORE_COMPAT should be defined and use SecCopyRandomBytes.

@ryancinsight
Copy link
Contributor

They shouldn't give you a hard time based on this: https://www.niap-ccevs.org/Documents_and_Guidance/view_td.cfm?TD=0510

@amari
Copy link
Contributor Author

amari commented Jun 16, 2021

@devnexen

The performance regression(s)

I've been benchmarking and digging through the XNU kernel on sourcegraph to find the performance regression(s). Here are my findings:

  1. The Apple silicon/ARM bugs were due to PALPOSIX::page_size being defined as Aal::smallest_page_size, which is 0x1000 for ARM. It needs to be 16KiB for Apple user-space on ARM.
  2. MADV_FREE does not behave as expected. I'm not going to write the entire call tree, but the pages are never marked as "reusable" by the kernel and this can be observed through profiling. I tested MADV_FREE and MADV_FREE_REUSABLE. It is significant. At least ~75% to ~90% less dirty memory is used by func-malloc-16 (observed on x86_64).
  3. Unlike Linux, madvise and mprotect will never return EAGAIN.
  4. mach_vm_protect and mach_vm_behavior_set are slower (100x) than mprotect and madvise. Apparently these functions pack their arguments into message buffers and use RPC to communicate with the kernel. (I had thought that Apple used registers for this.) This was slowing down func-malloc-16 significantly.

mimalloc-bench

I ran mimalloc-bench with leanN and redis. Performance is on-par with master.

master

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.81 466080 119.84 5.29 17 118441
      
redis sn 7.258 27696 3.47 0.17 0 7247

pal-apple

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.90 480648 119.34 5.18 20 122216
      
redis sn 7.109 27620 3.41 0.16 0 7226

Where things stand

I refactored PALApple to inherit from PALBSD. However, the only shared code is PALBSD::zero and PALPOSIX::error. notify_using, notify_not_using, and print_stack_trace have Apple-specific definitions (x86_64 now calls MADV_FREE_REUSABLE due to reasons above). I also removed all mach_* calls except for the single mach_vm_map in reserve_aligned. I also removed the address randomization from reserve_aligned, which makes virtual memory profiles easier to compare.

ctest is a little slower on x86_64 (~5 to 10 seconds). This is due to replacing MADV_FREE with MADV_FREE_REUSABLE.

@devnexen
Copy link
Collaborator

@devnexen

The performance regression(s)

I've been benchmarking and digging through the XNU kernel on sourcegraph to find the performance regression(s). Here are my findings:

Nice rework

  1. The Apple silicon/ARM bugs were due to PALPOSIX::page_size being defined as Aal::smallest_page_size, which is 0x1000 for ARM. It needs to be 16KiB for Apple user-space on ARM.
  2. MADV_FREE does not behave as expected. I'm not going to write the entire call tree, but the pages are never marked as "reusable" by the kernel and this can be observed through profiling. I tested MADV_FREE and MADV_FREE_REUSABLE. It is significant. At least ~75% to ~90% less dirty memory is used by func-malloc-16 (observed on x86_64).
  3. Unlike Linux, madvise and mprotect will never return EAGAIN.

Not sure about this statement, with madvise and MADV_FREE/MADV_FREE_REUSABLE it happened to me, rarely for sure tough.

  1. mach_vm_protect and mach_vm_behavior_set are slower (100x) than mprotect and madvise. Apparently these functions pack their arguments into message buffers and use RPC to communicate with the kernel. (I had thought that Apple used registers for this.) This was slowing down func-malloc-16 significantly.

Really 100x ? Not fighting it really glad finally mmap calls are back :-)

mimalloc-bench

I ran mimalloc-bench with leanN and redis. Performance is on-par with master.

master

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.81 466080 119.84 5.29 17 118441
      
redis sn 7.258 27696 3.47 0.17 0 7247

pal-apple

# --------------------------------------------------
# benchmark allocator elapsed rss user sys page-faults page-reclaims
      
leanN sn 34.90 480648 119.34 5.18 20 122216
      
redis sn 7.109 27620 3.41 0.16 0 7226

Where things stand

I refactored PALApple to inherit from PALBSD. However, the only shared code is PALBSD::zero and PALPOSIX::error. notify_using, notify_not_using, and print_stack_trace have Apple-specific definitions (x86_64 now calls MADV_FREE_REUSABLE due to reasons above). I also removed all mach_* calls except for the single mach_vm_map in reserve_aligned. I also removed the address randomization from reserve_aligned, which makes virtual memory profiles easier to compare.

ctest is a little slower on x86_64 (~5 to 10 seconds). This is due to replacing MADV_FREE with MADV_FREE_REUSABLE.

Too bad microquill is down for now :-) I filled an issue on the project.

Copy link
Member

@mjp41 mjp41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for such an in depth analysis, and rounds of changes. I am happy subject to a couple of minor things.

@devnexen
Copy link
Collaborator

I like the new changes minus would like to see back the EAGAIN loops.

@amari
Copy link
Contributor Author

amari commented Jun 17, 2021

@devnexen

I like the new changes minus would like to see back the EAGAIN loops.

I'm working on it. What OS version gave you EAGAIN? I going to look through sourcegraph to find any more undocumented errnos to be safe.

@devnexen
Copy link
Collaborator

devnexen commented Jun 17, 2021

@devnexen

I like the new changes minus would like to see back the EAGAIN loops.

I'm working on it. What OS version gave you EAGAIN? I going to look through sourcegraph to find any more undocumented errnos to be safe.

Well macOs Big Sur (never saw it anywhere else at least until now), it s pretty much contextual, depends of the current overall nature of the workload. since january I have my M1 maybe it occured 4/5 times.

@amari
Copy link
Contributor Author

amari commented Jun 17, 2021

@devnexen

I like the new changes minus would like to see back the EAGAIN loops.

I'm working on it. What OS version gave you EAGAIN? I going to look through sourcegraph to find any more undocumented errnos to be safe.

Well macOs Big Sur (never saw it anywhere else at least until now), it s pretty much contextual, depends of the current overall nature of the workload. since january I have my M1 maybe it occured 4/5 times.

I tried to find a root cause. I couldn't find one. Oh well. The EAGAIN loops are back! ;)

@devnexen
Copy link
Collaborator

The overall looks good I got nothing major else to say.

Copy link
Member

@mjp41 mjp41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mjp41
Copy link
Member

mjp41 commented Jun 21, 2021

@davidchisnall I am planning to merge this if you don't complain today ;-)

/**
* Source of Entropy
*
* Apple platforms do not have a public getentropy implementation, so use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is misleading. It's a system call with a man page. It's completely public, Apple just has App Store policies preventing it from being used via some distribution channels. Do these apply to arc4random? This is documented as being periodically seeded from getentropy and should be fine for our purposes. I'm really nervous about a malloc implementation depending on a library that expects malloc to be working.

Copy link
Contributor Author

@amari amari Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arc4random_buf can silently fail. CCGenerateRandomBytes returns an error code.

https://opensource.apple.com/source/Libc/Libc-1439.40.11/gen/FreeBSD/arc4random.c.auto.html

void
arc4random_buf(void *buf, size_t buf_size)
{
	arc4_init();
        // the error is ignored
	ccrng_generate(rng, buf_size, buf);
}

https://opensource.apple.com/source/CommonCrypto/CommonCrypto-60178.40.2/lib/CommonRandom.c.auto.html

CCRNGStatus CCRandomGenerateBytes(void *bytes, size_t count)
{
    int err;
    struct ccrng_state *rng;

    if (0 == count) {
        return kCCSuccess;
    }

    if (NULL == bytes) {
        return kCCParamError;
    }

    rng = ccDRBGGetRngState();
    // the error is captured
    err = ccrng_generate(rng, count, bytes);
    if (err == CCERR_OK) {
        return kCCSuccess;
    }

    return kCCRNGFailure;
}

I don't know why Apple didn't reimplement getentropy with CoreCrypto.

Apple's own libc depends on CoreCrypto, and CCGenerateRandomBytes does not depend on malloc.

  • libc -> CoreCrypto
  • CommonCrypto -> CoreCrypto

Copy link
Contributor Author

@amari amari Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll correct the comment so that it is more clear and concise.

Also, arc4random is not seeded with genentropy:

https://opensource.apple.com/source/Libc/Libc-1439.40.11/gen/FreeBSD/arc4random.c.auto.html

static struct ccrng_state *rng;

static void
arc4_init(void)
{
	int err;

	if (rng != NULL) return;

	rng = ccrng(&err);
	if (rng == NULL) {
#if OS_CRASH_ENABLE_EXPERIMENTAL_LIBTRACE
		os_crash("arc4random: unable to get ccrng() handle (%d)", err);
#else
		os_crash("arc4random: unable to get ccrng() handle");
#endif
	}
}

uint32_t
arc4random(void)
{
	uint32_t rand;

	arc4random_buf(&rand, sizeof(rand));

	return rand;
}

void
arc4random_buf(void *buf, size_t buf_size)
{
	arc4_init();
	ccrng_generate(rng, buf_size, buf);
}

uint32_t
arc4random_uniform(uint32_t upper_bound)
{
	uint64_t rand;

	arc4_init();
	ccrng_uniform(rng, upper_bound, &rand);

	return (uint32_t)rand;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we depending on CommonCrypto or CoreCrypto? Is there an API contract that says that no future version of CCGenerateRandomBytes may ever call malloc or related APIs (including implicitly, as a result of depending on thread-local storage or as a result of running library constructors that must be run before we invoke the exported function)?

If there is an error, we are killing the process. Given that we use the entropy only for probabilistic mitigations, I'm not convinced that this is a better outcome. Under what situations can ccrng_generate fail? I presume we cannot call ccrng_generate directly because it's an internal function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency is CommonCrypto. CoreCrypto is private. Yes, ccrng_generate is a private function.

CommonCrypto (/usr/lib/system/libcommonCrypto.dylib) is implicitly linked by dyld (macOS linker.) We have no say in the matter as library developers. If there's any "library initialization" that is dependent on malloc in CommonCrypto, I'd argue that the behavior is undefined and is intrinsic to macOS in production.

I'd argue that Apple makes no guarantees about future behavior of standard system APIs, as we saw with the macOS 11.2 release (a minor release!) that broke mprotect (W^X) and many projects experienced issues.

The descriptions of ccrng and ccrng_generate:

https://github.com/apple/darwin-xnu/blob/main/EXTERNAL_HEADERS/corecrypto/ccrng.h

/*!
 @function   ccrng
 @abstract   Initializes an AES-CTR mode cryptographic random number generator and returns the statically-allocated rng object.
             Getting a pointer to a ccrng has never been simpler!
             Call this function, get an rng object and then pass the object to ccrng_generate() to generate randoms.
             ccrng() may be called more than once. It returns pointer to the same object on all calls.
 @result  a cryptographically secure random number generator or NULL if fails
 @discussion
 - It is significantly faster than using the system /dev/random
 - FIPS Compliant: NIST SP800-90A + FIPS 140-2
 - Seeded from the system entropy.
 - Provides at least 128bit security if the system provide 2bit of entropy / byte.
 - Entropy accumulation
 - Backtracing resistance
 - Prediction break with frequent (asynchronous) reseed
 */
/*!
 @function   ccrng_generate
 @abstract   Generate `outlen` bytes of output, stored in `out`, using ccrng_state `rng`.
 @param rng  `struct ccrng_state` representing the state of the RNG.
 @param outlen  Amount of random bytes to generate.
 @param out  Pointer to memory where random bytes are stored, of size at least `outlen`.
 @result 0 on success and nonzero on failure.
 */

ccrng "Initializes an AES-CTR mode cryptographic random number generator and returns the statically-allocated rng object [emphasis mine]."

Standards-compliant entities/organizations deploy to macOS and iOS using this API. See, https://www.niap-ccevs.org/Documents_and_Guidance/view_td.cfm?TD=0510

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ccrng is closed source. ccrng_generate calls the function pointer in ccrng_state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error, we are killing the process.

I would much sooner kill the process. We build different versions of the library with and without the mitigations. If it fails with mitigations, then the user can try the version without mitigations. Silent failure seems like a problem for security auditing.

src/pal/pal.h Outdated
// define `PAGE_SIZE` as an extern making it unsuitable for this assertion.
#if !defined(__APPLE__) && defined(PAGE_SIZE)
static_assert(
PAGE_SIZE == OS_PAGE_SIZE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PAGE_SIZE == OS_PAGE_SIZE,
#if __has_builtin(__builtin_constant_p)
!__builtin_constant_p(PAGE_SIZE) || (PAGE_SIZE == OS_PAGE_SIZE),
#else
true,
#endif

We can then get rid of the platform-specific ifdefs. At some point, the builtin can be replaced with something more portable (with C++17 there are template tricks to check for compile-time constants, with C++20 and consteval I think there are simpler ones).

* spotted in various projects; might be a temporary fix).
*/
template<bool page_aligned = false>
static void zero(void* p, size_t size) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to have lost this function. @mjp41, is it no longer needed? If so, we should update the porting document to remove it and remove it from all of the other PALs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is coming from the PALBSD one, but I think we should keep this version. Good catch

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the !Arm case has been fixed in the logic in the YesZero case in notify_using?

Copy link
Contributor Author

@amari amari Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devnexen, who opened #278, tested this branch on both Intel and M1 machines and found no regression.

About a week ago, I crawled XNU on sourcegraph in an attempt to diagnose root cause(s) for the M1 bugs, as well as reconcile the difference between the x86_64 and ARM codepaths of PALApple:

  1. The Apple silicon/ARM bugs were due to PALPOSIX::page_size being defined as Aal::smallest_page_size, which is 0x1000 for ARM. It needs to be 16KiB for Apple user-space on ARM. There's quite a bit of "creative" engineering going on in XNU as of macOS 11.3 . On ARM, the kernel uses 4KiB pages internally, but requires userspace to use 16KiB pages, unless the process is "exotic" (i.e. rosetta).
  2. MADV_FREE does not behave as expected. I'm not going to write the entire call tree, but the pages are never marked as "reusable" by the kernel and this can be observed through profiling. I tested MADV_FREE and MADV_FREE_REUSABLE. It is significant. At least ~75% to ~90% less dirty memory is used by func-malloc-16 (observed on x86_64). Apple's own malloc implementation as well as many ports for Apple Operating Systems use MADV_FREE_REUS{E, ABLE} instead of MADV_FREE. See https://bugs.chromium.org/p/chromium/issues/detail?id=713892 https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/magazine_malloc.c.auto.html

The reason why notify_using and zero look like duplicate functionality is b/c of optimization. Also related to another one of @davidchisnall comments below, when PALPOSIX::zero calls mmap, it can fail and call bzero on the same pages that mmap failed to overwrite.

In pseudocode, for brevity.

master:

PALPOSIX::zero<page_aligned>:
    if constexpr (zero_mem == YesZero)
        if page_aligned || is_aligned_block<OS::page_size>(p, size)
            mmap
            if r != MAP_FAILED
                return;
    bzero

PALPOSIX::notify_using:
    mprotect(PROT_READ|PROT_WRITE)
    zero<true>

pal-apple:

PALApple::notify_using:
    if constexpr (zero_mem == YesZero)
        mmap
        if r != MAP_FAILED
            return;
    mprotect(PROT_READ|PROT_WRITE)
    madvise(MADV_FREE_REUSE)
    bzero

We are free to move the mmap before mprotect, thereby eliminating the other 2 syscalls. Otherwise it would be mprotect, mmap, which is wasteful.

Please note that I didn't want to deviate from the failure conditions of PALPOSIX::notify_using .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes a lot of sense. It would be great to have a comment explaining this.

* Notify platform that we will be using these pages.
*/
template<ZeroMem zero_mem>
static void notify_using(void* p, size_t size) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function now looks almost identical to the POSIX PAL version. Why do we need a separate one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MADV_FREE != MADV_FREE_REUSABLE on XNU.
Please see the above comment. #336 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We already have some template parameters for the generic PALs that allow tweaking things like the default mmap flags, I was wondering why this couldn't be done to make MADV_FREE something tunable, but it looks as if the order of operations is sufficiently different that this isn't possible (or, rather, would add a lot more complexity than having a custom implementation here).

* spotted in various projects; might be a temporary fix).
*/
template<bool page_aligned = false>
static void zero(void* p, size_t size) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully the !Arm case has been fixed in the logic in the YesZero case in notify_using?

is_aligned_block<PALBSD::page_size>(p, size) || (zero_mem == NoZero));
is_aligned_block<page_size>(p, size) || (zero_mem == NoZero));

if constexpr (zero_mem == YesZero)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put this logic in zero()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see #336 (comment)

if constexpr (zero_mem == YesZero)
zero<true>(p, size);
{
::bzero(p, size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only reach this if the mmap call above failed. I don't think that mmap should be able to fail, so the error condition there should probably be PAL::error (which kills the process): if we can't mmap over some address space that we've previously allocated, then something is badly wrong and we're probably trying to mmap over some address space that we shouldn't be trying to mmap over so bzeroing it will make things even worse.

Copy link
Contributor Author

@amari amari Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PALPOSIX::zero behaves in the same fashion. PALWindows calls error if VirtualAlloc2FromApp fails.

Copy link
Contributor Author

@amari amari Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to deviate from PALPOSIX::notify_using's current behavior. PALPOSIX::notify_using will call PALPOSIX::zero. Currently if mmap fails, bzero is called on the pages that mmap failed to overwrite.

Copy link
Contributor Author

@amari amari Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that this is more of an issue of PALPOSIX::notify_using's behavior. I too would like PALPOSIX and PALApple to abort the current process, like PALWindows. But I feel that belongs in a different pull request; one that will also update PORTING.md to include the invariant "failed mmap calls will cause the process to abort" or something similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable to move that to a different PR.

/**
* Source of Entropy
*
* Apple platforms do not have a public getentropy implementation, so use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we depending on CommonCrypto or CoreCrypto? Is there an API contract that says that no future version of CCGenerateRandomBytes may ever call malloc or related APIs (including implicitly, as a result of depending on thread-local storage or as a result of running library constructors that must be run before we invoke the exported function)?

If there is an error, we are killing the process. Given that we use the entropy only for probabilistic mitigations, I'm not convinced that this is a better outcome. Under what situations can ccrng_generate fail? I presume we cannot call ccrng_generate directly because it's an internal function?

@amari amari force-pushed the pal-apple branch 2 times, most recently from e51fdde to 8235fa7 Compare June 22, 2021 01:59
* spotted in various projects; might be a temporary fix).
*/
template<bool page_aligned = false>
static void zero(void* p, size_t size) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that makes a lot of sense. It would be great to have a comment explaining this.

* Notify platform that we will be using these pages.
*/
template<ZeroMem zero_mem>
static void notify_using(void* p, size_t size) noexcept
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. We already have some template parameters for the generic PALs that allow tweaking things like the default mmap flags, I was wondering why this couldn't be done to make MADV_FREE something tunable, but it looks as if the order of operations is sufficiently different that this isn't possible (or, rather, would add a lot more complexity than having a custom implementation here).

@amari
Copy link
Contributor Author

amari commented Jun 23, 2021

@mjp41 LowMemoryNotification will be a different PR.

The perf-low_memory-* tests pass, but will sometimes get SIGKILL'd due to memory usage even though I can see the reclamation happen on the dirty and swap graphs of the profiler.
macOS has 3 memory pressure states: Normal, Warning, and Critical. Not only is a process is picked to be killed based on memory usage, but also the number of times it has triggered the Warning and/or Critical states.

So it looks like by default macOS is intolerant of processes that eat dirty memory and swap memory in a short time-frame. (The current test in low_memory.cc calls reach_pressure then reduce_pressure 10 times in a loop.)

There are sysctl knobs that govern this behavior, but they are undocumented so I still need to experiment.

@mjp41
Copy link
Member

mjp41 commented Jun 23, 2021

@mjp41 LowMemoryNotification will be a different PR.

The perf-low_memory-* tests pass, but will sometimes get SIGKILL'd due to memory usage even though I can see the reclamation happen on the dirty and swap graphs of the profiler.
macOS has 3 memory pressure states: Normal, Warning, and Critical. Not only is a process is picked to be killed based on memory usage, but also the number of times it has triggered the Warning and/or Critical states. The current test in low_memory.cc calls reach_pressure then reduce_pressure 10 times in a loop. There are sysctl knobs that govern this behavior, but they are undocumented so I still need to experiment.

Perfect thanks. We are happy to alter the test to trigger it fewer times on Apply platforms if that helps.

@mjp41 mjp41 merged commit 85843e9 into microsoft:master Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants