c-deps: update jemalloc to 5.3.0#93045
Conversation
|
There were some memory leak concerns identified in #83289. Based on those, I took this SHA for a spin and compared it to The annotations, in order:
There was no observable increase in RSS in the period after the load test ran. Without patch: With patch: |
|
@nicktrav Did you build with |
Yikes! TIL. I did not build with that. Thank you for catching! |
|
Yeah, so our When we're confident that we do want to update to this version of |
|
The file |
|
Re-ran the same experiments with the correct build flags. There's a slight improvement in the fragmentation with the patch applied (red and orange lines closer together), and the total cgo memory is lower. No surprises here, given we're picking up a few years worth of improvements / optimizations. I didn't see any evidence of leaking memory, though I'm going to keep this cluster up for the next day or so. After: |
6598201 to
22b853f
Compare
|
Rebased on latest Any other testing / experimenting here that folks would like to see before landing this? Ideally this has a substantial amount of burn time on |
|
@nicktrav If you're comfortable with 5.3.0 I can kick off the archiving process so we can get this merged. |
Great - let's do that. |
|
Go ahead and apply the following patch to your PR and we'll check that CI is okay: |
|
@rickystewart - thanks for that. Just pushed a new commit with the changes. Happy to squash that into the first commit too. |
|
@nicktrav Let's squash into one commit, since all these changes should be atomic and bundled together. Let's check CI and make sure everything looks good. |
10ff22b to
df96697
Compare
Ack. Done. Also rebased on latest |
|
@rickystewart - any insight into these test failures I hit the same issue on my linux box doing a cross-compile for $ ./dev build short --cross=macos
...
external/go_sdk/pkg/tool/linux_amd64/link: running external/cross_x86_64_macos_toolchain/bin/x86_64-apple-darwin21.2-cc failed: exit status 1
Undefined symbols for architecture x86_64:
"_je_zone_register", referenced from:
__cgo_dba21935bf37_Cfunc_je_zone_register in 000013.o
(maybe you meant: __cgo_dba21935bf37_Cfunc_je_zone_register)
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
link: error running subcommand external/go_sdk/pkg/tool/linux_amd64/link: exit status 2 |
Is it just me, or are these two strings exactly the same? I'll try to figure out what's going on here, might take me a little bit though. |
Yeah ... goofy. |
|
OK. I can get a working build if I make this change. I have no idea why the |
|
Strange .. the upstream source doesn't have https://github.com/jemalloc/jemalloc/blob/54eaed1d8b56b1aa528be3bdd1877e59c56fa90c/src/zone.c#L438 I can push the patch to update the function call. I assume that wasn't failing prior to applying your initial patch as it was using the older artifacts? |
|
I found #35620. Possibly related? That issue was fixed on the upstream project from what I can tell. And we have that patch in 5.3.0. |
|
Is it possible that our previous build of jemalloc added that prefix? And now we're stripping it? @rickystewart - let me know whether you think it's worth removing the prefix in the Go code, or if you think it's worth digging into the build of jemalloc. I might need some help in that case. |
Yes, CI was not failing prior to my patch as it was still using the older prebuilt artifacts.
I think I am not the person you should ask this question to. From my perspective as long as the build is working and the executable runs, I'm happy. :) However there may be others on the team who are more acutely interested in the anomaly. Since the bugs you linked talk about the possibility of deadlocks, I would do some local testing on a macOS machine before making any decision. Testing the build from a normal local |
Not too worried about the deadlock issue in that linked issue. We have that patch now in upstream 5.3.0. Basic testing locally, on macOS, did not reveal anything weird.
Understood - could this be a build system thing though? Looking at the prebuilt artifacts in the patch you had me include, I see the # Old: https://storage.googleapis.com/public-bazel-artifacts/c-deps/20220708-170245/libjemalloc_foreign.macos.20220708-170245.tar.gz
$ nm -am ./libjemalloc-before.a 2> /dev/null | grep _zone_register
0000000000000000 (__TEXT,__text) external _je_zone_register
(undefined) external _malloc_zone_register
# New: https://storage.googleapis.com/public-bazel-artifacts/c-deps/20221208-201744/libjemalloc_foreign.macos.20221208-201744.tar.gz
$ nm -am ./libjemalloc-after.a 2> /dev/null | grep _zone_register
(undefined) external _malloc_zone_register
0000000000000000 (__TEXT,__text) external _zone_registerI also built this natively on macOS and I get the expected symbols: $ bazel build //pkg/cmd/cockroach-short:cockroach-short --force_build_cdeps
$ nm -am _bazel/bin/c-deps/libjemalloc_foreign/lib/libjemalloc.a 2> /dev/null | grep _zone_register
0000000000000000 (__TEXT,__text) external _je_zone_register
(undefined) external _malloc_zone_registerIn the jemalloc configure script, I see a I also found this commit in another project that has an explanation for why the symbols may not be prefixed. Is it possible our artifact creation scripts are different now, and aren't doing the mangling we need? |
Yes, I would say some build system thing is the likeliest explanation for the symbol name mismatch. I would expect the quirk to be inside of |
|
How does one build the artifacts? I'll run that locally and see if I can fix it to include the correct |
|
Note that the build in Bazel is single-threaded so you can't really quickly iterate on it. You might find it faster to just |
|
I think I'm getting closer to understanding the actual reason this isn't linking when doing a cross-compile for macOS.
Here's what I see in the build log from within the build container: $ bazel build //c-deps:libjemalloc_foreign --force_build_cdeps -s --verbose_failures --sandbox_debug --config=crossmacos
...
INFO: Build completed successfully, 2 total actions
$ cat _bazel/out/k8-fastbuild/bin/c-deps/libjemalloc_foreign_foreign_cc/Configure.log
...
nm -a src/jemalloc.sym.o | mawk -f include/jemalloc/internal/private_symbols.awk > src/jemalloc.sym
nm: src/jemalloc.sym.o: file format not recognized
...
$ file /home/roach/.cache/bazel/_bazel_roach/cc377fc379544923cc7508dd261e4a48/sandbox/processwrapper-sandbox/1/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/c-deps/libjemalloc_foreign.build_tmpdir/src/jemalloc.sym.o
/home/roach/.cache/bazel/_bazel_roach/cc377fc379544923cc7508dd261e4a48/sandbox/processwrapper-sandbox/1/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/c-deps/libjemalloc_foreign.build_tmpdir/src/jemalloc.sym.o: Mach-O 64-bit x86_64 object, flags:<|SUBSECTIONS_VIA_SYMBOLS>One solution is to make the Building on the above, we could set the prefix to Alternatively, if we can work out how to make the @rickystewart - any thoughts on any of this? |
To me this seems like a very probably huge amount of work with dubious benefit.
This seems like the sanest approach to me. Making sure the symbol name is consistent across all versions of the library (prebuilt, non-prebuilt) seems like a good idea. |
|
I was able to confirm that this upstream commit is the one that breaks things for us. Unsurprisingly, it introduces
I'll try this and report back. |
|
Ok I think I have a workable solution to this. LLVM's To test this out, I added diff --git a/build/bazelbuilder/Dockerfile b/build/bazelbuilder/Dockerfile
index 4360b2affb..7fda560fcf 100644
--- a/build/bazelbuilder/Dockerfile
+++ b/build/bazelbuilder/Dockerfile
@@ -76,6 +76,10 @@ RUN case ${TARGETPLATFORM} in \
google-chrome-stable ;; \
esac
+# Replace nm with llvm-nm
+RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends llvm \
+ && ln -sf /usr/bin/llvm-nm /usr/bin/nm
+
RUN apt-get purge -y \
apt-transport-https \
flex \I was then able to build jemalloc, and the resulting artifact had the symbols that the linker expects. Artifact with the current build container ( Artifact from the patched build container: Note the @rickystewart - what do you think about using LLVM's |
|
@nicktrav Yeah, please get the |
Currently, the build containers contain the standard GNU version of `nm`. This version of the tool is sufficient for the current build, which does not need to directly interact with binaries cross compiled for platforms other than Linux. When pulling in the latest version of `jemalloc` (`5.3.0`), builds for cross compiled for macOS requiring using `nm` to perform name mangling. As the GNU version of `nm` in the Linux build container does not know how to read Macho-O binaries, the name mangling does not occur, and the Cockroach binary will fail to link. Use the version of `nm` that ships with LLVM (`llvm-nm`) in place of GNU `nm`. This version of `nm` can understand Macho-O binaries and perform the name mangling as intended, resolving the linking issue. Update the build container version to `20230103-212639`. Touches cockroachdb#83289. Informs cockroachdb#93045. Release note: None.
94196: build: use llvm-nm in place of nm r=rickystewart a=nicktrav Currently, the build containers contain the standard GNU version of `nm`. This version of the tool is sufficient for the current build, which does not need to directly interact with binaries cross compiled for platforms other than Linux. When pulling in the latest version of `jemalloc` (`5.3.0`), builds for cross compiled for macOS requiring using `nm` to perform name mangling. As the GNU version of `nm` in the Linux build container does not know how to read Macho-O binaries, the name mangling does not occur, and the Cockroach binary will fail to link. Use the version of `nm` that ships with LLVM (`llvm-nm`) in place of GNU `nm`. This version of `nm` can understand Macho-O binaries and perform the name mangling as intended, resolving the linking issue. Touches #83289. Informs #93045. Release note: None. Epic: CRDB-20293 94700: kvserver: fix race in TestSchedulerBuffering r=erikgrinaker a=pavelkalinnikov In TestSchedulerBuffering there is a race between adding a new event to the queue and finishing the processing of the previous event. The scheduler processes this correctly, but when the ticks need to be dropped, the number of processed ticks is non-deterministic. This commit fixes the flake. It introduces a Raft ready interceptor which effectively allows blocking the processing loop while ticks are being added. This imitates the real world scenario. Fixes #94303 Epic: none 94709: replicastats: keep locality during mistmatch r=andrewbaptist a=kvoli Previously it was possibly for a panic to occur due to a nil ptr de-reference in `replicastats.PerLocalityDecayingRate()`. This situation arose when merging/splitting replica stats' with mismatched locality tracking; where one replicastats tracked the locality of requests and the other did not. This occurence was infrequent however could occur when merge or split is called with a replica which existed before the storepool initialization and a replica which was created after. This patch resolves this issue by ensuring that the the locality tracking on either side post-(split|merge) is consistent with the initial locality tracking, rather than attempting to reset either partially. fixes: #94696 Release note: None 94721: scripts: `dev generate` before updating dependencies r=jbowens a=jbowens See #94712 (review) Epic: None Release note: None Co-authored-by: Nick Travers <travers@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
df96697 to
6ff44d3
Compare
|
@rickystewart - rebased and pushed. I assume we'll need another patch with the pre-built c-deps, as above? |
|
Not sure if this will be sufficient or if we need more changes to the |
|
Try the following patch now. |
Update jemalloc to point to the upstream 5.3.0 release, hosted on our internal fork. This removes two custom patches that are no longer required: - Fix deadlock in multithreaded fork in OS X - fix upstreamed in jemalloc/jemalloc#954. - Fix JEMALLOC_MUTEX_INIT_CB to only be set if OSS_PINLOCK is false - spinlock support was removed upstream in jemalloc/jemalloc#1367 Touches cockroachdb#83289. Epic: CRDB-20293. Release note: None.
6ff44d3 to
1debfd3
Compare
|
I think we're looking good here. Everything green in CI. I also tested the following combinations:
Everything linked and I could run without issue on the appropriate target platform. Putting this into the queue ... TFTR! bors r=rickystewart |
|
Nice! |
|
Build failed (retrying...): |
|
Build succeeded: |





Update jemalloc to point to the upstream 5.3.0 release, hosted on our internal fork. This removes two custom patches that are no longer required:
Touches #83289.
Closes #17013. Closes #83289.
Epic: CRDB-20293.
Release note (performance improvement): The memory allocator,
jemallocwas updated to the latest available upstream version, 5.3.0, from 4.5.0. This update pulls in a number of upstream improvements, including reduced memory fragmentation for memory allocated outside of the Go runtime (i.e. the Pebble block and table caches), resulting in better memory utilization.