Skip to content

[core] Make preloading Jemalloc configurable for worker#47243

Merged
edoakes merged 1 commit intoray-project:masterfrom
Myasuka:jemalloc-fix
Jun 9, 2025
Merged

[core] Make preloading Jemalloc configurable for worker#47243
edoakes merged 1 commit intoray-project:masterfrom
Myasuka:jemalloc-fix

Conversation

@Myasuka
Copy link
Copy Markdown
Contributor

@Myasuka Myasuka commented Aug 21, 2024

Why are these changes needed?

The PR #39446 disables preloading Jemalloc for workers totally. However, Jemalloc is still useful in some cases, and we could make it configurable if user setting env RAY_LD_PRELOAD as 0.

The batch inference example code, using a TF model to infer the batch input of numpy's ndarray.

ds = ray.data.read_tfrecords(xxx)
ds.map_batches(BatchPredictor)
.map_batches(BatchPostProcessor)
.write_parquet(path=output_path

I did a inference test with limited memory, and we can see the OOM counts decrease from 900+ to 700.
image

Related issue number

Closes #47242

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Aug 22, 2024

@fishbone @rkooo567 could you please take a review?

@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) core Issues that should be addressed in Ray Core labels Aug 26, 2024
@hongchaodeng hongchaodeng self-assigned this Aug 26, 2024
@hongchaodeng hongchaodeng self-requested a review August 26, 2024 18:43
@rkooo567 rkooo567 self-assigned this Aug 26, 2024
@rkooo567 rkooo567 added the go add ONLY when ready to merge, run all tests label Aug 28, 2024
Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a new section in this page to explain how to enable jemalloc for workers? https://docs.ray.io/en/master/ray-core/miscellaneous.html#tuning-ray-settings

@hongchaodeng
Copy link
Copy Markdown
Member

Please add your inference example and benchmark as well.

@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Aug 29, 2024

can you add a new section in this page to explain how to enable jemalloc for workers? https://docs.ray.io/en/master/ray-core/miscellaneous.html#tuning-ray-settings

@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?

@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Aug 29, 2024

Please add your inference example and benchmark as well.

@hongchaodeng I have added the example in the description, and I think the previous pasted image could describe the benchmark well.

@rkooo567
Copy link
Copy Markdown
Contributor

@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?

I assume it is more of performance feature (that optimizes the memory usage). is enabling jemalloc allowing you to do mem profiling as well?

@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Sep 5, 2024

@rkooo567 Sure, I'll add a doc for this feature. However, why not adding the description in the memory-profiling doc?

I assume it is more of performance feature (that optimizes the memory usage). is enabling jemalloc allowing you to do mem profiling as well?

I see, enabling Jemalloc for workers could also benefit for memory profiling, I will mentation this in the tuning and profiling page.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 16, 2024
@MissiontoMars
Copy link
Copy Markdown

We also need to enable jemalloc for the worker processes because some C++ code is wrapped into Python workers, jemalloc is very effective for optimizing the memory of this part of c++ code.

@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025
@jjyao jjyao assigned dayshah and unassigned hongchaodeng and rkooo567 May 6, 2025
@dayshah
Copy link
Copy Markdown
Contributor

dayshah commented May 6, 2025

@Myasuka this makes sense to me, is this still something you want to merge?

@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented May 12, 2025

@Myasuka this makes sense to me, is this still something you want to merge?

Sure, could you help to review and merge this PR? I will update this PR this week.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 2, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 2, 2025
@Myasuka Myasuka requested a review from a team as a code owner June 2, 2025 18:19
@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Jun 2, 2025

Copy link
Copy Markdown
Contributor

@dstrodtman dstrodtman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RAY_LD_PRELOAD description was not parsable. Can you review/iterate on my edit to make sure it's technically correct?

Docs should never contain future tense ("will") or passive voice (be verb with a past tense verb). Thanks!

Copy link
Copy Markdown
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm, can you address the doc comments

Also, I would prefer if we rename the RAY_LD_PRELOAD env variable to something like RAY_LD_PRELOAD_ON_WORKERS, which gets set to 0 by default. This makes it easier for a user to understand without looking through code and docs.

@dstrodtman
Copy link
Copy Markdown
Contributor

@dayshah Just wondering, any reason why we're using 0/1 here instead of true/false? If this is a user-facing concept and the values are equivalent, the boolean operators are more friendly to humans.

@dayshah dayshah removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 2, 2025
@dayshah
Copy link
Copy Markdown
Contributor

dayshah commented Jun 2, 2025

@dayshah Just wondering, any reason why we're using 0/1 here instead of true/false? If this is a user-facing concept and the values are equivalent, the boolean operators are more friendly to humans.

I think it's because 0 and 1 usually translate to boolean values in C++, and the environment variables used in C++ are set using 0 or 1, like these

RAY_CONFIG(bool, report_actor_placement_resources, true)

I don't see a reason to not use true/false here though, since we're checking against the number.

@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Jun 3, 2025

generally lgtm, can you address the doc comments

Also, I would prefer if we rename the RAY_LD_PRELOAD env variable to something like RAY_LD_PRELOAD_ON_WORKERS, which gets set to 0 by default. This makes it easier for a user to understand without looking through code and docs.

I think RAY_LD_PRELOAD_ON_WORKERS looks more readable, and we could setting it as false by default to align with previous behavior.

@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Jun 3, 2025

I have updated this PR, please take a look @dayshah @dstrodtman

@dayshah
Copy link
Copy Markdown
Contributor

dayshah commented Jun 3, 2025

@Myasuka Myasuka force-pushed the jemalloc-fix branch 2 times, most recently from 817c168 to 905ac3a Compare June 4, 2025 16:05
@Myasuka
Copy link
Copy Markdown
Contributor Author

Myasuka commented Jun 5, 2025

https://buildkite.com/ray-project/premerge/builds/41233 some lint failures

@dayshah updated and all tests passes.

Copy link
Copy Markdown
Contributor

@dayshah dayshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty! lgtm @edoakes @jjyao to merge

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is "RAY_LD_PRELOAD" removed here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that RAY_LD_PRELOAD_ON_WORKERS looks better, if we still keep the undocumented RAY_LD_PRELOAD, it will make the code looks a bit weird.

generally lgtm, can you address the doc comments

Also, I would prefer if we rename the RAY_LD_PRELOAD env variable to something like RAY_LD_PRELOAD_ON_WORKERS, which gets set to 0 by default. This makes it easier for a user to understand without looking through code and docs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did RAY_LD_PRELOAD only apply to workers previously? I think that might be my gap in understanding. If that's the case, then this change LGTM. I was assuming that RAY_LD_PRELOAD also applied to system-level processes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES, RAY_LD_PRELOAD was first introduced in https://github.com/ray-project/ray/pull/39446/files and only be used within that PR.

We can set `RAY_LD_PRELOAD_ON_WORKERS` as `true` with `RAY_JEMALLOC_LIB_PATH` and `RAY_JEMALLOC_PROFILE` provided to also preload jemalloc for worker.
This is a fix after ray-project#39446

Signed-off-by: Yun Tang <myasuka@live.com>
@edoakes edoakes enabled auto-merge (squash) June 9, 2025 16:31
@edoakes edoakes merged commit 78e7c5a into ray-project:master Jun 9, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests P1 Issue that should be fixed within a few weeks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] We should make preloading Jemalloc configurable for worker