ARROW-4296: [Plasma] Use one mmap file by default, prevent crash with -f by pcmoritz · Pull Request #3490 · apache/arrow

pcmoritz · 2019-01-26T01:54:44Z

This PR is similar to #3434 but also makes sure we only have one well-tested code path to go through.

robertnishihara · 2019-01-26T21:22:38Z

c style cast

robertnishihara · 2019-01-26T21:24:25Z

Remove the command line flag also.

robertnishihara · 2019-01-26T21:25:19Z

Why replace 1 with SMALL_OBJECT_SIZE?

because right now, the memory capacity can (and will!) be off by some amount at the moment (due to the way that dlmalloc is computing the footprint limit).

robertnishihara · 2019-01-26T21:26:19Z

We know that this won't cause the page to get unmapped?

I tried it out and it will, but upon remapping, dlmalloc will use the large granularity size, so this has the desired effect of using one large mmap file.

robertnishihara

If we're truly using one huge page now then we should just send the relevant file descriptor to the clients as soon as they connect and never again.

We should do it in a different PR, but maybe file an issue?

robertnishihara · 2019-01-28T01:17:51Z

Remove f from getopt above

robertnishihara · 2019-01-28T01:20:22Z

Add note saying that this relies on implementation details of dlmalloc (that after the initial memory mapped file is unmapped, subsequent memory-mapped files will use the same granularity as the first page) and that if we switch to using jemalloc, this may need to change.

Alteratively, you could do

void* pointer_big = plasma::dlmemalign(kBlockSize, system_memory - 128 * sizeof(size_t)); // We do not deallocate this small object so that the memory mapped file never gets unmapped. void* pointer_small = plasma::dlmalloc(1); plasma::dlfree(pointer_big);

robertnishihara · 2019-01-28T01:23:50Z

While you're looking at this code, any idea about this ray-project/ray#3670

Somehow the way the object store computes the total system memory is different from the way we do it in Python (e.g., using psutil).

not sure what is going on but the only way to guarantee that the two sizes are consistent is to use the same method in both (either psutil or the system call we use here)

Well we can't really use psutil here because it is in C++, right?

pcmoritz · 2019-01-28T01:29:43Z

This does not guarantee to use one memory mapped file, it just makes it so that there is at least one that is large enough to fit all the objects.

robertnishihara

LGTM once tests pass. Note that valgrind might complain about pointer_small getting leaked.

pcmoritz · 2019-01-29T02:31:03Z

This is strange. It looks like the plasma client_tests are hanging. They are running in valgrind and get to the point where

<all the output from the test here up to>
2: ==14032=2: [       OK ] TestPlasmaStore.ManyObjectTest (414 ms)
2: [----------] 11 tests from TestPlasmaStore (2754 ms total)
2: 
2: [----------] Global test environment tear-down
2: [==========] 11 tests from 1 test case ran. (2828 ms total)
2: [  PASSED  ] 11 tests. 
2: ==14032== HEAP SUMMARY:
2: ==14032==     in use at exit: 9,448 bytes in 173 blocks
2: ==14032==   total heap usage: 4,294 allocs, 4,121 frees, 1,131,536 bytes allocated
2: ==14032==

and then the rest of the valgrind output is never printed.

On EC2 on ubuntu 14.04, running the tests in valgrind with

valgrind --tool=memcheck --gen-suppressions=all                  --leak-check=full --leak-check-heuristics=stdstring --error-exitcode=1 ~/arrow/cpp/build/debug//plasma-client_tests

works however :|

wesm · 2019-01-29T02:41:01Z

We recently switched our CI to Ubuntu 16.04 after the conda-forge compiler migration, so it may be specific to 16.04

pcmoritz · 2019-01-29T07:44:24Z

Thanks for the pointer. Wow, I can actually reproduce in 16.04!

EDIT: It just takes a long time to finish, like about 30 sec on the EC2 machine.

pcmoritz · 2019-01-29T08:00:10Z

I understand what is going on now. In verbose mode, valgrind prints

==22169== Searching for pointers to 1 not-freed blocks

where it is hanging. The memory consumption of the process goes up during that time (up to ~10GB), so probably on travis it started swapping and hanging).

Reducing the plasma store memory in the test seems to fix it. However I'd still like to understand where this PR is introducing the 1 non-freed block (there seem to be no such pointers found).

pcmoritz · 2019-01-29T08:13:44Z

So apparently valgrind already did the search before this PR, but is now searching more memory because of the initial dlmemalign. So I'd say this is ready to merge if the tests pass now.

pcmoritz · 2019-01-29T10:10:13Z

the failure is unrelated: /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext *ctx, const Datum &input, Datum *out)=0 are not documented:

pcmoritz mentioned this pull request Jan 26, 2019

ARROW-4296: [C++] [Plasma] Update dlmemalign call to prevent plasma store crash with -f flag #3434

Closed

pcmoritz force-pushed the one-mmap-file branch 2 times, most recently from 2f5e7c6 to 117d03a Compare January 26, 2019 20:34

pcmoritz mentioned this pull request Jan 26, 2019

Use one memory mapped file for plasma ray-project/ray#3871

Merged

robertnishihara reviewed Jan 26, 2019

View reviewed changes

robertnishihara reviewed Jan 28, 2019

View reviewed changes

robertnishihara approved these changes Jan 28, 2019

View reviewed changes

pcmoritz force-pushed the one-mmap-file branch 2 times, most recently from 1e33c7c to 55c0f27 Compare January 28, 2019 02:25

pcmoritz mentioned this pull request Jan 29, 2019

[WIP][Plasma] Use jemalloc instead of dlmalloc in plasma #2593

Closed

pcmoritz added 4 commits January 29, 2019 12:38

use only one mmapped file

990700f

add verbose flag

6072e8f

reduce plasma store size for test

f885f49

remove --verbose

c447af5

pcmoritz force-pushed the one-mmap-file branch from 82f1dce to c447af5 Compare January 29, 2019 20:38

robertnishihara closed this in 072200f Jan 29, 2019

robertnishihara deleted the one-mmap-file branch January 29, 2019 21:35

pcmoritz mentioned this pull request Mar 23, 2019

ARROW-4983: [Plasma] Unmap memory upon destruction of the PlasmaClient #4001

Closed

asfimport mentioned this pull request Mar 25, 2019

[Plasma] Starting Plasma store with use_one_memory_mapped_file enabled crashes due to improper memory alignment #20869

Closed

Conversation

pcmoritz commented Jan 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robertnishihara left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Jan 28, 2019

Uh oh!

robertnishihara left a comment

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Jan 29, 2019

Uh oh!

wesm commented Jan 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Jan 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Jan 29, 2019

Uh oh!

pcmoritz commented Jan 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Jan 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wesm commented Jan 29, 2019 •

edited

Loading

pcmoritz commented Jan 29, 2019 •

edited

Loading

pcmoritz commented Jan 29, 2019 •

edited

Loading