-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Numpy alloc logging #301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numpy alloc logging #301
Conversation
First step to enabling allocation tracing for numpy arrays, this just exposes the functions without any logging implemented.
This adds logging of allocations via the PyDataMem_NEW/FREE/RENEW
functions, controlled and accessed by the functions
alloc_log_enable(), alloc_log_disable(), alloc_log_isenabled(),
alloc_log_fetch(), and alloc_log_clear().
alloc_log_fetch() returns a list of tuples of
(event_type, inptr, outptr, size)
event_type is one of "malloc", "free", or "realloc"
corresponding to calls of the form:
outptr = malloc(size)
free(inptr)
outptr = realloc(inptr, size)
Both alloc_fetch_log() and alloc_clear_log() clear the log.
TODO: python function documentation
TODO: place in numpy namespace somewhere other than multiarray
(perhaps numpy.testing?)
|
Also needs some examples, but I have several ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably call alloc_log_clear (so just return alloc_log_clear(NULL);)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree. The user may want to capture the log for a piece of code by enabling/disabling around it and fetching the log afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe alloc_log_enable should clear it. The point is that we don't want log entries to get mixed up between different logging sessions, since that would silently give misleading results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. I wonder if there are cases it makes more difficult, though. One example would be:
for i in range(100):
alloc_log_enable()
result = my_func()
alloc_log_disable()
other_computation(result)
log = alloc_log_fetch()
# analyze log for problems in my_func()
It would be a bit more difficult to do this if enabling cleared the log, but not impossible.
Perhaps enable/disable should take a keyword argument clear_log that defaults to True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could also just be written
log = []
for i in range(100):
alloc_log_enable()
result = my_func()
alloc_log_disable()
log += alloc_log_fetch()
other_computation(result)
Slightly less efficient to keep appending to a list like that, but memory overhead always creates time overhead, no-one will be enabling this and then expecting maximum speed while it's turned on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cournape: Re; LD_PRELOAD: You really think it would be easier to add a new shared library (multiarray_alloc.so or something) containing just these functions and then arrange for it to be dynamically linked to multiarray.so on all supported platforms? IIUC that would require at least one of: screwing around with RPATH, installing the library into a systemwide shared directory, or using dlsym directly. And it still wouldn't work on Win32...
@thouis: I may not be understanding what you mean about complexity. IIUC, with the callback case, the data collection code looks like:
_heap_log = []
def record_heap_action(action, inptr, outptr, size):
_heap_log.append((action, inptr, outptr, size, traceback.extract_stack()))
numpy.something.set_malloc_callback(record_heap_action)
With the log version, it's something like
_heap_log = []
def poll_heap_log():
log = np.something.alloc_log_fetch()
if log:
stack = traceback.extract_stack()
for entry in log:
_heap_log.append(log + (stack,))
plus a bunch of arcane gunk to set up a trace function; I'd have to stare at the sys.settrace docs to work out how that would look. Plus this adds much more overhead (assuming most lines of code are not actually allocating any numpy arrays). So the callback version is significantly less complex on the user side, IMHO.
(I'm assuming that people will use some front-end tool that then produces reports based on these logs, or dumps kcachegrind-compatible output files, etc.)
There's still the question about re-entering Python from malloc/free call-sites, but really this should work. Every DECREF is already a potential Python re-entry point that might execute arbitrary code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njsmith: The complexity is in extracting the relevant parts of the stack trace, which, in the cases I've dealt with so far, is to to find the first line in the library you care about. In the callback case, this complexity is higher, because you have to walk the stack. In the log case, you only poll on the relevant lines. This is the trace function I'm using for profiling CellProfiler, for instance:
def trace_memory_usage(self, frame, event, arg):
# We get line and return events just before they are executed.
if self.path in frame.f_code.co_filename and \
event in ('line', 'return'):
new_state = (frame.f_code.co_filename, frame.f_lineno)
if self.state != new_state:
meminfo = get_memory()
if meminfo['allocations'] + meminfo['frees'] > 16:
meminfo['lineinfo'] = self.state
meminfo['time'] = time.time()
self.trace.append(meminfo)
self.state = new_state
# continue tracing
return self.trace_memory_usage
self.path is the path to the directory with the code being profiled. get_memory() fetches and condenses the log to something useful. The overhead is not particularly high, since it only triggers on lines within the relevant code, and if the log is empty, very little work is done. I'd estimate running np.test() while profiling numpy takes only about 1.2 times as long.
I expect most people could make use of a simple profiling class, sidestepping most of this. I have one based on https://github.com/fabianp/memory_profiler that I'll add to this PR in the tools directory when it's cleaned up.
That said, I'd be fine with either version, as an intended user of the code. I don't know if it's better to push the code to make sure it's not fragile (the callback version), or to not change assumptions about underlying behavior, even if incorrect (this version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njsmith: This is basically the complete script I'm using for profiling: https://gist.github.com/62ae8b026f82b45dc597
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Figuring out the best granularity of analysis is best done at report time, not record time. I recommend playing with kcachegrind some time -- after you've "zoomed" in and out a few times on the fly, you'll never want to write that kind of stack walking code again.
And anyway, if you are going to do line-by-line tracing in any case, while turning the memory profiler on and off as you go, then you can do this just as easily with either the callback or logging APIs. (Reimplementing the logging API using the callback API is like 5 lines of code, most of them function signatures.) So the complexity of the callback version is never higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njsmith: Good point. I think either version could produce that sort of output, with about the same amount of code.
|
I think that's all my comments on the code itself, but you should take the design discussion to the list too. I actually think the Python-callback version was somewhat clearer, not sure what was wrong with it, but I don't have a strong opinion. But "do we want this", "is this the right API" are list questions, not code review questions. |
|
Oh right, and one other standard code review point: needs docs and tests before it could be merged. |
This should be reverted if npysort is merged into the multiarray module build, as the PyDataMem_NEW/FREE functions will be available again in that case.
|
@njsmith re LD_PRELOAD, it is clearly overkill for the feature being discussed, but having this implemented would be useful for other interesting usecases. In general, I am worried about adding very specific API for usecases that can be supported with other tools on most platforms. |
|
@cournape: I agree that "there should be one obvious way to do it" (to paraphrase), but I think there are factors that make numpy suitable for this feature:
I would also prefer a solution that wasn't specific to numpy. How about adding a PyDataMem_HOOK variable, NULL by default, with the same signature as _alloc_log_event, called on each malloc/free/realloc when non-NULL. I'm not sure if global variables and the PyArray_API stuff (which I still don't understand, completely) would support this, so perhaps there needs to be a PyDataMem_SET_HOOK() function. If, on the other hand, the consensus is to eventually to add tracing facilities directly to numpy, I'd suggest that they spend a few releases behind a #ifdef gate, off by default, to limit their use to a more knowledgable subset of the numpy user base until such a time as support issues have been ironed out. |
|
@cournape: Maybe I'm not understanding what you're proposing for LD_PRELOAD, because the version I'm understanding (moving specifically these three functions into a shared library so that they're called by indirection through the linker) seems just as specific to this problem as the current PR does, while being absurdly more complex. I'm not sure how useful I'd find this patch personally - I know enough to work out where copies are being made own my own, and would find a general Python memory profiler more useful - but it clearly has at least some user demand and is very low hanging fruit. (Unlike a real Python memory profiler, which will probably never happen.) And to the best of my knowledge there simply isn't any good way to match up memory usage to specific lines of Python code on any popular platform, esp. Linux (which I'm most familiar with). If you know one, I'd be honestly interested in hearing about it. @thouis: That sounds like an excess of caution to me -- there's no magic about 'malloc', it's just a C function. Anyone who uses this API directly will have to know what they're doing, but that's just the nature of low-level debugging/profiling APIs. I know you ran into a number of weird issues when you started doing this, but AFAICT that's just because there was existing buggy C code; those issues could have started showing up on any change, really, and you just got unlucky. |
|
@njsmith: Actually, I think just adding a hook function may be better than either of the other approaches. Logging doesn't support calling back into python (and so loses some immediacy), while Python callbacks may prevent effective debugging in multithreaded code or some large-scale or high-performance cases. Setting up a callback is pretty easy via Cython, so a hook facility is accessible to a fair number of numpy users, and examples of both logging and callback can be included with numpy or made available in other projects. I'm also looking for consensus, and I don't know that either the logging or callback approaches can get it. It's early enough that I don't know that either should, but putting in a hook allows easy experimentation. |
|
@njsmith As for LD_PRELOAD, I meant that having the mean to do this in general (private shared libraries) would be very beneficial outside this precise usecase. I am not arguing doing it just for this, that would be insane. For memory profiling, there is a newcomer that may be of some interest to you: https://github.com/sjagoe/pikos |
|
@cournape OH, you mean, it would be good to patch the linker so that LD_PRELOAD was usable in more situations? Well, I guess I agree with that, but yeah, it doesn't seem very relevant here :-). Also, looking at the Pikos slides, it isn't a memory profiler at all -- it's just a front-end to existing tools, and the motivation for this patch in the first place is that there are no reliable existing tools. @thouis: Sure, a C callback approach seems fine. It's even easy to write a C callback in pure Python using ctypes -- though unfortunately I guess there'd be no way to get at the hook setting function using ctypes, so you'd need a C compiler for that. |
|
Closed in favor of #309 |
refactor: Remove unwanted intrinsics
A simpler version of the numpy allocation logging facility. This is pure C in the malloc/free/realloc functions, with a dynamic logging buffer that python code can fetch on demand.
It still lacks documentation for the C allocation and python logging-related functions, and a home in the namespace. numpy.testing?