Skip to content

Conversation

@thouis
Copy link
Contributor

@thouis thouis commented Jun 8, 2012

A simpler version of the numpy allocation logging facility. This is pure C in the malloc/free/realloc functions, with a dynamic logging buffer that python code can fetch on demand.

It still lacks documentation for the C allocation and python logging-related functions, and a home in the namespace. numpy.testing?

thouis added 2 commits June 8, 2012 19:26
First step to enabling allocation tracing for numpy arrays, this just
exposes the functions without any logging implemented.
This adds logging of allocations via the PyDataMem_NEW/FREE/RENEW
functions, controlled and accessed by the functions
alloc_log_enable(), alloc_log_disable(), alloc_log_isenabled(),
alloc_log_fetch(), and alloc_log_clear().

alloc_log_fetch() returns a list of tuples of
  (event_type, inptr, outptr, size)
    event_type is one of "malloc", "free", or "realloc"
  corresponding to calls of the form:
    outptr = malloc(size)
    free(inptr)
    outptr = realloc(inptr, size)

Both alloc_fetch_log() and alloc_clear_log() clear the log.

TODO: python function documentation
TODO: place in numpy namespace somewhere other than multiarray
    (perhaps numpy.testing?)
@thouis
Copy link
Contributor Author

thouis commented Jun 8, 2012

Also needs some examples, but I have several ready.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably call alloc_log_clear (so just return alloc_log_clear(NULL);)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. The user may want to capture the log for a piece of code by enabling/disabling around it and fetching the log afterwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe alloc_log_enable should clear it. The point is that we don't want log entries to get mixed up between different logging sessions, since that would silently give misleading results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. I wonder if there are cases it makes more difficult, though. One example would be:

for i in range(100):
    alloc_log_enable()
    result = my_func()
    alloc_log_disable()
    other_computation(result)
log = alloc_log_fetch()
# analyze log for problems in my_func()

It would be a bit more difficult to do this if enabling cleared the log, but not impossible.

Perhaps enable/disable should take a keyword argument clear_log that defaults to True.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could also just be written

log = []
for i in range(100):
    alloc_log_enable()
    result = my_func()
    alloc_log_disable()
    log += alloc_log_fetch()
    other_computation(result)

Slightly less efficient to keep appending to a list like that, but memory overhead always creates time overhead, no-one will be enabling this and then expecting maximum speed while it's turned on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cournape: Re; LD_PRELOAD: You really think it would be easier to add a new shared library (multiarray_alloc.so or something) containing just these functions and then arrange for it to be dynamically linked to multiarray.so on all supported platforms? IIUC that would require at least one of: screwing around with RPATH, installing the library into a systemwide shared directory, or using dlsym directly. And it still wouldn't work on Win32...

@thouis: I may not be understanding what you mean about complexity. IIUC, with the callback case, the data collection code looks like:

_heap_log = []
def record_heap_action(action, inptr, outptr, size):
  _heap_log.append((action, inptr, outptr, size, traceback.extract_stack()))
numpy.something.set_malloc_callback(record_heap_action)

With the log version, it's something like

_heap_log = []
def poll_heap_log():
  log = np.something.alloc_log_fetch()
  if log:
   stack = traceback.extract_stack()
   for entry in log:
     _heap_log.append(log + (stack,))

plus a bunch of arcane gunk to set up a trace function; I'd have to stare at the sys.settrace docs to work out how that would look. Plus this adds much more overhead (assuming most lines of code are not actually allocating any numpy arrays). So the callback version is significantly less complex on the user side, IMHO.

(I'm assuming that people will use some front-end tool that then produces reports based on these logs, or dumps kcachegrind-compatible output files, etc.)

There's still the question about re-entering Python from malloc/free call-sites, but really this should work. Every DECREF is already a potential Python re-entry point that might execute arbitrary code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njsmith: The complexity is in extracting the relevant parts of the stack trace, which, in the cases I've dealt with so far, is to to find the first line in the library you care about. In the callback case, this complexity is higher, because you have to walk the stack. In the log case, you only poll on the relevant lines. This is the trace function I'm using for profiling CellProfiler, for instance:

def trace_memory_usage(self, frame, event, arg):
    # We get line and return events just before they are executed.
    if self.path in frame.f_code.co_filename and \
            event in ('line', 'return'):
        new_state = (frame.f_code.co_filename, frame.f_lineno)
        if self.state != new_state:
            meminfo = get_memory()
            if meminfo['allocations'] + meminfo['frees'] > 16:
                meminfo['lineinfo'] = self.state
                meminfo['time'] = time.time()
                self.trace.append(meminfo)
            self.state = new_state

    # continue tracing
    return self.trace_memory_usage

self.path is the path to the directory with the code being profiled. get_memory() fetches and condenses the log to something useful. The overhead is not particularly high, since it only triggers on lines within the relevant code, and if the log is empty, very little work is done. I'd estimate running np.test() while profiling numpy takes only about 1.2 times as long.

I expect most people could make use of a simple profiling class, sidestepping most of this. I have one based on https://github.com/fabianp/memory_profiler that I'll add to this PR in the tools directory when it's cleaned up.

That said, I'd be fine with either version, as an intended user of the code. I don't know if it's better to push the code to make sure it's not fragile (the callback version), or to not change assumptions about underlying behavior, even if incorrect (this version).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njsmith: This is basically the complete script I'm using for profiling: https://gist.github.com/62ae8b026f82b45dc597

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figuring out the best granularity of analysis is best done at report time, not record time. I recommend playing with kcachegrind some time -- after you've "zoomed" in and out a few times on the fly, you'll never want to write that kind of stack walking code again.

And anyway, if you are going to do line-by-line tracing in any case, while turning the memory profiler on and off as you go, then you can do this just as easily with either the callback or logging APIs. (Reimplementing the logging API using the callback API is like 5 lines of code, most of them function signatures.) So the complexity of the callback version is never higher.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njsmith: Good point. I think either version could produce that sort of output, with about the same amount of code.

@njsmith
Copy link
Member

njsmith commented Jun 8, 2012

I think that's all my comments on the code itself, but you should take the design discussion to the list too. I actually think the Python-callback version was somewhat clearer, not sure what was wrong with it, but I don't have a strong opinion. But "do we want this", "is this the right API" are list questions, not code review questions.

@njsmith
Copy link
Member

njsmith commented Jun 8, 2012

Oh right, and one other standard code review point: needs docs and tests before it could be merged.

thouis added 3 commits June 9, 2012 22:29
This should be reverted if npysort is merged into the multiarray
module build, as the PyDataMem_NEW/FREE functions will be available
again in that case.
@cournape
Copy link
Member

@njsmith re LD_PRELOAD, it is clearly overkill for the feature being discussed, but having this implemented would be useful for other interesting usecases. In general, I am worried about adding very specific API for usecases that can be supported with other tools on most platforms.

@thouis
Copy link
Contributor Author

thouis commented Jun 11, 2012

@cournape: I agree that "there should be one obvious way to do it" (to paraphrase), but I think there are factors that make numpy suitable for this feature:

  • numpy arrays use orders of magnitude more memory than most python objects, and this is often a limiting factor in algorithms relying on numpy.
  • numpy can behave in complicated ways with regards to memory management, e.g., views, OWNDATA, etc., making it sometimes difficult to know where memory usage problems are happening and why.
  • numpy attracts a large number of programmers that have only worked with high-level languages (python/R/Matlab/etc.), who don't have the skills (or time to acquire them) to use other tools, but still need to be able to diagnose these sorts of problems.
  • Other tools are not well integrated with Python.

I would also prefer a solution that wasn't specific to numpy. How about adding a PyDataMem_HOOK variable, NULL by default, with the same signature as _alloc_log_event, called on each malloc/free/realloc when non-NULL. I'm not sure if global variables and the PyArray_API stuff (which I still don't understand, completely) would support this, so perhaps there needs to be a PyDataMem_SET_HOOK() function.

If, on the other hand, the consensus is to eventually to add tracing facilities directly to numpy, I'd suggest that they spend a few releases behind a #ifdef gate, off by default, to limit their use to a more knowledgable subset of the numpy user base until such a time as support issues have been ironed out.

@njsmith
Copy link
Member

njsmith commented Jun 14, 2012

@cournape: Maybe I'm not understanding what you're proposing for LD_PRELOAD, because the version I'm understanding (moving specifically these three functions into a shared library so that they're called by indirection through the linker) seems just as specific to this problem as the current PR does, while being absurdly more complex.

I'm not sure how useful I'd find this patch personally - I know enough to work out where copies are being made own my own, and would find a general Python memory profiler more useful - but it clearly has at least some user demand and is very low hanging fruit. (Unlike a real Python memory profiler, which will probably never happen.) And to the best of my knowledge there simply isn't any good way to match up memory usage to specific lines of Python code on any popular platform, esp. Linux (which I'm most familiar with). If you know one, I'd be honestly interested in hearing about it.

@thouis: That sounds like an excess of caution to me -- there's no magic about 'malloc', it's just a C function. Anyone who uses this API directly will have to know what they're doing, but that's just the nature of low-level debugging/profiling APIs. I know you ran into a number of weird issues when you started doing this, but AFAICT that's just because there was existing buggy C code; those issues could have started showing up on any change, really, and you just got unlucky.

@thouis
Copy link
Contributor Author

thouis commented Jun 14, 2012

@njsmith: Actually, I think just adding a hook function may be better than either of the other approaches. Logging doesn't support calling back into python (and so loses some immediacy), while Python callbacks may prevent effective debugging in multithreaded code or some large-scale or high-performance cases. Setting up a callback is pretty easy via Cython, so a hook facility is accessible to a fair number of numpy users, and examples of both logging and callback can be included with numpy or made available in other projects.

I'm also looking for consensus, and I don't know that either the logging or callback approaches can get it. It's early enough that I don't know that either should, but putting in a hook allows easy experimentation.

@cournape
Copy link
Member

@njsmith As for LD_PRELOAD, I meant that having the mean to do this in general (private shared libraries) would be very beneficial outside this precise usecase. I am not arguing doing it just for this, that would be insane.

For memory profiling, there is a newcomer that may be of some interest to you: https://github.com/sjagoe/pikos

@njsmith
Copy link
Member

njsmith commented Jun 14, 2012

@cournape OH, you mean, it would be good to patch the linker so that LD_PRELOAD was usable in more situations? Well, I guess I agree with that, but yeah, it doesn't seem very relevant here :-). Also, looking at the Pikos slides, it isn't a memory profiler at all -- it's just a front-end to existing tools, and the motivation for this patch in the first place is that there are no reliable existing tools.

@thouis: Sure, a C callback approach seems fine. It's even easy to write a C callback in pure Python using ctypes -- though unfortunately I guess there'd be no way to get at the hook setting function using ctypes, so you'd need a C compiler for that.

@thouis
Copy link
Contributor Author

thouis commented Jun 15, 2012

Closed in favor of #309

@thouis thouis closed this Jun 15, 2012
luyahan pushed a commit to plctlab/numpy that referenced this pull request Apr 25, 2024
refactor: Remove unwanted intrinsics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants