Skip to content

detectron2_maskrcnn OOMs on eager with A100 40G. #120115

@ysiraichi

Description

@ysiraichi

🐛 Describe the bug

It looks odd asking for 20209.02 GiB of memory.

python benchmarks/dynamo/torchbench.py \
    --accuracy --no-translation-validation --inference --bfloat16 \
    --backend inductor --disable-cudagraphs --device cuda --no-skip \
    -k '^detectron2_maskrcnn$'
cuda eval  detectron2_maskrcnn
Traceback (most recent call last):
  File "benchmarks/dynamo/common.py", line 2171, in validate_model
    self.model_iter_fn(model, example_inputs)
  File "benchmarks/dynamo/torchbench.py", line 469, in forward_pass
    return mod(*inputs)
  File "torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 150, in forward
    return self.inference(batched_inputs)
  File "/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 213, in inference
    results, _ = self.roi_heads(images, features, proposals, None)
  File "torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.8/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 747, in forward
    pred_instances = self._forward_box(features, proposals)
  File "/lib/python3.8/site-packages/detectron2/modeling/roi_heads/roi_heads.py", line 798, in _forward_box
    box_features = self.box_pooler(features, [x.proposal_boxes for x in proposals])
  File "torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.8/site-packages/detectron2/modeling/poolers.py", line 261, in forward
    output.index_put_((inds,), pooler(x[level], pooler_fmt_boxes_level))
  File "torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.8/site-packages/detectron2/layers/roi_align.py", line 58, in forward
    return roi_align(
  File "/lib/python3.8/site-packages/torchvision-0.18.0a0+a52607e-py3.8-linux-x86_64.egg/torchvision/ops/roi_align.py", line 236, in roi_align
    return _roi_align(input, rois, spatial_scale, output_size[0], output_size[1], sampling_ratio, aligned)
  File "/lib/python3.8/site-packages/torchvision-0.18.0a0+a52607e-py3.8-linux-x86_64.egg/torchvision/ops/roi_align.py", line 168, in _roi_align
    val = _bilinear_interpolate(input, roi_batch_ind, y, x, ymask, xmask)  # [K, C, PH, PW, IY, IX]
  File "/lib/python3.8/site-packages/torchvision-0.18.0a0+a52607e-py3.8-linux-x86_64.egg/torchvision/ops/roi_align.py", line 62, in _bilinear_interpolate
    v1 = masked_index(y_low, x_low)
  File "/lib/python3.8/site-packages/torchvision-0.18.0a0+a52607e-py3.8-linux-x86_64.egg/torchvision/ops/roi_align.py", line 55, in masked_index
    return input[
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20209.02 GiB. GPU 0 has a total capacity of 39.39 GiB of which 34.52 GiB is free. Process 7680 has 4.86 GiB memory in use. Of the allocated memory 4.22 GiB is allocated by PyTorch, and 119.07 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "benchmarks/dynamo/common.py", line 3826, in run
    ) = runner.load_model(
  File "benchmarks/dynamo/torchbench.py", line 405, in load_model
    self.validate_model(model, example_inputs)
  File "benchmarks/dynamo/common.py", line 2173, in validate_model
    raise RuntimeError("Eager run failed") from e
RuntimeError: Eager run failed

Versions

cc @ezyang @msaroufim @bdhirsh @anijain2305 @zou3519 @chauhang @miladm @lezcano

Metadata

Metadata

Assignees

No one assigned

    Labels

    triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions