Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

CachedOp performance regression #15067

@lanking520

Description

@lanking520

Recently I am running benchmark on the cachedOp performance and get some regression on the result. Please see the table below:

Module API cachedOp with Static CachedOp without static
p2.8xlarge 43ms 42ms 51ms
p3.2xlarge 11ms 19ms 16ms
c5.4xlarge 36ms 38ms 42ms

I would like to highlight the GPU performance comparison. You can see on P2 there is a performance gain with the flag being set but regression in P3.

imported_net.hybridize(static_alloc = True, static_shape = True)

In theory, it is expected the performance boost if you set these two flags since memory is reused. However, on large GPU it seemed not performing fine.

I used nightly build

pip3 install mxnet-cu92mkl --pre
pip3 install mxnet-mkl --pre

Benchmark Script

import mxnet as mx
from mxnet import ndarray as nd
import numpy as np
import json, time, os
from mxnet import gluon

path='http://data.mxnet.io/models/imagenet/'
[mx.test_utils.download(path+'resnet/152-layers/resnet-152-0000.params'),
mx.test_utils.download(path+'resnet/152-layers/resnet-152-symbol.json'),
mx.test_utils.download(path+'synset.txt')]


def compute_stats(perf_results, results):
  results["average"] = np.average(perf_results)
  results['tp50'] = np.percentile(perf_results, 50)
  results['tp90'] = np.percentile(perf_results, 90)
  results['tp99'] = np.percentile(perf_results, 99)

ctx_str = os.environ['BENCHMARK_CTX']

if ctx_str == 'GPU':
  ctx = mx.gpu(0)
elif ctx_str == 'CPU':
  ctx = mx.cpu()

benchmark = {}

prefix = 'resnet-152'

# Model Partition time
t1 = time.time()
imported_net = gluon.nn.SymbolBlock.imports(prefix + '-symbol.json', ['data', 'softmax_label'],
                                            prefix + '-0000.params')
t2 = time.time()
elapsed = (t2 - t1) * 1000

imported_net.hybridize(static_alloc = True, static_shape = True)

benchmark['ModelLoadTime'] = elapsed

fname = mx.test_utils.download('https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true')
img = mx.image.imread(fname)


# convert into format (batch, RGB, width, height)
img = mx.image.imresize(img, 300, 300) # resize
img = img.transpose((2, 0, 1)) # Channel first
img = img.expand_dims(axis=0) # batchify
img = img.astype('float32')

sf_label = nd.ones((1))

if ctx_str == 'GPU':
  img = img.as_in_context(mx.gpu(0))

# First Inference
t1 = time.time()
op = imported_net(img, sf_label)
op.wait_to_read()
t2 = time.time()
elapsed = (t2 - t1) * 1000

benchmark['FirstInferCall'] = elapsed

times = 100
time_cost = []

for idx in range(0, times):
  t1 = time.time()
  op = imported_net(img, sf_label)
  op.wait_to_read()
  t2 = time.time()
  elapsed = (t2 - t1) * 1000
  time_cost.append(elapsed)
  print("time cost: ", elapsed, "ms")

benchmark['ModelLoadTime'] = benchmark['FirstInferCall'] - time_cost[0]
compute_stats(time_cost, benchmark)

output = json.dumps(benchmark)

f = open('Inf.json', 'w')
f.write(output)
f.close()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions