This repository was archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
This repository was archived by the owner on Nov 17, 2023. It is now read-only.
CachedOp performance regression #15067
Copy link
Copy link
Open
Labels
Description
Recently I am running benchmark on the cachedOp performance and get some regression on the result. Please see the table below:
| Module API | cachedOp with Static | CachedOp without static | |
|---|---|---|---|
| p2.8xlarge | 43ms | 42ms | 51ms |
| p3.2xlarge | 11ms | 19ms | 16ms |
| c5.4xlarge | 36ms | 38ms | 42ms |
I would like to highlight the GPU performance comparison. You can see on P2 there is a performance gain with the flag being set but regression in P3.
imported_net.hybridize(static_alloc = True, static_shape = True)
In theory, it is expected the performance boost if you set these two flags since memory is reused. However, on large GPU it seemed not performing fine.
I used nightly build
pip3 install mxnet-cu92mkl --pre
pip3 install mxnet-mkl --pre
Benchmark Script
import mxnet as mx
from mxnet import ndarray as nd
import numpy as np
import json, time, os
from mxnet import gluon
path='http://data.mxnet.io/models/imagenet/'
[mx.test_utils.download(path+'resnet/152-layers/resnet-152-0000.params'),
mx.test_utils.download(path+'resnet/152-layers/resnet-152-symbol.json'),
mx.test_utils.download(path+'synset.txt')]
def compute_stats(perf_results, results):
results["average"] = np.average(perf_results)
results['tp50'] = np.percentile(perf_results, 50)
results['tp90'] = np.percentile(perf_results, 90)
results['tp99'] = np.percentile(perf_results, 99)
ctx_str = os.environ['BENCHMARK_CTX']
if ctx_str == 'GPU':
ctx = mx.gpu(0)
elif ctx_str == 'CPU':
ctx = mx.cpu()
benchmark = {}
prefix = 'resnet-152'
# Model Partition time
t1 = time.time()
imported_net = gluon.nn.SymbolBlock.imports(prefix + '-symbol.json', ['data', 'softmax_label'],
prefix + '-0000.params')
t2 = time.time()
elapsed = (t2 - t1) * 1000
imported_net.hybridize(static_alloc = True, static_shape = True)
benchmark['ModelLoadTime'] = elapsed
fname = mx.test_utils.download('https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true')
img = mx.image.imread(fname)
# convert into format (batch, RGB, width, height)
img = mx.image.imresize(img, 300, 300) # resize
img = img.transpose((2, 0, 1)) # Channel first
img = img.expand_dims(axis=0) # batchify
img = img.astype('float32')
sf_label = nd.ones((1))
if ctx_str == 'GPU':
img = img.as_in_context(mx.gpu(0))
# First Inference
t1 = time.time()
op = imported_net(img, sf_label)
op.wait_to_read()
t2 = time.time()
elapsed = (t2 - t1) * 1000
benchmark['FirstInferCall'] = elapsed
times = 100
time_cost = []
for idx in range(0, times):
t1 = time.time()
op = imported_net(img, sf_label)
op.wait_to_read()
t2 = time.time()
elapsed = (t2 - t1) * 1000
time_cost.append(elapsed)
print("time cost: ", elapsed, "ms")
benchmark['ModelLoadTime'] = benchmark['FirstInferCall'] - time_cost[0]
compute_stats(time_cost, benchmark)
output = json.dumps(benchmark)
f = open('Inf.json', 'w')
f.write(output)
f.close()