CachedOp performance regression

Recently I am running benchmark on the cachedOp performance and get some regression on the result. Please see the table below:

|            | Module API | cachedOp with Static | CachedOp without static |
|------------|------------|----------------------|-------------------------|
| p2.8xlarge | 43ms       | 42ms                 | 51ms                    | 
| p3.2xlarge | 11ms       | 19ms                 | 16ms                    |  
| c5.4xlarge | 36ms       | 38ms                 | 42ms                    | 

I would like to highlight the GPU performance comparison. You can see on P2 there is a performance gain with the flag being set but regression in P3.
```
imported_net.hybridize(static_alloc = True, static_shape = True)
```

In theory, it is expected the performance boost if you set these two flags since memory is reused. However, on large GPU it seemed not performing fine.

I used nightly build
```
pip3 install mxnet-cu92mkl --pre
pip3 install mxnet-mkl --pre
```

## Benchmark Script
```python
import mxnet as mx
from mxnet import ndarray as nd
import numpy as np
import json, time, os
from mxnet import gluon

path='http://data.mxnet.io/models/imagenet/'
[mx.test_utils.download(path+'resnet/152-layers/resnet-152-0000.params'),
mx.test_utils.download(path+'resnet/152-layers/resnet-152-symbol.json'),
mx.test_utils.download(path+'synset.txt')]


def compute_stats(perf_results, results):
  results["average"] = np.average(perf_results)
  results['tp50'] = np.percentile(perf_results, 50)
  results['tp90'] = np.percentile(perf_results, 90)
  results['tp99'] = np.percentile(perf_results, 99)

ctx_str = os.environ['BENCHMARK_CTX']

if ctx_str == 'GPU':
  ctx = mx.gpu(0)
elif ctx_str == 'CPU':
  ctx = mx.cpu()

benchmark = {}

prefix = 'resnet-152'

# Model Partition time
t1 = time.time()
imported_net = gluon.nn.SymbolBlock.imports(prefix + '-symbol.json', ['data', 'softmax_label'],
                                            prefix + '-0000.params')
t2 = time.time()
elapsed = (t2 - t1) * 1000

imported_net.hybridize(static_alloc = True, static_shape = True)

benchmark['ModelLoadTime'] = elapsed

fname = mx.test_utils.download('https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/python/predict_image/cat.jpg?raw=true')
img = mx.image.imread(fname)


# convert into format (batch, RGB, width, height)
img = mx.image.imresize(img, 300, 300) # resize
img = img.transpose((2, 0, 1)) # Channel first
img = img.expand_dims(axis=0) # batchify
img = img.astype('float32')

sf_label = nd.ones((1))

if ctx_str == 'GPU':
  img = img.as_in_context(mx.gpu(0))

# First Inference
t1 = time.time()
op = imported_net(img, sf_label)
op.wait_to_read()
t2 = time.time()
elapsed = (t2 - t1) * 1000

benchmark['FirstInferCall'] = elapsed

times = 100
time_cost = []

for idx in range(0, times):
  t1 = time.time()
  op = imported_net(img, sf_label)
  op.wait_to_read()
  t2 = time.time()
  elapsed = (t2 - t1) * 1000
  time_cost.append(elapsed)
  print("time cost: ", elapsed, "ms")

benchmark['ModelLoadTime'] = benchmark['FirstInferCall'] - time_cost[0]
compute_stats(time_cost, benchmark)

output = json.dumps(benchmark)

f = open('Inf.json', 'w')
f.write(output)
f.close()
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CachedOp performance regression #15067

Benchmark Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Module API	cachedOp with Static	CachedOp without static
p2.8xlarge	43ms	42ms	51ms
p3.2xlarge	11ms	19ms	16ms
c5.4xlarge	36ms	38ms	42ms

CachedOp performance regression #15067

Description

Benchmark Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions