Skip to content

[Quantization] fix quantization pass bug#355

Merged
Aalanli merged 2 commits intohidet-org:mainfrom
Aalanli:quant-improvement
Aug 29, 2023
Merged

[Quantization] fix quantization pass bug#355
Aalanli merged 2 commits intohidet-org:mainfrom
Aalanli:quant-improvement

Conversation

@Aalanli
Copy link
Copy Markdown
Contributor

@Aalanli Aalanli commented Aug 24, 2023

For llama 7b, decoding with no prefill

fp16 - 128 tokens
org_t: 3.048099994659424s
avg_t: 0.0044784750789403915s

int8 - 128 tokens
org_t: 2.2229115962982178s
avg_t: 0.0038476772606372833s

@Aalanli Aalanli merged commit 021b067 into hidet-org:main Aug 29, 2023
@Aalanli Aalanli deleted the quant-improvement branch August 29, 2023 00:03
vadiklyutiy added a commit that referenced this pull request Jul 22, 2024
Introduce `add_hint_pass`. It adds `__builtin_assume(...)` to .cu code
that helps nvcc to understand bounds if `threadIdx` and `blockIdx` and
optimize code better.

**Performance improvements.** 
Models
model|latency|prev_latency|ratio|
|--------|--------|--------|--------|
bert-base-uncased|19.8138|20.2316|2.109
densenet121|35.1161|36.7627|4.689
efficientnet_b0|18.9451|19.278|1.757
mobilenet_v2|11.5944|11.8764|2.432
resnet50|29.4878|29.9935|1.715
vit_b_16|125.787|123.672|-1.681

Operators
operator|latency|prev_latency|ratio
|--------|--------|--------|--------|
attn|1.50402|1.50131|-0.18
attn|0.219707|0.227568|3.578
attn_mask_add|1.5892|1.62516|2.263
attn_mask_add|0.226317|0.226507|0.084
batch_matmul|5.2399|5.11547|-2.375
batch_matmul|0.0216016|0.0223425|3.43
conv2d|0.0347093|0.0341758|-1.537
conv2d|0.310521|0.308458|-0.664
conv2d_gemm_f16|0.142542|0.146412|2.715
conv2d_gemm_f16|2.0421|2.07043|1.387
matmul_f16|2.22432|2.30458|3.608
matmul_f16|0.00888628|0.00892615|0.449
reduce|0.01375|0.0138618|0.813
vadiklyutiy added a commit that referenced this pull request Jul 23, 2024
Introduce `add_hint_pass`. It adds `__builtin_assume(...)` to .cu code
that helps nvcc to understand bounds if `threadIdx` and `blockIdx` and
optimize code better.

**Performance improvements.** 
Models
model|latency|prev_latency|ratio|
|--------|--------|--------|--------|
bert-base-uncased|19.8138|20.2316|2.109
densenet121|35.1161|36.7627|4.689
efficientnet_b0|18.9451|19.278|1.757
mobilenet_v2|11.5944|11.8764|2.432
resnet50|29.4878|29.9935|1.715
vit_b_16|125.787|123.672|-1.681

Operators
operator|latency|prev_latency|ratio
|--------|--------|--------|--------|
attn|1.50402|1.50131|-0.18
attn|0.219707|0.227568|3.578
attn_mask_add|1.5892|1.62516|2.263
attn_mask_add|0.226317|0.226507|0.084
batch_matmul|5.2399|5.11547|-2.375
batch_matmul|0.0216016|0.0223425|3.43
conv2d|0.0347093|0.0341758|-1.537
conv2d|0.310521|0.308458|-0.664
conv2d_gemm_f16|0.142542|0.146412|2.715
conv2d_gemm_f16|2.0421|2.07043|1.387
matmul_f16|2.22432|2.30458|3.608
matmul_f16|0.00888628|0.00892615|0.449
reduce|0.01375|0.0138618|0.813
vadiklyutiy added a commit that referenced this pull request Dec 26, 2024
Introduce `add_hint_pass`. It adds `__builtin_assume(...)` to .cu code
that helps nvcc to understand bounds if `threadIdx` and `blockIdx` and
optimize code better.

**Performance improvements.** 
Models
model|latency|prev_latency|ratio|
|--------|--------|--------|--------|
bert-base-uncased|19.8138|20.2316|2.109
densenet121|35.1161|36.7627|4.689
efficientnet_b0|18.9451|19.278|1.757
mobilenet_v2|11.5944|11.8764|2.432
resnet50|29.4878|29.9935|1.715
vit_b_16|125.787|123.672|-1.681

Operators
operator|latency|prev_latency|ratio
|--------|--------|--------|--------|
attn|1.50402|1.50131|-0.18
attn|0.219707|0.227568|3.578
attn_mask_add|1.5892|1.62516|2.263
attn_mask_add|0.226317|0.226507|0.084
batch_matmul|5.2399|5.11547|-2.375
batch_matmul|0.0216016|0.0223425|3.43
conv2d|0.0347093|0.0341758|-1.537
conv2d|0.310521|0.308458|-0.664
conv2d_gemm_f16|0.142542|0.146412|2.715
conv2d_gemm_f16|2.0421|2.07043|1.387
matmul_f16|2.22432|2.30458|3.608
matmul_f16|0.00888628|0.00892615|0.449
reduce|0.01375|0.0138618|0.813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants