mixed-precision quantization milestone1: naive_intNwo + eval/benchmark framework by Hanxian97 · Pull Request #531 · pytorch/ao

Hanxian97 · 2024-07-19T21:06:58Z

Summary:
This is a prototype for mixed-precision quantization. It consists of naive implementation of integer 2/3/5/6-bit quantization. Along with the int4wo and int8wo in torchao, it provides an evaluation framework leveraging lm_eval for mixed-precision quantization on Llama3

Test Plan:
To test the naive implementation of quantization APIs: python test/quantization/test_native_intNwo.py

pytorch-bot · 2024-07-19T21:07:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/531

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e516f0b with merge base 00b76c4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

andrewor14

Hi @Hanxian97, thanks for the PR. I think you can remove all the files except for mp_quant_eval.py, naive_intNwo.py, and test_naive_intNwo.py. Left a few minor comments other than that

andrewor14 · 2024-07-22T16:10:15Z

torchao/quantization/prototype/mixed_precision/run_sensi_linear_type.sh

+
+  wait
+done
+


Hi @Hanxian97, I feel we don't want to push these experiments scripts to torchao. Can you remove them from the PR? (OK to keep in your own separate branch for now)

Thanks for the comment. I have removed experiment scripts and only kept mp_quant_eval.py, naive_intNwo.py, and test_naive_intNwo.py.

andrewor14 · 2024-07-22T16:12:55Z

torchao/quantization/prototype/mixed_precision/scripts/naive_intNwo.py

+    ZeroPointDomain,
+)
+
+def intN_weight_only_asym(group_size=32, n=8):


Can you add a short docstring to describe what this is doing? Maybe add an example to use this with the quantize_ API? (same for intN_weight_only_sym)

also, should we limit this to n = [2, 3, 4, 5, 6, 8] for now? (throw error otherwise)

Added the doctoring and assertion to limit [2,3,4,5,6,8] only

andrewor14 · 2024-07-22T16:14:16Z

torchao/quantization/prototype/mixed_precision/scripts/sensitivity_study.py

@@ -0,0 +1,95 @@
+import torch


might wanna call this file something else since you're about to do the real sensitivity analysis

Removed this file for now and will commit the real sensitivity analysis in milestone2

andrewor14 · 2024-07-22T16:15:58Z

torchao/quantization/prototype/mixed_precision/scripts/mp_quant_eval.py

+    model = AutoModelForCausalLM.from_pretrained(repo_id).to(device="cpu", dtype=precision)
+
+    if quantization == "int8dq":
+        quantize_(model.to(device=device), int8_dynamic_activation_int4_weight())


this seems wrong? On main it's int8_dynamic_activation_int8_weight:

ao/scripts/hf_eval.py

Line 52 in 0e6c122

quantize_(model, int8_dynamic_activation_int8_weight())

.

Actually we can probably just delete this case?

removed this for now since we will not use this

andrewor14 · 2024-07-22T16:19:34Z

torchao/quantization/prototype/mixed_precision/scripts/test_naive_intNwo.py

@@ -0,0 +1,27 @@
+import torch


by the way I think we need to move this to torchao/test if we want it to run as part of CI

Put the test_naive_intNwo.py under test/quantization now

torchao/quantization/prototype/mixed_precision/scripts/naive_intNwo.py

jerryzh168 · 2024-07-23T22:03:01Z

torchao/quantization/prototype/mixed_precision/scripts/naive_intNwo.py

+        target_dtype = torch.int8
+        quant_min = 0
+        quant_max = 2**n-1


should target_dtype be torch.uint8 for this?

Fixed this.

jerryzh168 · 2024-07-23T22:04:54Z

torchao/quantization/prototype/mixed_precision/scripts/mp_quant_eval.py

+            if sensi_bit == 8:
+                quantize_(model.to(device=device), int8_weight_only(), filter_fn_sen)
+            elif  sensi_bit == 4: 
+                quantize_(model.to(device=device), int4_weight_only(group_size=group_size), filter_fn_sen)


you could merge these logic into intN_weight_only_asym I think

merged them into intN_weight_only now

jerryzh168 · 2024-07-23T22:07:35Z

torchao/quantization/prototype/mixed_precision/scripts/quant_model_size.py

+bit_zeropoint = 2  # Example value, please adjust as needed
+bit_scale = 2  # Example value, please adjust as needed


are these bytes?

Yes these are in bytes. Have fixed this. Thanks!

jerryzh168 · 2024-07-23T22:10:18Z

torchao/quantization/prototype/mixed_precision/scripts/quant_model_size.py

+    return total_size_gb
+
+# Example usage
+num_elements = 250945664 #number of elements per Llama3 linear layer


can this be calculated from the model instead of hardcoded? also I feel a better integration is just to fix and extend

ao/torchao/utils.py

Line 188 in 5787e9e

def get_model_size_in_bytes(model, ignore_embeddings=False):

Yes this is a temporary solution for Llama3. Thanks for the suggestion! I will try to generalize it by extend the get_model_size_in_bytes.

jerryzh168 · 2024-07-23T22:14:35Z

torchao/quantization/prototype/mixed_precision/scripts/sensitivity_study.py

+torch._inductor.config.force_fuse_int_mm_with_mul = True
+torch._inductor.config.fx_graph_cache = True
+
+def intN_weight_only(group_size=32, n=8):


I'd suggest to name this in more detail, since you have different dtypes and asymmetric/symmetric, in this case it's uintN_asymmetric_weight_only (or probably pass around asymmetric/symmetric as an argument)

I passed asymmetric/symmetric as an argument and merged them into intN_weight_only

jerryzh168 · 2024-07-23T22:15:05Z

torchao/quantization/prototype/mixed_precision/scripts/sensitivity_study.py

+        eps = 1e-6
+        preserve_zero = False
+        zero_point_dtype = torch.bfloat16
+        zero_point_domain = ZeroPointDomain.FLOAT


why is this FLOAT?

Thanks for pointing it out. Just changed it to INT.

andrewor14

Approving to unblock. Thanks!

andrewor14 · 2024-07-24T17:10:11Z

torchao/quantization/prototype/mixed_precision/scripts/naive_intNwo.py

+        eps = 1e-6
+        preserve_zero = False
+        zero_point_dtype = torch.bfloat16
+        zero_point_domain = ZeroPointDomain.FLOAT


I think this should be ZeroPointDomain.INT. FLOAT is mainly for the optimized int4 tinygemm kernel right now

Thanks for pointing it out. Just changed it to INT.

andrewor14 · 2024-07-24T17:10:48Z

test/quantization/test_naive_intNwo.py

@@ -0,0 +1,46 @@
+import torch


maybe call this test_mixed_precision.py to match your prototype folder and for your future test cases as well?

Renamed it.

torchao/quantization/prototype/mixed_precision/scripts/naive_intNwo.py

andrewor14 · 2024-07-25T18:14:01Z

test/quantization/test_mixed_precision.py

+        test_weight_only_quant(i, False)
+        print(f"Test passed for {i}-bit using naive intNwo asymmetric quantization implementation")
+    except Exception as e:
+        print(f"Exception handled in test loop for {i}-bit asymmetric quantization. Details: {e}")


might want to actually raise this exception too? Otherwise it'll be hard to catch the test when it fails

andrewor14 · 2024-07-25T18:47:17Z

test/quantization/test_mixed_precision.py

+import os
+import sys
+# append the path to the naive_intNwo.py file
+sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), "torchao/quantization/prototype/mixed_precision/scripts"))


Does it work if you just add an empty __init__.py to torchao/quantization/prototype/mixed_precision? Then you won't need this line anymore?

Added init.py and removed the path append

…ghtly

…k framework (#531) * milestone1: naive_intNwo + eval/benchmark * remove experiment scripts * remove exp files * use default ZeroPointDomain.INT for int2/3/5/6 * renamed test_naive_intNwo.py to test_mixed_precision.py * updated intNwo with _get_linear_subclass_inserter * adjust sqnr threshold according to bit width * fixed test for int4wo and add __init__.py * skip test_aq_int8_weight_only_quant_3_subclass due to seg fault on nightly * edit the sqnr threshold * add unittest * correct import path

* runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda

* make --device fast the default * Update iOS.md (pytorch#517) * Update iOS.md * Update iOS.md * Pip to pip3 (pytorch#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (pytorch#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <mikekg@meta.com> * Support llama3 in chat in run.cpp (pytorch#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (pytorch#519) * remove code for no KV Cache path (pytorch#527) * Update ADVANCED-USERS.md (pytorch#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (pytorch#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (pytorch#530) Update description of runner and build process in runner_build.md * clean up runner code a little (pytorch#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (pytorch#534) * add dtype tests for runner-aoti + runner-et (pytorch#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (pytorch#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (pytorch#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (pytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1. * fix generate for llama3 (pytorch#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (pytorch#551) * Add dtype runner aoti (pytorch#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (pytorch#553) * test sdpa with fp16 * kv cache fp32 * typo * update (pytorch#560) * Only support newest versions of lm-eval (pytorch#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (pytorch#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (pytorch#559) Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> * doc updates (pytorch#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>

* code beautification * code beautification, move functions together * make --device fast the default (pytorch#515) * make --device fast the default * Update iOS.md (pytorch#517) * Update iOS.md * Update iOS.md * Pip to pip3 (pytorch#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (pytorch#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <mikekg@meta.com> * Support llama3 in chat in run.cpp (pytorch#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (pytorch#519) * remove code for no KV Cache path (pytorch#527) * Update ADVANCED-USERS.md (pytorch#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (pytorch#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (pytorch#530) Update description of runner and build process in runner_build.md * clean up runner code a little (pytorch#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (pytorch#534) * add dtype tests for runner-aoti + runner-et (pytorch#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (pytorch#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (pytorch#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (pytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1. * fix generate for llama3 (pytorch#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (pytorch#551) * Add dtype runner aoti (pytorch#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (pytorch#553) * test sdpa with fp16 * kv cache fp32 * typo * update (pytorch#560) * Only support newest versions of lm-eval (pytorch#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (pytorch#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (pytorch#559) Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> * doc updates (pytorch#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com> * add unpacking support (pytorch#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (pytorch#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>

Hanxian97 requested review from andrewor14 and jerryzh168 July 19, 2024 21:06

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 19, 2024

Hanxian97 marked this pull request as draft July 21, 2024 23:47

andrewor14 reviewed Jul 22, 2024

View reviewed changes

jerryzh168 reviewed Jul 23, 2024

View reviewed changes

torchao/quantization/prototype/mixed_precision/scripts/naive_intNwo.py Show resolved Hide resolved

jerryzh168 reviewed Jul 23, 2024

View reviewed changes

Hanxian97 force-pushed the Hanxian_MixedPrecision branch from aadda53 to e9f56d4 Compare July 24, 2024 16:04

andrewor14 approved these changes Jul 24, 2024

View reviewed changes

andrewor14 marked this pull request as ready for review July 24, 2024 17:14

andrewor14 reviewed Jul 25, 2024

View reviewed changes

Hanxian97 force-pushed the Hanxian_MixedPrecision branch 4 times, most recently from 75b55c2 to ec36a94 Compare July 30, 2024 16:05

Hanxian97 added 7 commits July 30, 2024 09:41

milestone1: naive_intNwo + eval/benchmark

af83deb

remove experiment scripts

02ef81b

remove exp files

cf2c134

use default ZeroPointDomain.INT for int2/3/5/6

1055f14

renamed test_naive_intNwo.py to test_mixed_precision.py

c00b16d

updated intNwo with _get_linear_subclass_inserter

f765eef

adjust sqnr threshold according to bit width

9a343a4

Hanxian97 added 3 commits July 30, 2024 09:41

fixed test for int4wo and add __init__.py

aafe38e

skip test_aq_int8_weight_only_quant_3_subclass due to seg fault on ni…

1bfa370

…ghtly

edit the sqnr threshold

f4fccf3

Hanxian97 force-pushed the Hanxian_MixedPrecision branch from ec36a94 to f4fccf3 Compare July 30, 2024 16:42

Hanxian97 added 2 commits July 31, 2024 22:05

add unittest

8e787b6

correct import path

e516f0b

andrewor14 merged commit c023f71 into main Aug 1, 2024

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

runner-aoti on cuda (pytorch#531)

68b53f3

* runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda

		bit_zeropoint = 2 # Example value, please adjust as needed
		bit_scale = 2 # Example value, please adjust as needed


		wait
		done

Conversation

Hanxian97 commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/531

✅ No Failures

Uh oh!

andrewor14 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hanxian97 Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewor14 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrewor14 Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Hanxian97 commented Jul 19, 2024 •

edited

Loading

pytorch-bot bot commented Jul 19, 2024 •

edited

Loading

Hanxian97 Jul 24, 2024 •

edited

Loading

andrewor14 Jul 25, 2024 •

edited

Loading