mixed-precision quantization milestone1: naive_intNwo + eval/benchmark framework#531
mixed-precision quantization milestone1: naive_intNwo + eval/benchmark framework#531andrewor14 merged 12 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/531
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e516f0b with merge base 00b76c4 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
andrewor14
left a comment
There was a problem hiding this comment.
Hi @Hanxian97, thanks for the PR. I think you can remove all the files except for mp_quant_eval.py, naive_intNwo.py, and test_naive_intNwo.py. Left a few minor comments other than that
|
|
||
| wait | ||
| done | ||
|
|
There was a problem hiding this comment.
Hi @Hanxian97, I feel we don't want to push these experiments scripts to torchao. Can you remove them from the PR? (OK to keep in your own separate branch for now)
There was a problem hiding this comment.
Thanks for the comment. I have removed experiment scripts and only kept mp_quant_eval.py, naive_intNwo.py, and test_naive_intNwo.py.
| ZeroPointDomain, | ||
| ) | ||
|
|
||
| def intN_weight_only_asym(group_size=32, n=8): |
There was a problem hiding this comment.
Can you add a short docstring to describe what this is doing? Maybe add an example to use this with the quantize_ API? (same for intN_weight_only_sym)
There was a problem hiding this comment.
also, should we limit this to n = [2, 3, 4, 5, 6, 8] for now? (throw error otherwise)
There was a problem hiding this comment.
Added the doctoring and assertion to limit [2,3,4,5,6,8] only
| @@ -0,0 +1,95 @@ | |||
| import torch | |||
There was a problem hiding this comment.
might wanna call this file something else since you're about to do the real sensitivity analysis
There was a problem hiding this comment.
Removed this file for now and will commit the real sensitivity analysis in milestone2
| model = AutoModelForCausalLM.from_pretrained(repo_id).to(device="cpu", dtype=precision) | ||
|
|
||
| if quantization == "int8dq": | ||
| quantize_(model.to(device=device), int8_dynamic_activation_int4_weight()) |
There was a problem hiding this comment.
this seems wrong? On main it's int8_dynamic_activation_int8_weight:
Line 52 in 0e6c122
Actually we can probably just delete this case?
There was a problem hiding this comment.
removed this for now since we will not use this
| @@ -0,0 +1,27 @@ | |||
| import torch | |||
There was a problem hiding this comment.
by the way I think we need to move this to torchao/test if we want it to run as part of CI
There was a problem hiding this comment.
Put the test_naive_intNwo.py under test/quantization now
| target_dtype = torch.int8 | ||
| quant_min = 0 | ||
| quant_max = 2**n-1 |
There was a problem hiding this comment.
should target_dtype be torch.uint8 for this?
| if sensi_bit == 8: | ||
| quantize_(model.to(device=device), int8_weight_only(), filter_fn_sen) | ||
| elif sensi_bit == 4: | ||
| quantize_(model.to(device=device), int4_weight_only(group_size=group_size), filter_fn_sen) |
There was a problem hiding this comment.
you could merge these logic into intN_weight_only_asym I think
There was a problem hiding this comment.
merged them into intN_weight_only now
| bit_zeropoint = 2 # Example value, please adjust as needed | ||
| bit_scale = 2 # Example value, please adjust as needed |
There was a problem hiding this comment.
Yes these are in bytes. Have fixed this. Thanks!
| return total_size_gb | ||
|
|
||
| # Example usage | ||
| num_elements = 250945664 #number of elements per Llama3 linear layer |
There was a problem hiding this comment.
can this be calculated from the model instead of hardcoded? also I feel a better integration is just to fix and extend
Line 188 in 5787e9e
There was a problem hiding this comment.
Yes this is a temporary solution for Llama3. Thanks for the suggestion! I will try to generalize it by extend the get_model_size_in_bytes.
| torch._inductor.config.force_fuse_int_mm_with_mul = True | ||
| torch._inductor.config.fx_graph_cache = True | ||
|
|
||
| def intN_weight_only(group_size=32, n=8): |
There was a problem hiding this comment.
I'd suggest to name this in more detail, since you have different dtypes and asymmetric/symmetric, in this case it's uintN_asymmetric_weight_only (or probably pass around asymmetric/symmetric as an argument)
There was a problem hiding this comment.
I passed asymmetric/symmetric as an argument and merged them into intN_weight_only
| eps = 1e-6 | ||
| preserve_zero = False | ||
| zero_point_dtype = torch.bfloat16 | ||
| zero_point_domain = ZeroPointDomain.FLOAT |
There was a problem hiding this comment.
Thanks for pointing it out. Just changed it to INT.
aadda53 to
e9f56d4
Compare
andrewor14
left a comment
There was a problem hiding this comment.
Approving to unblock. Thanks!
| eps = 1e-6 | ||
| preserve_zero = False | ||
| zero_point_dtype = torch.bfloat16 | ||
| zero_point_domain = ZeroPointDomain.FLOAT |
There was a problem hiding this comment.
I think this should be ZeroPointDomain.INT. FLOAT is mainly for the optimized int4 tinygemm kernel right now
There was a problem hiding this comment.
Thanks for pointing it out. Just changed it to INT.
| @@ -0,0 +1,46 @@ | |||
| import torch | |||
There was a problem hiding this comment.
maybe call this test_mixed_precision.py to match your prototype folder and for your future test cases as well?
| test_weight_only_quant(i, False) | ||
| print(f"Test passed for {i}-bit using naive intNwo asymmetric quantization implementation") | ||
| except Exception as e: | ||
| print(f"Exception handled in test loop for {i}-bit asymmetric quantization. Details: {e}") |
There was a problem hiding this comment.
might want to actually raise this exception too? Otherwise it'll be hard to catch the test when it fails
| import os | ||
| import sys | ||
| # append the path to the naive_intNwo.py file | ||
| sys.path.append(os.path.join(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))), "torchao/quantization/prototype/mixed_precision/scripts")) |
There was a problem hiding this comment.
Does it work if you just add an empty __init__.py to torchao/quantization/prototype/mixed_precision? Then you won't need this line anymore?
There was a problem hiding this comment.
Added init.py and removed the path append
75b55c2 to
ec36a94
Compare
ec36a94 to
f4fccf3
Compare
…k framework (#531) * milestone1: naive_intNwo + eval/benchmark * remove experiment scripts * remove exp files * use default ZeroPointDomain.INT for int2/3/5/6 * renamed test_naive_intNwo.py to test_mixed_precision.py * updated intNwo with _get_linear_subclass_inserter * adjust sqnr threshold according to bit width * fixed test for int4wo and add __init__.py * skip test_aq_int8_weight_only_quant_3_subclass due to seg fault on nightly * edit the sqnr threshold * add unittest * correct import path
* runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda
* make --device fast the default * Update iOS.md (pytorch#517) * Update iOS.md * Update iOS.md * Pip to pip3 (pytorch#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (pytorch#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <mikekg@meta.com> * Support llama3 in chat in run.cpp (pytorch#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (pytorch#519) * remove code for no KV Cache path (pytorch#527) * Update ADVANCED-USERS.md (pytorch#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (pytorch#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (pytorch#530) Update description of runner and build process in runner_build.md * clean up runner code a little (pytorch#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (pytorch#534) * add dtype tests for runner-aoti + runner-et (pytorch#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (pytorch#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (pytorch#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (pytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1. * fix generate for llama3 (pytorch#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (pytorch#551) * Add dtype runner aoti (pytorch#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (pytorch#553) * test sdpa with fp16 * kv cache fp32 * typo * update (pytorch#560) * Only support newest versions of lm-eval (pytorch#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (pytorch#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (pytorch#559) Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> * doc updates (pytorch#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
* code beautification * code beautification, move functions together * make --device fast the default (pytorch#515) * make --device fast the default * Update iOS.md (pytorch#517) * Update iOS.md * Update iOS.md * Pip to pip3 (pytorch#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (pytorch#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <mikekg@meta.com> * Support llama3 in chat in run.cpp (pytorch#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (pytorch#519) * remove code for no KV Cache path (pytorch#527) * Update ADVANCED-USERS.md (pytorch#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (pytorch#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (pytorch#530) Update description of runner and build process in runner_build.md * clean up runner code a little (pytorch#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (pytorch#534) * add dtype tests for runner-aoti + runner-et (pytorch#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (pytorch#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (pytorch#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (pytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1. * fix generate for llama3 (pytorch#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (pytorch#551) * Add dtype runner aoti (pytorch#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (pytorch#553) * test sdpa with fp16 * kv cache fp32 * typo * update (pytorch#560) * Only support newest versions of lm-eval (pytorch#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (pytorch#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (pytorch#559) Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> * doc updates (pytorch#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com> * add unpacking support (pytorch#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (pytorch#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in> Co-authored-by: metascroy <161522778+metascroy@users.noreply.github.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: lucylq <lfq@meta.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
Summary:
This is a prototype for mixed-precision quantization. It consists of naive implementation of integer 2/3/5/6-bit quantization. Along with the int4wo and int8wo in torchao, it provides an evaluation framework leveraging lm_eval for mixed-precision quantization on Llama3
Test Plan:
To test the naive implementation of quantization APIs: python test/quantization/test_native_intNwo.py