RMSE-optimized quants for all quantization types#1106
RMSE-optimized quants for all quantization types#1106
Conversation
|
sounds like a good idea. for me personally io is the bottleneck, since i store them on a NAS. |
364c00a to
7ca90a8
Compare
|
It might be a good idea to get #953 merged first, which implements unit tests for the quantization. But that requires an improvement to the test samples. |
sw
left a comment
There was a problem hiding this comment.
Please update SHA256SUMS, at the very least remove the files which are now different.
| float scale = sumlx/suml2; | ||
| return scale; | ||
| } | ||
| static float kquantize_q4_with_bound_plus(int n, int nmax, const float * restrict X, int nCandidates, |
There was a problem hiding this comment.
What does _plus mean?
Couldn't you re-use kquantize_q4_with_bounds with nmin=0?
|
I'm still a bit skeptical if chasing after RMSE is the right thing to do. Let me explain what I mean: originally the Q4 methods calculate max(abs()) and divide that by 7. What if it actually helps perplexity if we clip the largest values somewhat, even if that comes at a higher RMS error? So the approach to find that would be use #729, choose a value in the interesting range of maybe [7,11], quantize the model, do a perplexity run, lather, rinse, repeat. |
|
Just made a full cuBLAS run on 13B using DetailsWill make another run, this time using RMSE optimization (i.e. same as the one in OP) and double-check the reported |
|
My result for 13B, using This result I think makes more sense since it is inline with my expectation that I described here: #406 (reply in thread) Details |
|
@ggerganov Are these results with or without the changes you made to |
By default this new option is ON. One can turn it off by setting LLAMA_NO_RMSE. With this option enabled, the Q4_3 quantization results in a perplexity of 6.0344, so 0.0273 lower than simple Q4_3 quantization.
Test does not work with RMSE-minimization enabled, so have to put the test cases between ifdefs.
It includes all changes from today related to |
|
@ggerganov Rebased this branch on latest master, re-quantized, re-ran the perplexity. Now I get the lower result as well with OPEN_BLAS ( Is it possible this affects this comment you made in #729 ? Detailsperplexity : calculating perplexity over 655 chunks, batch_size=512 30.59 seconds per pass - ETA 5 hours 33 minutes [1]3.7363,[2]4.1741,[3]4.9573,[4]5.3622,[5]5.5408,[6]5.4786,[7]5.6388,[8]5.7498,[9]6.0085,[10]6.2361,[11]6.4224,[12]6.4857,[13]6.4488,[14]6.5426,[15]6.7435,[16]6.4223,[17]6.3394,[18]6.3167,[19]6.0232,[20]6.0023,[21]5.9256,[22]5.7530,[23]5.7200,[24]5.6257,[25]5.6326,[26]5.4844,[27]5.3093,[28]5.2082,[29]5.1320,[30]4.9981,[31]4.9567,[32]4.9675,[33]4.9237,[34]4.9636,[35]4.9806,[36]5.0033,[37]4.9960,[38]4.9914,[39]5.0201,[40]5.0615,[41]5.0861,[42]5.1202,[43]5.0861,[44]5.1306,[45]5.1347,[46]5.1095,[47]5.1369,[48]5.1183,[49]5.1225,[50]5.0926,[51]5.0997,[52]5.0919,[53]5.1384,[54]5.1290,[55]5.1113,[56]5.1310,[57]5.1488,[58]5.1709,[59]5.1902,[60]5.2259,[61]5.2187,[62]5.2734,[63]5.2981,[64]5.3100,[65]5.3462,[66]5.3454,[67]5.3633,[68]5.3760,[69]5.4044,[70]5.4348,[71]5.4581,[72]5.4918,[73]5.5384,[74]5.5450,[75]5.5549,[76]5.5686,[77]5.5801,[78]5.5663,[79]5.5932,[80]5.5870,[81]5.5950,[82]5.5918,[83]5.5465,[84]5.5364,[85]5.5300,[86]5.5155,[87]5.4508,[88]5.4069,[89]5.3857,[90]5.3749,[91]5.3959,[92]5.3921,[93]5.3939,[94]5.3926,[95]5.4191,[96]5.4161,[97]5.4127,[98]5.4088,[99]5.4019,[100]5.3993,[101]5.4222,[102]5.4176,[103]5.4329,[104]5.4376,[105]5.4389,[106]5.4529,[107]5.4516,[108]5.4665,[109]5.4657,[110]5.4605,[111]5.4783,[112]5.4949,[113]5.4942,[114]5.4929,[115]5.4971,[116]5.4851,[117]5.4846,[118]5.5080,[119]5.5258,[120]5.5548,[121]5.5700,[122]5.5911,[123]5.6275,[124]5.6451,[125]5.6401,[126]5.6757,[127]5.7085,[128]5.7368,[129]5.7255,[130]5.7340,[131]5.7300,[132]5.7256,[133]5.7132,[134]5.7221,[135]5.7221,[136]5.7139,[137]5.7100,[138]5.6973,[139]5.6895,[140]5.6883,[141]5.6613,[142]5.6574,[143]5.6326,[144]5.6167,[145]5.6083,[146]5.5972,[147]5.6019,[148]5.6049,[149]5.6018,[150]5.6011,[151]5.6057,[152]5.5998,[153]5.5902,[154]5.5846,[155]5.5907,[156]5.5891,[157]5.6045,[158]5.6061,[159]5.6071,[160]5.6109,[161]5.6225,[162]5.5971,[163]5.5877,[164]5.5676,[165]5.5426,[166]5.5195,[167]5.4880,[168]5.4612,[169]5.4483,[170]5.4389,[171]5.4184,[172]5.4062,[173]5.3929,[174]5.3660,[175]5.3457,[176]5.3327,[177]5.3161,[178]5.2963,[179]5.2832,[180]5.2757,[181]5.2596,[182]5.2438,[183]5.2319,[184]5.2311,[185]5.2240,[186]5.2252,[187]5.2308,[188]5.2284,[189]5.2447,[190]5.2451,[191]5.2619,[192]5.2755,[193]5.2900,[194]5.3014,[195]5.3208,[196]5.3324,[197]5.3513,[198]5.3647,[199]5.3667,[200]5.3676,[201]5.3610,[202]5.3734,[203]5.3792,[204]5.3744,[205]5.3834,[206]5.3888,[207]5.3851,[208]5.3906,[209]5.3943,[210]5.3998,[211]5.4100,[212]5.4164,[213]5.4254,[214]5.4288,[215]5.4319,[216]5.4438,[217]5.4603,[218]5.4738,[219]5.4735,[220]5.4706,[221]5.4657,[222]5.4658,[223]5.4597,[224]5.4532,[225]5.4496,[226]5.4696,[227]5.4756,[228]5.4828,[229]5.4899,[230]5.4862,[231]5.5013,[232]5.4910,[233]5.4762,[234]5.4620,[235]5.4403,[236]5.4352,[237]5.4269,[238]5.4303,[239]5.4192,[240]5.4102,[241]5.4136,[242]5.4153,[243]5.4147,[244]5.4049,[245]5.4014,[246]5.3912,[247]5.3815,[248]5.3755,[249]5.3722,[250]5.3757,[251]5.3675,[252]5.3626,[253]5.3537,[254]5.3492,[255]5.3401,[256]5.3237,[257]5.3136,[258]5.3070,[259]5.3062,[260]5.2981,[261]5.2931,[262]5.2890,[263]5.2843,[264]5.2605,[265]5.2605,[266]5.2575,[267]5.2514,[268]5.2580,[269]5.2572,[270]5.2580,[271]5.2640,[272]5.2668,[273]5.2678,[274]5.2685,[275]5.2744,[276]5.2801,[277]5.2921,[278]5.3005,[279]5.3085,[280]5.3122,[281]5.3216,[282]5.3269,[283]5.3390,[284]5.3473,[285]5.3553,[286]5.3679,[287]5.3645,[288]5.3696,[289]5.3634,[290]5.3495,[291]5.3367,[292]5.3234,[293]5.3117,[294]5.3125,[295]5.3125,[296]5.3172,[297]5.3161,[298]5.3181,[299]5.3160,[300]5.3074,[301]5.3077,[302]5.3015,[303]5.2931,[304]5.2860,[305]5.2835,[306]5.2733,[307]5.2761,[308]5.2769,[309]5.2637,[310]5.2612,[311]5.2570,[312]5.2585,[313]5.2533,[314]5.2515,[315]5.2387,[316]5.2343,[317]5.2222,[318]5.2060,[319]5.2165,[320]5.2273,[321]5.2322,[322]5.2293,[323]5.2238,[324]5.2220,[325]5.2316,[326]5.2329,[327]5.2335,[328]5.2373,[329]5.2422,[330]5.2445,[331]5.2547,[332]5.2512,[333]5.2586,[334]5.2541,[335]5.2490,[336]5.2514,[337]5.2502,[338]5.2501,[339]5.2458,[340]5.2431,[341]5.2495,[342]5.2528,[343]5.2568,[344]5.2571,[345]5.2586,[346]5.2569,[347]5.2604,[348]5.2641,[349]5.2661,[350]5.2642,[351]5.2656,[352]5.2658,[353]5.2604,[354]5.2612,[355]5.2661,[356]5.2691,[357]5.2663,[358]5.2744,[359]5.2762,[360]5.2728,[361]5.2725,[362]5.2792,[363]5.2900,[364]5.2951,[365]5.2990,[366]5.3008,[367]5.3094,[368]5.3074,[369]5.3089,[370]5.3109,[371]5.3069,[372]5.3116,[373]5.3154,[374]5.3134,[375]5.3129,[376]5.3187,[377]5.3152,[378]5.3176,[379]5.3211,[380]5.3144,[381]5.3114,[382]5.3077,[383]5.3059,[384]5.3061,[385]5.3048,[386]5.3036,[387]5.3034,[388]5.3007,[389]5.2969,[390]5.2918,[391]5.2859,[392]5.2826,[393]5.2821,[394]5.2854,[395]5.2847,[396]5.2796,[397]5.2859,[398]5.2901,[399]5.2971,[400]5.2966,[401]5.2974,[402]5.2986,[403]5.3011,[404]5.3066,[405]5.2918,[406]5.2876,[407]5.2867,[408]5.2875,[409]5.2985,[410]5.3076,[411]5.3169,[412]5.3308,[413]5.3409,[414]5.3470,[415]5.3528,[416]5.3598,[417]5.3696,[418]5.3721,[419]5.3769,[420]5.3844,[421]5.3942,[422]5.3975,[423]5.4033,[424]5.4122,[425]5.4199,[426]5.4259,[427]5.4301,[428]5.4373,[429]5.4410,[430]5.4472,[431]5.4596,[432]5.4627,[433]5.4620,[434]5.4587,[435]5.4601,[436]5.4629,[437]5.4710,[438]5.4782,[439]5.4755,[440]5.4748,[441]5.4704,[442]5.4692,[443]5.4702,[444]5.4721,[445]5.4712,[446]5.4733,[447]5.4756,[448]5.4788,[449]5.4773,[450]5.4784,[451]5.4755,[452]5.4599,[453]5.4503,[454]5.4450,[455]5.4453,[456]5.4495,[457]5.4508,[458]5.4491,[459]5.4490,[460]5.4563,[461]5.4523,[462]5.4489,[463]5.4468,[464]5.4465,[465]5.4443,[466]5.4369,[467]5.4360,[468]5.4340,[469]5.4352,[470]5.4341,[471]5.4292,[472]5.4299,[473]5.4251,[474]5.4239,[475]5.4171,[476]5.4147,[477]5.4064,[478]5.4035,[479]5.4036,[480]5.4060,[481]5.4062,[482]5.4015,[483]5.3973,[484]5.3980,[485]5.3913,[486]5.3848,[487]5.3836,[488]5.3814,[489]5.3761,[490]5.3730,[491]5.3697,[492]5.3630,[493]5.3603,[494]5.3584,[495]5.3561,[496]5.3521,[497]5.3457,[498]5.3430,[499]5.3394,[500]5.3313,[501]5.3245,[502]5.3235,[503]5.3225,[504]5.3148,[505]5.3145,[506]5.3150,[507]5.3097,[508]5.3060,[509]5.3065,[510]5.3088,[511]5.3130,[512]5.3169,[513]5.3194,[514]5.3247,[515]5.3207,[516]5.3197,[517]5.3197,[518]5.3197,[519]5.3219,[520]5.3233,[521]5.3244,[522]5.3258,[523]5.3265,[524]5.3319,[525]5.3347,[526]5.3352,[527]5.3368,[528]5.3313,[529]5.3323,[530]5.3286,[531]5.3281,[532]5.3329,[533]5.3356,[534]5.3337,[535]5.3356,[536]5.3315,[537]5.3298,[538]5.3346,[539]5.3354,[540]5.3370,[541]5.3368,[542]5.3382,[543]5.3403,[544]5.3415,[545]5.3405,[546]5.3407,[547]5.3375,[548]5.3334,[549]5.3334,[550]5.3312,[551]5.3286,[552]5.3266,[553]5.3238,[554]5.3216,[555]5.3196,[556]5.3189,[557]5.3207,[558]5.3174,[559]5.3177,[560]5.3164,[561]5.3166,[562]5.3141,[563]5.3139,[564]5.3182,[565]5.3194,[566]5.3201,[567]5.3182,[568]5.3192,[569]5.3177,[570]5.3203,[571]5.3216,[572]5.3224,[573]5.3228,[574]5.3200,[575]5.3184,[576]5.3177,[577]5.3162,[578]5.3144,[579]5.3144,[580]5.3090,[581]5.3061,[582]5.3061,[583]5.3068,[584]5.3073,[585]5.3016,[586]5.2962,[587]5.2965,[588]5.3007,[589]5.3057,[590]5.3087,[591]5.3104,[592]5.3093,[593]5.3053,[594]5.3068,[595]5.3052,[596]5.3090,[597]5.3070,[598]5.3039,[599]5.3065,[600]5.3056,[601]5.3044,[602]5.3045,[603]5.3073,[604]5.3079,[605]5.3105,[606]5.3119,[607]5.3105,[608]5.3077,[609]5.3085,[610]5.3126,[611]5.3115,[612]5.3138,[613]5.3109,[614]5.3070,[615]5.3010,[616]5.3035,[617]5.2985,[618]5.2942,[619]5.2898,[620]5.2789,[621]5.2738,[622]5.2720,[623]5.2733,[624]5.2736,[625]5.2744,[626]5.2741,[627]5.2767,[628]5.2776,[629]5.2779,[630]5.2811,[631]5.2853,[632]5.2901,[633]5.2890,[634]5.2920,[635]5.2917,[636]5.2883,[637]5.2848,[638]5.2868,[639]5.2838,[640]5.2844,[641]5.2849,[642]5.2898,[643]5.2915,[644]5.2932,[645]5.2919,[646]5.2953,[647]5.2901,[648]5.2912,[649]5.2914,[650]5.2942,[651]5.2982,[652]5.2987,[653]5.3024,[654]5.2969,[655]5.2961,llama_print_timings: load time = 31077.61 ms |
|
I think we cannot expect cuBLAS and OpenBLAS to be exactly the same because cuBLAS dequantizes |
That's not exactly the case, when multiplying q x f32, cuBLAS dequantizes to f32 and does a f32 x f32 mat mul. The only difference with OpenBLAS is when performing a f16 x f32 mat mul ( |
7ca90a8 to
6fd49ed
Compare
|
@ggerganov I propose we close this PR. Although there is some benefit from rmse minimization for |
|
You are minimizing error - why it should be worse? It may be worse for one, but better for another case, no? |
|
By that I mean that perplexity for wide range of other files (than en-wikitext--whatever) may be better. And not for one model but for another... Quantization itself is here is to compress data as much as possible without affecting model's quality much. |
The PR adds a new build option (
LLAMA_NO_RMSE), which is off by default. When off, all current quantization types (Q4_0, Q4_1, Q4_2, Q4_3) are performed with RMSE minimization (on master RMSE minimization is enabled forQ4_2only and cannot easily be disabled).This makes generation of quantized models quite a bit longer, but still in the same ballpark as it used to take before it was multi-threaded in PR #1075.
With this option enabled,
Q4_3gives a perplexity of6.0344for the 7B model, so 0.0273 lower than simpleQ4_3quantization as reported by @ggerganov in #406. If I also enable his trick of not quantizing output tensors, perplexity becomes6.0085.Perplexity result for
Q4_3without quantization of output tensors for the 13B model is5.3117.Details for these perplexity runs can be found in here (issue #406)
As far as I can tell, we are now on par with best known
GPTQresult for 7B, and better for 13B by about0.05.