Super thrilled to release our work on extreme quantization (1-bit and 2-bit)! We're starting with the Llama2-7b since it's a well-understood model. Check out our detailed blog post: mobiusml.github.io/1bit_blog/
AI powered computer vision technology that gives you deep visual analysis and unparalleled flexibility with a convenient interface that anyone can use. 🚀
Another quantization method dropped in @huggingface transformers library ! Half Quadratic Quantization 🔥
HQQ implements on-the-fly quantization via fast robust optimization. It doesn’t require calibration data and can be used to quantize any model, up to 1-bit precision !
Prioritize your visual content based on client needs to increase sales, identify additional revenue streams and reduce operational costs.🚀mobiuslabs.com/media-business
Early preview of the new backend (torchao int4) for HQQ with transformers. Llama2-7B running at 150 tokens/sec on an RTX 4090 now. More details and code coming soon this week!
Enrich your content with visual intelligence that understands clients and serves them with the most relevant visuals. Leverage our next-gen AI solution to organize, tag and search for visual content with total ease and privacy.👇
🚀 Introducing two new kernels to HQQ! One is based on TorchAO and the other on Marlin. Currently supports only 4-bit models, achieving speeds up to 200 tokens/sec on 4090 and 60 tokens/sec on L4.
🔗 Get it here: github.com/mobiusml/hqq
📊 Colab: colab.research.google.com/drive/1uomMVKC…
Our #imagerecognition solution works out of the box to deliver market-leading precision & speed. Curious? Try our Superhuman Vision ™ demo today. ct.mobiuslabs.com/register
A 2-bit model performs quite well. Specifically, the base Llama2-7B 2-bit model with HQQ+ outperforms the full-precision model on Wikitext. The chat model exceeds its full-precision counterpart on GSM8K with adequate math and reasoning data.
Thrilled to contribute to the amazing work by Answer.AI on efficiently training large models (70B) with consumer GPUs. Putting our hopes on the next large models of consequence, trained & shared by the GPU-poor, now that per GPU memory is no longer a prerequisite.