Recently, vision–language models (VLMs) have delivered state-of-the-art multi-modal accuracy, yet deploying them on FPGA/ASIC accelerators remains costly: after quantizing general matrix multiply (GEMM) to low-bit integer math, non linear functions (NLFs) such as GELU and LayerNorm dominate resources and power. Prior NLF approximations either use high-precision piecewise-linear (PWL) models that still demand sizable logic/DSP budgets, or low-precision integer surrogates that require fine-tuning to recover accuracy. In this paper, we present hardware-efficient, training-free approximations for two representative NLFs, namely GELU and LayerNorm. First, for GELU, we propose a power-of-two (PoT) PWL scheme: we analytically study the LUT-entry/accuracy trade-off under input clipping, introduce an automatic clipping-point selection to meet a target error, and quantize segment slopes to PoT to replace multipliers with shifts. Second, for LayerNorm, we eliminate floating-point operations in quantized pipelines via a PoT-based mean estimator and a log-based shift-LUT approximation of the reciprocal square root for variance normalization. Both designs map to a common shift-add datapath and co-optimize naturally with quantized GEMM. On quantized CLIP-ViT models, our approach is plug-and-play (no additional training) and incurs at most a 0.93% Top-1 drop on ImageNet. A prototype on Xilinx FPGA reduces DSP usage by up to 100%, LUTs by 69.8%, and FFs by 96.0%, yielding substantial resource-utilization and practicality gains. These results indicate that simple, PoT-driven approximations can cap NLF overheads and enable practical, resource-aware VLM acceleration on reconfigurable and custom silicon.
ViT-B/32
ViT-L/14
python clip_test.py --model {model_name} --train_set {ImageNet Trainset Directory} --test_set{ImageNet Testset Directory}
python clip_test.py --model {model_name} --train_set {ImageNet Trainset Directory} --test_set{ImageNet Testset Directory} --quant --calib
python clip_test.py --model {model_name} --train_set {ImageNet Trainset Directory} --test_set{ImageNet Testset Directory} --quant --calib --int_norm --int_act --int_softmax
- This project is based on the following repository:
- https://github.com/openai/CLIP
(Used as the base implementation, and model weights were downloaded from this repository.)
- https://github.com/openai/CLIP
