Pinned
VLMs can compress text by rendering it as images, but accuracy collapses once images shrink below a certain resolution.
We introduce LensVLM: teach the model to scan compressed images, then selectively decompress what it needs.
Paper: arxiv.org/abs/2605.07019











