> be me, training vlms
> use hf transformers because who wants to reimplement models if they don't have to
> low MFU
> `pytorch_profiler.py`
> ViT takes 40X longer than the LLM
> sus
> .item() in the forward pass makes cudaStreamSynchronize every attention layer
> MFW
I rarely do this, but this is an absolutely useless benchmark. The examples are not problems I care if my computer vision system can solve.
I'm going to treat ZeroBench performance as an anti signal that someone made an overfit test taking VLM. I bet the next Phi will do great.
SimDINO: simplifying DINO/ DINOv2.
They use L2 loss on global <-> local features and a penalty on feature covariance to prevent collapse. With these additions, they can get rid of a lot of the bells and whistles in DINO/v2 training while improving representations (ImageNet-KNN).
DiMA: distilling a VLM into a planner for driving.
They use VAD as a baseline and pass tokens from its models into an LLM. The LLM is trained to do VQA, MAE, future prediction, and scene editing. You distill the LLM into planner transformer--you don't need the LLM for inference.
> $5.5M for Sonnet tier
it's unsurprising that they're proud of it, but it sure feels like they're rubbing it in. «$100M runs, huh? 30.84M H100-hours on 405B, yeah? Half-witted Western hacks, your silicon is wasted on you, your thoughts wouldn't reduce loss of your own models»