MQGAN Refiner
Combining the strengths of two networks to make the best possible model.
Note: This is a sequel to this model I made, although I’ll provide a quick TL;DR:
I made a model that uses a ResNet and two discriminators to adversarially VQ mel spectrograms with Finite Scalar Quantization. However, while the discriminators helped, there was still too much oversmoothing.
Architectural Review
The original model used a ResNet-style encoder-decoder to reconstruct mel spectrograms from FSQ (I assume you, dear reader, already know what a ResNet is). Solid, proven and simple enough. However, as I found out, ResNets aren’t exactly the best at pure image tasks.
So, what is good at image tasks? As Stable Diffusion proved, The UNet
Originally designed for biomedical segmentation, UNets excel at modifying images while preserving their overall structure — which makes them a perfect match for refining blurry spectrograms.
A crucial component is the skip connections, allowing the model to recover detail that may have been lost in downsampling, which makes the UNet itself inadequate for compression tasks, as the encoder and decoder must be independent of each other.
MQGAN becomes MQGAN Refiner
Therefore, I just bolted on a small UNet as a refiner, which takes in the ResNet reconstructed spectrogram with some hidden channels concatenated from the decoder’s last hidden states, detached, and outputs a residual for the reconstructed spectrogram. During training, the model outputs both the decoder reconstruction and the refined spectrogram
LaTeX made by ChatGPT 4o
Now, instead of doing GAN losses on the decoder output itself, I only do simple MSE loss on it, and do recon + GAN losses on the refined output. Since I apply the stop gradient operator (.detach() in PyTorch) on the input to the refiner, GAN gradients (should) never hit the base model itself. This provides a clear separation of tasks:
The ResNet encoder-decoder focus on compressing the spectrogram and producing a plausible reconstruction, a draft of sorts if you will
The UNet refiner focuses on adding enough detail to fool the discriminators
And it works quite beautifully. These results are after 9 epochs of generator pretraining, so only one epoch so far of GAN training — and in case it wasn’t clear enough, these two models are being trained jointly. No two-stage annoyance here, no sir/ma’am.
The model has, in total, 30M params in the G, of which 11M are the UNet. All of this is being done on a single AMD Instinct MI300X, thanks to Hot Aisle.
This is a night and day difference compared to my previous design. You have to actually look closely to spot differences between the ground truth and post-refiner.
Note: The examples for the previous and current model were done using different mel scales; but this is irrelevant since the model performance is unaffected by the range.
Closing Statement
So yeah, what can I say? ResNet + UNet = god-tier VQGAN! This theoretically isn’t limited to spectrograms: you could build any image GAN with this design, like a proper picture VQGAN or even diffusion-free image generation from text.
If there are no catches, next step on the agenda is to make it causal someday.
This is still very WIP, but check out the code (and one day, pretrained models) here.
Once again, shoutout to Hot Aisle and AI at AMD for the compute!!









