MQGAN Refiner

Combining the strengths of two networks to make the best possible model.

Aug 06, 2025

Note: This is a sequel to this model I made, although I’ll provide a quick TL;DR:

I made a model that uses a ResNet and two discriminators to adversarially VQ mel spectrograms with Finite Scalar Quantization. However, while the discriminators helped, there was still too much oversmoothing.

Architectural Review

The original model used a ResNet-style encoder-decoder to reconstruct mel spectrograms from FSQ (I assume you, dear reader, already know what a ResNet is). Solid, proven and simple enough. However, as I found out, ResNets aren’t exactly the best at pure image tasks.

So, what is good at image tasks? As Stable Diffusion proved, The UNet

Originally designed for biomedical segmentation, UNets excel at modifying images while preserving their overall structure — which makes them a perfect match for refining blurry spectrograms.

A crucial component is the skip connections, allowing the model to recover detail that may have been lost in downsampling, which makes the UNet itself inadequate for compression tasks, as the encoder and decoder must be independent of each other.

MQGAN becomes MQGAN Refiner

Therefore, I just bolted on a small UNet as a refiner, which takes in the ResNet reconstructed spectrogram with some hidden channels concatenated from the decoder’s last hidden states, detached, and outputs a residual for the reconstructed spectrogram. During training, the model outputs both the decoder reconstruction and the refined spectrogram

\(\begin{align*} &\textbf{Definitions:} \\ &\hat{Y}_{\text{dec}} \in \mathbb{R}^{T \times F} \quad \text{(ResNet decoder output)} \\ &H \in \mathbb{R}^{T \times C} \quad \text{(Projected decoder hidden states)} \\ &X_{\text{ref}} = [\hat{Y}_{\text{dec}} \,\|\, H] \quad \text{(Input to refiner, concatenated along channel axis)} \\ &\hat{Y}_{\text{ref}} = \text{Refiner}(\texttt{detach}(X_{\text{ref}})) + \hat{Y}_{\text{dec}} \quad \text{(Residual refined output)} \\ &Y \in \mathbb{R}^{T \times F} \quad \text{(Ground truth spectrogram)} \\[1em] &\textbf{Total Loss:} \\ &\mathcal{L}_{\text{total}} = \underbrace{\mathcal{L}_{\text{MSE}}(\hat{Y}_{\text{dec}}, Y)}_{\text{Decoder only}} + \underbrace{\mathcal{L}_{\text{MSE}}(\hat{Y}_{\text{ref}}, Y) + \lambda_{\text{GAN}} \cdot \mathcal{L}_{\text{GAN}}(\hat{Y}_{\text{ref}}, Y)}_{\text{Refiner losses}} \\ \end{align*} \)

LaTeX made by ChatGPT 4o

Now, instead of doing GAN losses on the decoder output itself, I only do simple MSE loss on it, and do recon + GAN losses on the refined output. Since I apply the stop gradient operator (.detach() in PyTorch) on the input to the refiner, GAN gradients (should) never hit the base model itself. This provides a clear separation of tasks:

The ResNet encoder-decoder focus on compressing the spectrogram and producing a plausible reconstruction, a draft of sorts if you will
The UNet refiner focuses on adding enough detail to fool the discriminators

And it works quite beautifully. These results are after 9 epochs of generator pretraining, so only one epoch so far of GAN training — and in case it wasn’t clear enough, these two models are being trained jointly. No two-stage annoyance here, no sir/ma’am.

The model has, in total, 30M params in the G, of which 11M are the UNet. All of this is being done on a single AMD Instinct MI300X, thanks to Hot Aisle.

This is a night and day difference compared to my previous design. You have to actually look closely to spot differences between the ground truth and post-refiner.

Note: The examples for the previous and current model were done using different mel scales; but this is irrelevant since the model performance is unaffected by the range.

Closing Statement

So yeah, what can I say? ResNet + UNet = god-tier VQGAN! This theoretically isn’t limited to spectrograms: you could build any image GAN with this design, like a proper picture VQGAN or even diffusion-free image generation from text.

If there are no catches, next step on the agenda is to make it causal someday.

This is still very WIP, but check out the code (and one day, pretrained models) here.

Once again, shoutout to Hot Aisle and AI at AMD for the compute!!

ZD’s Substack

Discussion about this post

Ready for more?