Schema

Inspiration

Learning creative tools is hard. Mastering Figma takes months. Learning sound design can take years. And right now, AI is mostly being used to replace creative work rather than help people understand it.

Schema came from a simple frustration:

"I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes."

We wanted to build something that uses AI as a teacher -- not to generate art for you, but to break down existing art so you can learn how it was made.


What it does

Schema takes a piece of creative work and reverse-engineers it into a step-by-step tutorial.

It supports two formats out of the box:

Visual art — Upload a screenshot, image, or video. Schema analyzes the layout and styling, converts it into HTML/CSS internally, and generates a guided Figma tutorial showing you how to recreate the design from scratch.

Sound design — Record a sample or hum a sound. Schema runs it through a neural network that estimates the synthesizer settings needed to recreate it, then walks you through how to build that sound on a synth.

The point isn't to hand you the output. It's to show you how it was made.


How we built it

Schema has two separate pipelines:

Visual pipeline — We use LLMs to decompose images into design primitives (layout, spacing, typography, color), then translate those into concrete Figma steps.

Music pipeline — This is where most of the ML work went. The goal was to take a raw audio clip and figure out what knobs to turn on the Vital synthesizer to reproduce that sound.

We built a custom dataset generator that renders hundreds of thousands of random Vital presets into audio clips, recording every parameter value. The model learns to map spectrograms back to those parameters.

Our V1 used a ResNet-18 encoder, but it couldn't capture the kind of spectral and temporal patterns that matter for sound. V2 replaced it with an Audio Spectrogram Transformer (AST), a pretrained audio model that understands frequency relationships much better.

The harder problem was modulation — Vital has a modulation matrix where LFOs, envelopes, and other sources can be routed to control any parameter. This matrix is huge (32 sources × 400+ destinations) and over 99% empty. Instead of trying to predict the whole thing, we borrowed an idea from object detection: a DETR-style transformer decoder that predicts a small set of active modulation connections, matched to ground truth using the Hungarian algorithm.

At inference time, we also run CMA-ES (a derivative-free optimizer) to further refine the predicted parameters by directly comparing the rendered output to the target audio spectrally.


Challenges we ran into

The music pipeline was by far the hardest part.

The parameter space is massive. Vital has over 700 controls. Encoding all of them properly — figuring out which are continuous vs. categorical, which ones are conditionally active, which ones crash the synth engine — took a lot of trial and error with the Vita rendering API.

Many-to-one ambiguity. Completely different parameter settings can produce the same sound. This means you can't just minimize parameter error — you need to think about perceptual similarity. We added spectral distance metrics to actually measure whether predictions sound right.

Modulation sparsity. A typical preset uses maybe 3-5 modulation connections out of ~13,000 possible slots. Training a model on a 99.85% sparse matrix doesn't work well. Switching to set prediction with Hungarian matching was the breakthrough that made modulation learning feasible.

Dataset quality. Randomly sampling synth parameters produces a lot of garbage — silent patches, clipping, degenerate timbres. We had to build rejection sampling and musically meaningful modulation templates (common patterns like envelope→filter cutoff or LFO→wavetable position) to get training data that actually teaches useful things.


Accomplishments that we're proud of

We built an end-to-end system that can listen to a sound and produce a loadable Vital preset file. No published work targets Vital specifically -- most inverse synthesis research focuses on simpler synths with a handful of parameters. Handling 700+ parameters with modulation routing and wavetable selection is a meaningfully harder problem than what's been done before.

We're also proud of how the tutorial generation ties everything together. The model's predictions aren't just numbers -- they get translated into plain-language instructions that explain what each setting does and why it matters for the sound you're trying to make.


What we learned

Sound design is a deceptively hard ML problem. Parameter-level accuracy and perceptual accuracy are very different things, and optimizing for the wrong one can lead you nowhere. We spent a lot of time learning about audio representations, spectral losses, and why naive approaches to sparse structured prediction don't work.

On the engineering side, we learned a lot about building custom dataset pipelines, working with synthesizer APIs that weren't designed for automation, and the importance of getting your data representation right before worrying about model architecture.


What's next for Schema

  • Expand into new creative domains — photography, videography, 3D modeling, animation
  • Improve real-time guidance and interactive tutorials
  • Partner with educators, creators, and learning platforms

Built With

Share this project:

Updates