JavaScript DRMs are Stupid and Useless

JavaScript DRMs are Stupid and Useless

A while back, I was browsing Reddit and came across a thread about hotaudio.net. For those unfamiliar, it’s a website developed by u/fermaw, the very same developer behind the ever-popular gwasi.com.

If neither of those websites rings a bell, then I need to welcome you to r/GoneWildAudio: an NSFW subreddit for ASMR. Stay and read, the ASMR is only part of this odd tale.

You see, not too long ago, Soundgasm, Mega, and a few others were quite popular for hosting these audios, but as ToS tightened and taboo topics got more taboo, other platforms popped up to fill the gap.

HotAudio is one of them, but in a different way. Their claim is offering DRM for ASMRtists—a rare thing in the ASMR space, let alone the NSFW ASMR space.

u/fermaw, the aforementioned developer, was bragging in that thread I mentioned earlier about coding a DRM and how he found it rather “fun” to do so.

I have no doubt it was fun, and believe me, this post is not meant to ridicule anyone or incite any form of hate, but I think calling it “DRM” is a little far-fetched.

Long before the days of Denuvo, the now-infamous game DRM, we knew that any such system living in the user’s accessible memory was vulnerable. So, we shifted to what we call today a Trusted Execution Environment (TEE).

I’d like to quote Microsoft here: “A Trusted Execution Environment is a segregated area of memory and CPU that’s protected from the rest of the CPU by using encryption. Any code outside that environment can’t read or tamper with the data in the TEE. Authorized code can manipulate the data inside the TEE.”

See what I’m getting at? JavaScript code is fundamentally a “userland” thing. The code you ship is accessible to the user to modify and fuck about with however they wish.

This is the problem with u/fermaw’s “DRM.” No matter how many clever keys, nonces, and encrypted file formats he attempts to send to the user, eventually, the very same JavaScript code will need to exit his decryption logic and—whoops—it goes plain Jane into digital and straight to the speakers.

On Elephants in the Room: Trusted Execution Environments

Before we get into the code, we need to understand why this was always going to end in a bloodbath. The entire history of DRM is, at its core, a history of trying to give someone a locked box while simultaneously handing them the fucking key. The film and music industries have been losing this battle since the first CSS-encrypted DVD was cracked in 1999.

The modern, professional answer to this problem is the Trusted Execution Environment, or TEE.

As quoted above, a TEE is a hardware-backed secure area of the main processor (like ARM TrustZone or Intel SGX). Technically speaking, the TEE is just the hardware fortress (exceptions exist like TrustZone) whilst a Content Decryption Module (CDM) like Google’s Widevine, Apple’s FairPlay, and Microsoft’s PlayReady use the TEE to ensure cryptographic keys and decrypted media buffers are never exposed to the host operating system let alone the user’s browser. For the purposes of this article, I may at times refer to them interchangeably but all you need to know is that they work together and in any case, the host OS can’t whiff any of their farts so to speak.

However, getting a Widevine licence requires a licensing agreement with Google. It requires native binary integration. It requires infrastructure, legal paperwork, not to mention, shitloads of money. A small NSFW audio hosting platform is not going to get a Widevine licence. They’d be lucky if Google even returned their emails. Okay maybe not quite but the point is they’re not getting Widevine.

So what does HotAudio do then? Based on everything I could observe, they implement a custom JavaScript-based decryption scheme. The audio is served in an encrypted format chunked via the MediaSource Extensions (MSE) API and then the player fetches, decrypts, and feeds each chunk to the browser’s audio engine in real time. It’s a reasonable-ish approach for a small platform. It stops casual right-clickers. It stops people opening the network tab and downloading the raw response file, only to discover it won’t play. For most users, that friction is sufficient.

Unfortunately for HotAudio, every r/DataHoarder user worth their salt knows these types of websites don’t have proper blackbox DRMs so it’s only a matter of time before someone with a tool they crafted with spit and spite shows up.

It just doesn’t stop someone who understands exactly where the decrypted data has to appear.

The “PCM Boundary”: a Wannabe-DRM Graveyard

Let me introduce you to what I call the PCM boundary. PCM (Pulse-Code Modulation) is the raw, uncompressed digital audio format that eventually gets sent to your speakers. It’s the terminal endpoint of every audio pipeline, regardless of how aggressively encrypted the source was.

graph TD
    Server[HotAudio Server] -->|Sends Encrypted audio chunks| JS[JavaScript Player]
    JS -->|Decrypts using proprietary logic| DecryptedData([Decrypted Data])
    DecryptedData -->|Calls appendBuffer| Hook[Hook]
    
    Hook -.->|GOLDEN INTERCEPT| SavedAudio[(Captures Pristine Audio File)]
    
    Hook -->|Forwards genuine appendBuffer| MSE[MediaSource API]
    MSE -->|Feeds to codec decoder| Decoder[Browser Decoder]
    Decoder -->|PCM audio output| Speakers[Speakers]

For our purposes, we don’t even need to chase it all the way to raw PCM which is valid avenue albeit in the realm of WEBRips and not defacto “downloaders.” We just need to find the last point in the pipeline where data is still accessible to JavaScript and that point is the MediaSource Extensions API, specifically the SourceBuffer.appendBuffer() method.

In practice:

  1. Your JavaScript code creates a MediaSource object and attaches it to a <video> or <audio> element via a blob URL.
  2. You call mediaSource.addSourceBuffer(mimeType) to declare what codec format you’ll be feeding the buffer.
  3. You repeatedly call sourceBuffer.appendBuffer(data) to push chunks of (in our case, pre-decrypted) encoded audio data to the browser.
  4. The browser’s internal decoder handles the rest: decoding the codec, managing the playback timeline, and routing audio to the hardware.

Notice how by step 3, the time HotAudio’s player calls appendBuffer, the data has already been decrypted by their JavaScript code. It has to be. The browser’s built-in AAC or Opus decoder doesn’t know a damn thing about HotAudio’s proprietary encryption scheme. It only speaks standard codecs. The decryption must happen in JavaScript before the data is handed to the browser.

This means there is a golden moment: the exact instant between “HotAudio’s player finishes decrypting a chunk” and “that chunk is handed to the browser’s media engine.” If you can intercept appendBuffer at that instant, you receive every chunk in its pristine, fully decrypted state, on a silver fucking platter.

Anyways, that is the fundamental vulnerability that no amount of encryption-decryption pipeline sophistication can close. You can make the key as complicated as you like. You can rotate keys per session, per user, per chunk. But eventually, the data has to come out the other end in a form the browser can decode. And that moment is yours to intercept.

Now. Let’s talk about how this little war actually played out. Dramatised and Ribbed™ for your pleasure.

Act One: Smash and Grab (V1.0)

The first version of my extension was built on a simple observation: HotAudio’s player was exposing its active audio instance as a global variable. You could just type window.as into the browser console and there it was; The entire audio source object, sitting in the open like a wallet left on a park bench.

The approach had two parts. The extension would attempt to modify a JavaScript file that was always shipped with every request: nozzle.js.

Essentially, this specific block would be appended to the top of nozzle.js before the stream had even begun which would compromise the environment from the get go.

// 1. I first prepare a place to store the intercepted chunks
window.DECRYPTED_AUDIO_CHUNKS = [];
console.log('[HIJACK] Audio chunk collector is ready.');

// 2. Then hijack the function that receives encrypted audio
const originalAppendBuffer = SourceBuffer.prototype.appendBuffer;
SourceBuffer.prototype.appendBuffer = function(data) {
    // 3. Save the copies
    window.DECRYPTED_AUDIO_CHUNKS.push(data);
    console.log(`[HIJACK] Captured a decrypted chunk. Size: ${data.byteLength} bytes. Total chunks: ${window.DECRYPTED_AUDIO_CHUNKS.length}`);

    // 4. Call the original function so the audio would still play
    return originalAppendBuffer.apply(this, arguments);
};

This is, without exaggeration, a client-side Man-in-the-Middle attack baked directly into the browser’s extension API. The site requests its player script; the extension intercepts that network request at the manifest level and silently substitutes its own poisoned version. HotAudio’s server never even knows.

Once the hook was in place, the automation script grabbed window.as.el, muted it, slammed the playback rate to 16 (can’t go faster since that is the maximum supported by browsers), and sat back as the browser frantically decoded and fed chunks into the collection array. When the ended event fired, the chunks were stitched together with new Blob() and downloaded as an .m4a file.

Checkmate-ish.

The First Counter-Attack

Of course, this was a patch war. According to various Reddit threads and GitHub Issues, fermaw is known for patrolling subreddits and Issues looking for ways in which devs have attempted bypasses in order to patch them.

It was only a matter of time. Indeed by week two of the extension’s public release on GitHub, he had patched the vulnerability.

First, he stopped exposing his player instance as a predictable global variable. He wrapped his initialisation code tightly so that window.as no longer pointed to anything useful. Without the player reference, my automation script had nothing to grab, nothing to control, nowhere to start.

Second, and more cleverly: he implemented a hash verification check on nozzle.js. The exact implementation could have been Subresource Integrity (SRI), a custom self-hashing routine, or a server-side nonce system, but the effect was the same. When the browser (or the application itself) loaded the script, it compared the modified file against a canonical hash and if it did not pass the check, the player would never initialise.

This effectively meant the old method was dead.

Act Two: Traps and Dicks. Synonyms and Subs-titutes.

Fermaw’s In-Memory Defences

I suppose at this point, fermaw assumed he was dealing with someone who wasn’t going to just fuck off. And I wasn’t. It was as fun for me to try and beat as it was for him to develop.

His response was to implement anti-tamper checks at the JavaScript level. Specifically, he started inspecting his own critical functions using .toString().

This is a well-known browser security technique. In JavaScript, calling .toString() on a native browser function returns "function appendBuffer() { [native code] }". Calling it on a JavaScript function returns the actual source code. So if your appendBuffer has been monkey-patched, .toString() will betray you; it’ll return the attacker’s JavaScript source instead of the expected native code string.

Fermaw added checks along the lines of:

if (
  !SourceBuffer.prototype.appendBuffer.toString().includes('[native code]')
) {
  // We've been tampered with. Refuse to play.
  throw new Error('Integrity check failed.');
}

Fermaw also, it seems, started obfuscating and scrambling how his player was initialised, making the AudioSource class harder to find via the polling loop. The constructor hijack became unreliable.

Enter, the Omni-Trap.

My technique had changed at this point. Since he was trying multiple things, well, I had to as well.

First: mockToString — The Lie That Defeats The Check

The single most important addition in V2 was a function to make my hooked methods lie about what they are:

function mockToString(target, name) {
  try {
    target.toString = function () {
      return `function ${name}() { [native code] }`;
    };
  } catch (e) {
    console.warn('[Hotaudio] Failed to mock toString', e);
  }
}

After hooking any function, I immediately called mockToString on it. From that point on, if fermaw’s integrity check asked .toString() whether appendBuffer was native, it would receive the pristine, authentic-looking answer: function appendBuffer() { [native code] }. Basically, it’s like asking your ex if they cheated on you and they did but they say they didn’t and you take their word for it because reasons. Don’t worry, on écoute et on ne juge pas.

Fermaw’s anti-tamper check was now returning a false negative. The enemy’s spy was wearing his uniform.

Second: Ambushing HTMLMediaElement.prototype.play

I gave up entirely on finding the player by name. Instead of looking for window.as or window.AudioSource, I simply staked out the exit. I hooked the most generic, lowest-level method available:

const originalPlay = HTMLMediaElement.prototype.play;
HTMLMediaElement.prototype.play = function () {
  window.__ha_player = this;
  window.postMessage({ type: 'HOTAUDIO_PLAYER_READY' }, '*');
  return originalPlay.apply(this, arguments);
};
mockToString(HTMLMediaElement.prototype.play, 'play');

The logic is fairly simple: I don’t give a shit what you name your player object. I don’t care how deeply you bury it in a closure. I don’t care what class you instantiate it from. At some point, you have to call .play(). And when you do, I’ll be waiting.

I was confident in that approach because you would not call multiple .play()s on the same page to lead a reverse engineer astray. Why? Because mobile devices typically speaking will pause every other player except one. If fermaw were to do that, it’d ruin the experience for mobile users even if desktop users would probably be fine. It also makes casting a bitch and a half. Even if you did manage to pepper them around, it would be fairly easily to listen in on all of them and then programmatically pick out the one with actually consistent data being piped out.

Now then, the moment HotAudio’s player commanded the browser to begin playback, the hook snapped shut. The audio element, this, was grabbed and stored. mockToString ensured the hook was invisible to integrity checks.

Third: Keep it Untouchable (Object.defineProperty)

When hijacking the Audio constructor, I also used Object.defineProperty with a specific, paranoid configuration:

Object.defineProperty(window, 'Audio', {
  value: HijackedAudio,
  writable: false,
  configurable: false,
});

writable: false means no code can reassign window.Audio to a different value. configurable: false means no code can even call Object.defineProperty again to change those settings. If fermaw’s initialisation code tried to restore the original Audio constructor (a perfectly sensible defensive move) the browser would either fail or throw a TypeError. The hook was permanent for the lifetime of the page.

Act Three: Choking on Natives (V3.0)

Iframes and the Shadow DOM

By this point, fermaw understood that his player instance was being ambushed whenever it called .play(). He tried to isolate the player from the main window context entirely.

The two primary techniques at his disposal were iframes and Shadow DOM.

An <iframe> creates a completely separate browsing context with its own window object, its own document, and most importantly;its own prototype chain. A function hooked on HTMLMediaElement.prototype in the parent window is not the same object as HTMLMediaElement.prototype in the iframe’s window. They’re entirely separate objects. If fermaw’s audio element lived inside an iframe, my prototype hook in the parent window would never fire.

Shadow DOM is a web component feature that lets you attach an isolated DOM subtree to any HTML element, hidden from the main document’s standard queries. A querySelector('audio') on the main document cannot see inside a Shadow Root unless you specifically traverse into it. If fermaw’s player was mounted inside a Shadow Root, basic DOM searches would come up empty.

On top of this, fermaw was likely switching to assigning audio sources via srcObject rather than the src attribute. srcObject accepts a MediaStream or MediaSource object directly, bypassing the standard URL assignment path that’s easier to intercept.

V3.0 — Hooks, Crooks, and Nooks

My response was to abandon trying to intercept at the level of individual elements and instead intercept at the level of the browser’s own property descriptors. I went straight for HTMLMediaElement.prototype with Object.getOwnPropertyDescriptor, hooking the native src and srcObject setters before any page code could run:

// Hook the 'src' property setter on ALL media elements, forever
try {
  const srcDesc = Object.getOwnPropertyDescriptor(
    HTMLMediaElement.prototype,
    'src',
  );
  if (srcDesc && srcDesc.set) {
    const origSet = srcDesc.set;
    const hookedSet = function (v) {
      capturePlayer(this);
      return _call.call(origSet, this, v);
    };
    spoof(hookedSet, origSet);
    _defineProperty(HTMLMediaElement.prototype, 'src', {
      ...srcDesc,
      set: hookedSet,
    });
  }
} catch (e) {}

// Same for 'srcObject' catches MediaSource-based streams
try {
  const srcObjDesc = Object.getOwnPropertyDescriptor(
    HTMLMediaElement.prototype,
    'srcObject',
  );
  if (srcObjDesc && srcObjDesc.set) {
    const origSet = srcObjDesc.set;
    const hookedSet = function (v) {
      capturePlayer(this);
      return _call.call(origSet, this, v);
    };
    spoof(hookedSet, origSet);
    _defineProperty(HTMLMediaElement.prototype, 'srcObject', {
      ...srcObjDesc,
      set: hookedSet,
    });
  }
} catch (e) {}

HTMLMediaElement.prototype is the browser’s own internal prototype for all <audio> and <video> elements and by redefining the property descriptor for src and srcObject on this prototype, I ensured that regardless of where the audio element lives (whether it’s in the main document, inside an iframe’s shadow, or buried inside a web component) the moment any source is assigned to it, the hook fires. The element cannot receive audio without announcing itself.

Even if fermaw’s code lives in an iframe with its own HTMLMediaElement, the prototype hookery via document_start injection means my hooks are installed before the iframe can even initialise.

But the triumphance of V3 is in the addSourceBuffer hook which solves a subtle problem. In earlier versions, hooking SourceBuffer.prototype.appendBuffer at the prototype level had a vulnerability in that if fermaw’s player cached a direct reference to appendBuffer before the hook was installed (i.e., const myAppend = sourceBuffer.appendBuffer; myAppend.call(sb, data)), the hook would never fire. The player would bypass the prototype entirely and call the original native function through its cached reference.

try {
  const MS = window.MediaSource || window.ManagedMediaSource;
  if (MS && MS.prototype) {
    const origAddSB = MS.prototype.addSourceBuffer;

    const hookedAddSB = function addSourceBuffer(mimeType) {
      const sb = _apply.call(origAddSB, this, arguments);

      // Hook the SourceBuffer INSTANCE immediately,
      // before anyone else can cache a reference to appendBuffer
      try {
        const origAppend = sb.appendBuffer;
        const hookedAppend = function appendBuffer(data) {
          try {
            _chunks.push(data);
          } catch (e) {}
          return _apply.call(origAppend, this, arguments);
        };

        spoof(hookedAppend, origAppend);

        _defineProperty(sb, 'appendBuffer', {
          value: hookedAppend,
          writable: true,
          configurable: true,
        });
      } catch (e) {}

      return sb;
    };

    spoof(hookedAddSB, origAddSB);
    MS.prototype.addSourceBuffer = hookedAddSB;
  }
} catch (e) {}

The V3 approach obliterates this race condition by hooking addSourceBuffer at the MediaSource.prototype level, I intercept the creation of every SourceBuffer. The moment a buffer is created and returned, I immediately install a hooked appendBuffer directly on that specific instance; before any page code can even see the instance, let alone cache a reference to its methods. The hooked appendBuffer is installed as an own property of the instance, which takes precedence over the prototype chain. There is no window for fermaw to cache the original. The hook is always first.

To catch any elements that somehow slipped through all of the above, I added capturing-phase event listeners as a belt-and-braces fallback:

// document_start means these listeners are in place before any element exists
document.addEventListener(
  'play',
  (e) => {
    if (e.target?.tagName === 'AUDIO' || e.target?.tagName === 'VIDEO')
      capturePlayer(e.target);
  },
  true,
); // <-- 'true' = capture phase, fires on the way DOWN the DOM tree

document.addEventListener(
  'loadedmetadata',
  (e) => {
    if (e.target?.tagName === 'AUDIO' || e.target?.tagName === 'VIDEO')
      capturePlayer(e.target);
  },
  true,
);

The true flag for useCapture is important. Browser events propagate in two phases: first, they travel down the DOM tree from the root to the target (capture phase), then they bubble up from the target back to the root (bubble phase). By listening in the capture phase, my listener fires before any event listener attached by HotAudio’s player code. Even if fermaw tried to cancel or suppress the event, he’d be too late because the capturing listener always fires first.

The combination of all four layers in addSourceBuffer at the MediaSource prototype level, src and srcObject property descriptor hooks, play() prototype hook, and capture-phase event listeners means there is, practically speaking, no architectural escape route left. The entire browser surface area through which a media element can receive and play audio has been covered. How fucking braggadocious of me to say that. I will be humbled in due time. That much is universal law.


Automation: Rinsing It in Seconds

With the capture hooks in place, the automation script handles the actual download process. The approach has been refined significantly across the three versions, but the core idea has remained fairly constant: trick the browser into buffering the entire audio track as fast as the hardware and network allow, rather than in real time.

The script grabs the captured audio element, mutes it, sets playbackRate to 16 (the browser maximum), seeks to the beginning, and calls .play(). The browser, in its infinite eagerness to keep the buffer full ahead of the playback position, frantically fetches, decrypts, and feeds chunks into the SourceBuffer. Every single one of those chunks passes through the hooked appendBuffer and gets collected.

Tip

Worth noting here is that Chrome itself limits this to 16x. The HTML spec has no mandated cap but since this is a Chromium extension; the constraint stands.

Of course, fermaw does have protections against this. For one, he aggressively throttles bursty traffic meaning downloads can go from a few hundred KB/s to 50-ish KB/s. Of course, it will in every case be several times faster than listening and recording anyways.

Fermaw cannot realistically slow down the stream more than that since it would stutter real traffic that has a download-y pattern. There is a possibility that he could enforce IP bans on patterns that display it but it would have to risk blanket bans against possible CGNAT traffic. There are ways to get around it but it prolongs the inevitable.

audioElement.currentTime = 0;
audioElement.playbackRate = 16;
audioElement.muted = true;
audioElement.play().catch((e) => {
  cleanup();
  updateStatus('ERROR: PLAY FAILED', -1);
});

V3 also added adaptive speed control. Rather than blindly holding at 16x, the script monitors the audio element’s buffered time ranges to assess buffer health. If the buffer ahead of the playback position is shrinking (meaning the network can’t keep up with the decode speed), the playback rate is reduced to give the fetcher time to catch up. If the buffer is healthy and growing, the rate is nudged back up. This prevents the browser from stalling entirely on slow connections, which would previously break the ended event trigger and leave you waiting forever.

const monitorBufferHealth = () => {
  if (audioElement.paused || audioElement.ended || downloadTriggered) return;
  if (audioElement.buffered.length > 0) {
    const current = audioElement.currentTime;
    let bufferedEnd = 0;
    for (let i = 0; i < audioElement.buffered.length; i++) {
      if (
        audioElement.buffered.start(i) <= current &&
        audioElement.buffered.end(i) > current
      ) {
        bufferedEnd = audioElement.buffered.end(i);
        break;
      }
    }
    const bufferAhead = bufferedEnd - current;
    if (bufferAhead > 15) {
      setSpeed(currentSpeed + 0.5); // Comfortable, push faster
    } else if (bufferAhead < 3 && currentSpeed > 2) {
      setSpeed(currentSpeed - 1.0); // Starvation, back off
    }
  }
};

When the track ends—detected either via the ended event or via the stall watcher noticing the currentTime approaching durationit will collect chunks that are stitched together:

const blob = new Blob(chunks, { type: 'audio/mp4' });
const url = URL.createObjectURL(blob);
const a = document.createElement('a');
a.href = url;
a.download = `${filename}.m4a`;
a.style.display = 'none';
document.body.appendChild(a);
a.click();

There is a minor artefact in the final file. The stitched .m4a sometimes contains silent padding at the start or end from incomplete chunks at buffer boundaries. A quick ffmpeg pass fixes it cleanly:

ffmpeg \
  -err_detect ignore_err \
  -fflags +genpts+discardcorrupt+igndts \
  -analyzeduration 500M \
  -probesize 500M \
  -i input.m4a \
  -af "aresample=async=1:first_pts=0,aformat=sample_fmts=fltp:\
sample_rates=44100:channel_layouts=stereo,silenceremove=start_periods=0:\
start_threshold=-90dB,loudnorm" \
  -c:a libmp3lame \
  -q:a 2 \
  fixed.mp3

The Technical Footnote: Why the spoof() Function is Different in V3

Across all three versions, there’s a mockToString or spoof helper. But the V3 implementation is subtly more robust than the V2 one, and it’s worth examining why.

V2’s version was straightforward:

function mockToString(target, name) {
  target.toString = function () {
    return `function ${name}() { [native code] }`;
  };
}

This works, but it has a vulnerability: it hardcodes the native code string manually. If fermaw’s integrity check was especially paranoid and compared the spoofed string against the actual native code string retrieved from a trusted reference (say, by calling Function.prototype.toString.call(originalFunction) on a cached copy of the original), the manually crafted string might not match precisely, particularly across different browser versions or platforms where the exact whitespace or formatting of [native code] strings varies slightly.

I tried to solve it somewhat elegantly:

function spoof(fake, original) {
  const str = _call.call(_toString, original);
  _defineProperty(fake, 'toString', {
    value: function () {
      return str;
    },
    writable: true,
    configurable: true,
  });
  return fake;
}

Instead of hardcoding the expected string, it captures the actual native code string from the original function before hooking it, then returns that exact string. This way, no matter what browser, no matter what platform, the spoofed toString returns precisely the same string that the original function would have returned. It is, in effect, a perfect forgery.

Also note the use of _call.call(_toString, original) rather than simply original.toString(). This is because original.toString might itself be hooked by the time spoof is called. By holding cached references to Function.prototype.call and Function.prototype.toString at the very beginning of the script (before any page code runs), and invoking them via those cached references, the spoof function is immune to any tampering that might have happened in the interim. It’s eating its own tail in the most delightful way.

Ethics, Grandstanding, Pretentiousness, and Playing Wise

DRM, as an industry institution, has an almost comically bad track record when it comes to actually protecting content. Denuvo which is perhaps the most sophisticated game DRM ever deployed commercially has been cracked for virtually every major game it’s protected, usually within weeks of release. Every DVD ever made is trivially rippable. Every Blu-ray. Every streaming service has been ripped by someone, somewhere.

Denuvo for a few years had gotten more successful with infamous crackers like Empress stepping down. Progress was slow and new releases came courtesy of voices38. However, it seems that the trend has reversed once more after a new wave of hypervisor style exploits leading to a flurry of new cracks for previously uncracked games. It only serves to prove my point. It’s an inevitability and while game DRMs arguably serve a different purpose compared to two-bit JS based DRMs on a fucking NSFW ASMR site, the point is, yet again, the same.

The reason is always the same: the content and the key that decrypts it are both present on the client’s machine. The user’s hardware decrypts the content to display it. The user’s hardware is, definitionally, something the user controls. Any sufficiently motivated person with the right tools can intercept the decrypted output.

For a small NSFW audio platform run by a solo developer, “true” blackbox DRMs running with TEEs are not a realistic option. Which brings me to the point I actually want to make:

The HotAudio DRM isn’t stupid because fermaw is stupid. It’s the best that JavaScript-based DRM can be. He implemented client-side decryption, chunked delivery, and active anti-tamper checks and for the vast majority of users, it absolutely works as friction. Someone who just wants to download an audio file and doesn’t know what a browser extension is will be stopped completely.

The problem is that calling it “DRM” sets expectations it simply cannot meet. Real DRM, you know; the kind that requires a motivated attacker to invest serious time and expertise to defeat; lives in hardware TEEs and requires commercial licensing. JavaScript DRM is not that. It’s sophisticated friction. And sophisticated friction, while valuable, is a completely different thing.

The question is whether any DRM serves ASMRtists well. Their audience is, by and large, not composed of sophisticated reverse engineers. The people who appreciate their work enough to want offline copies are, in many cases, their most dedicated fans. The kind who would also pay for a Patreon tier if one were offered. The people who would pirate the content regardless are not meaningfully slowed down by JavaScript DRM; they simply won’t bother and will move on to freely available content or… hunt down extensions that do the trick, I suppose.

I’m genuinely not convinced the DRM serves the creators it’s designed to protect. But I acknowledge that this is a harder conversation than just the technical one, and reasonable people can disagree.

Happy Endings

I got all the dopamine I needed from “reverse engineering” this “DRM.” I don’t imagine there’s any point continuing its development considering the fact that I have made my point abundantly clear even beyond this very article.

I hate DRM, I love FOSS, I love the very idea that the internet should be open and accessible.

Unfortunately, the Internet is no longer just a toy for the nerds amongst us. For many, it’s a source of income and a way to put food on the table. So I do understand that DRM is in turn a way for people to feel protected against “pirates” threatening their livelihoods. I don’t think it works the way it’s intended to work but I suppose I cannot fault fermaw for wanting to create a solution for the ASMRtists who felt they needed it.

Just… don’t do it with JavaScript ffs.

Until next time :)

All code samples represent recreations of actual extension versions or are taken directly from the open-source V1/V1.2 releases. The V3 hijack code shown is the actual production code. No HotAudio server infrastructure was accessed, modified, or interfered with at any point. All techniques demonstrated operated exclusively on the client side, within the user’s own browser.

Read More

Fuck Prompt Engineering. Sincerely, Ranty Dev.

Fuck Prompt Engineering. Sincerely, Ranty Dev.

Part One: The Anatomy of Slop Let me paint you a picture. You've just spent six hours writing a 3,000-word article about a niche technical topic, packed with code snippets, architectural diagrams in your head, and enough spite to make Ramsay Bolton blush, and now you need a hero image for the dev server preview. You're a dev-blogger; this is a normal Sunday evening for you. So you boot up your custom 900-line Node.js script to auto-gen that bitch up, right? No of course not, you're a lazy bastard like I am. (Until last week anyways). What you actually do is what every other lazy bastard does: you copy-paste the entire fucking article into ChatGPT or Nano Banana or whatever's trendy this week, add the magic words "minimalist, flat vector, no text, clean editorial style," and hit enter. What you get back is frankly what you deserve. Neon-soaked cyberpunk hacker hunched over a mechanical keyboard like it's 1999. Glowing green rain. Matrix code dripping down a monitor that looks like it was designed by someone who's only seen computers in films. And, my personal favourite, a screen filled with gibberish text that says something like "JVSCRPT DMMR" in a font that looks like it was assembled by a drunk Victorian typesetter.You said no text. It drew text anyway. Why? Tempting it may be to call the AI stupid. And it may very well be, but reality is often disappointing, is it not? The Transformer architecture is fundamentally an issue of Lady Whistledown, where every token talks to every other token, and you just fed it a novel-length scream. The failure mode is mathematical, predictable, and completely fucking avoidable if you know the thing I'm about to show you. Self-Attention: The Toxic Gossip Train I've garnered a bit of a reputation for referencing Attention Is All You Need. You know, that paper that kicked off this entire circus? In that very paper, Self-Attention was introduced. Every token looks at every other token simultaneously to figure out what the hell is going on. Mathematically, for $(n)$ tokens, you're doing: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$ Where $(Q)$, $(K)$, and $(V)$ are your Query, Key, and Value matrices. At this point, softmax is not as important as complexity. Because every token chats with every other token, your memory and compute cost scale as $(O(n^2))$ with sequence length.FlashAttention is $(O(n^2))$ in compute/FLOPs but achieves $(O(n))$ memory through tiling and recomputation. It lowers the VRAM/SRAM footprint but computational complexity stays quadratic, which means the main constraint (at least for SOTA models running on enterprise hardware and data centres) is still compute.More important for our purposes: as the input gets longer, the model's "attention" gets spread thinner than Marmite on toast. It has to pay attention to everything. Your frontmatter YAML, your inline code blocks, your hyperlinks, your prose, your headers, your existential dread... until the actual visual signal you care about drowns in the fucking noise. Now, to be precise: when you feed a long prompt to a text-to-image pipeline, it doesn't hit the generation model raw. It goes through a text encoder first; CLIP, T5, whatever the flavour of the month is, which then produces fixed-size embeddings regardless of input length. The $(O(n^2))$ dilution happens there, in the encoder, as it struggles to compress your 3,000-word article into a meaningful embedding vector. The result is what I call "encoder saturation": a mushy, low-signal blob that the generation model then has to work from. The framing is the same. The specific culprit is the encoder stage. There's also more here because CLIP has a hard token limit of 77 including the start/end tokens. It makes the problem worse because depending on your inference provider's internal harness, they either throw away the rest of the prompt, or pass it through a shitty "distillation model" to make sense of your content just like I do, except said model is EXTRA SHITTY. They have traffic constraints to deal with, remember? When you feed a 3,000-word Markdown blog post into an image generator, you're not giving it a rich input. I made this point before when I talked about AGENTS.md files. The wisdom behind giving your agent rich context to work with is technically correct but prone to being overdone. You need only enough to prevent wrong assumptions or else you're giving it a flooded basement and asking it to find a specific teacup. The attention weights distribute across every token, the signal-to-noise ratio becomes total dogshit, and the output is chaotic because the input was a fucking mess. Cross-Attention: LALALA CAN'T HEAR YOU Self-attention explains why the output is noisy. Cross-Attention explains why it spells words like a drunk toddler. In text-to-image diffusion, your prompt gets chewed up by a text encoder (usually CLIP or a language model), producing embeddings. These embeddings then get fed into the image generation via Cross-Attention layers, where the image generation blocks "look at" the text to decide what to draw. Here's the problem: in the latent space of a model trained on the entire internet (including, regrettably, Twitter a.k.a X), the concept of "a blog post about JavaScript" is mathematically adjacent to the visual concept of "text on a screen." These ideas have high cosine similarity because that's how they appear in the training data. Articles about code are overwhelmingly represented as... text on screens. When you say "draw a hero image for this JavaScript article," the Cross-Attention layers pull on embeddings that are entangled with visual representations of... text. The model cannot easily disentangle "the topic of JavaScript" from "the visual of JavaScript" which is, in the training data, invariably a glowing IDE, a mechanical keyboard, and neon text. And oh God, the garbled text. That's the model generating "text-like pixel patterns" because it learned that "screen with content" correlates strongly with "shapes that look vaguely like letters." It doesn't know what an 'A' looks like. It knows what the idea of text looks like, and that's close enough for its purposes, even if it's nonsense. In data science, we call this Compositional Generalisation. If a model has Lexical Compositionality, it doesn't see a "man-with-a-hat" as one big blob of pixels instead it understands the relationship between them. When a model fails at it, we call it Attribute Binding failure (e.g., you ask for a "red hat and a blue shirt" and get a "blue hat and a red shirt"). This is the reason why when you tell the model "no text" in the negative, it helps, but is like asking a shark not to swim. That is, you're fighting against the gravitational pull of the entire training distribution by asking it to draw a concept while actively suppressing the most common visual representation of that concept. Circa 2022-2023, it was a dumpster fire with simple, short prompts including negative prompts. The "Pink Elephant" issue comes to mind. Nowadays, with larger corpora of data containing more metadata/descriptions of images and antonymic comparisons, we can safely insert negative prompts to some degree. Six-Fingers: Three Solutions This brings us to the elephant in the room. Or rather, the six-fingered hand in the room. If you were using AI image generators circa 2022–2023, you know the pain. Hands were fucked. Too many fingers. Too few fingers. Fingers that looked like they were trying to escape the hand. It became a cultural shorthand for AI-generated slop. "Look, six fingers" was the easiest way to spot the fraud. Or strange ways of consuming spaghetti. Hands are high-frequency, spatially complex, and appear in the training data in every possible pose, lighting condition, and occlusion state. Without the hands being contextualised properly in the training set's metadata, the model's learned representation of "hand" was a blurry average of all those cases, and when forced to render one at high resolution, it hedged its bets and grew extra digits. Interestingly, the six-finger problem has been largely "fixed" in modern models. The question is: how? The first mechanism is intrinsic behavioural change through training data and RLHF. The big labs identified the problem, weighted more anatomically correct hand data, and used Reinforcement Learning from Human Feedback to penalise deformed outputs. The fix is baked into the weights and the model doesn't follow a rule that says "draw five fingers." It has learned, at a parameter level, that five-fingered hands have higher reward signal than six-fingered ones. This is intrinsic because it lives inside the model. The second mechanism is a harness. External constraints applied at inference time, without touching the weights. A negative prompt is the simplest example. A pipeline of structured prompts is a more sophisticated one. The harness doesn't change what the model knows intrinsically. It changes the conditions under which the model generates, nudging it away from low-quality attractors. This is extrinsic and lives outside the model. The third mechanism is external multimodal simulation. Instead of asking one model to do everything, you decompose the task across multiple models, each handling a specific subtask it's genuinely suited for, and integrate the outputs. This is a harness taken to its logical conclusion. Instead of constraining a single model, we replicate the specialist-model decomposition that happens internally in a proper multimodal architecture. We do it externally, with independently swappable components. A human or an orchestrator script acts as the integrator. These three mechanisms are frequently discussed as though they were interchangeable, but they are not. They have different failure modes, different scopes, and they can interfere with each other in ways that are instructive. Which brings me to the hero image for this very article. The image at the top depicts a six-fingered hand transforming to a five-fingered hand. It's the visual metaphor for this entire argument. It's also, I can tell you from direct personal experience, an image that the current generation of production-grade models fucking refused to generate. The prompt was wholly unambiguous. The pipeline produced a clean, structured brief. The Art Director stage correctly identified the concept. The image generation stage received a precise, constrained prompt asking for a six-fingered hand as the input to a machine which is a fairly deliberate visual element, the entire fucking point of the illustration. And yet every modern model I tried (Nano Banana Pro, Imagen 4, SeeDream v4.5, Flux 2 Flex) ignored the six-finger instruction and drew a perfectly normal hand. The intrinsic RLHF fix baked into the weights was so aggressively applied that it overrode an explicit creative instruction. The model couldn't be told to draw six fingers, even as a deliberate metaphor, because its learned parameters treat "six fingers" as an error condition to be suppressed regardless of context. This is an overfitting problem. RLHF is extraordinarily effective at pushing model behaviour toward high-reward outputs. It is much less effective at teaching the model when a particular behaviour is appropriate. The model learned "six fingers = bad, suppress it." It did not learn "six fingers = bad in a realistic portrait, intentional in a satirical illustration." We call this "reward hacking." The model does not understand the true underlying quality it is being penalised or rewarded for leading to literal interpretations of the human feedback. The solution, then, was to lean into the third mechanism, external multimodal simulation at a component level. I found an older, less aggressively fine-tuned model that still had the original glitchy-hand failure mode intact. I used it to generate only the deformed hand. I then took the pipeline's clean, correct output i.e. the machine, the setting, the style, and then manually composited the glitchy hand from the legacy model into it in Photoshop. This is where the external simulation approach reveals its defining advantage over any unified architecture: composability. Because the pipeline is a collection of independently addressable components rather than a fused set of weights, I could swap one component for a version with different intrinsic behaviour, get precisely the output I needed from it, and slot it back in. A proper multimodal model gives you no such lever. The sub-models are fused. The weights are the weights. You get what the training run decided, and if the RLHF pass decided six-fingered hands are always wrong, you have no appeal. Talk about a dictator, eh? The hero image you see is a Frankenstein's monster. The competence is from the pipeline. The deliberate incompetence is from a model that was not yet "fixed." The stitch is human. And the ability to do any of that is a direct consequence of simulating the multimodal architecture externally rather than relying on one model to do everything. I am telling you this not as a charming anecdote, but because it is a near-perfect empirical demonstration of everything I am about to describe. The intrinsic RLHF fix worked so well that it blocked a naïve single-model approach. The external simulation approach routed around it by treating model versions as interchangeable components. That is the entire fucking argument, in one image. Right. Back to the maths. If you understand the architecture, the solution becomes obvious:You cannot give the image model a noisy, long-form input. The $(O(n^2))$ attention dilution in the encoder will cause it to lose focus. You need to compress the input to a dense, visually relevant signal before it reaches the image model. You cannot let the image model decide what "visual" means for a given concept. Its Cross-Attention will default to the most statistically probable visual representation in the training data, which for technical content is almost always "text on a glowing screen." You need to manually decouple the semantic concept from its default visual representation. Negative prompts are not sufficient. They shift the generation away from certain vector clusters in the latent space, but they cannot override strong associative priors and as we have just seen, they cannot override intrinsic RLHF behaviour at all. You need to provide a positively constrained, narrow target, not just a set of exclusions.This is why a single-prompt approach is a dead end when it comes to long-form data. You need a fucking pipeline. Specifically, you need to simulate, externally and explicitly, the kind of specialist decomposition that a proper multimodal architecture would do internally and you want to do it externally precisely because that gives you the ability to swap components when the model's intrinsic behaviour gets in your way. Part Two: Building a Digital Sweatshop Let's be precise about what we are building before a single line of code appears. The pipeline I am about to describe is an external multimodal simulation. It replicates, using independently addressable components wired together by a harness, the specialist-model decomposition that happens inside a proper multimodal architecture... except we're doing it on the outside, where we can see it, control it, and swap individual pieces out when they misbehave. A real multimodal model has sub-models fused together by training. You get what the training run produced, full stop. Our simulation has sub-models that are independently versioned, independently replaceable, and independently constrainable. The six-finger compositing in Part One is not a workaround. It is this property in action. The implementation mechanism is a harness: a system of external constraints applied at inference time that does not touch any model's weights, does not retrain anything, and produces no intrinsic behavioural change. The harness structures the conditions under which each specialist model is asked to generate, compensating for architectural limitations we cannot fix from the outside. This is both its greatest strength and its hard ceiling. The harness is model-agnostic, flexible, and deployable without access to training infrastructure. It is also, as we have already seen with the six-finger debacle, completely powerless against intrinsic behaviour baked into the weights by RLHF. When the harness meets a strong enough trained prior, the prior wins. Every fucking time. The answer, when that happens, is not to try harder with the harness. It is to swap the component. Keep that in mind as we go through each stage. Each one is doing a specific job, compensating for a specific architectural failure mode. For each one, I will be explicit about where the harness ends and where the model's intrinsic behaviour begins. Here is the full generate-hero-images.mjs pipeline at a high level before we get into each stage: const DISTILL_MODEL = "gemini-3-flash-preview"; // The cheap speed-reader. Speed = 100 TPS, Comprehension=NULL const BRIEF_MODEL = "gemini-2.5-pro"; // The pedantic art director who never made it onto Broadway const IMAGE_MODEL = "gemini-3.1-flash-image-preview"; // The volatile artist. Unlikely to start a war because we won't reject itThree models. Three distinct jobs: one external multimodal simulation.Stage One: The Speed-Reader (gemini-3-flash-preview) The first problem to solve is context dilution. A blog post is not a prompt. It is a document chock-full of structural noise, tangential prose, code snippets that are visually irrelevant, and MDX component tags that mean absolutely fuck-all to an image generator. This is the $(O(n^2))$ problem made practical. Feeding a 3,000-word article to an image generation pipeline is giving it a flooded input. The encoder saturates, the signal-to-noise ratio collapses, and the visual output is chaotic because the embedding was a mushy average of everything you wrote. The longer the sequence, the worse this gets. The simulation solves this by intercepting the input before it reaches the sensitive part of the pipeline. Stage One is a cheap, high-context model whose singular job is to read the entire article and compress it into a dense paragraph of raw visual facts without metaphors or interpretation. async function distillArticle(ai, content) { const cleanText = stripMdxNoise(content).slice(0, 25000); const prompt = [ "Analyze the following blog post text.", "Your goal: Extract the 'Visual Context' for a technical illustrator.", "Please identify:", "1. The Core Subject (What is explicitly happening?).", "2. Concrete Objects (List physical/UI things mentioned).", "3. The Vibe (e.g. 'Frustrated rant', 'Clean tutorial', 'Theoretical deep-dive').", "", "Output a dense paragraph of text summarizing these visual ingredients.", "Do not interpret. Do not use metaphors. Just facts.", "", `TEXT:\n${cleanText}`, ].join("\n"); const res = await ai.models.generateContent({ model: DISTILL_MODEL, contents: prompt, config: { temperature: 0.1 }, }); // ... }Several deliberate choices are worth calling out because I'm not just throwing shit at the wall here. The stripMdxNoise() function is pre-processing that happens before we touch the API at all. It strips MDX component syntax, replaces code blocks with [CODE BLOCK] placeholders, removes image tags, and discards link URLs. The model does not need to know what my <Callout> component looks like. It needs to know that the article is about bypassing a DRM by hooking the browser's MSE API. We clean the signal before it even enters the noisy channel. function stripMdxNoise(mdx) { return mdx .replace(/```[\s\S]*?```/g, " [CODE BLOCK] ") .replace(/!\[[\s\S]*?\]\([\s\S]*?\)/g, " [IMAGE] ") .replace(/\[([^\]]+)\]\([^)]*\)/g, "$1") .replace(/<[^>]+>/g, " ") .replace(/[`*_~]/g, " ") .trim(); }The 25,000-character cap is a practical concession to the very problem Stage One is solving. Even with a model that has a massive context window, a focused input produces a more focused output. This is the harness compensating for an architectural limitation it cannot fix. We cannot make the attention mechanism cheaper, but we can give it less to attend to. The temperature: 0.1 setting is the creativity knob turned almost all the way down. At 0.1, the model is nearly deterministic. We do not want creativity here. We want the model to read the article like a filing clerk on amphetamines and produce the same output reliably, every time, for the same input. For reference, Gemini models tend to do well at around 0.7-0.8 for coding, 0.8-0.9 for conversation, and 1.0 for nearly everything else. The output of Stage One for the DRM article looks something like this:A developer examines and defeats a JavaScript-based DRM system on an NSFW audio streaming platform. Key objects: browser developer tools, JavaScript source code, audio waveform, a web browser, a hex editor. Concrete actions: intercepting API calls, hooking prototype methods, reassembling audio chunks. Vibe: technical rant, cat-and-mouse engineering challenge.That is your article, compressed from 5,000 words of snark, code, and philosophy into a single paragraph of actionable visual information. The context dilution problem is solved at the harness level. We pass this into Stage Two. Stage Two: The Art Director (gemini-2.5-pro) We now have a dense, clean summary. But we still have a problem that compression alone cannot solve: if we feed even this clean summary to an image model with a vague instruction like "draw a minimalist hero image," the Cross-Attention layers will default to the most statistically probable visual representation of the concept in the training data. "Developer, browser, code" → glowing screen, keyboard, neon aesthetic. The intrinsic prior is still there, and a clean summary does not override it. The harness needs to go further. We need to manually decouple the semantic concepts from their default visual representations and do, by hand, the kind of structured decomposition that the model's Multi-Head Attention attempts but cannot reliably perform under freeform prompting. Recall from Part One: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$$ $$\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$ Different attention heads are supposed to learn to attend to different types of relationships: syntax, coreference, semantic similarity, all in parallel. The intuition is sound. The practice, under freeform prompting, is that the heads compete for salience, and the strongest trained prior tends to dominate. The harness solves this by doing the decomposition manually, upstream of the model, separating each visual dimension into a discrete, structured channel before it ever reaches the image generator. This is the job of Stage Two. We impose a constraint that is absolute: the model is only allowed to respond in strict JSON. It is not allowed to draw anything. It is not allowed to write prose. It is only allowed to fill out a fucking form. const schema = [ "Respond ONLY with valid JSON. No markdown fences, no prose.", "{", ' "primaryConcept": "6–12 words describing what to literally draw",', ' "motifs": ["0–2 supporting concrete objects max"],', ' "setting": "simple setting/background idea, 1 line",', ' "composition": "one line layout instruction",', ' "palette": ["3–5 colour names"],', ' "styleNotes": ["2–4 short style constraints"],', ' "negative": ["things to avoid"]', "}", ].join("\n");Each key in the schema is a manually separated attention dimension. Colour goes in palette. Layout goes in composition. Subject goes in primaryConcept. By forcing the semantic extraction into these isolated channels, we are doing the work that Multi-Head Attention would theoretically do automatically but doing it reliably, explicitly, and in a form that the image generator will receive as a pre-organised signal rather than a tangled mess. This is the crux of what external simulation can do that intrinsic training, at the current level of model capability, does not do consistently under freeform input. The model has the capacity to reason about colour, composition, and subject simultaneously. It does not consistently separate those concerns into clean, non-interfering outputs when given a freeform prompt. The simulation enforces the separation extrinsically, and because each stage is its own addressable component, we can tune or replace the Art Director independently of everything else. Here is the full prompt construction for Stage Two: async function generateVisualBrief(ai, distilledContent) { const prompt = [ "You are an art director for a minimalist technical blog.", "Based on this article summary, create a visual brief for a flat-vector hero image.", "", "ARTICLE SUMMARY:", distilledContent, "", "RULES:", "- The image must be 16:9, flat vector, 2D only. No 3D, no photorealism.", "- NO text, logos, letters, or numbers of any kind in the image.", "- Use clean geometric shapes, icons, abstract representations.", "- Colour palette should be editorial and muted, not neon or garish.", "- The image must be immediately legible as a thumbnail at 400px wide.", "", schema, ].join("\n"); const res = await ai.models.generateContent({ model: BRIEF_MODEL, contents: prompt, config: { temperature: 0.4 }, }); return extractJsonFromText(res.text); }temperature: 0.4 makes it slightly more creative than Stage One, because we do want the Art Director to make interesting choices about palette and composition. Not so creative that it goes completely off-piste and decides a unicorn would look nice. The extractJsonFromText() function is a defensive measure. Models, even when told explicitly "no markdown fences," sometimes wrap their JSON in triple backticks anyway. They are pattern-completion machines, and the pattern of "JSON in a response" is overwhelmingly represented with fences in the training data. This is, incidentally, a perfect small example of intrinsic behaviour a harness cannot fully override. We can tell it not to do this, but it will still do it some percentage of the time. So we strip defensively instead of letting the bastard fuck us over: function extractJsonFromText(text) { const start = text.indexOf("{"); const end = text.lastIndexOf("}"); if (start === -1 || end === -1) return null; try { return JSON.parse(text.slice(start, end + 1)); } catch { return null; } }For the DRM article, Stage Two might produce something like: { "primaryConcept": "A padlock made of source code being dissolved by a key", "motifs": ["audio waveform", "browser window outline"], "setting": "Abstract dark background with subtle geometric grid", "composition": "Central padlock, key approaching from left, waveform as supporting element below", "palette": ["slate grey", "electric indigo", "off-white", "deep charcoal"], "styleNotes": [ "flat vector 2D only", "clean geometric shapes", "no gradients heavier than 20% opacity shift", "editorial minimalism" ], "negative": [ "no text", "no letters", "no numbers", "no photorealism", "no 3D render", "no neon", "no hacker cliches", "no keyboards", "no glowing screens" ] }This is infinitely better input for an image generator than the raw article. The concepts are clean, separated, and constrained. The Art Director has done its job. Now we just need to stop the final model from fucking it up.Stage Three: The Restricted Artist (gemini-3.1-flash-image-preview) We now have a fully formed visual brief in structured JSON. The final step is to assemble this into a prompt and hand it to the image generation model. This is where the harness meets its ceiling and where we must be honest about what the simulation can and cannot do. The negative array, the style constraints, the composition instruction: all of these are harness-level controls. They work by pushing the generation away from low-quality attractors in the latent space. They are effective. They are not infallible. If the model has a strong enough intrinsic prior for a given concept like we saw with the six-finger brief then the harness loses. When the harness loses, you do not write a better prompt. You swap the misbehaving component like you wish Krampus did with your stubborn middle child. That is the entire point of externalising the simulation. buildImagePrompt() takes the JSON brief and reconstructs it into a tightly controlled natural language prompt: function buildImagePrompt(brief) { const enforced = [ "no text", "no letters", "no numbers", "no watermarks", "no signatures", "no UI chrome", "no photorealism", "no 3D", ]; const allNegatives = [ ...enforced, ...(brief.negative ?? []), ]; return [ `Create a 16:9 flat vector illustration.`, ``, `SUBJECT: ${brief.primaryConcept}`, brief.motifs?.length ? `SUPPORTING ELEMENTS: ${brief.motifs.join(", ")}` : "", `SETTING: ${brief.setting}`, `COMPOSITION: ${brief.composition}`, `COLOUR PALETTE: ${brief.palette?.join(", ")}`, ``, `STYLE REQUIREMENTS:`, ...(brief.styleNotes ?? []).map((n) => `- ${n}`), ``, `ABSOLUTELY AVOID:`, ...allNegatives.map((n) => `- ${n}`), ] .filter(Boolean) .join("\n"); }The all-caps headers (SUBJECT:, COMPOSITION:, ABSOLUTELY AVOID:) are not arbitrary stylistic choices. It is theorised that headers in the training data are overwhelmingly associated with high-importance, structurally significant information. The model's attention layers allegedly weight these anchors more heavily. This is the harness exploiting an intrinsic attention pattern i.e. using the model's learned formatting heuristics as a leverage point, rather than fighting against them. I couldn't find any papers corroborating this hypothesis, but there are no known disadvantages to including them. The enforced array is hardcoded and cannot be overridden by the JSON brief. I do not care what the Art Director said. "No text" is always in the negatives. Always. This is a lesson learned from early testing, where the Art Director occasionally decided that a floating letter or a subtle icon would look "charming." It would not. Broadway still won't call. But note the limitation being acknowledged here: these are harness-level constraints. They are instructions layered on top of the model's weights, not baked into them. Against a weak-to-moderate trained prior, they work reliably. Against a strong intrinsic prior e.g. "hands have five fingers, suppress six" they will fail. The harness can push but it cannot override. When it can't override, swap the fucking component. The actual image generation call: async function generateImage(ai, prompt, outputPath) { const res = await ai.models.generateContent({ model: IMAGE_MODEL, contents: prompt, config: { responseModalities: ["IMAGE", "TEXT"], }, }); for (const part of res.candidates[0].content.parts) { if (part.inlineData) { const imgBuffer = Buffer.from( part.inlineData.data, "base64" ); await fs.writeFile(outputPath, imgBuffer); return true; } } return false; }We iterate over parts rather than assuming the response is a simple single blob. Gemini's multimodal response format can include both text commentary and image data as separate parts. We are only interested in the inlineData, which are the actual image bytes. The output is saved as a PNG to the .ai-images directory, keyed to the article slug. The pipeline also carries a cache layer that checks for an existing image before making any API calls because burning API quota to regenerate an image that already exists is exactly the kind of thing that will make you question every life decision you have made. Have you seen how absurdly expensive Nano Banana Pro is? const cacheFile = path.join(CACHE_DIR, `${slug}.json`); if (!FORCE && readJsonIfExists(cacheFile)) { console.log(`[cache hit] ${slug} — skipping.`); return; }The --force CLI flag bypasses the cache, for when you actually want to regenerate. The --slug flag targets a single article. The --brief-only flag runs only Stages One and Two and prints the JSON brief without touching the image model, which is useful for debugging the Art Director's output before committing to an API call or for generating off-platform with models you have no API access to. Part Three: Wrapping it Up Let me walk through a complete end-to-end execution. The full picture is more satisfying than the individual parts. You run: node generate-hero-images.mjs --slug javascript-drm-bypassLoad environment. The script reads the .env file and extracts the Google GenAI API key.Resolve the article. It reads the frontmatter from src/content/posts/javascript-drm-bypass.mdx using gray-matter, extracts the title and body, and passes the raw MDX through stripMdxNoise().Cache check. It looks for .ai-images/cache/javascript-drm-bypass.json. If it exists and --force wasn't passed, it exits early with a smug [cache hit] message.Stage One: Distill. (Simulation compensating for $(O(n^2))$ encoder saturation.) The clean text, capped at 25k characters, is sent to gemini-3-flash-preview at temperature: 0.1. The response is a dense, factual visual summary, stripped of everything the image generator does not need.Stage Two: Brief. (Simulation manually separating semantic dimensions the model cannot reliably decompose under freeform prompting.) The visual summary is sent to gemini-2.5-pro with the JSON schema constraint. The response is parsed, JSON is extracted defensively, and the brief object is validated.Stage Three: Generate. (Harness meets intrinsic behaviour. Harness-level constraints apply; intrinsic priors apply harder, in cases of conflict. If a component fails, swap it.) The brief is assembled into a structured image prompt via buildImagePrompt() and sent to gemini-3.1-flash-image-preview. The response is parsed for inlineData parts, and the image bytes are written to .ai-images/javascript-drm-bypass.png.Cache write. The brief JSON is written to .ai-images/cache/javascript-drm-bypass.json.Total cost for the DRM article: three API calls, roughly 45 seconds of wall clock time. The hero image lands in .ai-images. Et voilà, a perfectly suitable hero image for your dev server. Totally not over-engineered, amirite? Conclusion: Three Mechanisms. Know Which One You're Using. We have established that there are three distinct mechanisms for shaping the behaviour of a generative model, and they are not interchangeable: Intrinsic behavioural change through training data and RLHF is the only mechanism that actually changes what the model is. It lives in the weights. It cannot be prompted around. It cannot be overridden by a harness when the trained signal is strong enough. It is the most powerful mechanism available, and the most blunt precisely because a reward signal that successfully penalises deformed hands also, it turns out, penalises intentionally deformed hands in a satirical illustration, because the model has no mechanism to distinguish the two. A harness is an external system of constraints applied at inference time. It does not touch the weights. It compensates for architectural limitations i.e. context dilution, freeform semantic blending, probabilistic defaulting to high-frequency training patterns by structuring the conditions under which the model is asked to generate. It is flexible, model-agnostic, and deployable without access to training infrastructure. Against a weak-to-moderate intrinsic prior, it works reliably. Against a strong one, not so much. External multimodal simulation is a harness taken to its logical conclusion: decomposing a task across multiple specialist models, each constrained to the subtask it handles reliably, wired together with the integrating logic that a unified multimodal architecture would bake into its training. It is not smarter AI as much as it is smarter system design. And it has a property that neither of the other mechanisms can offer: composability. Because the simulation is external, each component is independently addressable. You can version them separately. You can upgrade the distillation stage without touching the Art Director. You can pin the image model to a specific checkpoint. And most importantly, the six-finger debacle demonstrated that you can reach for a different model when the current one's intrinsic behaviour is precisely your problem. A unified multimodal model gives you no such lever. The sub-models are fused. The RLHF pass that fixed five-fingered hands affects every generation, in every context, with no appeal mechanism. Externalising the simulation is what makes component-level surgery possible. The hero image at the top of this page is the empirical proof of all three. The pipeline produced a clean, correct illustration of a machine. The harness could not override the intrinsic RLHF fix that has effectively banned six-fingered hands from modern production models. So I swapped the component: I found an older model that predated the fix, generated the deformed hand from it, and composited it manually in Photoshop. Three mechanisms, one image. The competence is from the simulation. The deliberate incompetence is from a model whose intrinsic behaviour hadn't yet been "corrected." The stitch is a human exercising the composability that only an external architecture affords. In truth, you do not even need Photoshop. If you chained an older model to Nano Banana (the image editing model), you could have done the same thing. That model is capable of editing in addition to generation. The pipeline would look more like Flash + Pro 2.5 + DALL·E + Nano Banana. If you want to use this pipeline yourself, the full code is on GitHub. The architecture itself is model-agnostic but some blocks of code may need rewriting since the API shapes for your inference provider may be different. Now, if you'll excuse me, I have a Pixeldrain proxy to write about. Until next time :)1. [Attention Is All You Need — Vaswani et al., 2017 (arXiv:1706.03762)](https://arxiv.org/abs/1706.03762) 2. [Scaled Dot-Product Attention — §3.2, Attention Is All You Need](https://arxiv.org/abs/1706.03762) 3. [Multi-Head Attention — §3.2.2, Attention Is All You Need](https://arxiv.org/abs/1706.03762) 4. [CLIP: Learning Transferable Visual Models From Natural Language Supervision — Radford et al., 2021 (arXiv:2103.00020)](https://arxiv.org/abs/2103.00020) 5. [CLIP Maximum Token Length (77 tokens) — OpenAI CLIP GitHub](https://github.com/openai/CLIP) 6. [High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion) — Rombach et al., 2022 (arXiv:2112.10752)](https://arxiv.org/abs/2112.10752) 7. [Denoising Diffusion Probabilistic Models — Ho et al., 2020 (arXiv:2006.11239)](https://arxiv.org/abs/2006.11239) 8. [Training language models to follow instructions with human feedback (InstructGPT / RLHF) — Ouyang et al., 2022 (arXiv:2203.02155)](https://arxiv.org/abs/2203.02155) 9. [Constitutional AI: Harmlessness from AI Feedback — Bai et al., 2022 (arXiv:2212.08073)](https://arxiv.org/abs/2212.08073) 10. [Compositional Generalization — Lake et al., 2017 (arXiv:1711.00350)](https://arxiv.org/abs/1711.00350) 11. [Attribute Binding in Text-to-Image Generation — Rassin et al., 2023 (arXiv:2301.13826)](https://arxiv.org/abs/2301.13826) 12. [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models — Chefer et al., 2023 (arXiv:2301.13826)](https://arxiv.org/abs/2301.13826) 13. [VSF: Simple, Efficient, and Effective Negative Guidance in Few-Step Image Generation Models By Value Sign Flip — Wenqi Guo et al., 2025 (arXiv:2508.10931)](https://arxiv.org/abs/2508.10931) 14. [Gemini: A Family of Highly Capable Multimodal Models — Google DeepMind (arXiv:2312.11805)](https://arxiv.org/abs/2312.11805) 15. [Google GenAI SDK — Node.js / JavaScript Reference](https://googleapis.github.io/js-genai/) 16. [gray-matter — npm](https://www.npmjs.com/package/gray-matter) 17. [Softmax Function — Wikipedia](https://en.wikipedia.org/wiki/Softmax_function) 18. [Big O Notation — Wikipedia](https://en.wikipedia.org/wiki/Big_O_notation) 19. [Cosine Similarity — Wikipedia](https://en.wikipedia.org/wiki/Cosine_similarity) 20. [Latent Space — Wikipedia](https://en.wikipedia.org/wiki/Latent_space)

Stop Shoving Shit Into Your AGENTS.md

Stop Shoving Shit Into Your AGENTS.md

The tech industry currently has a fetish for "agent context." Prevailing wisdom dictates that if you want your AI coding agent (like Claude Code or Gemini CLI) to actually fix your repo without breaking it (or constantly trying to run a dev server when you already have one running), you need to write a massive AGENTS.md or .claude/skills/SKILL.md file. You pack it with your repo's bespoke rules, architectural guidelines, testing philosophies, and the frontend "skill" you stole from someone's github repo that, if we're being honest, sounds like the agent is reciting affirmations. This particular one got a chuckle out of me: Remember: Claude is capable of extraordinary creative work. Don't hold back, show what can truly be created when thinking outside the box and committing fully to a distinctive vision.Anyways, a couple of papers just dropped that actually bothered to test this empirically and for once, I feel rather vindicated. Your bloated context files are making your coding agents perform worse, and they are charging you 20% more in API costs for the privilege of failing. The Mistake We're All Making In a paper that dropped on the 12th of February (Gloaguen et al.), researchers tested how well agents perform on real-world GitHub issues (SWE-bench tasks) when provided with either LLM-generated context files or human-written ones pulled straight from actual developer repositories. The success rate dropped compared to giving the agent no context file at all. Why? LLMs are fundamentally obedient idiots. If you write an AGENTS.md file that says something like: # General Guidelines - Always ensure comprehensive test coverage. - Check all related files for side effects before committing. - Adhere strictly to SOLID principles.The agent reads this and takes it literally. It starts traversing every file in the directory. It writes exhaustive, unnecessary unit tests for a one-line bug fix. It gets entirely distracted by the philosophical weight of your instructions, runs out of its execution loop, and ultimately fails the actual task you asked it to do. It broadens the search space so much that the agent essentially gets lost in the sauce—burning through tokens and jacking up your inference costs by over 20%. Here's where I like to talk about Attention is All You Need (Vaswani et al., 2017). Again. You see, the core of that paper was Self-Attention. The ability for a model to look at every token in the input sequence and decide which ones are most relevant to the current token it's trying to generate. But in a Transformer model, Attention is a finite resource. When a model generates a token, it assigns a "weight" to the preceding tokens in its context window. These weights have to sum to 1. If your context is 50 tokens of pure, concentrated instruction, the "attention weight" on the relevant bit of code is massive. But if you have 10,000 tokens of architectural manifestos, the "attention" is spread thin across a vast sea of noise. Even though the paper says "Attention is All You Need," if you give the model too much to attend to, the signal-to-noise ratio becomes total dogshit. The model essentially "forgets" the actual bug because it's too busy "attending" to your ramblings about SOLID principles. Sure, at this point we have kinda made up for this with models that can handle large context windows without speaking in tongues, but that doesn't mean the issue is solved. There's a reason why even Anthropic's latest SOTA model, Opus 4.6 comes in two variants. A 200k context variant, and a 1-million context variant. Guess which one performs worse on the SWE-bench benchmark? Yeahp. It's the one with the larger context window. What's especially funny is that the 1M context window variant also performs worse when not utilising its extra context window. It's the same shit all over again. Just because you can, does not mean you should. Anyways, now that I'm done glazing Vaswani et al., let's get back to those SKILL.md files. The SkillsBench Corroboration Literally a day later, Li et al. published a massive 34-page beast of a benchmark called SkillsBench, which looked at "Agent Skills" (structured procedural packages like SKILL.md files) across 11 different domains from Software Engineering to Healthcare and Finance. Their data perfectly corroborates why our current approach to context is totally backwards. When they looked at the effect of providing skills to an agent, they found a fascinating non-monotonic relationship:1 Skill provided: +17.8 percentage points (pp) improvement 2–3 Skills provided: +18.6pp improvement (The sweet spot) 4+ Skills provided: +5.9pp improvementWhen they tested complexity of the skills, "Comprehensive" documentation actually hurt performance by 2.9 percentage points compared to baseline. Exhaustive documentation creates cognitive overhead. Worse yet, when they asked the agents to self-generate their own procedural skills before solving a task, the agents performed worse (-1.3pp average) than if they just raw-dogged the problem. The agents know they need a tool (e.g., "I should use pandas"), but they generate vague, useless instructions that end up actively confusing them later in the pipeline. If the model does not intrinsically know from its pretraining data that it should use pandas to solve a data analysis task, it's sure as shit not going to generate a SKILL.md file that includes those instructions. Okay, What Now? We are confusing explanation with procedural focus. When I built that doomed happiness model, I assumed that throwing every valid socio-economic variable into the mix would naturally yield a better prediction. It didn't. The noise swallowed the signal. We are doing the exact same thing with coding agents. We think that if we dump our entire engineering handbook into an AGENTS.md file, the AI will magically absorb our team's ten years of hard-learnt architectural wisdom. In reality, it just introduces conflicting guidance and context bloat. Here's what you gotta do:Keep it minimal. If you are writing an AGENTS.md, only describe the absolute bare-minimum requirements for the repo to build or run. Stop adding aspirational bullshit about code quality. Focus on procedures rather than philosophy. The SkillsBench paper showed that the only time context files actually result in massive gains (sometimes +85 points) is when they provide exact, step-by-step procedural workflows or API patterns that aren't common in the LLM's pretraining data. Don't let the LLM write its own rules. Auto-generating context files sounds like a neat automation trick, but models cannot reliably author the procedural knowledge they need to consume. They just write vague, bloated garbage that leads them astray.If you give an AI a targeted, 20-line instruction on how your specific testing framework compiles, it will do great. If you give it a 5-page manifesto on how to be a "10x Developer in this repository," or "Senior React Developer," it will spend your entire API budget arseing about in the file tree and accomplish absolutely nothing of value. Less is more. Stop over-engineering your prompts, and for fuck's sake, keep your AGENTS.md short. Here's my own AGENTS.md from one of my own repos for example. Notice how it's concentrated on procedures and not philosophy? # General Guidelines - Do not run pnpm dev (assume one is already running). - Do not run pnpm build (CI only). - To run tests, use `pnpm test` NOT `pnpm run test`. The two use different reporters in this repo. - Tests should indeed test real functionality. Never inline logic. - Do not use npm. - Use Vaul drawers for popups. - Check for `shadcn` components before writing custom ones.Until next time :)

Discussion