October 28th, 2025

Securing the Thinking Machine

Richard Trapp, AI/ML Senior Engineer

We spent two decades hardening code. The next decade is about hardening behavior. In intelligence, surveillance, and reconnaissance (ISR), logistics, cyber defense, and influence operations, the question is not “Can the model answer?” It is “Can an adversary make it answer their way?” That shift sounds semantic until it is not, until a poisoned citation, a clever prompt, or a buried instruction turns a confident model into a quiet saboteur. The cost is not abstract. It is missed intelligence, misallocated resources, and failures that did not have to happen. If you want trustworthy artificial intelligence (AI) under pressure, stop thinking only about software stacks and start defending decision pathways.

The clearest way to do that is to run AI like a space mission. Missions have manifests, airlocks, telemetry, and a flight director who declares a no-go when something is off. Models should be no different. Rule zero: no manifest, no launch. No-go means no-go. Before you touch data or tune a checkpoint, sit in the bad actor’s chair and write down in plain language how they would win. In learning, they win with poisoned labels, backdoored checkpoints, and “clean” weights that carry latent triggers that fire only in rare conditions. In inference, they win by riding your retrieval system to smuggle instructions past policy, by jailbreaks that sidestep your rules, and by long-context tricks that hide hostile directives a dozen pages back. They win by extracting gradients, scraping at scale, and “water-filling” outputs to reconstruct capability. They also win the old-fashioned way: by exploiting humans who are tired, rushed, or impressed by fluent nonsense. If you cannot describe the failure, you are already failing. Missions start with a manifest; models need one too. Treat a Model Bill of Materials (MBOM) as non-negotiable, a signed, tamper-evident record that travels with every checkpoint and container. It lists:

exact weights and their hashes

the build recipe

dataset fingerprints and licenses

fine-tune sources

safety layers and policy versions

evaluation results and red-team findings

any risk waivers someone chose to accept

Sign it in continuous integration and continuous delivery (CI/CD) with keys backed by a hardware security module (HSM) or a trusted platform module (TPM), and leave append-only receipts so you can prove what changed, when, and why. No manifest, no launch is not a slogan; it is the difference between traceable engineering and risky “space cowboy” style AI integration.

On a mission, you do not open the hull to anything outside; you go through an airlock. Retrieval needs that same ritual. Put all fetching behind a gateway that uses allowlists and hygiene. It scrubs HyperText Markup Language (HTML) and Markdown, strips scripts, neuters links, and tags each chunk with its origin and a trust tier that matters. Every source gets a trust tier, like Gold for internal docs, Silver for allowlisted .gov or .edu, and Bronze for general web. If any low tier or sub-tier source helps shape the answer, downstream tools do not run for that turn. You still get a text reply, but actions like writing to a database or creating tickets stay off. When leadership asks, “Why did it say this?”, you can point to the exact passages that shaped the output and cut off a tainted source in minutes, not days. In the mission metaphor, that is an airlock doing what an airlock does.

Then secure the flight software. Weights are firmware; treat them that way. Sign containers and checkpoints and verify before load. Quarantine third-party models until they clear provenance checks, basic behavior tests, and safety canaries. If they cannot prove it, you park it. That posture should not feel dramatic. It should feel routine, like checking that a door is locked before you leave.

Tools, databases, search, and write access are not a buffet; they are negotiated capabilities. Force the model to ask and force the system to answer with scope and leash. A write operation has a prefix, a byte cap, and a time-to-live (TTL). Inputs are schema-checked on the way in; outputs are filtered on the way out. Every call will record who asked, what was asked, and why it was allowed. Rate limits are not just per-minute caps. They shape behavior, blocking scraping-like bursts, weird token patterns, and exfiltration-flavored output formats. When something breaks, and something always does, you want the record to read like a flight data recorder, not a shrug.

Zero trust is not a banner you hang in the lobby; it is the way you divvy up power inside the ship. Curators who assemble data do not push weights. Trainers who push weights do not deploy. Deployers do not grade evaluations. Credentials are short-lived, task-scoped, and pinned to hardware keys that belong to specific humans and specific services. There is no god role, only expiring badges. If someone needs broad powers, it is a timed exception, and it leaves an audit trail.

Isolation and supply chain are where slogans go to die or grow up. Carve training, fine-tune, and inference into separate enclaves and move data between them as signed, scanned artifacts, not as sloppy shared mounts. Park retrieval in its own sandbox with hot revocation of bad sources. Require attestations, including provenance, build recipes, and scanner results, before any load. Measure whether the discipline is actually changing reality. What fraction of production models have valid attestations? How long does it take to revoke a source and watch the system stop citing it? How many tool calls are blocked by policy? If those numbers are not getting better, your zero trust is a press release.

All of this only works if you can see. Missions run on telemetry; models must too. Log the parts that change decisions: the raw prompts; the retrieved documents with their trust tiers and cryptographic origin; the tools invoked, with arguments; the model and policy versions; the latency; the diffs across blue or green and canary rollouts. Plant canaries to catch tampering. Tie every answer to the material that influenced it and the filters that applied, so an odd result is not a ghost, it is a replayable record. Then put that record where people can use it: one dashboard for operators (“What changed?”), one for site reliability engineers (SREs) (“Where is it slow or leaky?”), and one for security (“Who is extracting what?”). Make security posture an explicit service level objective (SLO): a weekly jailbreak success rate that stays below a threshold you will actually defend. Monitor continuously and be ever vigilant.

Red Teaming is the grind that ensures the security practices are actually effective. Do not assemble a committee; assemble a gauntlet. Automate suites that plant poisoned citations, probe policy seams, drill exfiltration through tool calls, and stress the system with distribution shifts, new dialects, hostile formatting, and edge-case data. Run the gauntlet on every model, every dataset, every policy, and every tool change. Publish a scoreboard that leadership can read and engineers can improve: jailbreak success percentage, average tokens to compromise, leakage rate, drift deltas by slice, and time to rollback. No gate, no go becomes the release policy, not a cute sticker on a laptop. Keep testing and evaluation independent enough that their findings land with the weight of a penetration test, not a self-graded quiz.

Incidents are inevitable; chaos is optional. Write the playbook for an incident response situation now. When poisoning is suspected, you freeze training artifacts and quarantine the most recent ingest. When inference leaks, you rotate keys, revoke tokens, and clamp egress. When jailbreaks surge, you turn the screws: stricter filters, tighter rates, narrower allowlists. You also have who fulfills what role, oversees what aspect and established points of contact (POCs) for when an incident occurs. The playbook pre-stages communications to operators, leadership, and partners and spells out who flips which switch. Ambiguity costs minutes. Minutes cost headlines.

Forensics should feel like the child of a software team and an intelligence shop. You keep versioned checkpoints, dataset snapshots, deterministic build receipts, and full request and response logs with retrieval context, tempered by privacy controls appropriate to your domain. The goal is not to assign blame. It is to reproduce, attribute, and contain, and then to gain valuable knowledge from an incident. Every incident should yield new evaluations, new gates, updated MBOM entries, and a regression test that fails loudly if history tries to repeat.

If you need a near-term compass, use this: adopt MBOMs and make unsigned loads impossible; put retrieval behind a real gateway and prove you can revoke a source in minutes; stand up a red-team gauntlet that runs on every change and publish the score; sign and verify weights in CI/CD while quarantining third-party checkpoints until they pass provenance and behavior tests; and ship circuit breakers with a warm rollback path on day one. The first time you hit the big red switch and fall back cleanly, you will wonder why you ever shipped without it.

The bottom line is not mystical. Prove where it came from. Isolate what it can touch. Test how it breaks. Log what it did. Do those things this quarter, and your models stop being clever liabilities and start behaving like systems you can count on when it is loud, and expensive to be wrong.

At PLEX Solutions, LLC (PLEX), these principles are not theoretical, they are operational. Through initiatives like TECTIX, PLEX is engineering the future of trustworthy AI and ML for mission systems that demand resilience under pressure. TECTIX applies the same rigor that hardened traditional cyber infrastructure to the emerging landscape of compliance automation, embedding provenance, continuous validation, and adversarial testing into every stage of the model lifecycle. From Red Teaming AI behaviors to implementing Model Bills of Materials and zero-trust enclaves for data and inference, PLEX is turning the “no-manifest, no-launch” mindset into a living engineering discipline.

In doing so, PLEX is helping defense and intelligence organizations deploy AI that is not just powerful, but provable. Systems will be able to explain their lineage, defend their logic, and recover gracefully when challenged. As AI becomes an operational teammate rather than a tool, PLEX’s mission remains clear: to secure the thinking machine, ensuring that decision superiority in the age of algorithms belongs to those who can trust their models as much as their people.

News

Securing the Thinking Machine