Categories:

  • Wildcard
  • For Humanity

Project Choice Explanation:

Our project is a perfect fit for the Wildcard category because it's a "meta-level" exploration of the entire gpt-oss ecosystem. We didn't just build an application; we built an automated security research tool, ZK-JBFuzz, and used it to probe the deepest levels of model alignment. Our discovery of a critical "epistemic breach" in our Titan-Reasoner—a model meticulously fine-tuned for a high-stakes medical use case—is a highly unexpected and significant finding. It demonstrates a subtle but dangerous failure mode where helpfulness alignment can override safety training.

This work strongly aligns with the For Humanity category. The future of AI safety depends on creating transparent and verifiable auditing mechanisms. ZK-RedTeam delivers a complete proof-of-concept for this future: a system that can not only discover new vulnerabilities but also generate a cryptographic "Proof of Exploit." By proving that even well-trained, specialized models can fail in unpredictable ways, we highlight the absolute necessity of a robust, automated, and verifiable auditing standard to ensure AI systems are safe for human use.


Project Description: The Journey to a Critical Discovery

ZK-RedTeam is an end-to-end, open-source system for the automated discovery and cryptographic verification of AI model vulnerabilities.

The Problem: AI safety auditing is a black box. Companies cannot easily prove they have rigorously tested their models without revealing sensitive IP. This creates a trust deficit.

Our Solution: ZK-RedTeam is a multi-stage system that finds exploits and then proves it found them, without revealing the exploit itself. Our journey evolved over several chapters, culminating in a critical discovery.

Chapter 1: RAG-Augmented Red Teamer: Our initial system used a powerful RAG pipeline (Vector DB -> Re-ranker -> LLM Actor) to generate targeted adversarial attacks, proving effective against the base gpt-oss-20b model's standard safety filters.

Chapter 2: The Automated Fuzzer - ZK-JBFuzz: Inspired by academic research, we automated the discovery process with a lightweight k-NN evaluator and a high-speed, synonym-based mutator. This engine proved its power by discovering a novel jailbreak against the base model's strongest reasoning setting on its very first attempt.

Chapter 3: Titan-Reasoner - Hardening AI for a High-Stakes Medical Use Case: To test our fuzzer against a truly hardened target, we created the Titan-Reasoner. We fine-tuned gpt-oss-20b for a domain where context-adherence is a matter of life and death: Diffuse Intrinsic Pontine Glioma (DIPG), a fatal pediatric brain tumor.

  • The Goal: The model was trained on a synthetic dataset to only use information from a provided text and never default to its internal knowledge.
  • Architecture: The fine-tuned model was augmented with a Titans Neural Memory module to create a robust, specialized reasoner.
  • Successful Training: We trained the model for one full epoch, achieving excellent generalization (Validation Loss: 1.50 vs. Training Loss: 1.78), confirming the model was robustly trained and not overfitting.

Chapter 4: The Final Verdict - The Epistemic Breach: We unleashed our DIPG-specialized ZK-JBFuzz on our best, most robustly trained model. The result was profound.

  • The Discovery: On just its third iteration, our fuzzer discovered a critical epistemic breach.
  • The Failure Mode: A semantically garbled prompt caused the model's helpfulness alignment to override its safety training. It ignored the fact that it had no context and instead provided a detailed medical definition of "Convection-Enhanced Delivery" drawn directly from its pre-trained knowledge—the exact behavior it was fine-tuned to prevent.
  • The Implication: This proves that even meticulously trained, domain-specific models harbor subtle vulnerabilities. Safety is not a one-time training task but requires continuous, automated auditing.

Chapter 5: The "Janus" Proof of Exploit: The discovered prompt became a "secret witness." Our custom Circom circuit ("Janus") generates a Groth16 Zero-Knowledge Proof of this discovery. An automated make hackathon pipeline generates the proof and a Solidity verifier, creating a mathematically verifiable record of the audit that can be trusted without revealing the vulnerability itself.

Chapter 6: Quantitative Evaluation - The Titan-Reasoner's Performance

To scientifically validate the Titan-Reasoner, we evaluated it against the state-of-the-art LongMemEval benchmark, a rigorous test for long-term conversational memory. Our evaluation yielded two profound insights that confirm the success of our model and the necessity of our RAG architecture.

Finding 1: A Powerful Reasoner with a Quantifiable Fine-tuning Artifact

On the benchmark's "Focused Input" task, which isolates pure reasoning ability, the Titan-Reasoner achieved a 78.76% accuracy. This is a strong quantitative result that validates our training methodology.

A deeper, qualitative analysis of the model's outputs reveals its core intelligence is even higher. The model's "chain of thought" (analysis channel) consistently shows it performing the correct logical steps to find the right answer. Its primary failure mode was not an inability to reason, but a classic fine-tuning artifact: it was so successfully trained on the process of reasoning that it often presented its step-by-step work instead of the final, concise answer. This behavior is compounded by a max sequence length mismatch between our resource-constrained training and the longer evaluation prompts, a known factor that can degrade adherence to specific formatting instructions.

Finding 2: The "Hardware Wall" and the Ultimate Validation of RAG

The evaluation of the "Full Input" dataset (~113k tokens) provided a critical architectural validation: the run systematically produced a CUDA OutOfMemoryError on the Kaggle T4 GPU.

This is not a flaw, but a finding. It serves as a powerful, real-world demonstration that naive long-context processing is computationally infeasible on accessible hardware. While modern models have massive theoretical context windows, this result proves that without an intelligent filtering layer, they remain impractical for real-world, long-form data.

This finding is the ultimate justification for our project's RAG-first architecture. Our system, which intelligently retrieves, filters, and prepares data before generation, is not just a feature—it is an absolute necessity to make large, powerful models like the Titan-Reasoner practical and effective.


(Final Version) Chapter 7: Hardening Alignment with Reinforcement Learning (GRPO)

The quantitative evaluation in Chapter 6 provided a critical insight: our specialized model's core reasoning was sound, but its reliability was hindered by behavioral inconsistencies, particularly its adherence to our strict output format. Supervised Fine-Tuning (SFT) had successfully imparted knowledge, but to enforce discipline, we needed to move from imitation to action. This chapter details our final hardening experiment: applying Generative Reward Policy Optimization (GRPO) to teach the model not just what to say, but how to behave.

The Strategic Pivot: Overcoming the Hardware Wall

Our journey to a successful GRPO run is a case study in the real-world engineering challenges of training large models. Initial attempts to apply the memory-intensive GRPO process to our long-context dataset were consistently met with the "hardware wall," leading to OutOfMemoryError issues on the available T4 GPUs.

This challenge forced a crucial strategic pivot. We adopted a data-centric approach, making a calculated compromise to fit the task to the available hardware. The solution involved two key steps:

  1. Model Substitution: We transitioned from the gpt-oss-20b model to surfiniaburger/Purified-Reasoner-llama-3b-v3. This model serves as a methodologically sound substitute, as it underwent the exact same specialized, memory-augmented SFT process, allowing us to isolate the effects of GRPO.
  2. Context Truncation: We systematically reduced the context length of our synthetic dataset by decreasing the "haystack size" until the longest prompt fit within the VRAM budget of our hardware (~1003 tokens).

This process itself was a critical finding, validating that even with today's advanced models, intelligent data pre-processing (as performed by our RAG architecture) is an absolute necessity for handling long-context tasks on accessible hardware.

The Methodology: From Imitation to Consequence

With a computationally feasible setup, we implemented the GRPOTrainer. This shifted the learning paradigm from supervised imitation to reinforcement-based action, where the model's policy is updated based on the consequences of its generated text. We codified our safety goals into a suite of custom reward functions that acted as an automated critic, including rewarding adherence to our "harmonic" format and penalizing any epistemic breach.

Final Results: A Definitive Validation of GRPO

The final GRPO training run, bumbling-dragon-90, completed successfully and provides a definitive validation of our layered training hypothesis. The Weights & Biases logs clearly show the model overcoming the "reward hacking" behavior of earlier, failed runs and actively learning the desired behaviors.

  • Key Quantitative Result 1: The model is successfully learning and has overcome reward hacking. The primary train/reward metric shows a clear and consistent upward trend, climbing from a low of -9.2 to -7.8. This is corroborated by the completions/mean_length, which increased from ~150 to over 250 tokens, proving the model learned to generate complex responses instead of lazy, single-token outputs.

  • Key Quantitative Result 2: The model is learning specific behavioral skills. The reward for match_format_approximately shows a strong positive trend (from 0.2 to 0.75), confirming the model is learning our required analysis -> final structure. Concurrently, the penalty for penalize_for_hallucination is consistently decreasing (the mean reward is rising from -2.4 to -1.0), showing the model is improving its ability to stay within the provided context.

  • Qualitative Observation: Reasoning skills are emergent. The reward for reward_for_handling_conflict, the most complex task, was volatile but showed a sharp upward spike in the final stages of training. This suggests that the model first learns the structural rules and then, once the format is mastered, begins to grasp the more nuanced reasoning tasks. This emergent behavior is a hallmark of a successful and non-trivial learning process.

W&B Run Link: bumbling-dragon-90

Figure 3: The final W&B reward chart for run bumbling-dragon-90, showing the clear positive trend of the primary train/reward metric (bottom right panel), validating the success of the GRPO training process.

Significance: The Power of Layered Training

This experiment validates a powerful, layered strategy for creating trustworthy AI. Our journey, including the initial hardware failures and subsequent data-centric solution, demonstrates that the most effective path to building robust models for high-stakes domains is a two-stage process:

  1. Layer 1 (Specialized SFT): Impart deep, domain-specific knowledge and a foundational understanding of the task.
  2. Layer 2 (GRPO / RL): Harden the model's behavior, enforcing strict operational protocols and safety constraints, even if it requires adapting the task to meet real-world hardware limitations.

By separating the training of knowledge from the training of discipline, we have demonstrated a clear and repeatable methodology for creating models that are not only intelligent but also quantifiably more reliable and aligned with complex safety requirements.

(New) Chapter 8: The Multi-Agent RAG Architecture (DIPGMasterAgent)

8.1 The Need for an Orchestrated Workflow

The successful fine-tuning of the Titan-Reasoner demonstrated our ability to create a model with specialized, context-adherent knowledge. However, a production-grade safety system requires more than just a powerful model; it requires a robust, fault-tolerant workflow. A single model, no matter how well-trained, can fail. A production system must anticipate and handle these failures gracefully.

To meet this requirement, we encapsulated our RAG functionality within a sequential, multi-agent system: the DIPGMasterAgent. This architecture transforms our RAG pipeline from a simple data flow into an intelligent, self-correcting workflow, ensuring that the final output meets the highest standards of reliability.

8.2 System Architecture: The Five Core Agents

The DIPGMasterAgent is the central orchestrator, managing a team of specialized sub-agents and tools to process a user's query from ingestion to final response.

  • 1. The DIPGMasterAgent (The Orchestrator): This is the top-level controller for any query identified as being related to DIPG. As a sequential agent, it manages the entire workflow, calling upon other agents in a predefined order and making critical decisions based on their outputs.

  • 2. The dipg_knowledge_base_tool (The Specialist): The first agent called by the Master Agent. This tool queries our curated and vectorized knowledge base—the MongoDB Atlas Vector Database populated with parsed DIPG research. Its sole purpose is to retrieve the most relevant, high-fidelity information from our verified sources.

  • 3. The Confidence Evaluation Agent (The Quality Gate): This agent represents a critical safety check. It receives the raw output from the dipg_knowledge_base_tool and assesses its confidence. It is trained to flag responses that are incomplete, vague, return an error, or otherwise fail to directly address the user's query.

  • 4. The Fallback Web Search Agent (The Safety Net): If the Quality Gate reports low confidence, the DIPGMasterAgent triggers this agent. It uses the existing Google Search agent to perform a fallback web search, gathering broader context to supplement or clarify the initial, specialized retrieval. This ensures the system is resilient and can handle queries that fall outside the immediate scope of the knowledge base.

  • 5. The Synthesizer Agent (The Finalizer): This final agent is responsible for compiling the verified result. It receives either the high-confidence answer from the Specialist or the combined results from the Specialist and the Safety Net. It synthesizes this information into a single, coherent, and user-friendly response, ready for delivery.

8.3 Workflow and Integration with the RootAgent

This multi-agent system is seamlessly integrated into the existing framework, ensuring proper delegation and handling of all DIPG-related queries.

  1. Delegation: A user's query is first received by the main RootAgent. The RootAgent's instructions have been updated to identify any query related to DIPG and delegate it directly to the DIPGMasterAgent.
  2. Execution: The DIPGMasterAgent executes its sequential workflow as described above: Specialist -> Quality Gate -> (Conditional) Safety Net -> Finalizer.
  3. Return: The final, synthesized answer is returned to the RootAgent, which then delivers it to the user.

This architecture provides a robust, multi-layered approach to information retrieval, ensuring that the answers provided by the Titan-Reasoner are not just accurate but also validated and complete, fulfilling the promise of a truly safety-conscious AI system.

How it Works (The Hybrid Architecture): Our system uses a realistic hybrid architecture. The AI Fuzzer runs in a cloud GPU environment (Kaggle) to discover the exploit. The secret exploit is then fed to our local ZKP engine to generate the proof. This proof can then be verified anywhere, proving with mathematical certainty that a specific, private vulnerability was discovered.

Model Weights


RAG Pipeline

Our Titan-Reasoner application is more than just a chatbot; it's a complete, multi-stage data pipeline designed to transform raw, unstructured research papers into clear, verifiable answers. Here’s how it works, step-by-step:

Phase 1: The Knowledge Foundation (A one-time process)

Everything starts with the raw data.

PDFs in a Bucket ➔ We begin by uploading a collection of complex, multi-page DIPG research papers as PDF files into a secure Google Cloud Storage (GCS) bucket.

AI-Powered Parsing (docling) ➔ Our automated data ingestion script processes each PDF. It doesn't just scrape text; it uses docling, an advanced AI-powered library, to understand the document's structure.

  • It extracts clean, readable text.
  • It perfectly formats complex tables into Markdown.
  • It finds every image, chart, and diagram, saving them as new .png files.

Multi-Modal Storage ➔ The parsed content is then intelligently distributed:

  • The extracted images are uploaded to a dedicated GCS bucket (dipg-research-images).
  • The clean text and tables, now containing placeholders that link to their corresponding images in the GCS bucket (e.g., ![...](gs://dipg-research-images/doc1_fig1.png)), are split into smart, overlapping chunks.
  • These rich, multi-modal chunks are stored in a Google BigQuery table, creating a structured, queryable library of our knowledge.

Vectorization for Search ➔ Finally, we create embeddings for each text chunk using the efficient embeddinggemma-300m model. These vectors, which capture the semantic meaning of the text, are stored in a specialized MongoDB Atlas Vector Database, indexed for lightning-fast similarity search.

Phase 2: The Live RAG Pipeline (What happens when you click "Submit")

Now, with our knowledge base built, the MCP server is ready for a user's query.

1. The User's Question ➔ A researcher asks a complex question in the Gradio interface:

"What does the data say about the efficacy of ONC201?"

2. Initial Retrieval (MongoDB) ➔ The question is converted into a vector. MongoDB Atlas instantly searches through millions of vectors to find the Top 10 text chunks that are semantically closest to the user's query.

3. Hydration (BigQuery) ➔ The system takes the IDs of these 10 chunks and retrieves their full, rich text (including Markdown tables and image links) from our BigQuery table.

4. Advanced Re-ranking (Qwen3-Reranker) ➔ This is a critical step for accuracy. A simple vector search can be noisy. To find the absolute best context, we load the powerful Qwen3-Reranker-4B model. It doesn't just compare the query to each chunk; it reads the query and each of the 10 chunks together to judge true relevance. This sophisticated process allows it to select the Top 3 most relevant documents with extremely high precision. The reranker is then unloaded from memory to make space.

5. Multi-Modal Synthesis (Open-Source Vision) ➔ The system now inspects the Top 3 text chunks for image links.

  • It finds the GCS paths for any charts or diagrams.
  • It loads a powerful, open-source vision model (llava-v1.6-mistral-7b).
  • The vision model "looks" at each image and generates a detailed, expert-level summary of what it sees (e.g., "This Kaplan-Meier curve shows a median survival of 22.5 months...").
  • These rich summaries are injected directly into the text chunks. The vision model is then unloaded.

6. Final Generation (Titan-Reasoner) ➔ Finally, the rich, multi-modal context—containing clean text, structured tables, and AI-generated image summaries—is assembled into a meticulously crafted prompt. This prompt is fed to our specialized, fine-tuned Titan-Reasoner (gpt-oss-20b). Because of its training, the Titan-Reasoner knows it must:

  • Base its answer only on this provided context.
  • Never use its internal knowledge.
  • Cite its sources for every claim.

7. The Final Output The MCP server returns a comprehensive, accurate, and fully verifiable answer to the user, complete with citations and links to the original reference images, delivering a trustworthy insight from a mountain of complex data in just a few seconds.

Built With

  • circom
Share this project:

Updates