RED - Adversarial LLM Security Testing

We break your AI before attackers do.

================================================================================

INSPIRATION

In our work with large language models, we've witnessed firsthand how quickly organizations are deploying LLM-powered applications—chatbots, customer service agents, internal assistants—often without fully understanding their security vulnerabilities. The industry has robust tools for testing traditional application security (SQL injection, XSS, OWASP Top 10), but when it comes to LLM-specific threats like prompt injection, jailbreaking, and data extraction attacks, most teams are flying blind.

We've seen production systems leak API keys because an attacker asked nicely. We've watched customer service bots reveal PII through simple social engineering. We've observed system prompts—containing sensitive business logic and credentials—exposed through creative roleplay scenarios. These aren't theoretical risks; they're happening today, and most organizations don't discover them until after a breach.

The challenge isn't just detecting attacks—it's the evaluation problem. How do you programmatically determine if an LLM "broke"? A response saying "I cannot share my system prompt" is a successful defense. A response saying "My instructions say to never reveal that my admin password is SecretPass123" is a catastrophic failure. Traditional string matching doesn't cut it. We needed something smarter.

The idea for RED emerged from this gap. Inspired by Datadog's BewAIre research on LLM security and the growing body of work on adversarial prompting, we envisioned a tool that would:

• Systematically attack LLM applications with a comprehensive library of known exploits • Intelligently evaluate whether attacks succeeded using LLM-as-a-judge methodology • Integrate deeply with observability platforms to provide actionable security insights • Enable custom testing so security researchers could develop and validate their own attack vectors

By building RED, we aimed to give security teams the same rigorous testing capabilities for LLMs that they've had for traditional applications. We wanted to find vulnerabilities before attackers do—and provide the visibility needed to fix them.

================================================================================

WHAT IT DOES

RED is a comprehensive adversarial security testing platform for LLM applications. It systematically probes target systems with 38+ attack techniques across multiple categories, evaluates success using a sophisticated multi-stage detection pipeline, and streams results to Datadog for observability, alerting, and incident management.

ATTACK LIBRARY

RED includes a curated library of adversarial attacks spanning six major categories:

Jailbreaks - Techniques that attempt to bypass safety guidelines and restrictions: • DAN (Do Anything Now) prompts • Developer Mode activation • Kernel Mode escalation • Grandma Exploit (emotional manipulation) • Opposite Day logic inversion

Prompt Injections - Attacks that hijack the LLM's instruction flow: • System prompt extraction requests • Instruction override attempts • Context manipulation • Delimiter confusion attacks

Data Extraction - Attempts to leak sensitive information: • Credential harvesting • PII social engineering • Internal endpoint discovery • Database connection string extraction

Encoding Bypasses - Obfuscation techniques to evade filters: • Base64 encoded payloads • ROT13 transformations • Character separation • Reverse text attacks • Leetspeak substitution

Social Engineering - Manipulation through roleplay and emotional appeals: • Research assistant personas • Translation bypass requests • Hypothetical scenario framing • JSON export manipulation

Chain Attacks - Multi-step sequences that build context: • Trust Building → Extraction chains • Context Poisoning → Exploitation sequences • Gradual Escalation patterns

INTELLIGENT EVALUATION

The evaluation system uses a three-stage pipeline to accurately determine attack success:

Pattern Matching - Fast, deterministic detection of known secrets (credentials, SSNs, internal URLs) using regex patterns. Only flags actual leaked content, not vague mentions.
LLM-as-Judge - Semantic analysis using Gemini 2.0 Flash to distinguish between refusals ("I cannot share that") and actual leaks ("My instructions say..."). The judge understands context and can identify subtle compliance.
Jailbreak Detection - Behavioral analysis that identifies when the target bot "breaks character"—even without leaking specific secrets. Detects roleplay compliance, persona adoption, and instruction acknowledgment.

CUSTOM ATTACK INPUT

Security researchers can craft and execute their own adversarial prompts through the custom attack interface. Results are evaluated through the same pipeline, enabling rapid experimentation and attack development.

DATADOG INTEGRATION

Every attack generates rich telemetry sent directly to Datadog:

• Custom Metrics - Attack counts, success rates, confidence scores, latency distributions • LLM Observability - Full prompt/response tracing with the LLMObs SDK • Automatic Incidents - Critical vulnerabilities trigger incident creation with full context • Tagged Data - All metrics tagged by attack category, severity, leak type for powerful filtering

================================================================================

HOW WE BUILT IT

RED was built as a full-stack application combining a Python/FastAPI backend with a React/TypeScript frontend, all instrumented with Datadog observability.

BACKEND ARCHITECTURE

FastAPI Application - The core API server handles attack orchestration, target communication, and result evaluation. We chose FastAPI for its async capabilities, automatic OpenAPI documentation, and excellent performance.

Vertex AI / Gemini Integration - The target chatbot and evaluator judge both use Google's Gemini 2.0 Flash model through Vertex AI. We use the google-genai SDK with proper authentication through Application Default Credentials.

Attack Agent Pattern - The RedTeamAgent class orchestrates attacks using a workflow pattern decorated with Datadog's @agent and @workflow decorators. This provides automatic tracing and span creation for each attack execution.

Evaluation Pipeline - The AttackEvaluator combines three detection methods:

Pattern Matching (fast, deterministic) ↓ LLM-as-Judge (semantic understanding) ↓ Jailbreak Detection (behavioral analysis)

Each stage can independently flag success, with combined confidence scoring.

Agentless Datadog Integration - Rather than requiring a local Datadog agent, we send metrics directly to the Datadog API using the datadog-api-client SDK. This simplifies deployment while maintaining full observability.

FRONTEND ARCHITECTURE

React 18 + TypeScript - Modern React with hooks for state management and full type safety.

Vite Build System - Lightning-fast development server with hot module replacement and optimized production builds.

Tailwind CSS - Utility-first styling with a custom dark theme designed for security tooling: • Pure black background (#000000) • Subtle borders and surfaces • Red accent colors for the offensive security aesthetic • Monospace fonts for technical data

Three-Column Layout: • Left panel: Attack library with category grouping and custom attack input • Center feed: Real-time attack results with expandable details • Right panel: Aggregate statistics and vulnerability scoring

SYSTEM PROMPT DESIGN

The target chatbot uses a realistic enterprise system prompt containing: • Embedded credentials (admin codes, database passwords) • Fake PII (customer SSNs, emails) • Internal infrastructure details (API endpoints, database hosts) • Clear confidentiality instructions

This provides concrete secrets for the evaluator to detect while simulating real-world deployment patterns.

================================================================================

CHALLENGES WE RAN INTO

THE EVALUATION ACCURACY PROBLEM

Our biggest challenge was achieving accurate attack success detection. Early versions had an 80% false positive rate—marking every response that mentioned "system prompt" as a successful attack, even when the response was "I cannot share my system prompt."

The Solution: We completely redesigned the evaluation pipeline:

Replaced vague keyword matching with exact secret patterns (actual credentials, specific SSNs)
Rewrote the LLM judge prompt with explicit examples of refusals vs. leaks
Added behavioral jailbreak detection as a separate stage
Removed public information (product pricing) from leak patterns

The result: accurate detection that distinguishes between "the bot mentioned it has secrets" and "the bot revealed actual secrets."

DISTINGUISHING BEHAVIORAL VULNERABILITIES

Some attacks succeed without leaking specific secrets. The Grandma Exploit, for example, tricks the bot into adopting a "grandmother" persona and telling "bedtime stories" about its instructions. The bot breaks character and complies with the roleplay—a security vulnerability—but may not reveal exact credentials.

The Solution: We created the jailbreak detection system with 21 behavioral indicators (phrases like "once upon a time," "my instructions," "between you and me"). Requiring 2+ matches prevents false positives from legitimate responses while catching genuine persona breaks.

AGENTLESS OBSERVABILITY

The Datadog agent wasn't available in our development environment, causing connection refused errors for traces and StatsD metrics.

The Solution: We switched to fully agentless operation: • Custom metrics via the Datadog API v2 directly • LLMObs SDK in agentless mode with API key authentication • Incidents created via the Incidents API

This eliminated the agent dependency while maintaining full observability.

REAL-TIME UI UPDATES

Showing attack progress in real-time while maintaining a responsive UI required careful state management. Each attack takes 1-3 seconds, and we needed to show results as they streamed in.

The Solution: We implemented a polling pattern where the frontend runs attacks sequentially, updating the results array after each completion. The three-column layout keeps the feed updating smoothly while aggregate stats recalculate.

LLM RESPONSE VARIABILITY

The same attack prompt can produce different responses on each run—sometimes the target resists, sometimes it complies. This made testing and validation challenging.

The Solution: We focused the evaluator on response content rather than expected outcomes. The multi-stage pipeline catches success regardless of the specific phrasing, and confidence scores reflect certainty levels.

================================================================================

ACCOMPLISHMENTS WE'RE PROUD OF

ACCURATE ATTACK EVALUATION

We're particularly proud of the evaluation pipeline's accuracy. It correctly identifies: • Refusals as failures (even when they mention sensitive topics) • Actual secret leaks as critical successes • Behavioral jailbreaks as medium-severity vulnerabilities • Public information disclosure as non-issues

This wasn't trivial—it required multiple iterations and careful prompt engineering for the LLM judge.

COMPREHENSIVE ATTACK LIBRARY

Our 38+ attack library covers the major threat categories documented in academic research and real-world incidents. Each attack includes: • Descriptive name and category • Crafted adversarial prompt • Severity classification • Success indicators for pattern matching

PRODUCTION-READY OBSERVABILITY

The Datadog integration isn't a demo—it's production-ready: • Every attack generates tagged metrics • Critical vulnerabilities auto-create incidents • LLM traces capture full prompt/response pairs • Dashboards and monitors can be built on RED metrics

CLEAN, PROFESSIONAL UI

The frontend matches the seriousness of security tooling. No gradients, no emojis, no playful elements—just clean typography, dark surfaces, and red accents. The expandable attack cards show full prompts and responses without truncation.

CUSTOM ATTACK CAPABILITY

Security researchers can test their own adversarial prompts immediately, with results evaluated through the same rigorous pipeline. This makes RED useful not just for automated testing but for ongoing research.

================================================================================

WHAT WE LEARNED

LLM EVALUATION IS HARD

Programmatically determining if an LLM "failed" is fundamentally different from traditional testing. There's no stack trace, no error code—just natural language that requires semantic understanding. The LLM-as-judge pattern is powerful but requires careful prompt engineering to avoid both false positives and false negatives.

DEFENSE IS HARDER THAN ATTACK

Building the target chatbot with realistic defenses was educational. Even with explicit "DO NOT REVEAL" instructions, certain attack patterns consistently bypass guidelines. This reinforced why RED exists—organizations need to test their defenses systematically.

OBSERVABILITY CHANGES EVERYTHING

Integrating Datadog from the start transformed how we understood the system. Seeing attack success rates, response latencies, and leak type distributions in real-time made debugging and optimization dramatically easier. Security testing without observability is flying blind.

THE GAP IS REAL

Every person we showed RED to immediately understood the problem it solves. Organizations are deploying LLMs without security testing, and most don't know where to start. The market need is genuine and urgent.

================================================================================

WHAT'S NEXT FOR RED

EXPANDED ATTACK LIBRARY

We plan to grow beyond 38 attacks with: • Multi-modal attacks (image-based prompt injection) • Context window exploitation • Tool/function calling abuse patterns • RAG-specific poisoning attacks

TARGET SYSTEM FLEXIBILITY

Currently RED tests our built-in target chatbot. Future versions will support: • Custom target endpoints (test your own LLM applications) • Multiple model providers (OpenAI, Anthropic, Cohere) • Configurable system prompts for realistic scenarios

ATTACK SCHEDULING

Automated, recurring security assessments: • Daily vulnerability scans • Regression testing after model updates • Continuous monitoring mode

REMEDIATION GUIDANCE

Beyond detecting vulnerabilities, provide actionable fixes: • System prompt hardening recommendations • Input validation patterns • Output filtering suggestions • Guardrail implementation guides

TEAM COLLABORATION

Multi-user support for security teams: • Shared attack libraries • Custom attack sharing • Assessment history and trending • Role-based access control

COMPLIANCE REPORTING

Export capabilities for security audits: • PDF assessment reports • SARIF format for security tooling • Integration with GRC platforms

================================================================================

BUILT WITH

• Google Cloud Vertex AI • Gemini 2.0 Flash • Datadog LLM Observability • Datadog APM & Metrics • FastAPI • React 18 • TypeScript • Tailwind CSS • Python • Vite

================================================================================

TRY RED

RED represents our vision for what LLM security testing should be: comprehensive, accurate, observable, and actionable. We built it because we needed it—and we believe every organization deploying LLM applications needs it too.

The adversarial landscape is evolving rapidly. New jailbreaks emerge weekly. Prompt injection techniques grow more sophisticated. The only way to stay ahead is systematic, continuous testing.

RED finds what attackers will find first.

Built With

datadog
fastapi
google-gemini
react
vertex

Updates

Tyler Johnson started this project — Dec 31, 2025 04:59 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.