feat: Design Intelligence Agent + full audit fix cycle#1
Merged
Conversation
- Fix Pro quota 50→200 in quota.ts, known-state.json, quota.test.ts - Update known-state.json with current ecosystem (new MCP tools, features, SDK/skills refs) - Update docs/known-state.md to match - Fix Pro quota display in PromptBuilder TUI (50→200) - Fix forge-sync.md and forge-build.md tool references (get_screen→get_screen_code) - Create 6 SKILL.md files in .claude/skills/ (forge-design, forge-generate, forge-build, forge-preview, forge-research, forge-sync) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- design.ts: Add actual readline confirmation before overwriting DESIGN.md, with --force flag support to skip the prompt - index.ts: Wire --force option through commander to the design command - generate.ts: Prompt user to select from multiple projects via readline instead of silently using the first one - config.ts: Log a warning with error details when config parsing fails instead of silently returning defaults - updater.ts: Deduplicate research changes by category+description before pushing, and cap detectedChanges array at 50 entries Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…() with ESM import
- App.tsx: Replace fake setTimeout research with real crawlSources/diffAgainstKnownState pipeline, display actual results and errors
- PromptBuilder.tsx: Replace stub handleSend with real StitchMcpClient generation (listProjects, generateScreen, getScreenCode, file write)
- PromptBuilder.tsx: Replace require('node:fs') with ESM import { existsSync, writeFileSync, mkdirSync }
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dators Add exponential backoff with jitter (up to 3 retries) for transient errors (429, 5xx) and 30s request timeouts to the Stitch MCP client. Replace raw HTTP error text with user-friendly messages. Broaden vague prompt detection (fuzzy matching) and multi-screen detection patterns in validators. Add comprehensive tests for both. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…roved DESIGN.md template Enhance the DESIGN.md template with descriptive color names, anti-slop Do's/Don'ts defaults, richer visual theme guidance, and non-default font suggestions. Add prompt enhancer that scores slop risk and suggests specific UI vocabulary. Add output validator that checks generated HTML for common AI patterns (Inter/Poppins fonts, purple-blue gradients, three-column icon grids, heading hierarchy, accessibility). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Wrap StitchMcpClient constructor in try/catch in generate, build, sync commands to show clean error instead of stack trace when API key missing - Add postbuild step to copy known-state.json to dist/ (fixes research crash) - Add empty prompt rejection (P0: whitespace-only prompts were accepted) - Rewrite vague detection with word-level scoring instead of fragile regex anchors (P0: "make it better please" was bypassing detection) - Add industry-specific DESIGN.md presets (5 aesthetics: bold, elegant, warm, playful, minimal) with distinct palettes, fonts, and imagery - Add empty HTML detection to output validator - Add Tailwind font-sans class detection for AI slop - Update all skill descriptions with natural language trigger phrases - Add specificity guardrail (#7) back to forge-generate skill - Reference edit_screens, generate_variants, apply_design_system MCP tools Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add JSON-RPC 2.0 envelope (jsonrpc, id) to all MCP requests - Parse MCP content[].text responses correctly - Handle JSON-RPC error responses with user-friendly messages - Map resource names (projects/123) to internal IDs and back - Map internal model names to API model IDs (GEMINI_2_5_FLASH→GEMINI_3_FLASH) - Add get_screen fallback when get_screen_code proxy unavailable - Fetch HTML from htmlCode.downloadUrl in native API - Increase generation timeout to 120s (Stitch generation takes 60-90s) - Wrap all API calls in generate command with try/catch - Update test fixtures to match real API response format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- generate_screen_from_text returns {projectId, sessionId, outputComponents}
not a screen object — use list_screens before/after to detect new screen
- Fix projectId parameter (API expects without 'projects/' prefix)
- Update mock test to provide 3 fetch responses for the 3-call flow
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace single-dimension aesthetic presets with a 3-axis system: - Industry axis (14 palettes): retail, fintech, saas, healthcare, etc. Each industry gets psychologically appropriate colors (retail=red/gold, fintech=navy/blue, wellness=sage/terracotta) - Aesthetic axis (6 modifiers): bold, elegant, warm, playful, minimal, confident. Controls surface tones, typography, border-radius, shadows - Audience axis: generates context-aware imagery guidelines and industry+audience-specific Do's/Don'ts Key changes: - Industry-aware color psychology (a grocery chain gets Retail Red, not the same Deep Navy as a fintech) - Audience-aware imagery (Mexican families ≠ enterprise CTOs) - Industry-specific guardrails (retail: show prices prominently, fintech: don't use red for positive metrics) - Culture-aware rules (Mexican market: Spanish copy, no US/EU stock) - Theme description uses all 4 brief parameters, not just aesthetic - SaaS rules excluded from retail industry match (false positive fix) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Export matchIndustry, matchAesthetic, generateImageryGuidelines, generateDosAndDonts from design-md.ts for synthesizer fallback layer - Export INDUSTRY_PALETTES, AESTHETIC_MODIFIERS constants - Export IndustryPalette, AestheticModifier interfaces - Create src/research/types.ts with shared interfaces for the Design Intelligence Agent pipeline (BusinessBrief, SiteAnalysis, CompetitorAnalysis, AudienceInsight, MarketPosition, BusinessResearchResult, DesignQualityScore, SynthesizedDesign) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation New pipeline: Research → Synthesize → Validate Research (business-researcher.ts): - analyzeSite: fetch + cheerio extraction of colors, fonts, layout patterns - extractPalette: CSS color parsing with frequency counting and context - extractTypography: Google Fonts + font-family detection - detectLayoutPatterns: DOM structure analysis (hero, grid, cards, nav) - inferAudienceInsights: knowledge base for 5 industries + cultural locale - inferMarketPosition: keyword-based positioning inference - researchBusiness: orchestrator with confidence scoring Synthesis (design-synthesizer.ts): - synthesizePalette: real brand colors > competitor differentiation > presets - synthesizeTypography: competitor-aware font selection from curated list - synthesizeImagery: audience + culture + market-aware guidelines - synthesizeDosAndDonts: base anti-slop + competitor + cultural rules - Falls back to static template at confidence < 30 Validation (design-validator.ts): - scoreSpecificity (0-25): penalizes placeholders, generic terms - scoreDifferentiation (0-25): hex distance from competitor colors - scoreCompleteness (0-25): validates all 8 sections with content - scoreActionability (0-25): checks for unambiguous, testable rules Integration: - forge discover CLI command with --url, --competitors, --locale flags - forge design --research flag redirects to discover - forge-discover SKILL.md for Claude Code autonomous invocation - Research cache in .forge-research/ (7-day TTL) Tests: 48 new tests (106 total), fixtures for retail + competitor sites Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dedup rules
- extractPalette: filter ALL near-gray colors (R≈G≈B within 20) as structural,
not just very dark/light grays. This ensures brand colors like 3B Red (#DC0C0C)
are identified as primary instead of layout grays (#333, #666)
- extractTypography: filter icon fonts (dashicons, material icons, fontawesome,
etc.) from font detection — these are never heading/body fonts
- synthesizeDosAndDonts: normalize rules before deduplication to catch
near-duplicate entries ("Show prices prominently in every product display"
vs "Show prices prominently on every product card")
- Fix theme description grammar: "balances a [personality] personality with
the practical needs of [audience]"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…score Change "the design should feel confident" to "maintain a confident tone" to avoid "should" penalty in design validator actionability scoring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Current state: agent pipeline works E2E but output quality is insufficient. Research extracts real brand colors (#DC0C0C) from tiendas3b.com but the generated DESIGN.md and downstream Stitch output don't reflect the actual business model, positioning, or user needs accurately enough. Known issues to fix: - DESIGN.md doesn't capture business model context (3B is physical stores, not e-commerce — the landing page should drive foot traffic, not sales) - Prompt construction doesn't leverage DESIGN.md deeply enough - Skills don't guide Claude to produce business-aware prompts - Output validator can't check for business alignment - No feedback loop between validation and regeneration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Core change: The system now understands WHAT a business IS before designing for it. Previously it extracted colors/fonts but generated e-commerce pages for physical-only stores. Business model inference (business-researcher.ts): - inferBusinessModel() detects type from site signals (nav items, CTAs) - 8 business types: physical-retail, e-commerce, saas, marketplace, service, media, nonprofit, other - Physical retail detection: "Sucursales/Tiendas/Locations" in nav, absence of cart/checkout signals - CTA text extraction for business signal detection - Confidence scoring: 80 with site data, 50 from keywords only DESIGN.md business context (design-synthesizer.ts): - Section 1 now includes Business Model, Website Purpose, Primary User Goals, Key Page Elements, and Avoid list - Physical retail: "NOT an e-commerce site", "Store locator primary CTA" - Business-model-aware Do's/Don'ts per type Prompt alignment (prompt-enhancer.ts): - Detects e-commerce terms in prompts when DESIGN.md says "not e-commerce" - Suggests store locator when DESIGN.md marks it as key feature - New field: businessAlignmentIssues in EnhancementResult Output validation (output-validator.ts): - Flags cart/checkout elements when business is not e-commerce - Checks for store locator presence when it's a key feature Skill rewrite (forge-discover SKILL.md): - Conversational discovery agent with confidence gating - Phase 1: Understand business (REQUIRED before designing) - Phase 2: Research (WebSearch/WebFetch) - Confidence threshold: ≥70 weighted average to proceed - Core guardrail: "NEVER assume e-commerce" Tests: 121 total (15 new for business model inference + context) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: generate_screen_from_text returns the screen data (id, htmlCode.downloadUrl, screenshot.downloadUrl) inside outputComponents[0].design.screens[0], but the code was ignoring this and trying list_screens which doesn't show new screens immediately. This caused every generation to succeed (step 1) but fail on HTML retrieval (step 2), wasting Stitch API tokens. Fix: - extractScreenFromResponse() parses outputComponents to get screen ID, htmlCode URL, and screenshot URL directly from generation response - getScreenCode() accepts optional htmlCodeUrl parameter for direct download (no additional API calls needed) - Fallback: retry list_screens with 2s/4s/6s delays if parsing fails - GenerateScreenResult now includes htmlCodeUrl and screenshotUrl - generate command passes htmlCodeUrl to getScreenCode Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The primary extraction path iterates ALL outputComponents (not just [0]) to find the first one with design.screens. If that fails, a deep search regex scans the JSON for downloadUrl + screen ID patterns. This handles edge cases where the screen data might be in a different outputComponent index or in an unexpected nested structure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete overhaul of stitch-forge based on a comprehensive audit that found 8 bugs, 6 architectural issues, and fundamental design flaws. Every fix was identified through systematic E2E testing against the real Stitch API, using a real-world business case study to validate output quality.
The most critical discovery: the system didn't understand what businesses actually DO. It generated an e-commerce page for a physical-only store chain. This led to building the Design Intelligence Agent — a research-driven system that autonomously discovers the business model before designing.
What was found and fixed
Phase 1: Audit-driven bug fixes (8 bugs)
These were identified by reading every source file (28 TypeScript, 3,554 lines) and cross-referencing against the Stitch ecosystem:
quota.ts)design.ts)setTimeout(() => 'done', 2000)with no actual research (App.tsx)PromptBuilder.tsx)require('node:fs')in ESM module (PromptBuilder.tsx)generate.ts)config.ts)updater.ts)Phase 2: Resilience layer
Found through E2E testing — the MCP client had zero error recovery:
^exact match$regex with word-level analysisPhase 3: Anti-AI-slop layer
Built after analyzing common Stitch output patterns:
prompt-enhancer.ts): scores slop risk 0-10, suggests specific UI vocabulary replacementsoutput-validator.ts): scores generated HTML 0-100, detects AI-default fonts, purple-blue gradients, heading hierarchy skips, missing alt attributes, Tailwind font-sans classPhase 4: Real API integration
Discovered through live Stitch API testing — the client didn't match the real API:
jsonrpc: "2.0"andidfields (was missing both)result.content[0].textwith JSON string, not direct objectsprojects/123format, not bare IDsGEMINI_2_5_FLASHdoesn't exist — real IDs areGEMINI_3_FLASHandGEMINI_3_1_PROgenerate_screen_from_textreturns screen data insideoutputComponents[0].design.screens[0], not as a standalone object. Fixed by parsingoutputComponentsdirectly with deep-search fallback.Phase 5: Design Intelligence Agent
The fundamental breakthrough. Identified when the system generated an e-commerce landing page for a physical-only retail chain that doesn't sell online.
Root cause: the DESIGN.md only encoded visual rules (colors, fonts) but zero business context. Stitch defaults to e-commerce patterns because that's the most common web pattern.
Solution: 3-stage pipeline (Research → Synthesize → Validate):
Business model inference (
business-researcher.ts): analyzes website navigation items and CTAs to determine business type (physical-retail, e-commerce, SaaS, service, etc.)Business context in DESIGN.md (
design-synthesizer.ts): Section 1 now includes Business Model, Website Purpose, User Goals, Key Features, and what the site must NOT have.Business alignment validation: prompt enhancer flags e-commerce terms when DESIGN.md says "not e-commerce". Output validator flags cart/checkout HTML in non-e-commerce contexts.
Confidence-gated skill (
forge-discover): conversational agent that asks questions and researches autonomously. Refuses to generate until business model confidence ≥ 70%.Phase 6: Skills migration
Migrated 6 slash commands to SKILL.md format with:
/forge-discoverskill as primary entry pointE2E test results (physical retail case study)
Stats
Test plan
npx tsc --noEmit— zero errorsnpx vitest run— 121/121 tests passforge discoverdetects physical retail from site analysis🤖 Generated with Claude Code