Skip to content

Improve visual context OCR quality#482

Merged
FuJacob merged 4 commits into
mainfrom
fix/visual-context-quality
Jun 1, 2026
Merged

Improve visual context OCR quality#482
FuJacob merged 4 commits into
mainfrom
fix/visual-context-quality

Conversation

@FuJacob

@FuJacob FuJacob commented May 31, 2026

Copy link
Copy Markdown
Owner

Summary

Improve the screenshot/OCR visual-context pipeline so Cotabby sends richer, sanitized prompt context instead of dropping context when summarization fails. The change keeps Screen Recording required, expands OCR capture quality and budgets, adds deterministic OCR-noise filtering, and replaces the generic summary prompt with an autocomplete-focused context extraction prompt.

Validation

  • plutil -lint Cotabby.xcodeproj/project.pbxproj
    • Cotabby.xcodeproj/project.pbxproj: OK
  • xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS build-for-testing -derivedDataPath build/DerivedData
    • ** TEST BUILD SUCCEEDED **
  • xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination platform=macOS test -only-testing:CotabbyTests/PromptContextSanitizerTests -only-testing:CotabbyTests/ScreenshotContextGeneratorTests -only-testing:CotabbyTests/VisualContextSummaryPromptRendererTests -derivedDataPath build/DerivedData
    • Compiled, then failed to load the app-hosted CotabbyTests.xctest bundle because the test bundle and host app have different Team IDs. This matches the known local signing failure for app-hosted tests.
  • rm -rf build/DerivedData

Linked issues

None.

Risk / rollout notes

  • Visual context intentionally trades a bit more capture/OCR/summarization latency for higher-quality context: larger screenshot/OCR budgets, Vision accurate mode, and a longer summary timeout.
  • Fast Mode still skips screenshot/OCR capture after permissions are granted, but Screen Recording remains required for autocomplete eligibility.
  • No hosted APIs, raw screenshots, or unbounded OCR text are introduced.

Greptile Summary

This PR overhauls the visual-context OCR pipeline to deliver richer, sanitized prompt context for autocomplete instead of silently dropping context when summarization fails. Capture budgets are expanded (larger screenshot dimensions, 5 000-char OCR cap, 6 s timeout, Vision accurate mode), and a new deterministic token-scoring pass in PromptContextSanitizer replaces the previous heuristic with layered rules that keep prose, technical tokens, emails, file paths, and non-Latin scripts while dropping OCR garbage.

  • Graceful OCR fallback: ScreenshotContextGenerator.resolvedContextText now falls back to bounded, sanitized OCR text whenever summarization fails or produces an empty result, rather than propagating the error and losing all visual context.
  • Testability via protocols: WindowScreenshotCapturing and ScreenTextExtracting protocol seams let ScreenshotContextGenerator be exercised with stub implementations, enabling the new ScreenshotContextGeneratorTests to cover every key path without requiring live capture or a GGUF model.
  • Autocomplete-focused prompt: VisualContextSummaryPromptRenderer extracts the summarization prompt into a dedicated, testable type with explicit priority ordering and prompt-injection rejection instructions.

Confidence Score: 5/5

Safe to merge; the pipeline changes are well-contained, all fallback paths return bounded sanitized text rather than empty context, and new protocol seams enable meaningful unit coverage.

The OCR fallback logic and token-scoring rules are the most complex additions, and both are exercised by new unit tests. The summarizer now propagates errors instead of swallowing them, but resolvedContextText catches every error and falls back gracefully, so no regression path was opened. Expanded capture budgets increase latency but are isolated to the visual-context path and explicitly documented.

PromptContextSanitizer.swift — the isLikelyShortMixedCaseNoise heuristic (already flagged in a previous review round) still false-positives on lowercase-initial brand names common on macOS screenshots (iPhone, macOS, iCloud, UIKit). No other files require special attention.

Important Files Changed

Filename Overview
Cotabby/Support/PromptContextSanitizer.swift Extensive new deterministic token-scoring logic for OCR noise filtering; handles non-ASCII, emails, file tokens, repeated-glyph junk, and mixed-case noise, but isLikelyShortMixedCaseNoise false-positives on lowercase-initial brand names (iPhone, macOS, iCloud, UIKit) remain unaddressed.
Cotabby/Services/Visual/ScreenshotContextGenerator.swift Refactored to use protocol seams (WindowScreenshotCapturing, ScreenTextExtracting) and extracts resolvedContextText helper; fallback from failed/empty summaries to sanitized OCR is now robust and well-structured.
Cotabby/Services/Visual/LlamaVisualContextSummarizer.swift Timeout doubled to 6s, prediction tokens doubled to 160, now propagates errors rather than swallowing them, and delegates prompt rendering to VisualContextSummaryPromptRenderer; error propagation up to resolvedContextText is correct.
Cotabby/Support/VisualContextSummaryPromptRenderer.swift New file extracting the summarization prompt into a testable policy type; sanitizeOCR is called on text that was already sanitized by the caller, a harmless but redundant second pass noted in a previous review comment.
Cotabby/Services/Visual/ScreenTextExtractor.swift Adds ScreenTextExtracting protocol test seam; switches Vision recognition to accurate mode and lowers minimumTextHeight from 0.012 to 0.008 for denser text capture.
Cotabby/Services/Visual/WindowScreenshotService.swift Adds WindowScreenshotCapturing protocol; increases horizontal padding (100 to 160pt) and vertical context height (600 to 800pt) to give OCR a wider view of surrounding content.
Cotabby/Models/VisualContextModels.swift Default configuration bumped: snapshotDimension 500 to 700, maxImageDimension 900 to 1600, maxRecognizedCharacters 2000 to 5000, maxSummaryCharacters 900 to 1500.
CotabbyTests/ScreenshotContextGeneratorTests.swift New test file with stub implementations for the new protocol seams; covers summary usage, empty-summary OCR fallback, thrown-error OCR fallback, size capping, and all-noise rejection.
CotabbyTests/PromptContextSanitizerTests.swift Adds 5 new OCR sanitization tests covering mixed-case garbage, technical token preservation, line-level noise dropping, non-Latin script passthrough, and mixed ASCII/non-Latin lines.
CotabbyTests/VisualContextSummaryPromptRendererTests.swift New test verifying the autocomplete-focused prompt structure and prompt-injection rejection instructions.
Cotabby/UI/Settings/Panes/PermissionsPaneView.swift Updated Screen Recording description from Optional to Required even when Fast Mode skips capture, matching the new autocomplete eligibility semantics.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[FocusedInputSnapshot] --> B[captureScreenshot WindowScreenshotCapturing]
    B --> C{OCR extractText}
    C -- noRecognizedText --> D{windowTitle available?}
    D -- no --> ERR1[throw unavailable]
    D -- yes --> E[normalizeRecognizedText windowTitle]
    E --> F{hasMeaningfulSignal?}
    F -- no --> ERR1
    F -- yes --> RETURN1[return VisualContextExcerpt source ocr_fallback]
    C -- success --> G[normalizeRecognizedText sanitizeOCR cap 5000 chars]
    G --> H{hasMeaningfulSignal normalizedText?}
    H -- no --> ERR2[throw unavailable]
    H -- yes --> I[resolvedContextText]
    I --> J{summarizer configured?}
    J -- no --> K[ocrFallback boundedSummaryText]
    J -- yes --> L[summarizer.summarize VisualContextSummaryPromptRenderer 6s timeout 160 tokens]
    L -- throws --> K
    L -- empty result --> K
    L -- success --> M[boundedSummaryText cap 1500 chars]
    M --> N{hasMeaningfulSignal summary?}
    N -- no --> K
    N -- yes --> RETURN2[return VisualContextExcerpt source summary]
    K --> O{hasMeaningfulSignal fallback?}
    O -- no --> ERR2
    O -- yes --> RETURN3[return VisualContextExcerpt source ocr_fallback]
Loading

Comments Outside Diff (2)

  1. Cotabby/Support/PromptContextSanitizer.swift, line 614-633 (link)

    P2 isLikelyShortMixedCaseNoise false-positives for lowercase-initial camelCase brand names

    The function is designed to flag OCR garbage like "gLVWrt" but its final return condition — uppercaseCount >= 2 || !firstCharacterIsUppercase — also catches legitimate tokens where the first letter is lowercase but the token contains any uppercase. Concretely: "iPhone" (letters.first=i, uppercaseCount=1 → !firstCharacterIsUppercase=true → classified as noise), "macOS" (letters.first=m, uppercaseCount=2 → both clauses true → noise), and "iCloud", "iPad", "eBay" all follow the same path. None of these appear in knownWordSignals or preservedTechnicalTokens, so they reach isLikelyShortMixedCaseNoise and are silently dropped. On a macOS screenshot tool, references to Apple product names are common OCR output and their removal degrades autocomplete context.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code

  2. Cotabby/Support/VisualContextSummaryPromptRenderer.swift, line 660-661 (link)

    P2 Redundant sanitizeOCR call on already-filtered text

    VisualContextSummaryPromptRenderer.prompt is called from LlamaVisualContextSummarizer.summarize, which receives text that was already processed by normalizeRecognizedText (i.e. PromptContextSanitizer.sanitizeOCR with a 5 000-character cap). The second sanitizeOCR call here without a character limit runs the full line-scoring logic on text that is already clean, making it a no-op in the production path. The extra pass is harmless and idempotent, but it creates a cost asymmetry: in production the sanitizer runs twice while tests that call prompt directly sanitize exactly once. Consider removing the internal call and documenting that callers are expected to pre-sanitize, or keep it but note the redundancy so future callers know the contract.

    Fix in Codex Fix in Claude Code

Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

@FuJacob FuJacob marked this pull request as draft May 31, 2026 18:21
FuJacob added 2 commits May 31, 2026 11:35
SwiftLint (--strict) flagged generateContext at cyclomatic complexity 11 (limit 10). Extract the summarizer-vs-OCR fallback branch into resolvedContextText(...), a behavior-preserving helper that drops the function to ~8 and reads more clearly.

XcodeGen drift: the project.pbxproj had hand-written synthetic object IDs (AA1000...) for the three new files instead of being regenerated. Ran 'xcodegen generate' so the committed project matches project.yml (hash-based IDs); diff vs main is now only the three new file references. Sparkle stays pinned exactVersion 2.9.1.
sanitizeOCR's noise filter only treated a token as real when it had an ASCII vowel or matched the English word lists, so CJK, Cyrillic, Greek, Arabic, Hebrew, Thai, and accented-Latin tokens were stripped to nothing. Non-English users would get empty visual context and fully generic suggestions.

assessOCRToken now keeps any token carrying a non-ASCII letter as strong signal, after numbers and repeated-glyph junk are rejected so this is not a backdoor for ASCII garbage. The Latin-only tail moved into assessLatinToken to keep cyclomatic complexity under the limit; ASCII behavior is unchanged (a count<=2 token can never be repeated-glyph junk, so the reordering is a no-op for it). Adds regression tests for CJK/Cyrillic/accented-Latin preservation and for ASCII noise still being dropped on a mixed line.
@FuJacob FuJacob marked this pull request as ready for review June 1, 2026 04:10
…ality

# Conflicts:
#	Cotabby.xcodeproj/project.pbxproj
@FuJacob FuJacob merged commit cc068da into main Jun 1, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant