Skip to content

/review produces high false-positive rate when applied to mature frameworks (Django) #1539

@kazushono

Description

@kazushono

Description

Hi! Thanks for gstack — it's been a core part of my Claude Code workflow
for two consecutive sprints. I wanted to share data on /review false
positive rate that might be useful for tuning the prompt.

Context

  • Project: Django + DRF + PostgreSQL full-stack app (PoC for construction
    site management)
  • Sprint 2 (/ship adversarial review): 1+ false positive (Finding browse skill: default to sonnet to save tokens #8)
  • Sprint 2.5 (/review): 4 false positives out of 8 findings (50%)

Both runs raised concerns that were resolvable in under 5 minutes by
inspecting the relevant code or model definitions, suggesting the
adversarial review prompt may be raising hypotheses without first applying
basic self-verification.

Specific examples from Sprint 2.5

The four false positives all fit a pattern: "resolvable in <5 minutes by
viewing the actual code or running a simple grep"
.

# Concern raised Resolution
FP-1 dict.get() might be None-unsafe Django form's cleaned_data is {}-initialized — visible by reading the form code
FP-2 rental.save() might lose fields Standard Django ORM INSERT behavior for unsaved instances
FP-3 update_fields might miss updated_at Field doesn't exist on the model — grep resolves immediately
FP-4 3 classes might be asymmetric Mechanical comparison shows they're symmetric

What was correctly identified (true positives)

The same /review run also correctly identified:

  • F-1: Test calling rental.save() directly bypasses save_model mutation
    testing (real coverage gap)
  • F-3: Django 3.1+ _post_clean() calls validate_constraints() before
    save_model, causing UniqueConstraint to fire before the ServiceLayer
    can resolve the conflict (real bug, surfaced only via browser QA)

These were valuable findings — they required cross-layer reasoning
(test/admin/form/DB) and were not resolvable by simple code inspection.

Suggested improvement

What worked for me was adding a self-check before reporting findings:

Before reporting a finding, ask:

  1. Can I resolve this by view-ing 1-2 files in under 5 minutes?
    • Yes → resolve it, don't report it
    • No → report it with Y-10 evidence
  2. Is this reproducible by existing pytest tests?
    • Yes → likely already covered, re-check before flagging
    • No → likely a real-environment issue, worth flagging
  3. Is the answer in the framework's surface-level documentation?
    • Yes → skip (or just cite the docs)
    • No → genuine internal-behavior concern, worth flagging

I added this as a project-level rule in my CLAUDE.md. Will measure
Sprint 3+ false positive rates to validate.

Question

Would gstack be open to:

  • (a) Adding a "self-verification gate" to the /review prompt before
    finding generation, or
  • (b) Documenting this pattern in gstack docs (e.g., as a known caveat
    with Django/mature frameworks)?

Happy to contribute either way. Let me know if you'd like to see the full
context (sprint retrospective notes are public in my repo).

Happy hacking 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions