Skip to content

Add first-class DOCX support with Docxodus microservice and WASM renderer#1102

Merged
JSv4 merged 50 commits intomainfrom
claude/explore-docxodus-G49qt
Mar 21, 2026
Merged

Add first-class DOCX support with Docxodus microservice and WASM renderer#1102
JSv4 merged 50 commits intomainfrom
claude/explore-docxodus-G49qt

Conversation

@JSv4
Copy link
Copy Markdown
Collaborator

@JSv4 JSv4 commented Mar 14, 2026

Summary

This PR adds comprehensive DOCX document support to OpenContracts through a parallel ingestion and rendering pipeline. It introduces a Docxodus .NET microservice for backend parsing and a React WASM component for frontend rendering, enabling native DOCX handling with character-offset-based annotations that align perfectly between backend and frontend.

Key Changes

Backend

  • DocxodusServiceParser (opencontractserver/pipeline/parsers/docxodus_parser.py): New REST-based parser that delegates DOCX processing to a containerized .NET microservice. Handles base64 encoding/decoding, timeout/error handling with transient retry logic, and Cloud Run IAM authentication support.
  • DocxThumbnailGenerator (opencontractserver/pipeline/thumbnailers/docx_thumbnailer.py): Two-tier thumbnail generation—first attempts to extract embedded thumbnails from DOCX ZIP archives, falls back to text-based rendering.
  • Docxodus .NET Microservice (docxodus-service/): Minimal ASP.NET Core Web API wrapping Docxodus's OpenContractExporter.Export(). Exposes /parse endpoint accepting base64-encoded DOCX files and returning OpenContractDocExport JSON with structural annotations and character offsets.

Frontend

  • DocxAnnotator (frontend/src/components/annotator/renderers/docx/DocxAnnotator.tsx): Core WASM-based DOCX renderer using convertDocxToHtmlWithExternalAnnotations() to project server-side annotations onto native DOCX HTML output. Supports text selection for new annotation creation, search highlighting, and chat source integration.
  • DocxAnnotatorWrapper (frontend/src/components/annotator/components/wrappers/DocxAnnotatorWrapper.tsx): State management wrapper mirroring TxtAnnotatorWrapper pattern to minimize parent component rerenders.
  • Document integration: Updated DocumentKnowledgeBase.tsx to detect DOCX file types and load DOCX bytes via new getDocxBytes() API helper.

Infrastructure

  • Docker Compose: Added docxodus-parser service to local.yml, test.yml, and production.yml with proper dependency ordering.
  • Environment configuration: Added DOCXODUS_PARSER_SERVICE_URL and DOCXODUS_PARSER_TIMEOUT settings.
  • File type utilities: Added isDocxFileType() helper and DOCX_MIME_TYPE constant.
  • Dependencies: Added docxodus package to frontend.

Testing

  • Unit tests (opencontractserver/tests/test_doc_parser_docxodus.py): Comprehensive test coverage for parser success cases, timeout handling, connection errors, and HTTP error responses.
  • Component tests (frontend/tests/DocxAnnotator.ct.tsx): Playwright component tests for renderer initialization and rendering.
  • Test wrapper (frontend/tests/DocxAnnotatorTestWrapper.tsx): Minimal DOCX test fixture with sample annotations and labels.

Implementation Details

  • Offset alignment guarantee: Both backend parser and frontend renderer use the same Docxodus library, ensuring character offsets in SpanAnnotation objects ({start, end}) are identical—no validation or reconciliation needed.
  • Annotation format: Uses existing SpanAnnotation format with character offsets, same as TXT documents, enabling consistent annotation handling across file types.
  • Error handling: Parser distinguishes transient errors (timeout, connection errors) from permanent failures (4xx responses) to enable appropriate retry logic.
  • No chunking: DOCX files are processed whole by the microservice (unlike PDFs which are chunked), simplifying the pipeline.
  • Embedded thumbnails: Thumbnail generator leverages DOCX ZIP structure to extract pre-built thumbnails when available, avoiding unnecessary rendering.

claude and others added 11 commits March 14, 2026 04:13
Comprehensive plan for adding first-class DOCX support with:
- Backend DocxParser using python-docx for text extraction + structural annotations
- Frontend rendering via react-docxodus-viewer (Docxodus WASM)
- Character-offset SpanAnnotation format (same as TXT)
- Hash-based validation for offset alignment between backend and frontend
- DocxThumbnailGenerator for document cards

https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
Drop python-docx parser entirely. Use Docxodus .NET microservice as the
sole DOCX parser, mirroring the Docling pattern for PDFs. Both backend
(microservice) and frontend (WASM) use the same Docxodus library,
guaranteeing character offset alignment with zero reconciliation needed.

https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
Implements a complete parallel document ingestion pipeline and rendering
tree for DOCX files, bringing Word document support alongside existing
PDF and TXT pipelines.

Backend:
- Docxodus .NET 8 microservice (docxodus-service/) wrapping
  OpenContractExporter.Export() for structural annotation extraction
- DocxodusServiceParser REST parser with camelCase→snake_case
  normalization and transient/permanent error classification
- DocxThumbnailGenerator with embedded thumbnail extraction fallback
  to XML-based text preview rendering
- Docker Compose service definitions for local/test/production

Frontend:
- DocxAnnotator component using docxodus WASM for native DOCX→HTML
  conversion with annotation projection via
  convertDocxToHtmlWithExternalAnnotations()
- DocxAnnotatorWrapper mirroring TxtAnnotatorWrapper state management
- DocumentKnowledgeBase DOCX loading flow and renderer dispatch
- isDocxFileType utility, DOCX_MIME_TYPE constant, docxBytesAtom

Tests:
- Backend: parser unit tests (success, timeout, connection error,
  normalization) and thumbnailer tests (text preview, embedded
  thumbnail, invalid DOCX)
- Frontend: DocxAnnotator component test with docScreenshot captures

https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
WASM initialization takes longer than standard components. Added
test.setTimeout(30_000) and a third screenshot for the error/loading
state.

https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
…tator

- Remove PLAN.md (AI planning artifact)
- Remove unused react-docxodus-viewer dependency
- Add CSS.escape() for annotation ID interpolation in style block
- Document first-occurrence text selection limitation with TODO
- Remove console.log from getDocxBytes
- Remove no-op identity mappings in _normalize_annotation/_normalize_relationship
- Add env_var for use_cloud_run_iam_auth setting
- Fix redundant exception type (BadZipFile already covered by Exception)
- Add docxodus-parser to production.yml django depends_on
- Fix black formatting and flake8 line-too-long
The docxodus-service .NET microservice has been extracted to its own
repository at Open-Source-Legal/docxodus-service with tests and CI/CD
that publishes to ghcr.io.

- Remove docxodus-service/ directory (now at github.com/Open-Source-Legal/docxodus-service)
- Update local.yml, test.yml, production.yml to use ghcr.io/open-source-legal/docxodus-service:main
Configure Playwright CT to load and run the Docxodus .NET WASM runtime,
enabling real DOCX-to-HTML conversion in component tests instead of
testing only loading/error states.

- Add setupDocxodusWasm() utility that serves WASM files via page.route()
- Set docxodus WASM base path in playwright/index.tsx hooks
- Add test.docx fixture generated with python-docx
- Update DocxAnnotatorTestWrapper to load real DOCX bytes
- Capture 3 screenshots showing actual rendered DOCX content
- Add *.wasm to Vite assetsInclude
Production fix: add annotationType and annotationJson.text fields to
buildExternalAnnotationSet — required by Docxodus WASM to project
annotation highlights onto rendered HTML.

Test fixes:
- Fix docText to match Docxodus extraction (newlines between paragraphs)
- Add second annotation (Important Clause) with correct character offset
- Screenshots now show real annotation highlights (red + teal)
- Add DOCX document mocks and DKB test scaffolding
…enshot

- Fix double-slash in DOCX file route pattern (**/URL → **URL)
- Add GET_CORPUS_VERSIONS mock to createSummaryMocks (was unmocked)
- Add duplicate mocks for refetch queries
- Screenshot shows full DocumentKnowledgeBase with DOCX viewer
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 15, 2026

Codecov Report

❌ Patch coverage is 93.78238% with 12 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...contractserver/pipeline/parsers/docxodus_parser.py 94.26% 7 Missing ⚠️
...ctserver/pipeline/thumbnailers/docx_thumbnailer.py 92.75% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

…erage

- Add DOMPurify sanitization for DOCX WASM HTML output before
  dangerouslySetInnerHTML to prevent XSS from crafted documents
- Fix docxBytesAtom memory leak by clearing it when navigating to
  non-DOCX documents (PDF/text load paths)
- Add URL-keyed caching to getDocxBytes() to avoid refetching on
  Apollo query refetches
- Fix textLabels deduplication to key by label ID instead of text,
  preventing collision of labels with identical names
- Rename read_only prop to readOnly for camelCase consistency
- Make unused CRUD props (updateAnnotation, approveAnnotation,
  rejectAnnotation) optional on DocxAnnotatorProps
- Change DOCX viewer container from id=pdf-container to docx-container
- Fix document-only DOCX handler to pass docId/textHash to
  getDocumentRawText, matching the corpus+document handler
- Add DOCXODUS_PARSER_SERVICE_URL to test environment config
- Add test for HTTP 4xx permanent failure path in DocxodusServiceParser
- Update CHANGELOG to reference dompurify instead of removed
  react-docxodus-viewer
JSv4 added 10 commits March 15, 2026 16:57
…ucture for DOCX support

- Resolve ambiguous text selection in DocxAnnotator by using DOM TreeWalker
  to compute approximate character offsets, then picking the closest
  findTextOccurrences match (handles repeated phrases like "Party",
  "Agreement", boilerplate)
- Fix race condition when switching between DOCX documents by clearing
  docxBytesAtom before fetching new bytes
- Add docling-parser to production.yml depends_on (was missing vs local/test)
- Guard zero-length Uint8Array in DocxAnnotatorWrapper to show loading spinner
- Document pdf_file field reuse for DOCX storage in parser and frontend
- Clarify DOMPurify data-* attribute handling in sanitization comment
- Fix pre-existing DocxAnnotator.ct.tsx: replace broken ?url asset import
  with page.route() interceptor, remove test.describe() that broke
  Playwright CT component registration
- Add unit tests for offset resolution algorithm (11 tests)
- Add component test proving second occurrence is selected over first
…overlay

Separate expensive DOCX-to-HTML conversion from annotation projection using
the convert-once / project-many pattern. Annotation changes now re-project
in ~56ms instead of re-converting the full document (~892ms), and label
visibility toggling is instant via CSS-only rules.
- Extract getGlobalOffsetFromDomPosition to shared docxOffsetUtils.ts
  to prevent silent divergence between production code and tests
- Pin docxodus-service image to v1.0.0 in production.yml and test.yml
  for reproducible builds
- Add DOCXODUS_PARSER_TIMEOUT to test env for parity with other parsers
- Add HTTP 5xx transient error test for docxodus parser
- Add cache eviction comment to docxBytesCache
- Fix visibleLabels null-coalesce in TxtAnnotatorWrapper for consistency
  with DocxAnnotatorWrapper
…dd schema validation

- Revert visibleLabels ?? [] in TxtAnnotatorWrapper that converted null
  (meaning "show all") to empty array (meaning "show none"), breaking
  TXT annotation display
- Use consistent :main docker tag for docxodus-parser across all compose
  files (v1.0.0 tag doesn't exist on GHCR)
- Add required field validation in docxodus_parser._normalize_response
- Add LRU eviction (3-entry cap) to docxBytesCache
- Extract duplicate DOCX loading logic into loadDocxDocument helper
- Add useEffect cleanup for docxBytesAtom on unmount
- Document DOMPurify threat model and validateBeforeProjection rationale
- Use DOCX_MIME_TYPE constant in SUPPORTED_MIME_TYPES
- Fix CHANGELOG docxodus version (5.4.2 -> 5.5.0)
- Add console.warn for silent annotation creation failures
… fixes

- Render search results and chat sources in DocxAnnotator by projecting
  them as synthetic annotations with dedicated CSS styling
- Register annotation DOM refs via onAnnotationRefChange callback so
  sidebar scroll-into-view works for DOCX documents
- Replace inline disambiguation logic with pickClosestOccurrence (DRY)
- Eliminate loading flash by setting annotatedHtml = baseHtml immediately
  after DOCX conversion, before projection completes
- Move DOCX_CACHE_MAX_ENTRIES from cachedRest.ts to constants.ts
…arnings

- Wrap filteredAnnotations/filteredSearchResults in useMemo (DocxAnnotatorWrapper)
- Extract useTestDocxBytes() hook to eliminate duplicate DOCX loading in test wrappers
- Show generic user-facing error message instead of raw error strings (DocxAnnotator)
- Add sanity assertions on pinned offsets in disambiguation test for clearer failures
- Add mutable :main tag warnings to all Docker Compose files for docxodus-parser
- Add comment explaining local xml.etree.ElementTree import in docx_thumbnailer
Playwright CT's babel transform fails to replace component refs with import
refs when non-component exports (sampleDocText) are imported alongside
component exports in the same import statement. The transform only removes
the import when ALL specifiers are used in JSX; mixed imports leave both
the original import and the generated const declarations, causing the
component type to remain as a function instead of an importRef object.

Split the import so component names are on a separate statement from
non-component exports.
@claude
Copy link
Copy Markdown

claude bot commented Mar 19, 2026

Code Review: First-class DOCX Support

Overall this is a well-architected addition that follows existing patterns cleanly. The convert-once/project-many rendering model is smart, the XSS protection is thoughtful, and the test coverage is thorough. A few important issues to address before merging:


Critical

1. Mutable :main Docker image tag in all three compose files

local.yml, test.yml, and production.yml all use ghcr.io/open-source-legal/docxodus-service:main. The PR itself flags this with a WARNING comment, but the fix should happen before merging — not after. The PR description explicitly ties the microservice version to the frontend WASM (docxodus@5.5.0), so a CI rebuild of :main could silently break character-offset alignment in production.

Pin to a semver tag (e.g., :v5.5.0) before merging, or open a follow-up issue with a hard deadline.


2. production.yml makes docxodus-parser (and docling-parser) hard depends_on for the django service — breaking existing deployments

Docker Compose will refuse to start the django container if either service is absent from the compose file or fails health checks. Existing deployments that don't run these services will break on the next deploy. Consider using Compose profiles so operators can opt in, or set required: false on the dependency (supported in Compose v2.20+).


Important

3. annotationLabel not normalized in _normalize_annotation

ann_mappings in docxodus_parser.py maps annotationJson, parentId, annotationType, and contentModalities to snake_case — but not annotationLabel → annotation_label. The test confirms the field stays camelCase in the output. If OpenContractDocExport's labelled text entries expect annotation_label (matching the snake_case convention used throughout the rest of the Python pipeline), this will silently produce wrong data downstream. Please verify against the OpenContractDocExport TypedDict and add the mapping if needed.


4. loadDocxDocument has an unmount race condition

If the component unmounts while these promises are in flight, all three setters fire on a dead component, and the setDocxBytes(null) cleanup in the unmount effect races against the async set. Add a cancelled flag (same pattern used in DocxAnnotator's effects): check if (cancelled) return; before calling unstable_batchedUpdates, and set cancelled = true in the cleanup function.


5. validateBeforeProjection: false removes a meaningful safety net

The guarantee only holds when both sides are pinned to the same version (see issue 1 above). If the frontend npm package is upgraded independently of the Docker image, offsets will silently diverge — and validation would catch that. Consider keeping it enabled (the ~56ms projection budget can absorb validation) and logging a warning on failure rather than disabling it entirely.


Suggestions

6. docxBytesCache uses FIFO eviction, not LRU

The cache evicts the oldest inserted entry. Users are more likely to re-open recently viewed documents. LRU is a one-liner using Map's insertion-order semantics — delete and re-insert on a cache hit.


7. Named export of DocxAnnotatorWrapper bypasses React.memo

Any consumer importing the named export gets the un-memoized version. Either apply React.memo at definition time, or drop the named export.


8. No file size guard before base64-encoding DOCX and POSTing to microservice

Large DOCX files (e.g. 50MB contracts with embedded images) will be encoded to ~67MB and sent over HTTP. request_timeout handles hung connections but there is no size check. Consider adding a configurable max_file_size_mb setting with a clear error message when exceeded.


9. docxodus-service/ source not in this diff

The PR description mentions a docxodus-service/ directory but no files from it appear in the diff. If the service lives in a separate repo, please link to it and confirm that the Docxodus library version it packages matches docxodus@5.5.0 pinned in package.json.


10. Minor: duplicate import in DocxAnnotator.ct.tsx

Both DocxAnnotatorTestWrapper imports come from the same module — they can be merged into one import statement.


What's done well

  • 3-effect split (convert-once / project-many / CSS-only visibility toggle) is a clean, correct performance architecture — avoids re-running the expensive WASM conversion on annotation changes.
  • DOMPurify usage is correct: sanitize on first conversion, re-sanitize after projection with an explicit ADD_ATTR allowlist. The inline comment explaining why data-* attributes are safe is appreciated.
  • DOM-based selection disambiguation (getGlobalOffsetFromDomPosition + pickClosestOccurrence) is a thoughtful solution for the repeated-text problem, with solid unit tests including pinned-offset sanity checks in the Playwright test.
  • unstable_batchedUpdates used correctly for all multi-setter state transitions.
  • Cleanup on unmount (setDocxBytes(null)) prevents stale WASM data when navigating between documents.
  • Transient vs permanent error classification in the parser is correct: 4xx → permanent (don't retry), 5xx/network → transient (retry eligible).
  • WASM path override for Playwright CT is well-documented and cleanly isolated in a test utility.

JSv4 added 2 commits March 19, 2026 18:10
- Pin docxodus-service image to v1.0.0 across all compose files
- Make docling-parser and docxodus-parser optional in production.yml
  (required: false) to avoid breaking existing deployments
- Fix unmount race condition in loadDocxDocument with cancellation flag
- Enable validateBeforeProjection as safety net for version drift
- Change docxBytesCache from FIFO to LRU eviction
- Remove named export of DocxAnnotatorWrapper (bypassed React.memo)
- Add configurable max_file_size_mb setting to DocxodusServiceParser
- Merge duplicate import in DocxAnnotator.ct.tsx
@claude
Copy link
Copy Markdown

claude bot commented Mar 19, 2026

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models.

Baseline Commit Rules

  1. Always ensure all affected (or new) tests pass - backend tests suite should only be run in its entirety for good reason as it takes 30+ minutes.
  2. Always make sure typescript compiles and pre-commits pass before committing new code.
  3. Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts. This includes Co-Authored-By lines, "Generated by Claude", etc.

Essential Commands

Backend (Django)

# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb

# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope

# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope

# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db

# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb

# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope

# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb

# Apply database migrations
docker compose -f local.yml run django python manage.py migrate

# Create new migration
docker compose -f local.yml run django python manage.py makemigrations

# Django shell
docker compose -f local.yml run django python manage.py shell

# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-files

Frontend (React/TypeScript)

cd frontend

# Start development server (proxies to Django on :8000)
yarn start

# Run unit tests (Vitest) - watches by default
yarn test:unit

# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list

# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"

# Run E2E tests
yarn test:e2e

# Coverage report
yarn test:coverage

# Linting and formatting
yarn lint
yarn fix-styles

# Build for production
yarn build

# Preview production build locally
yarn serve

Production Deployment

# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate

# Then start main services
docker compose -f production.yml up

High-Level Architecture

Backend Architecture

Stack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery

Key Patterns:

  1. GraphQL Schema Organization:

    • config/graphql/graphene_types.py - All GraphQL type definitions
    • config/graphql/queries.py - Query resolvers
    • config/graphql/*_mutations.py - Mutation files (organized by feature)
    • config/graphql/schema.py - Schema composition
  2. Permission System (CRITICAL - see docs/permissioning/consolidated_permissioning_guide.md):

    • Annotations & Relationships: NO individual permissions - inherited from document + corpus
    • Documents & Corpuses: Direct object-level permissions via django-guardian
    • Analyses & Extracts: Hybrid model (own permissions + corpus permissions + document filtering)
    • Formula: Effective Permission = MIN(document_permission, corpus_permission)
    • Structural items are ALWAYS read-only except for superusers
    • Use Model.objects.visible_to_user(user) pattern (NOT resolve_oc_model_queryset - DEPRECATED)
  3. AnnotatePermissionsForReadMixin:

    • Most GraphQL types inherit this mixin (see config/graphql/permissioning/permission_annotator/mixins.py)
    • Adds my_permissions, is_published, object_shared_with fields
    • Requires model to have guardian permission tables ({model}userobjectpermission_set)
    • Notifications use simple ownership model and DON'T use this mixin
  4. Django Signal Handlers:

    • Automatic notification creation on model changes (see opencontractserver/notifications/signals.py)
    • Must be imported in app's apps.py ready() method
    • Use _skip_signals attribute to prevent duplicate notifications in tests
  5. Pluggable Parser Pipeline:

    • Base classes in opencontractserver/pipeline/base/
    • Parsers, embedders, thumbnailers auto-discovered and registered
    • Multiple backends: Docling (ML-based), LlamaParse, Text
    • All convert to unified PAWLs format for frontend
  6. Agent Tool Architecture (see docs/architecture/llms/README.md):

    • CoreTool (framework-agnostic) → PydanticAIToolWrapper (pydantic-ai specific)
    • All production tools MUST be async (a-prefixed in core_tools.py). The wrapper supports sync functions for lightweight helpers/tests but does NOT wrap them in a thread pool — sync ORM calls will raise SynchronousOnlyOperation.
    • Tool fault tolerance (issue Agent Tools Not Fault Tolerant #820): operational exceptions are caught and returned as error strings to the LLM; security exceptions (PermissionError, ToolConfirmationRequired) propagate.
    • Pre-execution checks run on every call (not cached): permission validation, resource ID validation, approval gates.

Frontend Architecture

Stack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite

Key Patterns:

  1. State Management - Jotai Atoms:

    • Global state via atoms in frontend/src/atoms/ (NOT Redux/Context)
    • Key atoms: selectedCorpusIdAtom, selectedFolderIdAtom, currentThreadIdAtom
    • Derived atoms automatically update when dependencies change
    • Apollo reactive vars in frontend/src/graphql/cache.ts for UI state
    • AuthGate pattern ensures auth completes before rendering
  2. Central Routing System (see docs/frontend/routing_system.md):

    • Single source of truth: frontend/src/routing/CentralRouteManager.tsx
    • URL paths → Entity resolution via GraphQL slug queries
    • URL params ↔ Reactive vars (bidirectional sync)
    • Components consume state via reactive vars, never touch URLs directly
    • Deep linking and canonical redirects handled automatically
  3. PDF Annotation System (see .cursor/rules/pdf-viewer-and-annotator-architecture.mdc):

    • Virtualized rendering: Only visible pages (+overscan) rendered for performance
    • Binary search to find visible page range (O(log n))
    • Height caching per zoom level
    • Two-phase scroll-to-annotation system
    • Dual-layer architecture: Document layer (annotations) + Knowledge layer (summaries)
  4. Unified Filtering Architecture:

    • useVisibleAnnotations hook provides single-source-of-truth filtering for both annotations and relationship-connected annotations
    • Reads from Jotai atoms via useAnnotationDisplay, useAnnotationControls, and useAnnotationSelection (from UISettingsAtom)
    • Ensures consistency across all components
    • Forced visibility for selected items and their relationship connections
  5. Component Testing (see .cursor/rules/test-document-knowledge-base.mdc):

    • ALWAYS mount components through test wrappers (e.g., DocumentKnowledgeBaseTestWrapper)
    • Wrapper provides: MockedProvider + InMemoryCache + Jotai Provider + asset mocking
    • Use --reporter=list flag to prevent hanging
    • Increase timeouts (20s+) for PDF rendering in Chromium
    • GraphQL mocks must match variables EXACTLY (null vs undefined matters)
    • Mock same query multiple times for refetches
    • Use page.mouse for PDF canvas interactions (NOT locator.dragTo)
    • Add settle time after drag operations (500ms UI, 1000ms Apollo cache)
  6. Development Server Configuration:

    • Vite dev server on :3000 proxies to Django on :8000
    • WebSocket proxy for /wsws://localhost:8000
    • GraphQL proxy for /graphqlhttp://localhost:8000
    • REST API proxy for /apihttp://localhost:8000
    • Auth0 optional via REACT_APP_USE_AUTH0 environment variable

Data Flow Architecture

Document Processing:

  1. Upload → Parser Selection (Docling/LlamaParse/Text)
  2. Parser generates PAWLs JSON (tokens with bounding boxes)
  3. Text layer extracted from PAWLs
  4. Annotations created for structure (headers, sections, etc.)
  5. Relationships detected between elements
  6. Vector embeddings generated for search

GraphQL Permission Flow:

  1. Query resolver filters objects with .visible_to_user(user)
  2. GraphQL types resolve my_permissions via AnnotatePermissionsForReadMixin
  3. Frontend uses permissions to enable/disable UI features
  4. Mutations check permissions and return consistent errors to prevent IDOR

Critical Security Patterns

  1. IDOR Prevention:

    • Query by both ID AND user-owned field: Model.objects.get(pk=pk, recipient=user)
    • Return same error message whether object doesn't exist or belongs to another user
    • Prevents enumeration via timing or different error messages
  2. Permission Checks:

    • NEVER trust frontend - always check server-side
    • Use visible_to_user() manager method for querysets
    • Check user_has_permission_for_obj() for individual objects (in opencontractserver.utils.permissioning)
  3. XSS Prevention:

    • User-generated content in JSON fields must be escaped on frontend
    • GraphQL's GenericScalar handles JSON serialization safely
    • Document this requirement in resolver comments

Critical Concepts

  1. No dead code - when deprecating or replacing code, always try to fully replace older code and, once it's no longer in use, delete it and related texts.
  2. DRY - please always architect code for maximal dryness and always see if you can consolidate related code and remove duplicative code.
  3. Single Responsibility Principle - Generally, ensure that each module / script has a single purpose or related purpose.
  4. No magic numbers - we have constants files in opencontractserver/constants/ (backend) and frontend/src/assets/configurations/constants.ts (frontend). Use them for any hardcoded values.
  5. Don't touch old tests without permission - if pre-existing tests fail after changes, try to identify why and present user with root cause analysis. If the test logic is correct but expectations need updating due to intentional behavior changes, document the change clearly.
  6. Utility functions belong in utility files:
    • Before writing new utilities: Check existing utility files first (frontend/src/utils/, opencontractserver/utils/)
    • When reviewing code: If you find utility functions defined inline in components/views, check if they already exist in utility files. If not, consider whether they should be extracted there for reuse.
    • Frontend utilities: frontend/src/utils/formatters.ts (formatting), frontend/src/utils/files.ts (file operations), etc.
    • Backend utilities: opencontractserver/utils/ contains permissioning, PDF processing, and other shared utilities

Testing Patterns

Manual Test Scripts

Location: docs/test_scripts/

When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under docs/test_scripts/. These scripts will later be used to build automated integration tests.

Format:

# Test: [Brief description]

## Purpose
What this test verifies.

## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")

## Steps
1. Step one with exact command
   ```bash
   docker compose -f local.yml run --rm django python manage.py shell -c "..."
  1. Step two...

Expected Results

  • What should happen after each step
  • Success criteria

Cleanup

Commands to restore original state if needed.


**When to document**:
- Migration testing (rollback, create test data, migrate forward)
- Database constraint validation
- Race condition verification
- Any manual verification requested during code review

### Backend Tests

**Location**: `opencontractserver/tests/`

**Parallel Testing** (pytest-xdist):
- Run with `-n 4` (or `-n auto`) for parallel execution across workers
- Use `--dist loadscope` to keep tests from the same class on the same worker (respects `setUpClass`)
- Each worker gets its own database (test_db_gw0, test_db_gw1, etc.)
- Use `@pytest.mark.serial` to mark tests that cannot run in parallel
- First run or after DB schema changes: add `--create-db` flag

**Patterns**:
- Use `TransactionTestCase` for tests with signals/asynchronous behavior
- Use `TestCase` for faster tests without transaction isolation
- Clear auto-created notifications when testing moderation: `Notification.objects.filter(recipient=user).delete()`
- Use `_skip_signals` attribute on instances to prevent signal handlers during fixtures

### Frontend Component Tests

**Location**: `frontend/tests/`

**Critical Requirements**:
- Mount through test wrappers that provide all required context
- GraphQL mocks must match query variables exactly
- Include mocks for empty-string variants (unexpected boot calls)
- Wait for visible evidence, not just network-idle
- Use `page.mouse` for PDF canvas interactions (NOT `locator.dragTo`)
- Add settle time after drag operations (500ms UI, 1000ms Apollo cache)

**Test Wrapper Pattern**:
```typescript
// ALWAYS use wrappers, never mount components directly
const component = await mount(
  <DocumentKnowledgeBaseTestWrapper corpusId="corpus-1" documentId="doc-1">
    <DocumentKnowledgeBase />
  </DocumentKnowledgeBaseTestWrapper>
);

Automated Documentation Screenshots

Location: docs/assets/images/screenshots/auto/ (output) | frontend/tests/utils/docScreenshot.ts (utility)

Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the screenshots.yml CI workflow.

How it works:

  1. Import docScreenshot from ./utils/docScreenshot in any .ct.tsx test file
  2. Call await docScreenshot(page, "area--component--state") after the component reaches the desired visual state
  3. Reference the image in markdown: ![Alt text](../assets/images/screenshots/auto/area--component--state.png)
  4. CI runs tests on every PR, captures screenshots, and auto-commits any changes

Naming convention (-- separates segments, - within words):

Segment Purpose Examples
area Feature area landing, badges, corpus, versioning, annotations
component Specific view or component hero-section, celebration-modal, list-view
state Visual state captured anonymous, with-data, empty, auto-award

At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens.

Example:

import { docScreenshot } from "./utils/docScreenshot";

// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");

Rules:

  • Place docScreenshot() calls AFTER assertions that confirm the desired visual state
  • The filename IS the contract between tests and docs — keep names stable
  • Never manually edit files in docs/assets/images/screenshots/auto/ — they are overwritten by CI
  • Manually curated screenshots stay in docs/assets/images/screenshots/ (parent directory)

Release Screenshots (Point-in-Time)

For release notes, use releaseScreenshot to capture screenshots that are locked in amber — they show the UI at a specific release and never change.

Location: docs/assets/images/screenshots/releases/{version}/ (output)

import { releaseScreenshot } from "./utils/docScreenshot";

await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });

Key differences from docScreenshot:

  • Output: docs/assets/images/screenshots/releases/{version}/{name}.png
  • Write-once: If the file already exists, the function is a no-op (won't overwrite)
  • CI never touches the releases/ directory
  • Name is a simple kebab-case string (no -- segment convention)
  • Version must match v{major}.{minor}.{patch} format (with optional suffix)

When to use which:

  • docScreenshot → README, quickstart, guides (always fresh)
  • releaseScreenshot → Release notes (frozen at release time)

Authenticated Playwright Testing (Live Frontend Debugging)

When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests.

Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to http://localhost:8000/graphql/ (cross-origin from the Vite dev server at localhost:5173). Since fetch defaults to credentials: 'same-origin', browser cookies aren't sent cross-origin. The workaround is to inject the session cookie into request headers via Playwright's route interception.

Step 1: Set a password for the superuser (one-time setup):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"

Step 2: Playwright script pattern:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // Collect console messages for debugging
  const consoleMsgs = [];
  page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));

  // 1. Login to Django admin to get session cookie
  await page.goto('http://localhost:8000/admin/login/');
  await page.fill('#id_username', '<superuser-username>');
  await page.fill('#id_password', 'testpass123');
  await page.click('input[type=submit]');
  await page.waitForTimeout(2000);

  // 2. Extract the session cookie
  const cookies = await context.cookies();
  const sessionCookie = cookies.find(c => c.name === 'sessionid');

  // 3. Intercept GraphQL requests to inject the session cookie
  //    (needed because Apollo sends cross-origin requests to :8000)
  await page.route('**/graphql/**', async (route) => {
    const headers = {
      ...route.request().headers(),
      'Cookie': 'sessionid=' + sessionCookie.value,
    };
    await route.continue({ headers });
  });

  // 4. Navigate to the frontend page under test
  await page.goto('http://localhost:5173/extracts');
  await page.waitForTimeout(5000);

  // 5. Inspect results
  const bodyText = await page.textContent('body');
  console.log(bodyText);

  await browser.close();
})();

Run from the frontend directory (where playwright is a dependency):

cd frontend && node /path/to/script.js

Key details:

  • The Django admin login sets a sessionid cookie for localhost
  • page.route('**/graphql/**') intercepts Apollo's requests to localhost:8000/graphql/ and injects the cookie header
  • The AuthGate will still show anonymous state (no Auth0 session), but GraphQL queries execute as the authenticated Django user
  • This is useful for verifying backend query results, permission filtering, and data flow through the real frontend
  • For verifying just the GraphQL response without the full frontend, curl with the session cookie also works:
    curl -s -X POST http://localhost:8000/graphql/ \
      -H "Content-Type: application/json" \
      -H "Cookie: sessionid=<session-key>" \
      -d '{"query":"query { extracts { edges { node { id name } } } }"}' | python3 -m json.tool

Alternative — create a session programmatically (no admin login needed):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"

Then use the printed session key directly in curl or Playwright route interception.

Documentation Locations

  • Permissioning: docs/permissioning/consolidated_permissioning_guide.md
  • Frontend Routing: docs/frontend/routing_system.md
  • Frontend Auth Flow: docs/frontend/auth_flow.md
  • PDF Data Layer: docs/architecture/PDF-data-layer.md
  • PAWLs Format (v1/v2): docs/architecture/pawls-format.md
  • Parser Pipeline: docs/pipelines/pipeline_overview.md
  • LLM Framework: docs/architecture/llms/README.md
  • Collaboration System: docs/commenting_system/README.md
  • Auth Pattern (detailed): frontend/src/docs/AUTHENTICATION_PATTERN.md
  • Documentation Screenshots: docs/development/screenshots.md
  • Query Permission Patterns: docs/architecture/query_permission_patterns.md
  • OS-Legal-Style Migration Guide: docs/frontend/os-legal-style-migration-guide.md

Branch Strategy

This project follows trunk-based development:

  • Work directly on main branch or use short-lived feature branches
  • Feature branches: feature/description-issue-number
  • Merge feature branches quickly (within a day or two)
  • Commit message format: Descriptive with issue references (e.g., "Closes Epic: Notification System #562")

Changelog Maintenance

IMPORTANT: Always update CHANGELOG.md when making significant changes to the codebase.

The changelog follows Keep a Changelog format:

## [Unreleased] - YYYY-MM-DD

### Added
- New features

### Fixed
- Bug fixes with file locations and line numbers

### Changed
- Changes to existing functionality

### Technical Details
- Implementation specifics, architectural notes

When to update:

  • New features or models added
  • Production code bugs fixed (document file location, line numbers, and impact)
  • Breaking changes to APIs or data models
  • Test suite fixes that reveal production issues
  • Database migrations
  • Architecture changes

What to include:

  • File paths and line numbers for code changes
  • Clear description of the issue and fix
  • Impact on system behavior
  • Migration notes if applicable

Pre-commit Hooks

Automatically run on commit:

  • black (Python formatting)
  • isort (import sorting)
  • flake8 (linting)
  • prettier (frontend formatting)
  • pyupgrade (Python syntax modernization)

Run manually: pre-commit run --all-files

Common Pitfalls

  1. Frontend tests hanging: Always use --reporter=list flag
  2. Permission N+1 queries: Use .visible_to_user() NOT individual permission checks. For Conversation list queries in GraphQL resolvers, use ConversationQueryOptimizer from opencontractserver.conversations.query_optimizer - it provides request-level caching to avoid repeated corpus/document visibility subqueries
  3. Missing GraphQL mocks: Check variables match exactly (null vs undefined matters), add duplicates for refetches
  4. Notification duplication in tests: Moderation methods auto-create ModerationAction records
  5. Structural annotation editing: Always read-only except for superusers
  6. Missing signal imports: Import signal handlers in apps.py ready() method
  7. PDF rendering slow in tests: Increase timeouts to 20s+ for Chromium
  8. Cache serialization crashes: Keep InMemoryCache definition inside wrapper, not test file
  9. Backend Tests Waiting > 10 seconds on Postgres to be Ready: Docker network issue. Fix with: docker compose -f test.yml down && docker kill $(docker ps -q) && docker compose -f test.yml down
  10. Empty lists on direct navigation: AuthGate pattern solves this (don't check auth status, it's always ready)
  11. URL desynchronization: Use CentralRouteManager, don't bypass routing system
  12. Jotai state not updating: Ensure atoms are properly imported and used with useAtom hook
  13. Writing sync agent tools: All agent tools must be async. The PydanticAIToolWrapper accepts sync functions but calls them directly (no thread pool) — sync Django ORM calls will raise SynchronousOnlyOperation. Use the a-prefixed async versions in core_tools.py.
  14. PydanticAI system_prompt silently dropped: When creating PydanticAIAgent, use instructions= NOT system_prompt=. The system_prompt parameter is only included when message_history is None, but OpenContracts' chat() persists a HUMAN message before calling pydantic-ai's run(), so history is always non-empty. This causes the system prompt to be silently dropped. See docs/architecture/llms/README.md for full details.
  15. Apollo cache keyArgs must use field argument names, not variable names: In cache.ts, relayStylePagination(["corpus", "name_Contains"]) uses the GraphQL field argument names (e.g., extracts(corpus: $id, name_Contains: $text)), NOT the variable names ($id, $text). Mismatched keyArgs silently fail to isolate cache entries, causing queries with different filters to share stale cached results.
  16. Corrupted Docker iptables chains (RARE): If you see Chain 'DOCKER-ISOLATION-STAGE-2' does not exist errors, Docker's iptables chains have been corrupted during docker cycling. Run this nuclear fix:
    sudo systemctl stop docker && sudo systemctl stop docker.socket && sudo ip link delete docker0 2>/dev/null || true && sudo iptables -t nat -F && sudo iptables -t nat -X && sudo iptables -t filter -F && sudo iptables -t filter -X 2>/dev/null || true && sudo iptables -t mangle -F && sudo iptables -t mangle -X && sudo iptables -t filter -N INPUT 2>/dev/null || true && sudo iptables -t filter -N FORWARD 2>/dev/null || true && sudo iptables -t filter -N OUTPUT 2>/dev/null || true && sudo iptables -P INPUT ACCEPT && sudo iptables -P FORWARD ACCEPT && sudo iptables -P OUTPUT ACCEPT && sudo systemctl start docker
    This completely resets Docker's networking and iptables state. Docker will recreate all required chains on startup.

@claude
Copy link
Copy Markdown

claude bot commented Mar 19, 2026

#!/usr/bin/env python
"""
Test script for Django rate limiting in GraphQL endpoints.

This script tests that rate limiting is properly configured and working
for various GraphQL mutations and queries.
"""

import os
import sys

import django
from django.test import RequestFactory

Setup Django

os.environ.setdefault("DJANGO_SETTINGS_MODULE", "config.settings.local")
django.setup()

from config.graphql.ratelimits import ( # noqa: E402
RateLimitExceeded,
RateLimits,
graphql_ratelimit,
)

def test_rate_limit_decorator():
"""Test the rate limit decorator directly."""

print("Testing rate limit decorator...")

# Create a mock resolver function
call_count = 0

@graphql_ratelimit(rate="3/m", key="test")
def mock_resolver(root, info, **kwargs):
    nonlocal call_count
    call_count += 1
    return f"Call {call_count}"

# Create a mock request
factory = RequestFactory()

# Create mock info object
class MockInfo:
    def __init__(self, request):
        self.context = request

# Test with anonymous user
request = factory.post("/graphql")
request.user = None
info = MockInfo(request)

# Should allow first 3 calls
for i in range(3):
    result = mock_resolver(None, info)
    print(f"  Call {i+1}: {result}")

# 4th call should be rate limited
try:
    result = mock_resolver(None, info)
    print(f"  Call 4: {result} - ERROR: Should have been rate limited!")
except RateLimitExceeded as e:
    print(f"  Call 4: Rate limited as expected - {e}")

print("✓ Rate limit decorator test passed\n")

def test_rate_limit_configuration():
"""Test that rate limit configurations are properly loaded."""

print("Testing rate limit configuration...")

# Check that default rate limits are accessible
expected_limits = [
    ("AUTH_LOGIN", "5/m"),
    ("READ_MEDIUM", "30/m"),
    ("WRITE_HEAVY", "5/m"),
    ("AI_QUERY", "20/m"),
    ("EXPORT", "5/h"),
]

for attr, expected_value in expected_limits:
    actual_value = getattr(RateLimits, attr)
    print(f"  {attr}: {actual_value}")
    if actual_value != expected_value:
        print(f"    ERROR: Expected {expected_value}, got {actual_value}")
    else:
        print("    ✓ Correct")

print("✓ Rate limit configuration test passed\n")

def test_graphql_integration():
"""Test that rate limiting is integrated with GraphQL resolvers."""

print("Testing GraphQL integration...")

# Import mutations and queries
from config.graphql.mutations import CreateLabelset
from config.graphql.queries import Query

# Check that mutations have rate limiting decorators
mutations_to_check = [
    ("CreateLabelset.mutate", CreateLabelset.mutate),
]

for name, method in mutations_to_check:
    # Check if the method has been wrapped
    if hasattr(method, "__wrapped__"):
        print(f"  {name}: ✓ Has rate limiting")
    else:
        # The decorator might be applied differently
        # Check for rate limit attributes
        if "__closure__" in dir(method):
            print(f"  {name}: ✓ Likely has rate limiting (closure detected)")
        else:
            print(f"  {name}: ⚠ May not have rate limiting")

# Check queries
query_instance = Query()
queries_to_check = [
    ("resolve_annotations", query_instance.resolve_annotations),
    ("resolve_corpus_stats", query_instance.resolve_corpus_stats),
]

for name, method in queries_to_check:
    if hasattr(method, "__wrapped__"):
        print(f"  {name}: ✓ Has rate limiting")
    elif "__closure__" in dir(method):
        print(f"  {name}: ✓ Likely has rate limiting (closure detected)")
    else:
        print(f"  {name}: ⚠ May not have rate limiting")

print("✓ GraphQL integration test completed\n")

def main():
"""Run all tests."""

print("=" * 60)
print("Django Rate Limiting Test Suite")
print("=" * 60)
print()

try:
    test_rate_limit_configuration()
    test_rate_limit_decorator()
    test_graphql_integration()

    print("=" * 60)
    print("All tests completed successfully! ✓")
    print("=" * 60)

except Exception as e:
    print(f"\n❌ Test failed with error: {e}")
    import traceback

    traceback.print_exc()
    sys.exit(1)

if name == "main":
main()

@claude
Copy link
Copy Markdown

claude bot commented Mar 19, 2026

Code Review

This is a well-structured PR that follows existing patterns (mirroring TxtAnnotatorWrapper, reusing the convert-once/project-many split, DOMPurify sanitization). The offset disambiguation via getGlobalOffsetFromDomPosition and pickClosestOccurrence is a solid design and the unit test coverage for those utilities is good. A few things need attention before merge:

Bug: annotationLabel not normalized in _normalize_annotation

In opencontractserver/pipeline/parsers/docxodus_parser.py, the ann_mappings dict converts several camelCase fields to snake_case but is missing annotationLabel to annotation_label. The .NET microservice returns annotationLabel (camelCase), but without this mapping the key passes through unchanged. Downstream code reading OpenContractDocExport typed dicts will silently get annotationLabel instead of annotation_label, leading to either KeyErrors or annotations stored without labels. This was flagged in the prior review and appears to still be unaddressed.

Fix: Add annotationLabel: annotation_label to ann_mappings in _normalize_annotation.

Hard dependency on docxodus-parser in dev/test but not production

local.yml and test.yml wire docxodus-parser as a hard depends_on (condition: service_started) for the django service, meaning the entire dev environment and all test runs including PDF/TXT tests fail if the container is unavailable. Production correctly uses required: false for this service. Since this is an optional microservice only exercised for DOCX documents, dev and test should also use required: false to avoid friction for contributors who do not need DOCX support.

Minor: getDocxBytes has no concurrent request deduplication

Two simultaneous fetches for the same URL will both fire independent axios.get calls. This matches the existing getPawlsLayer pattern so it is not a regression, but given DOCX files can be large (50 MB limit), it is worth a follow-up improvement (e.g. a pending-promise map keyed by URL).

Minor: PDFContainer reused for DOCX viewer

DocumentKnowledgeBase.tsx wraps the DOCX branch in PDFContainer id=docx-container. Functionally correct, but semantically confusing. If this is a generic scroll container, aliasing or renaming it would clarify intent. Not blocking.

Minor: Enter shortcut hint without a keyboard handler

The annotation creation menu renders a ShortcutHint Enter for Annotate Selection, but the only keydown handler present closes the menu on Escape. Enter does not trigger handleCreateAnnotation. Either wire up Enter or remove the hint.

Tests look solid

  • getGlobalOffset.test.ts covers label-span skipping, cross-paragraph offsets, and the five-occurrence disambiguation scenario well.
  • DocxAnnotator.ct.tsx WASM route interception pattern is clean.
  • Backend parser tests cover timeout, connection error, and normalization cases.

The main blocking item is the annotationLabel normalization bug; the local.yml/test.yml hard dependency issue is worth fixing before merge too.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer

This is a well-structured and ambitious PR. The architecture — same library on both sides guaranteeing offset alignment, three-phase render pipeline (convert once / project many / CSS toggle), DOM-based disambiguation for repeated text — is solid. The test coverage is thorough. Here are the issues I found, from critical down to minor.


Critical: Version mismatch breaks the offset alignment guarantee

The core correctness guarantee is that the backend microservice and frontend WASM use the same Docxodus version. But they do not:

  • frontend/package.json: "docxodus": "5.5.2"
  • local.yml / production.yml / test.yml: ghcr.io/open-source-legal/docxodus-service:1.1.0-docxodus5.4.2

The Docker image tag explicitly encodes docxodus5.4.2. If character offsets differ between 5.4.2 and 5.5.2, annotations will be projected at wrong positions. The PR description and code comments both stress this alignment requirement. Either the npm package should pin to 5.4.2 or the Docker image needs a rebuild against 5.5.2.


Important: File size check happens after reading entire file into memory

In docxodus_parser.py, the code reads the full file first, then checks the size limit. For a 200 MB file with max_file_size_mb=50, 200 MB is allocated before raising. Better to check size before reading — default_storage.size() is available for both local and S3 backends.


Important: HTTP requests are not actually cancelled on component unmount

loadDocxDocument in DocumentKnowledgeBase.tsx uses a closure-based cancel flag that prevents state updates after unmount, which is correct — but the underlying axios.get in getDocxBytes continues transferring data to completion, wasting bandwidth for large DOCX files. Consider threading an AbortController.signal through getDocxBytes and calling abort() on cancel.


Notable: CSS color value from server is interpolated without validation

In DocxAnnotator.tsx, label colors from availableLabels are interpolated directly into a style block. CSS.escape is correctly applied to label IDs, but color is used raw — e.g., the color string is stripped of its leading hash and embedded directly into the CSS rule for border-bottom. Corpus labels are only editable by users with write permission, so the blast radius is self-contained — but validating that color matches ^#?[0-9A-Fa-f]{6}$ before interpolating would eliminate the surface entirely.


Notable: annotation_type GraphQL field — root cause not addressed

The new resolve_annotation_type resolver returns a plain string "to tolerate invalid DB values." This is reasonable, but the PR does not say how invalid values got into the DB or how many rows are affected. A data migration to clean those rows would fix the root cause. Consider a follow-up ticket.


Minor: Local import of xml.etree.ElementTree is unnecessary

In docx_thumbnailer.py, import xml.etree.ElementTree as ET is placed inside the function with a comment about avoiding loading the XML parser at module level. xml.etree.ElementTree is a standard library module always loaded early in a Django process — the optimization does not apply. Move it to top-level imports for consistency.


Minor: getSpan silently discards the selection on missing label

In DocxAnnotatorWrapper, if activeSpanLabel is set but not found in spanLabels, getSpan throws, handleCreateAnnotation catches it with console.warn, and the annotation menu closes with no user feedback. A toast or inline message would be more informative.


Things done well

  • Three-phase render pipeline (convert once / project many / CSS toggle) is well-reasoned; the performance comments (~900ms, ~56ms, instant) give future maintainers the right mental model.
  • DOM-based disambiguation (getGlobalOffsetFromDomPosition + pickClosestOccurrence) is correct and well-tested, including the "5 occurrences of Party" contract scenario.
  • LRU cache in cachedRest.ts with Map insertion-order semantics is clean and correct.
  • required: false on the docxodus-parser service in Docker Compose allows existing deployments to boot without the microservice — good progressive-enhancement default.
  • Test coverage is comprehensive: success, timeout, connection error, 4xx, 5xx, normalization, thumbnail extraction, and disambiguation all covered.
  • setDocxBytes(null) called in PDF/TXT load paths and on unmount prevents cross-document state bleed.
  • DOMPurify is correctly applied with a config that preserves docxodus formatting attributes while blocking scripts and event handlers.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer

This is a well-structured feature addition that follows existing patterns (mirrors TxtAnnotatorWrapper, reuses the parser pipeline, uses Jotai atoms). The architecture of using the same library for both backend parsing and frontend rendering is elegant. A few items warrant attention before merging.

HIGH PRIORITY

  1. Version mismatch breaks the core offset alignment guarantee

The PR's key invariant is that both backend parser and frontend WASM renderer use the same Docxodus library version, ensuring character offsets are identical. But the Docker image tag in all three compose files is 1.1.0-docxodus5.4.2 (backend on 5.4.2) while package.json pins docxodus at 5.5.2. The CHANGELOG also mentions 5.5.0 but the installed package is 5.5.2. Minor version bumps can change how Docxodus walks document nodes, causing silent annotation mis-alignment where validation passes but spans land in the wrong place.

Fix: Bump the microservice image to use docxodus 5.5.2 and update the tag accordingly, or pin the frontend to 5.4.2. The two version strings must agree exactly.

  1. DOMPurify ADD_TAGS includes "style" which enables CSS injection

SANITIZE_CONFIG passes ADD_TAGS: ["style"], meaning a malicious DOCX can inject arbitrary CSS blocks into the page DOM. Even without JavaScript, injected CSS enables clickjacking via overlay elements, hiding annotation controls via display:none, and visual spoofing of UI elements.

Fix: Remove "style" from ADD_TAGS. If style blocks are required for PaginatedDocument to render correctly, scope them via Shadow DOM or strip cross-origin selectors before injection.

MEDIUM PRIORITY

  1. Unused imports in docxodus_parser.py will fail flake8 pre-commit

"from typing import Any" and "from opencontractserver.annotations.models import LABEL_TYPES" are imported but never referenced in the file. These appear to be leftovers from an earlier normalization implementation that was removed.

  1. docxBytesAtom not cleared on document-type switch

The atom is only cleared on component unmount. If DocumentKnowledgeBase stays mounted while the user navigates from a DOCX document to a PDF or TXT document, stale DOCX bytes remain in the atom while the PDF/TXT renderer is active. The atom should also be reset in the type-switching branches of the onCompleted handlers, alongside the existing setDocxBytes(null) call at the start of loadDocxDocument.

  1. RAF-based annotation ref registration may miss virtualized pages

Effect 6 uses requestAnimationFrame after dangerouslySetInnerHTML updates to query annotation elements. If PaginatedDocument only renders visible pages, annotations on off-screen pages will not be in the DOM yet and will be silently skipped, breaking sidebar scroll-to-annotation for those annotations. Worth verifying against the TXT renderer approach.

POSITIVE NOTES

Error classification (transient vs permanent for HTTP 4xx/5xx/network errors) in DocxodusServiceParser matches the pipeline contract well. The LRU cache in cachedRest.ts using Map insertion-order semantics is correct and clean. The cancellation pattern with docxLoadCancelRef prevents stale state from in-flight loads. The required: false on Docker Compose dependencies is the right choice so existing deployments do not break. The offset disambiguation logic (pickClosestOccurrence + getGlobalOffsetFromDomPosition) is well-tested including a real-contract scenario with repeated terms. The annotation_type GraphQL resolver returning a plain string is a pragmatic fix with clear rationale.

SUMMARY: The Docxodus version mismatch and the DOMPurify style-tag injection are the two issues that should be fixed before merging. Unused imports will be caught by pre-commit hooks. The docxBytesAtom stale-state and RAF ref registration gaps are worth addressing in this PR or a fast follow-up.

scrollIntoView didn't work with nested scroll containers. Instead,
walk up from the annotation element to find the nearest scrollable
ancestor (PaginatedDocument's overflow:auto div) and scroll within
that specific container. Added small delay for PaginatedDocument
render timing.
@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: DOCX Support via Docxodus

This is a well-structured PR with solid architecture — the convert-once/project-many pattern, LRU caching, DOMPurify sanitization, and offset disambiguation utilities are all thoughtfully designed. The test coverage is unusually thorough for a feature this size. A few issues need addressing before merge.

Critical

Version mismatch between backend service and frontend WASM — this directly breaks the PR's core guarantee.

local.yml pulls image tag docxodus-service:1.1.0-docxodus5.4.2, but frontend/package.json specifies docxodus@5.5.2.

The PR's own description states: "Both backend parser and frontend renderer use the same Docxodus library, ensuring character offsets are identical." A 5.4.2 vs 5.5.2 version gap can produce misaligned offsets and silent annotation placement errors. Either bump the service image to docxodus5.5.2 or pin the npm package to 5.4.2.

High

DOMPurify allows style attributes — CSS injection vector

In DocxAnnotator.tsx, SANITIZE_CONFIG includes "style" in ADD_ATTR. Allowing arbitrary style attributes on all elements from DOCX-derived HTML enables CSS injection — for example, position:fixed; top:0; left:0; width:100% to overlay the UI, or background:url() for data exfiltration via crafted DOCX files. Since annotation projection already handles styling via injected style tags separately, consider whether style in ADD_ATTR is necessary, or at minimum restrict it via a DOMPurify afterSanitizeAttributes hook.

New GraphQL annotation_type field bypasses type safety without explanation

In annotation_types.py, resolve_annotation_type returns raw DB values "to tolerate invalid DB values" but doesn't explain how invalid values get there. If this is covering a data migration gap from the DOCX import path, a one-time migration to fix the invalid rows would be cleaner than a permanent workaround in the API layer. If it's intentional forward-compatibility for future label types, document that explicitly in the resolver docstring.

Medium

File size check occurs pre-base64 encoding

In docxodus_parser.py, the max_file_size_mb check measures raw DOCX bytes before base64 encoding. Base64 inflates payload size by ~33%, so a 50 MB DOCX produces a ~66 MB JSON body. The configured limit should account for this overhead, or the check should measure post-encoding size. This matters because reverse proxies (nginx, Cloud Run) often cap request body sizes well below 66 MB.

_normalize_response silently drops unrecognized annotation types

When annotation_type is not in valid_annotation_types, it's silently removed with pop(). A microservice API change introducing a new Docxodus label type would silently lose annotation data. At minimum, add a logger.warning when a type is dropped so schema drift is visible in logs.

docxLoadCancelRef cleanup effect has confusing dependency array

In DocumentKnowledgeBase.tsx, the unmount cleanup effect has [setDocxBytes] as its dependency. This is functionally correct but looks like a missing dependency lint suppression. A short comment explaining why this is the right dependency (and why AbortController wasn't used) would help future maintainers.

Low / Nits

Hardcoded offset constants in testsFIRST_THIS_OFFSET = 13 and SECOND_THIS_OFFSET = 57 are pinned to a specific docxodus version. The sanity-check pattern is good, but consider deriving the offsets programmatically from sampleDocText.indexOf("This") which would survive version bumps automatically.

Missing data-testid on FileDropZone — upload modal tests changed from asserting on visible text to [data-testid="file-dropzone"], but this attribute doesn't appear to be added to FileDropZone.tsx in this PR. Verify these tests actually pass.

_normalize_response's field_mappings dict is recreated on every call — move to a module-level constant.

production.yml service image — only local.yml was visible in the diff for the service image line; confirm production.yml also uses a matching Docxodus version tag.

Summary

The architecture is sound. The version mismatch between the backend service image and frontend npm package is the blocking issue — the core offset alignment guarantee fails silently without it. The DOMPurify style attribute concern is worth resolving before merge since it affects all DOCX documents users upload. Everything else is low-risk and can be addressed in follow-up.

…oView

The previous attempts failed because PaginatedDocument hadn't finished
rendering when the scroll effect fired. Use the onPaginationComplete
callback to set paginationReady state, then gate the scroll effect on it.
scrollIntoView automatically finds the nearest scrollable ancestor
(PaginatedDocument's overflow:auto container).
@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

test comment please ignore

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

Deleting test comment above, actual review incoming.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review Part 1 of 3 - DOCX Support via Docxodus Microservice + WASM Renderer

This is a well-structured and ambitious feature. The architecture - same library on both ends guaranteeing offset alignment, convert-once/project-many rendering pattern, and cancellation-safe async loading - is solid. Below is a focused list of items worth addressing before merge.

CRITICAL - Backend microservice and frontend npm package version mismatch: The PR description states both the backend parser and frontend renderer use the same Docxodus library to guarantee identical character offsets, but the actual versions differ: Docker images pin docxodus 5.4.2 while frontend/package.json uses 5.5.2. If text extraction changed between those versions (e.g. different whitespace or paragraph separator handling), annotations created by the backend will silently appear at wrong positions in the frontend renderer. The docker-compose comment even flags this warning. Please align these versions before merging.

SECURITY - ADD_TAGS style in DOMPurify config: In DocxAnnotator.tsx, SANITIZE_CONFIG allows style tags and style attributes through DOMPurify. Allowing injected style blocks means a maliciously crafted DOCX (or compromised microservice response) can inject arbitrary CSS into the page. While not classic XSS, this enables UI redress/clickjacking, information leakage via CSS attribute selectors, and potential data exfiltration - high-risk for a platform where untrusted documents are a primary input. Consider scoping injected styles to a Shadow DOM, or at minimum filtering url(), @import, and content: values that can trigger network requests.

JSv4 added 2 commits March 20, 2026 21:41
1. Search: useTextSearch only handled PDF (token-based search). Add
   span-based search for TXT/DOCX that produces TextSearchSpanResult
   objects with start_index/end_index, matching TxtAnnotator's pattern.

2. Annotation visibility: reset paginationReady when projection produces
   new annotatedHtml, so PaginatedDocument re-paginates and the new
   annotation becomes visible without a page refresh.

3. Labels: EnhancedLabelSelector's filteredLabelChoices didn't handle
   DOCX file type — it only checked isTextFileType (text/*) and
   isPdfFileType, so DOCX fell through to an empty array. Add
   isDocxFileType check to treat DOCX like TXT (SPAN_LABEL type,
   humanSpanLabels list).
DOCX documents use span-based annotations (character offsets) like TXT,
not token-based like PDF. Multiple code paths only checked isTextFileType
(which only matches text/*) and isPdfFileType, causing DOCX to fall
through to wrong defaults.

Add isSpanBasedFileType(fileType) = isTextFileType || isDocxFileType as
a single utility, then update all affected locations:

- UISettingsAtom: active label initialization now picks span labels for DOCX
- AnnotationHooks: create/update annotation uses ServerSpanAnnotation for DOCX
- EnhancedLabelSelector: label type and available labels use span labels for DOCX
- ChatMessage: chat source annotation labels use span labels for DOCX
@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models.

Baseline Commit Rules

  1. Always ensure all affected (or new) tests pass - backend tests suite should only be run in its entirety for good reason as it takes 30+ minutes.
  2. Always make sure typescript compiles and pre-commits pass before committing new code.
  3. Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts. This includes Co-Authored-By lines, "Generated by Claude", etc.

Essential Commands

Backend (Django)

# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb

# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope

# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope

# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db

# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb

# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope

# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb

# Apply database migrations
docker compose -f local.yml run django python manage.py migrate

# Create new migration
docker compose -f local.yml run django python manage.py makemigrations

# Django shell
docker compose -f local.yml run django python manage.py shell

# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-files

Frontend (React/TypeScript)

cd frontend

# Start development server (proxies to Django on :8000)
yarn start

# Run unit tests (Vitest) - watches by default
yarn test:unit

# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list

# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"

# Run E2E tests
yarn test:e2e

# Coverage report
yarn test:coverage

# Linting and formatting
yarn lint
yarn fix-styles

# Build for production
yarn build

# Preview production build locally
yarn serve

Production Deployment

# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate

# Then start main services
docker compose -f production.yml up

High-Level Architecture

Backend Architecture

Stack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery

Key Patterns:

  1. GraphQL Schema Organization:

    • config/graphql/graphene_types.py - All GraphQL type definitions
    • config/graphql/queries.py - Query resolvers
    • config/graphql/*_mutations.py - Mutation files (organized by feature)
    • config/graphql/schema.py - Schema composition
  2. Permission System (CRITICAL - see docs/permissioning/consolidated_permissioning_guide.md):

    • Annotations & Relationships: NO individual permissions - inherited from document + corpus
    • Documents & Corpuses: Direct object-level permissions via django-guardian
    • Analyses & Extracts: Hybrid model (own permissions + corpus permissions + document filtering)
    • Formula: Effective Permission = MIN(document_permission, corpus_permission)
    • Structural items are ALWAYS read-only except for superusers
    • Use Model.objects.visible_to_user(user) pattern (NOT resolve_oc_model_queryset - DEPRECATED)
  3. AnnotatePermissionsForReadMixin:

    • Most GraphQL types inherit this mixin (see config/graphql/permissioning/permission_annotator/mixins.py)
    • Adds my_permissions, is_published, object_shared_with fields
    • Requires model to have guardian permission tables ({model}userobjectpermission_set)
    • Notifications use simple ownership model and DON'T use this mixin
  4. Django Signal Handlers:

    • Automatic notification creation on model changes (see opencontractserver/notifications/signals.py)
    • Must be imported in app's apps.py ready() method
    • Use _skip_signals attribute to prevent duplicate notifications in tests
  5. Pluggable Parser Pipeline:

    • Base classes in opencontractserver/pipeline/base/
    • Parsers, embedders, thumbnailers auto-discovered and registered
    • Multiple backends: Docling (ML-based), LlamaParse, Text
    • All convert to unified PAWLs format for frontend
  6. Agent Tool Architecture (see docs/architecture/llms/README.md):

    • CoreTool (framework-agnostic) → PydanticAIToolWrapper (pydantic-ai specific)
    • All production tools MUST be async (a-prefixed in core_tools.py). The wrapper supports sync functions for lightweight helpers/tests but does NOT wrap them in a thread pool — sync ORM calls will raise SynchronousOnlyOperation.
    • Tool fault tolerance (issue Agent Tools Not Fault Tolerant #820): operational exceptions are caught and returned as error strings to the LLM; security exceptions (PermissionError, ToolConfirmationRequired) propagate.
    • Pre-execution checks run on every call (not cached): permission validation, resource ID validation, approval gates.

Frontend Architecture

Stack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite

Key Patterns:

  1. State Management - Jotai Atoms:

    • Global state via atoms in frontend/src/atoms/ (NOT Redux/Context)
    • Key atoms: selectedCorpusIdAtom, selectedFolderIdAtom, currentThreadIdAtom
    • Derived atoms automatically update when dependencies change
    • Apollo reactive vars in frontend/src/graphql/cache.ts for UI state
    • AuthGate pattern ensures auth completes before rendering
  2. Central Routing System (see docs/frontend/routing_system.md):

    • Single source of truth: frontend/src/routing/CentralRouteManager.tsx
    • URL paths → Entity resolution via GraphQL slug queries
    • URL params ↔ Reactive vars (bidirectional sync)
    • Components consume state via reactive vars, never touch URLs directly
    • Deep linking and canonical redirects handled automatically
  3. PDF Annotation System (see .cursor/rules/pdf-viewer-and-annotator-architecture.mdc):

    • Virtualized rendering: Only visible pages (+overscan) rendered for performance
    • Binary search to find visible page range (O(log n))
    • Height caching per zoom level
    • Two-phase scroll-to-annotation system
    • Dual-layer architecture: Document layer (annotations) + Knowledge layer (summaries)
  4. Unified Filtering Architecture:

    • useVisibleAnnotations hook provides single-source-of-truth filtering for both annotations and relationship-connected annotations
    • Reads from Jotai atoms via useAnnotationDisplay, useAnnotationControls, and useAnnotationSelection (from UISettingsAtom)
    • Ensures consistency across all components
    • Forced visibility for selected items and their relationship connections
  5. Component Testing (see .cursor/rules/test-document-knowledge-base.mdc):

    • ALWAYS mount components through test wrappers (e.g., DocumentKnowledgeBaseTestWrapper)
    • Wrapper provides: MockedProvider + InMemoryCache + Jotai Provider + asset mocking
    • Use --reporter=list flag to prevent hanging
    • Increase timeouts (20s+) for PDF rendering in Chromium
    • GraphQL mocks must match variables EXACTLY (null vs undefined matters)
    • Mock same query multiple times for refetches
    • Use page.mouse for PDF canvas interactions (NOT locator.dragTo)
    • Add settle time after drag operations (500ms UI, 1000ms Apollo cache)
  6. Development Server Configuration:

    • Vite dev server on :3000 proxies to Django on :8000
    • WebSocket proxy for /wsws://localhost:8000
    • GraphQL proxy for /graphqlhttp://localhost:8000
    • REST API proxy for /apihttp://localhost:8000
    • Auth0 optional via REACT_APP_USE_AUTH0 environment variable

Data Flow Architecture

Document Processing:

  1. Upload → Parser Selection (Docling/LlamaParse/Text)
  2. Parser generates PAWLs JSON (tokens with bounding boxes)
  3. Text layer extracted from PAWLs
  4. Annotations created for structure (headers, sections, etc.)
  5. Relationships detected between elements
  6. Vector embeddings generated for search

GraphQL Permission Flow:

  1. Query resolver filters objects with .visible_to_user(user)
  2. GraphQL types resolve my_permissions via AnnotatePermissionsForReadMixin
  3. Frontend uses permissions to enable/disable UI features
  4. Mutations check permissions and return consistent errors to prevent IDOR

Critical Security Patterns

  1. IDOR Prevention:

    • Query by both ID AND user-owned field: Model.objects.get(pk=pk, recipient=user)
    • Return same error message whether object doesn't exist or belongs to another user
    • Prevents enumeration via timing or different error messages
  2. Permission Checks:

    • NEVER trust frontend - always check server-side
    • Use visible_to_user() manager method for querysets
    • Check user_has_permission_for_obj() for individual objects (in opencontractserver.utils.permissioning)
  3. XSS Prevention:

    • User-generated content in JSON fields must be escaped on frontend
    • GraphQL's GenericScalar handles JSON serialization safely
    • Document this requirement in resolver comments

Critical Concepts

  1. No dead code - when deprecating or replacing code, always try to fully replace older code and, once it's no longer in use, delete it and related texts.
  2. DRY - please always architect code for maximal dryness and always see if you can consolidate related code and remove duplicative code.
  3. Single Responsibility Principle - Generally, ensure that each module / script has a single purpose or related purpose.
  4. No magic numbers - we have constants files in opencontractserver/constants/ (backend) and frontend/src/assets/configurations/constants.ts (frontend). Use them for any hardcoded values.
  5. Don't touch old tests without permission - if pre-existing tests fail after changes, try to identify why and present user with root cause analysis. If the test logic is correct but expectations need updating due to intentional behavior changes, document the change clearly.
  6. Utility functions belong in utility files:
    • Before writing new utilities: Check existing utility files first (frontend/src/utils/, opencontractserver/utils/)
    • When reviewing code: If you find utility functions defined inline in components/views, check if they already exist in utility files. If not, consider whether they should be extracted there for reuse.
    • Frontend utilities: frontend/src/utils/formatters.ts (formatting), frontend/src/utils/files.ts (file operations), etc.
    • Backend utilities: opencontractserver/utils/ contains permissioning, PDF processing, and other shared utilities

Testing Patterns

Manual Test Scripts

Location: docs/test_scripts/

When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under docs/test_scripts/. These scripts will later be used to build automated integration tests.

Format:

# Test: [Brief description]

## Purpose
What this test verifies.

## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")

## Steps
1. Step one with exact command
   ```bash
   docker compose -f local.yml run --rm django python manage.py shell -c "..."
  1. Step two...

Expected Results

  • What should happen after each step
  • Success criteria

Cleanup

Commands to restore original state if needed.


**When to document**:
- Migration testing (rollback, create test data, migrate forward)
- Database constraint validation
- Race condition verification
- Any manual verification requested during code review

### Backend Tests

**Location**: `opencontractserver/tests/`

**Parallel Testing** (pytest-xdist):
- Run with `-n 4` (or `-n auto`) for parallel execution across workers
- Use `--dist loadscope` to keep tests from the same class on the same worker (respects `setUpClass`)
- Each worker gets its own database (test_db_gw0, test_db_gw1, etc.)
- Use `@pytest.mark.serial` to mark tests that cannot run in parallel
- First run or after DB schema changes: add `--create-db` flag

**Patterns**:
- Use `TransactionTestCase` for tests with signals/asynchronous behavior
- Use `TestCase` for faster tests without transaction isolation
- Clear auto-created notifications when testing moderation: `Notification.objects.filter(recipient=user).delete()`
- Use `_skip_signals` attribute on instances to prevent signal handlers during fixtures

### Frontend Component Tests

**Location**: `frontend/tests/`

**Critical Requirements**:
- Mount through test wrappers that provide all required context
- GraphQL mocks must match query variables exactly
- Include mocks for empty-string variants (unexpected boot calls)
- Wait for visible evidence, not just network-idle
- Use `page.mouse` for PDF canvas interactions (NOT `locator.dragTo`)
- Add settle time after drag operations (500ms UI, 1000ms Apollo cache)

**Test Wrapper Pattern**:
```typescript
// ALWAYS use wrappers, never mount components directly
const component = await mount(
  <DocumentKnowledgeBaseTestWrapper corpusId="corpus-1" documentId="doc-1">
    <DocumentKnowledgeBase />
  </DocumentKnowledgeBaseTestWrapper>
);

Automated Documentation Screenshots

Location: docs/assets/images/screenshots/auto/ (output) | frontend/tests/utils/docScreenshot.ts (utility)

Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the screenshots.yml CI workflow.

How it works:

  1. Import docScreenshot from ./utils/docScreenshot in any .ct.tsx test file
  2. Call await docScreenshot(page, "area--component--state") after the component reaches the desired visual state
  3. Reference the image in markdown: ![Alt text](../assets/images/screenshots/auto/area--component--state.png)
  4. CI runs tests on every PR, captures screenshots, and auto-commits any changes

Naming convention (-- separates segments, - within words):

Segment Purpose Examples
area Feature area landing, badges, corpus, versioning, annotations
component Specific view or component hero-section, celebration-modal, list-view
state Visual state captured anonymous, with-data, empty, auto-award

At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens.

Example:

import { docScreenshot } from "./utils/docScreenshot";

// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");

Rules:

  • Place docScreenshot() calls AFTER assertions that confirm the desired visual state
  • The filename IS the contract between tests and docs — keep names stable
  • Never manually edit files in docs/assets/images/screenshots/auto/ — they are overwritten by CI
  • Manually curated screenshots stay in docs/assets/images/screenshots/ (parent directory)

Release Screenshots (Point-in-Time)

For release notes, use releaseScreenshot to capture screenshots that are locked in amber — they show the UI at a specific release and never change.

Location: docs/assets/images/screenshots/releases/{version}/ (output)

import { releaseScreenshot } from "./utils/docScreenshot";

await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });

Key differences from docScreenshot:

  • Output: docs/assets/images/screenshots/releases/{version}/{name}.png
  • Write-once: If the file already exists, the function is a no-op (won't overwrite)
  • CI never touches the releases/ directory
  • Name is a simple kebab-case string (no -- segment convention)
  • Version must match v{major}.{minor}.{patch} format (with optional suffix)

When to use which:

  • docScreenshot → README, quickstart, guides (always fresh)
  • releaseScreenshot → Release notes (frozen at release time)

Authenticated Playwright Testing (Live Frontend Debugging)

When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests.

Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to http://localhost:8000/graphql/ (cross-origin from the Vite dev server at localhost:5173). Since fetch defaults to credentials: 'same-origin', browser cookies aren't sent cross-origin. The workaround is to inject the session cookie into request headers via Playwright's route interception.

Step 1: Set a password for the superuser (one-time setup):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"

Step 2: Playwright script pattern:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // Collect console messages for debugging
  const consoleMsgs = [];
  page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));

  // 1. Login to Django admin to get session cookie
  await page.goto('http://localhost:8000/admin/login/');
  await page.fill('#id_username', '<superuser-username>');
  await page.fill('#id_password', 'testpass123');
  await page.click('input[type=submit]');
  await page.waitForTimeout(2000);

  // 2. Extract the session cookie
  const cookies = await context.cookies();
  const sessionCookie = cookies.find(c => c.name === 'sessionid');

  // 3. Intercept GraphQL requests to inject the session cookie
  //    (needed because Apollo sends cross-origin requests to :8000)
  await page.route('**/graphql/**', async (route) => {
    const headers = {
      ...route.request().headers(),
      'Cookie': 'sessionid=' + sessionCookie.value,
    };
    await route.continue({ headers });
  });

  // 4. Navigate to the frontend page under test
  await page.goto('http://localhost:5173/extracts');
  await page.waitForTimeout(5000);

  // 5. Inspect results
  const bodyText = await page.textContent('body');
  console.log(bodyText);

  await browser.close();
})();

Run from the frontend directory (where playwright is a dependency):

cd frontend && node /path/to/script.js

Key details:

  • The Django admin login sets a sessionid cookie for localhost
  • page.route('**/graphql/**') intercepts Apollo's requests to localhost:8000/graphql/ and injects the cookie header
  • The AuthGate will still show anonymous state (no Auth0 session), but GraphQL queries execute as the authenticated Django user
  • This is useful for verifying backend query results, permission filtering, and data flow through the real frontend
  • For verifying just the GraphQL response without the full frontend, curl with the session cookie also works:
    curl -s -X POST http://localhost:8000/graphql/ \
      -H "Content-Type: application/json" \
      -H "Cookie: sessionid=<session-key>" \
      -d '{"query":"query { extracts { edges { node { id name } } } }"}' | python3 -m json.tool

Alternative — create a session programmatically (no admin login needed):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"

Then use the printed session key directly in curl or Playwright route interception.

Documentation Locations

  • Permissioning: docs/permissioning/consolidated_permissioning_guide.md
  • Frontend Routing: docs/frontend/routing_system.md
  • Frontend Auth Flow: docs/frontend/auth_flow.md
  • PDF Data Layer: docs/architecture/PDF-data-layer.md
  • PAWLs Format (v1/v2): docs/architecture/pawls-format.md
  • Parser Pipeline: docs/pipelines/pipeline_overview.md
  • LLM Framework: docs/architecture/llms/README.md
  • Collaboration System: docs/commenting_system/README.md
  • Auth Pattern (detailed): frontend/src/docs/AUTHENTICATION_PATTERN.md
  • Documentation Screenshots: docs/development/screenshots.md
  • Query Permission Patterns: docs/architecture/query_permission_patterns.md
  • OS-Legal-Style Migration Guide: docs/frontend/os-legal-style-migration-guide.md

Branch Strategy

This project follows trunk-based development:

  • Work directly on main branch or use short-lived feature branches
  • Feature branches: feature/description-issue-number
  • Merge feature branches quickly (within a day or two)
  • Commit message format: Descriptive with issue references (e.g., "Closes Epic: Notification System #562")

Changelog Maintenance

IMPORTANT: Always update CHANGELOG.md when making significant changes to the codebase.

The changelog follows Keep a Changelog format:

## [Unreleased] - YYYY-MM-DD

### Added
- New features

### Fixed
- Bug fixes with file locations and line numbers

### Changed
- Changes to existing functionality

### Technical Details
- Implementation specifics, architectural notes

When to update:

  • New features or models added
  • Production code bugs fixed (document file location, line numbers, and impact)
  • Breaking changes to APIs or data models
  • Test suite fixes that reveal production issues
  • Database migrations
  • Architecture changes

What to include:

  • File paths and line numbers for code changes
  • Clear description of the issue and fix
  • Impact on system behavior
  • Migration notes if applicable

Pre-commit Hooks

Automatically run on commit:

  • black (Python formatting)
  • isort (import sorting)
  • flake8 (linting)
  • prettier (frontend formatting)
  • pyupgrade (Python syntax modernization)

Run manually: pre-commit run --all-files

Common Pitfalls

  1. Frontend tests hanging: Always use --reporter=list flag
  2. Permission N+1 queries: Use .visible_to_user() NOT individual permission checks. For Conversation list queries in GraphQL resolvers, use ConversationQueryOptimizer from opencontractserver.conversations.query_optimizer - it provides request-level caching to avoid repeated corpus/document visibility subqueries
  3. Missing GraphQL mocks: Check variables match exactly (null vs undefined matters), add duplicates for refetches
  4. Notification duplication in tests: Moderation methods auto-create ModerationAction records
  5. Structural annotation editing: Always read-only except for superusers
  6. Missing signal imports: Import signal handlers in apps.py ready() method
  7. PDF rendering slow in tests: Increase timeouts to 20s+ for Chromium
  8. Cache serialization crashes: Keep InMemoryCache definition inside wrapper, not test file
  9. Backend Tests Waiting > 10 seconds on Postgres to be Ready: Docker network issue. Fix with: docker compose -f test.yml down && docker kill $(docker ps -q) && docker compose -f test.yml down
  10. Empty lists on direct navigation: AuthGate pattern solves this (don't check auth status, it's always ready)
  11. URL desynchronization: Use CentralRouteManager, don't bypass routing system
  12. Jotai state not updating: Ensure atoms are properly imported and used with useAtom hook
  13. Writing sync agent tools: All agent tools must be async. The PydanticAIToolWrapper accepts sync functions but calls them directly (no thread pool) — sync Django ORM calls will raise SynchronousOnlyOperation. Use the a-prefixed async versions in core_tools.py.
  14. PydanticAI system_prompt silently dropped: When creating PydanticAIAgent, use instructions= NOT system_prompt=. The system_prompt parameter is only included when message_history is None, but OpenContracts' chat() persists a HUMAN message before calling pydantic-ai's run(), so history is always non-empty. This causes the system prompt to be silently dropped. See docs/architecture/llms/README.md for full details.
  15. Apollo cache keyArgs must use field argument names, not variable names: In cache.ts, relayStylePagination(["corpus", "name_Contains"]) uses the GraphQL field argument names (e.g., extracts(corpus: $id, name_Contains: $text)), NOT the variable names ($id, $text). Mismatched keyArgs silently fail to isolate cache entries, causing queries with different filters to share stale cached results.
  16. Corrupted Docker iptables chains (RARE): If you see Chain 'DOCKER-ISOLATION-STAGE-2' does not exist errors, Docker's iptables chains have been corrupted during docker cycling. Run this nuclear fix:
    sudo systemctl stop docker && sudo systemctl stop docker.socket && sudo ip link delete docker0 2>/dev/null || true && sudo iptables -t nat -F && sudo iptables -t nat -X && sudo iptables -t filter -F && sudo iptables -t filter -X 2>/dev/null || true && sudo iptables -t mangle -F && sudo iptables -t mangle -X && sudo iptables -t filter -N INPUT 2>/dev/null || true && sudo iptables -t filter -N FORWARD 2>/dev/null || true && sudo iptables -t filter -N OUTPUT 2>/dev/null || true && sudo iptables -P INPUT ACCEPT && sudo iptables -P FORWARD ACCEPT && sudo iptables -P OUTPUT ACCEPT && sudo systemctl start docker
    This completely resets Docker's networking and iptables state. Docker will recreate all required chains on startup.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review placeholder - content will follow in separate API call

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: First-class DOCX Support (PR 1102)

This is a substantial and well-structured PR. The architecture delegating DOCX parsing to a .NET microservice and using the same Docxodus library in the WASM renderer is clever, and the offset-alignment guarantee is a strong design decision. The code is generally clean, follows existing patterns, and has good test coverage. A few issues worth addressing before merge:


Critical: Version Mismatch Breaks the Core Alignment Guarantee

The PR central claim is that character offsets are guaranteed to align because both sides use the same Docxodus library version. But they do not:

  • Docker image tag: ghcr.io/open-source-legal/docxodus-service:1.1.0-docxodus5.4.2 (docxodus 5.4.2)
  • Frontend package.json: docxodus version 5.5.2

The docker-compose comments explicitly warn that the frontend WASM docxodus version and this microservice must match for character offset alignment, and then immediately violate it. If there were any text-extraction behavior changes between 5.4.2 and 5.5.2, annotations from the backend will silently misalign in the frontend renderer.

Action needed: Align to the same version across the image tag and package.json, or use the newer microservice image version.


hexToRgb can throw on invalid colors

In DocxAnnotator.tsx, the customCss memo calls hexToRgb(color) and immediately destructures the result. If any label has an invalid or empty hex color string and hexToRgb returns null or undefined, this crashes. Labels sourced from the DB could have any value. Add a null guard before destructuring the return value.


Magic number context_length = 64 in useTextSearch

In the new span-based search branch (useTextSearch.tsx) there is a hardcoded const context_length = 64. Per project guidelines, no magic numbers should live inline. This should be a named export in frontend/src/assets/configurations/constants.ts. The PDF branch presumably uses the same value, worth consolidating.


annotation_type GraphQL workaround hides data quality issues

The resolver returns an empty string for invalid DB values. While pragmatic, this makes it harder to detect corrupted data at the API layer. Consider at minimum logging a warning when an invalid value is encountered so it surfaces in monitoring. A follow-up migration to clean existing invalid values would be more robust.


SANITIZE_CONFIG allows style tags: document the risk

Re-enabling style tags is unusual in DOMPurify configs. The comment explains it is needed for docxodus formatting, but it opens the door to CSS-based data exfiltration (e.g., CSS history sniffing, font-face-based SSRF). DOMPurify FORCE_BODY helps contain styles to the component, but worth a comment explaining the risk assessment and why it is acceptable here (e.g., content is always server-generated, never user-typed HTML).


@ts-expect-error for PaginatedDocument import is tech debt

The tsconfig uses moduleResolution node which does not support package.json exports subpaths. This works at runtime via Vite but bypasses TypeScript type checking for the PaginatedDocument component leaving its props untyped. Consider updating tsconfig.json to moduleResolution bundler (Vite-native) which properly resolves exports subpaths. This is the recommended setting for Vite projects and would eliminate the suppression comment.


What is done well

  • Cancellation pattern in loadDocxDocument is correct: the ref-based cancel flag prevents stale state updates on fast navigation.
  • LRU cache in getDocxBytes is implemented correctly using Map insertion-order semantics.
  • DOM-based selection disambiguation (docxOffsetUtils.ts) is well-tested with clear edge cases including the 5 occurrences of Party contract scenario.
  • isSpanBasedFileType() abstraction cleanly extends the existing text-file type guards to cover DOCX without touching every callsite individually.
  • Annotation ref registration in Effect 6 correctly skips synthetic search/chat-source annotations and handles deregistration on unmount.
  • Test coverage is solid: unit tests for the backend parser (success, timeout, connection error, HTTP errors, normalization), Vitest unit tests for the offset utils, and Playwright CT tests for the WASM renderer including the disambiguation smoke test.
  • Upload modal integration (dynamic file type list from backend) is a nice bonus that makes the modal forward-compatible with future parsers.

The version mismatch is the only blocking issue. The others are improvements worth addressing but would not cause data loss.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

## PR Review: First-class DOCX support with Docxodus microservice and WASM renderer

This is a test to see if HTML entities render as markdown headers.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models.

Baseline Commit Rules

  1. Always ensure all affected (or new) tests pass - backend tests suite should only be run in its entirety for good reason as it takes 30+ minutes.
  2. Always make sure typescript compiles and pre-commits pass before committing new code.
  3. Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts. This includes Co-Authored-By lines, "Generated by Claude", etc.

Essential Commands

Backend (Django)

# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb

# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope

# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope

# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db

# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb

# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope

# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb

# Apply database migrations
docker compose -f local.yml run django python manage.py migrate

# Create new migration
docker compose -f local.yml run django python manage.py makemigrations

# Django shell
docker compose -f local.yml run django python manage.py shell

# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-files

Frontend (React/TypeScript)

cd frontend

# Start development server (proxies to Django on :8000)
yarn start

# Run unit tests (Vitest) - watches by default
yarn test:unit

# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list

# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"

# Run E2E tests
yarn test:e2e

# Coverage report
yarn test:coverage

# Linting and formatting
yarn lint
yarn fix-styles

# Build for production
yarn build

# Preview production build locally
yarn serve

Production Deployment

# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate

# Then start main services
docker compose -f production.yml up

High-Level Architecture

Backend Architecture

Stack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery

Key Patterns:

  1. GraphQL Schema Organization:

    • config/graphql/graphene_types.py - All GraphQL type definitions
    • config/graphql/queries.py - Query resolvers
    • config/graphql/*_mutations.py - Mutation files (organized by feature)
    • config/graphql/schema.py - Schema composition
  2. Permission System (CRITICAL - see docs/permissioning/consolidated_permissioning_guide.md):

    • Annotations & Relationships: NO individual permissions - inherited from document + corpus
    • Documents & Corpuses: Direct object-level permissions via django-guardian
    • Analyses & Extracts: Hybrid model (own permissions + corpus permissions + document filtering)
    • Formula: Effective Permission = MIN(document_permission, corpus_permission)
    • Structural items are ALWAYS read-only except for superusers
    • Use Model.objects.visible_to_user(user) pattern (NOT resolve_oc_model_queryset - DEPRECATED)
  3. AnnotatePermissionsForReadMixin:

    • Most GraphQL types inherit this mixin (see config/graphql/permissioning/permission_annotator/mixins.py)
    • Adds my_permissions, is_published, object_shared_with fields
    • Requires model to have guardian permission tables ({model}userobjectpermission_set)
    • Notifications use simple ownership model and DON'T use this mixin
  4. Django Signal Handlers:

    • Automatic notification creation on model changes (see opencontractserver/notifications/signals.py)
    • Must be imported in app's apps.py ready() method
    • Use _skip_signals attribute to prevent duplicate notifications in tests
  5. Pluggable Parser Pipeline:

    • Base classes in opencontractserver/pipeline/base/
    • Parsers, embedders, thumbnailers auto-discovered and registered
    • Multiple backends: Docling (ML-based), LlamaParse, Text
    • All convert to unified PAWLs format for frontend
  6. Agent Tool Architecture (see docs/architecture/llms/README.md):

    • CoreTool (framework-agnostic) → PydanticAIToolWrapper (pydantic-ai specific)
    • All production tools MUST be async (a-prefixed in core_tools.py). The wrapper supports sync functions for lightweight helpers/tests but does NOT wrap them in a thread pool — sync ORM calls will raise SynchronousOnlyOperation.
    • Tool fault tolerance (issue Agent Tools Not Fault Tolerant #820): operational exceptions are caught and returned as error strings to the LLM; security exceptions (PermissionError, ToolConfirmationRequired) propagate.
    • Pre-execution checks run on every call (not cached): permission validation, resource ID validation, approval gates.

Frontend Architecture

Stack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite

Key Patterns:

  1. State Management - Jotai Atoms:

    • Global state via atoms in frontend/src/atoms/ (NOT Redux/Context)
    • Key atoms: selectedCorpusIdAtom, selectedFolderIdAtom, currentThreadIdAtom
    • Derived atoms automatically update when dependencies change
    • Apollo reactive vars in frontend/src/graphql/cache.ts for UI state
    • AuthGate pattern ensures auth completes before rendering
  2. Central Routing System (see docs/frontend/routing_system.md):

    • Single source of truth: frontend/src/routing/CentralRouteManager.tsx
    • URL paths → Entity resolution via GraphQL slug queries
    • URL params ↔ Reactive vars (bidirectional sync)
    • Components consume state via reactive vars, never touch URLs directly
    • Deep linking and canonical redirects handled automatically
  3. PDF Annotation System (see .cursor/rules/pdf-viewer-and-annotator-architecture.mdc):

    • Virtualized rendering: Only visible pages (+overscan) rendered for performance
    • Binary search to find visible page range (O(log n))
    • Height caching per zoom level
    • Two-phase scroll-to-annotation system
    • Dual-layer architecture: Document layer (annotations) + Knowledge layer (summaries)
  4. Unified Filtering Architecture:

    • useVisibleAnnotations hook provides single-source-of-truth filtering for both annotations and relationship-connected annotations
    • Reads from Jotai atoms via useAnnotationDisplay, useAnnotationControls, and useAnnotationSelection (from UISettingsAtom)
    • Ensures consistency across all components
    • Forced visibility for selected items and their relationship connections
  5. Component Testing (see .cursor/rules/test-document-knowledge-base.mdc):

    • ALWAYS mount components through test wrappers (e.g., DocumentKnowledgeBaseTestWrapper)
    • Wrapper provides: MockedProvider + InMemoryCache + Jotai Provider + asset mocking
    • Use --reporter=list flag to prevent hanging
    • Increase timeouts (20s+) for PDF rendering in Chromium
    • GraphQL mocks must match variables EXACTLY (null vs undefined matters)
    • Mock same query multiple times for refetches
    • Use page.mouse for PDF canvas interactions (NOT locator.dragTo)
    • Add settle time after drag operations (500ms UI, 1000ms Apollo cache)
  6. Development Server Configuration:

    • Vite dev server on :3000 proxies to Django on :8000
    • WebSocket proxy for /wsws://localhost:8000
    • GraphQL proxy for /graphqlhttp://localhost:8000
    • REST API proxy for /apihttp://localhost:8000
    • Auth0 optional via REACT_APP_USE_AUTH0 environment variable

Data Flow Architecture

Document Processing:

  1. Upload → Parser Selection (Docling/LlamaParse/Text)
  2. Parser generates PAWLs JSON (tokens with bounding boxes)
  3. Text layer extracted from PAWLs
  4. Annotations created for structure (headers, sections, etc.)
  5. Relationships detected between elements
  6. Vector embeddings generated for search

GraphQL Permission Flow:

  1. Query resolver filters objects with .visible_to_user(user)
  2. GraphQL types resolve my_permissions via AnnotatePermissionsForReadMixin
  3. Frontend uses permissions to enable/disable UI features
  4. Mutations check permissions and return consistent errors to prevent IDOR

Critical Security Patterns

  1. IDOR Prevention:

    • Query by both ID AND user-owned field: Model.objects.get(pk=pk, recipient=user)
    • Return same error message whether object doesn't exist or belongs to another user
    • Prevents enumeration via timing or different error messages
  2. Permission Checks:

    • NEVER trust frontend - always check server-side
    • Use visible_to_user() manager method for querysets
    • Check user_has_permission_for_obj() for individual objects (in opencontractserver.utils.permissioning)
  3. XSS Prevention:

    • User-generated content in JSON fields must be escaped on frontend
    • GraphQL's GenericScalar handles JSON serialization safely
    • Document this requirement in resolver comments

Critical Concepts

  1. No dead code - when deprecating or replacing code, always try to fully replace older code and, once it's no longer in use, delete it and related texts.
  2. DRY - please always architect code for maximal dryness and always see if you can consolidate related code and remove duplicative code.
  3. Single Responsibility Principle - Generally, ensure that each module / script has a single purpose or related purpose.
  4. No magic numbers - we have constants files in opencontractserver/constants/ (backend) and frontend/src/assets/configurations/constants.ts (frontend). Use them for any hardcoded values.
  5. Don't touch old tests without permission - if pre-existing tests fail after changes, try to identify why and present user with root cause analysis. If the test logic is correct but expectations need updating due to intentional behavior changes, document the change clearly.
  6. Utility functions belong in utility files:
    • Before writing new utilities: Check existing utility files first (frontend/src/utils/, opencontractserver/utils/)
    • When reviewing code: If you find utility functions defined inline in components/views, check if they already exist in utility files. If not, consider whether they should be extracted there for reuse.
    • Frontend utilities: frontend/src/utils/formatters.ts (formatting), frontend/src/utils/files.ts (file operations), etc.
    • Backend utilities: opencontractserver/utils/ contains permissioning, PDF processing, and other shared utilities

Testing Patterns

Manual Test Scripts

Location: docs/test_scripts/

When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under docs/test_scripts/. These scripts will later be used to build automated integration tests.

Format:

# Test: [Brief description]

## Purpose
What this test verifies.

## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")

## Steps
1. Step one with exact command
   ```bash
   docker compose -f local.yml run --rm django python manage.py shell -c "..."
  1. Step two...

Expected Results

  • What should happen after each step
  • Success criteria

Cleanup

Commands to restore original state if needed.


**When to document**:
- Migration testing (rollback, create test data, migrate forward)
- Database constraint validation
- Race condition verification
- Any manual verification requested during code review

### Backend Tests

**Location**: `opencontractserver/tests/`

**Parallel Testing** (pytest-xdist):
- Run with `-n 4` (or `-n auto`) for parallel execution across workers
- Use `--dist loadscope` to keep tests from the same class on the same worker (respects `setUpClass`)
- Each worker gets its own database (test_db_gw0, test_db_gw1, etc.)
- Use `@pytest.mark.serial` to mark tests that cannot run in parallel
- First run or after DB schema changes: add `--create-db` flag

**Patterns**:
- Use `TransactionTestCase` for tests with signals/asynchronous behavior
- Use `TestCase` for faster tests without transaction isolation
- Clear auto-created notifications when testing moderation: `Notification.objects.filter(recipient=user).delete()`
- Use `_skip_signals` attribute on instances to prevent signal handlers during fixtures

### Frontend Component Tests

**Location**: `frontend/tests/`

**Critical Requirements**:
- Mount through test wrappers that provide all required context
- GraphQL mocks must match query variables exactly
- Include mocks for empty-string variants (unexpected boot calls)
- Wait for visible evidence, not just network-idle
- Use `page.mouse` for PDF canvas interactions (NOT `locator.dragTo`)
- Add settle time after drag operations (500ms UI, 1000ms Apollo cache)

**Test Wrapper Pattern**:
```typescript
// ALWAYS use wrappers, never mount components directly
const component = await mount(
  <DocumentKnowledgeBaseTestWrapper corpusId="corpus-1" documentId="doc-1">
    <DocumentKnowledgeBase />
  </DocumentKnowledgeBaseTestWrapper>
);

Automated Documentation Screenshots

Location: docs/assets/images/screenshots/auto/ (output) | frontend/tests/utils/docScreenshot.ts (utility)

Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the screenshots.yml CI workflow.

How it works:

  1. Import docScreenshot from ./utils/docScreenshot in any .ct.tsx test file
  2. Call await docScreenshot(page, "area--component--state") after the component reaches the desired visual state
  3. Reference the image in markdown: ![Alt text](../assets/images/screenshots/auto/area--component--state.png)
  4. CI runs tests on every PR, captures screenshots, and auto-commits any changes

Naming convention (-- separates segments, - within words):

Segment Purpose Examples
area Feature area landing, badges, corpus, versioning, annotations
component Specific view or component hero-section, celebration-modal, list-view
state Visual state captured anonymous, with-data, empty, auto-award

At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens.

Example:

import { docScreenshot } from "./utils/docScreenshot";

// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");

Rules:

  • Place docScreenshot() calls AFTER assertions that confirm the desired visual state
  • The filename IS the contract between tests and docs — keep names stable
  • Never manually edit files in docs/assets/images/screenshots/auto/ — they are overwritten by CI
  • Manually curated screenshots stay in docs/assets/images/screenshots/ (parent directory)

Release Screenshots (Point-in-Time)

For release notes, use releaseScreenshot to capture screenshots that are locked in amber — they show the UI at a specific release and never change.

Location: docs/assets/images/screenshots/releases/{version}/ (output)

import { releaseScreenshot } from "./utils/docScreenshot";

await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });

Key differences from docScreenshot:

  • Output: docs/assets/images/screenshots/releases/{version}/{name}.png
  • Write-once: If the file already exists, the function is a no-op (won't overwrite)
  • CI never touches the releases/ directory
  • Name is a simple kebab-case string (no -- segment convention)
  • Version must match v{major}.{minor}.{patch} format (with optional suffix)

When to use which:

  • docScreenshot → README, quickstart, guides (always fresh)
  • releaseScreenshot → Release notes (frozen at release time)

Authenticated Playwright Testing (Live Frontend Debugging)

When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests.

Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to http://localhost:8000/graphql/ (cross-origin from the Vite dev server at localhost:5173). Since fetch defaults to credentials: 'same-origin', browser cookies aren't sent cross-origin. The workaround is to inject the session cookie into request headers via Playwright's route interception.

Step 1: Set a password for the superuser (one-time setup):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"

Step 2: Playwright script pattern:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // Collect console messages for debugging
  const consoleMsgs = [];
  page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));

  // 1. Login to Django admin to get session cookie
  await page.goto('http://localhost:8000/admin/login/');
  await page.fill('#id_username', '<superuser-username>');
  await page.fill('#id_password', 'testpass123');
  await page.click('input[type=submit]');
  await page.waitForTimeout(2000);

  // 2. Extract the session cookie
  const cookies = await context.cookies();
  const sessionCookie = cookies.find(c => c.name === 'sessionid');

  // 3. Intercept GraphQL requests to inject the session cookie
  //    (needed because Apollo sends cross-origin requests to :8000)
  await page.route('**/graphql/**', async (route) => {
    const headers = {
      ...route.request().headers(),
      'Cookie': 'sessionid=' + sessionCookie.value,
    };
    await route.continue({ headers });
  });

  // 4. Navigate to the frontend page under test
  await page.goto('http://localhost:5173/extracts');
  await page.waitForTimeout(5000);

  // 5. Inspect results
  const bodyText = await page.textContent('body');
  console.log(bodyText);

  await browser.close();
})();

Run from the frontend directory (where playwright is a dependency):

cd frontend && node /path/to/script.js

Key details:

  • The Django admin login sets a sessionid cookie for localhost
  • page.route('**/graphql/**') intercepts Apollo's requests to localhost:8000/graphql/ and injects the cookie header
  • The AuthGate will still show anonymous state (no Auth0 session), but GraphQL queries execute as the authenticated Django user
  • This is useful for verifying backend query results, permission filtering, and data flow through the real frontend
  • For verifying just the GraphQL response without the full frontend, curl with the session cookie also works:
    curl -s -X POST http://localhost:8000/graphql/ \
      -H "Content-Type: application/json" \
      -H "Cookie: sessionid=<session-key>" \
      -d '{"query":"query { extracts { edges { node { id name } } } }"}' | python3 -m json.tool

Alternative — create a session programmatically (no admin login needed):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"

Then use the printed session key directly in curl or Playwright route interception.

Documentation Locations

  • Permissioning: docs/permissioning/consolidated_permissioning_guide.md
  • Frontend Routing: docs/frontend/routing_system.md
  • Frontend Auth Flow: docs/frontend/auth_flow.md
  • PDF Data Layer: docs/architecture/PDF-data-layer.md
  • PAWLs Format (v1/v2): docs/architecture/pawls-format.md
  • Parser Pipeline: docs/pipelines/pipeline_overview.md
  • LLM Framework: docs/architecture/llms/README.md
  • Collaboration System: docs/commenting_system/README.md
  • Auth Pattern (detailed): frontend/src/docs/AUTHENTICATION_PATTERN.md
  • Documentation Screenshots: docs/development/screenshots.md
  • Query Permission Patterns: docs/architecture/query_permission_patterns.md
  • OS-Legal-Style Migration Guide: docs/frontend/os-legal-style-migration-guide.md

Branch Strategy

This project follows trunk-based development:

  • Work directly on main branch or use short-lived feature branches
  • Feature branches: feature/description-issue-number
  • Merge feature branches quickly (within a day or two)
  • Commit message format: Descriptive with issue references (e.g., "Closes Epic: Notification System #562")

Changelog Maintenance

IMPORTANT: Always update CHANGELOG.md when making significant changes to the codebase.

The changelog follows Keep a Changelog format:

## [Unreleased] - YYYY-MM-DD

### Added
- New features

### Fixed
- Bug fixes with file locations and line numbers

### Changed
- Changes to existing functionality

### Technical Details
- Implementation specifics, architectural notes

When to update:

  • New features or models added
  • Production code bugs fixed (document file location, line numbers, and impact)
  • Breaking changes to APIs or data models
  • Test suite fixes that reveal production issues
  • Database migrations
  • Architecture changes

What to include:

  • File paths and line numbers for code changes
  • Clear description of the issue and fix
  • Impact on system behavior
  • Migration notes if applicable

Pre-commit Hooks

Automatically run on commit:

  • black (Python formatting)
  • isort (import sorting)
  • flake8 (linting)
  • prettier (frontend formatting)
  • pyupgrade (Python syntax modernization)

Run manually: pre-commit run --all-files

Common Pitfalls

  1. Frontend tests hanging: Always use --reporter=list flag
  2. Permission N+1 queries: Use .visible_to_user() NOT individual permission checks. For Conversation list queries in GraphQL resolvers, use ConversationQueryOptimizer from opencontractserver.conversations.query_optimizer - it provides request-level caching to avoid repeated corpus/document visibility subqueries
  3. Missing GraphQL mocks: Check variables match exactly (null vs undefined matters), add duplicates for refetches
  4. Notification duplication in tests: Moderation methods auto-create ModerationAction records
  5. Structural annotation editing: Always read-only except for superusers
  6. Missing signal imports: Import signal handlers in apps.py ready() method
  7. PDF rendering slow in tests: Increase timeouts to 20s+ for Chromium
  8. Cache serialization crashes: Keep InMemoryCache definition inside wrapper, not test file
  9. Backend Tests Waiting > 10 seconds on Postgres to be Ready: Docker network issue. Fix with: docker compose -f test.yml down && docker kill $(docker ps -q) && docker compose -f test.yml down
  10. Empty lists on direct navigation: AuthGate pattern solves this (don't check auth status, it's always ready)
  11. URL desynchronization: Use CentralRouteManager, don't bypass routing system
  12. Jotai state not updating: Ensure atoms are properly imported and used with useAtom hook
  13. Writing sync agent tools: All agent tools must be async. The PydanticAIToolWrapper accepts sync functions but calls them directly (no thread pool) — sync Django ORM calls will raise SynchronousOnlyOperation. Use the a-prefixed async versions in core_tools.py.
  14. PydanticAI system_prompt silently dropped: When creating PydanticAIAgent, use instructions= NOT system_prompt=. The system_prompt parameter is only included when message_history is None, but OpenContracts' chat() persists a HUMAN message before calling pydantic-ai's run(), so history is always non-empty. This causes the system prompt to be silently dropped. See docs/architecture/llms/README.md for full details.
  15. Apollo cache keyArgs must use field argument names, not variable names: In cache.ts, relayStylePagination(["corpus", "name_Contains"]) uses the GraphQL field argument names (e.g., extracts(corpus: $id, name_Contains: $text)), NOT the variable names ($id, $text). Mismatched keyArgs silently fail to isolate cache entries, causing queries with different filters to share stale cached results.
  16. Corrupted Docker iptables chains (RARE): If you see Chain 'DOCKER-ISOLATION-STAGE-2' does not exist errors, Docker's iptables chains have been corrupted during docker cycling. Run this nuclear fix:
    sudo systemctl stop docker && sudo systemctl stop docker.socket && sudo ip link delete docker0 2>/dev/null || true && sudo iptables -t nat -F && sudo iptables -t nat -X && sudo iptables -t filter -F && sudo iptables -t filter -X 2>/dev/null || true && sudo iptables -t mangle -F && sudo iptables -t mangle -X && sudo iptables -t filter -N INPUT 2>/dev/null || true && sudo iptables -t filter -N FORWARD 2>/dev/null || true && sudo iptables -t filter -N OUTPUT 2>/dev/null || true && sudo iptables -P INPUT ACCEPT && sudo iptables -P FORWARD ACCEPT && sudo iptables -P OUTPUT ACCEPT && sudo systemctl start docker
    This completely resets Docker's networking and iptables state. Docker will recreate all required chains on startup.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

test

5.5.3 adds typesVersions to package.json, fixing subpath type resolution
for moduleResolution:"node". The @ts-expect-error on the docxodus/react
import is no longer needed. Also picks up pagination rendering fixes.
@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer

This is a well-structured PR that adds a complete parallel pipeline for DOCX support. The architecture is sound - offset alignment via a shared library on both sides is clever, and the convert-once / project-many pattern for annotations is a good performance approach. Below are observations organized by severity.


Bugs / Correctness Issues

1. Chat source ID format mismatch in DocxAnnotatorWrapper

In DocxAnnotatorWrapper.tsx, chatSourceMatches entries use sourceId: src.id, but selectedChatSourceId is set to a template string of selectedMessageId + "." + selectedSourceIndex. These formats differ - unless src.id always equals that combined string, the active chat source highlight will never be shown as selected. Worth verifying against how TxtAnnotatorWrapper handles this same field.

2. Span-based search fallthrough in useTextSearch.tsx

The new span-based search path is in a bare else block, meaning any non-PDF document type (including unsupported or future types) falls through to span-based search. An explicit isSpanBasedFileType guard would be safer and more consistent with the rest of this PR.

3. Annotation projection re-runs on selectedChatSourceId change

Effect 3 (annotation projection, approximately 56ms) has selectedChatSourceId in its dependency array. This means selecting a chat source triggers a full re-projection. For TXT documents the equivalent is likely CSS-only - consider moving the selected/unselected chat source distinction to the visibility CSS layer (Effect 5) so selecting a chat source becomes a CSS toggle rather than a full re-projection.


Security Issues

4. CSS injection via label color field

In DocxAnnotator.tsx, the customCss memo interpolates label colors directly into a style string. If a label color field contains unexpected characters, this could inject arbitrary CSS rules. Label colors come from corpus admins so the risk is limited, but validating the hex value before interpolation would be more robust.

5. Microservice error response body in exception messages

In docxodus_parser.py, the first 500 chars of the microservice error response are appended to the exception message. If the microservice ever echoes back fragments of the document (e.g. in a validation error), sensitive document content could end up in logs. Logging response_text separately at DEBUG level would be safer.


Performance / Correctness

6. No microservice response validation

After response.raise_for_status(), the response JSON is presumably passed directly to the caller without checking that it conforms to OpenContractDocExport. A malformed-but-valid-JSON response from a version mismatch between the Django service and the .NET microservice could cause confusing downstream errors. A lightweight check for required top-level keys would catch version drift early.


Code Quality

7. Fragile offset coupling in DocxAnnotatorTestWrapper.tsx

The sampleAnnotation2 offsets (start: 84, end: 100) are pinned to docxodus version 5.5.0 text extraction. The disambiguation test does this well - it has an explicit sanity check verifying the expected substring is at the pinned offset. Applying the same pattern to sampleAnnotation2 would make failures self-explanatory when the library version changes.

8. annotation_type resolver suggests stale DB data

The new resolve_annotation_type is described as tolerating invalid DB values. If there are known rows with invalid annotation_type values in production, a data migration would be more correct than a permanent resolver-level workaround. The resolver is reasonable as a short-term fix but worth tracking.

9. sampleDocText export from test wrapper

sampleDocText is exported from DocxAnnotatorTestWrapper.tsx and imported in DocxAnnotator.ct.tsx. Moving it to a shared fixtures file alongside the other test mocks would avoid coupling the wrapper implementation to test internals.


Minor / Nits

  • The maxHeight and maxWidth props on DocxAnnotator are accepted but DocxAnnotatorWrapper hardcodes 100% for both, making those props effectively dead from the parent. Either remove them from the interface or pass them through from the wrapper.
  • In DocxAnnotatorWrapper, getSpan throws "Selected label not found" which is caught silently with console.warn. The user gets no feedback. Consider a toast.warn to surface the error.
  • The docxodus package is pinned to exact version 5.5.3 while other packages use caret ranges. If this is intentional due to the offset-alignment guarantee, a comment explaining the pin would help future maintainers resist the urge to loosen it.

Overall this is solid work. The main items worth addressing before merge are the chat source ID mismatch (item 1), the CSS injection path (item 4), and the span-based search fallthrough (item 2). The rest are improvements rather than blockers.

JSv4 added 2 commits March 21, 2026 09:46
…wright CT

PaginatedDocument's pagination engine needs a real viewport to compute
page dimensions. In Playwright CT, text nodes render with zero bounding
rects, making mouse-drag coordinate-based tests impossible. The feature
works correctly in the live app. Mark as test.fixme with TODO to
re-enable when PaginatedDocument supports headless rendering.
When a selection spans page boundaries in PaginatedDocument, the selected
text includes page number artifacts and extra whitespace that don't exist
in docText, causing findTextOccurrences to return no matches and silently
breaking the annotation menu for all subsequent selections.

Fix by resolving both anchor and focus positions via DOM offset
computation (getGlobalOffsetFromDomPosition) which walks text nodes
and counts characters, giving exact start/end offsets in docText
regardless of page boundaries. Fall back to text matching for single-
page selections where DOM resolution fails.

Also:
- Skip page-number elements in the TreeWalker (PaginatedDocument injects
  these as visible text nodes that aren't part of docText)
- Wrap handleMouseUp in try/catch to prevent uncaught errors from
  corrupting component state and breaking subsequent interactions
@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

test-ping-comment-will-delete

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer

Overall this is a well-structured addition. The three-pass render architecture (convert-once then project-many then CSS-only visibility toggle) is thoughtful, and the error classification (transient vs. permanent) in the parser is exactly right. Below are issues ranging from critical to minor.

CRITICAL: Version mismatch between microservice image and npm package

local.yml pins the microservice to docxodus 5.4.2 (image tag 1.1.0-docxodus5.4.2) but package.json uses docxodus 5.5.3. The entire offset-alignment guarantee rests on both sides using the same version of the Docxodus library. A version skew here will silently produce misaligned annotations in production. The image tag and npm version must match. If 5.5.3 is the intended version, the Docker image needs to be rebuilt and re-tagged, and both the lockfile and image digest should be pinned together so they cannot drift independently.

HIGH: Unsanitized label colors injected into CSS

In DocxAnnotator.tsx, the customCss memo builds CSS rules from label colors without sanitization. The color.replace strip does not sanitize special characters -- a corpus admin could set a label color to a value that breaks out of the CSS declaration via CSS injection. Add a hex validation guard (e.g. /^[0-9a-fA-F]{3,8}$/) before using the value in CSS. Additionally, hexToRgb(color) is called without null-checking the return value; if color is an empty string or non-hex, hexToRgb may throw or return unexpected values -- add a guard before destructuring.

HIGH: handleMouseUp silently falls back to first occurrence on DOM offset failure

When the DOM lookup fails (contentEl is null or anchorOffset is null), the annotation is silently created at the first occurrence of the text rather than the one the user selected. Consider bailing out with a user-visible message rather than silently annotating the wrong span.

MEDIUM: max_file_size_mb default is a magic number

Per CLAUDE.md: no magic numbers -- use constants files. The 50 default for max_file_size_mb should be defined in opencontractserver/constants/ rather than inline in the dataclass.

MEDIUM: annotation_type GraphQL resolver returns empty string for invalid data

Returning an empty string for invalid/missing annotation types means clients cannot distinguish valid-empty from corrupted data. Returning None instead (the field is already graphene.String which allows null) is more semantically correct.

MEDIUM: docxBytesCache entries can be arbitrarily large

The LRU cache caps at 3 entries (DOCX_CACHE_MAX_ENTRIES) but places no cap on the size of the stored Uint8Array values. Three 50 MB enterprise contracts would pin 150 MB in browser memory indefinitely. Either add a byte-budget alongside the entry count, or document the intentional trade-off in the constant JSDoc.

MINOR: Test fixture offsets are fragile across library version bumps

In DocxAnnotatorTestWrapper.tsx, sample annotations hardcode character offsets tied to a specific Docxodus text extraction. If the docxodus npm version changes its extraction output, these offsets will silently break. Consider adding runtime assertions that sampleDocText contains the expected text at each hardcoded offset, similar to the FIRST_THIS_OFFSET/SECOND_THIS_OFFSET sanity checks in the .fixme test.

MINOR: Duplicate setDocxBytes(null) calls

setDocxBytes(null) is called in three separate unstable_batchedUpdates blocks (PDF onCompleted, TXT onCompleted, and the unmount effect). While correct, extracting this into a reset helper would reduce the risk of missing the reset when a future file type is added.

Positive notes

  • The cancellation pattern in loadDocxDocument via docxLoadCancelRef is correct and consistent with the rest of the codebase.
  • DOMPurify integration with a named SANITIZE_CONFIG constant is well-structured.
  • Transient vs. permanent error classification in the parser is exactly the right model for retry logic.
  • isSpanBasedFileType cleanly generalises the TXT-only paths without touching each individual call site.
  • Fetching accepted file types dynamically from the backend registry in the upload modal is cleaner than hardcoding MIME types in the frontend.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

OpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models.

Baseline Commit Rules

  1. Always ensure all affected (or new) tests pass - backend tests suite should only be run in its entirety for good reason as it takes 30+ minutes.
  2. Always make sure typescript compiles and pre-commits pass before committing new code.
  3. Never credit Claude or Claude Code in commit messages, PR messages, comments, or any other artifacts. This includes Co-Authored-By lines, "Generated by Claude", etc.

Essential Commands

Backend (Django)

# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb

# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope

# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope

# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db

# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb

# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope

# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb

# Apply database migrations
docker compose -f local.yml run django python manage.py migrate

# Create new migration
docker compose -f local.yml run django python manage.py makemigrations

# Django shell
docker compose -f local.yml run django python manage.py shell

# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-files

Frontend (React/TypeScript)

cd frontend

# Start development server (proxies to Django on :8000)
yarn start

# Run unit tests (Vitest) - watches by default
yarn test:unit

# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list

# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"

# Run E2E tests
yarn test:e2e

# Coverage report
yarn test:coverage

# Linting and formatting
yarn lint
yarn fix-styles

# Build for production
yarn build

# Preview production build locally
yarn serve

Production Deployment

# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate

# Then start main services
docker compose -f production.yml up

High-Level Architecture

Backend Architecture

Stack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery

Key Patterns:

  1. GraphQL Schema Organization:

    • config/graphql/graphene_types.py - All GraphQL type definitions
    • config/graphql/queries.py - Query resolvers
    • config/graphql/*_mutations.py - Mutation files (organized by feature)
    • config/graphql/schema.py - Schema composition
  2. Permission System (CRITICAL - see docs/permissioning/consolidated_permissioning_guide.md):

    • Annotations & Relationships: NO individual permissions - inherited from document + corpus
    • Documents & Corpuses: Direct object-level permissions via django-guardian
    • Analyses & Extracts: Hybrid model (own permissions + corpus permissions + document filtering)
    • Formula: Effective Permission = MIN(document_permission, corpus_permission)
    • Structural items are ALWAYS read-only except for superusers
    • Use Model.objects.visible_to_user(user) pattern (NOT resolve_oc_model_queryset - DEPRECATED)
  3. AnnotatePermissionsForReadMixin:

    • Most GraphQL types inherit this mixin (see config/graphql/permissioning/permission_annotator/mixins.py)
    • Adds my_permissions, is_published, object_shared_with fields
    • Requires model to have guardian permission tables ({model}userobjectpermission_set)
    • Notifications use simple ownership model and DON'T use this mixin
  4. Django Signal Handlers:

    • Automatic notification creation on model changes (see opencontractserver/notifications/signals.py)
    • Must be imported in app's apps.py ready() method
    • Use _skip_signals attribute to prevent duplicate notifications in tests
  5. Pluggable Parser Pipeline:

    • Base classes in opencontractserver/pipeline/base/
    • Parsers, embedders, thumbnailers auto-discovered and registered
    • Multiple backends: Docling (ML-based), LlamaParse, Text
    • All convert to unified PAWLs format for frontend
  6. Agent Tool Architecture (see docs/architecture/llms/README.md):

    • CoreTool (framework-agnostic) → PydanticAIToolWrapper (pydantic-ai specific)
    • All production tools MUST be async (a-prefixed in core_tools.py). The wrapper supports sync functions for lightweight helpers/tests but does NOT wrap them in a thread pool — sync ORM calls will raise SynchronousOnlyOperation.
    • Tool fault tolerance (issue Agent Tools Not Fault Tolerant #820): operational exceptions are caught and returned as error strings to the LLM; security exceptions (PermissionError, ToolConfirmationRequired) propagate.
    • Pre-execution checks run on every call (not cached): permission validation, resource ID validation, approval gates.

Frontend Architecture

Stack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite

Key Patterns:

  1. State Management - Jotai Atoms:

    • Global state via atoms in frontend/src/atoms/ (NOT Redux/Context)
    • Key atoms: selectedCorpusIdAtom, selectedFolderIdAtom, currentThreadIdAtom
    • Derived atoms automatically update when dependencies change
    • Apollo reactive vars in frontend/src/graphql/cache.ts for UI state
    • AuthGate pattern ensures auth completes before rendering
  2. Central Routing System (see docs/frontend/routing_system.md):

    • Single source of truth: frontend/src/routing/CentralRouteManager.tsx
    • URL paths → Entity resolution via GraphQL slug queries
    • URL params ↔ Reactive vars (bidirectional sync)
    • Components consume state via reactive vars, never touch URLs directly
    • Deep linking and canonical redirects handled automatically
  3. PDF Annotation System (see .cursor/rules/pdf-viewer-and-annotator-architecture.mdc):

    • Virtualized rendering: Only visible pages (+overscan) rendered for performance
    • Binary search to find visible page range (O(log n))
    • Height caching per zoom level
    • Two-phase scroll-to-annotation system
    • Dual-layer architecture: Document layer (annotations) + Knowledge layer (summaries)
  4. Unified Filtering Architecture:

    • useVisibleAnnotations hook provides single-source-of-truth filtering for both annotations and relationship-connected annotations
    • Reads from Jotai atoms via useAnnotationDisplay, useAnnotationControls, and useAnnotationSelection (from UISettingsAtom)
    • Ensures consistency across all components
    • Forced visibility for selected items and their relationship connections
  5. Component Testing (see .cursor/rules/test-document-knowledge-base.mdc):

    • ALWAYS mount components through test wrappers (e.g., DocumentKnowledgeBaseTestWrapper)
    • Wrapper provides: MockedProvider + InMemoryCache + Jotai Provider + asset mocking
    • Use --reporter=list flag to prevent hanging
    • Increase timeouts (20s+) for PDF rendering in Chromium
    • GraphQL mocks must match variables EXACTLY (null vs undefined matters)
    • Mock same query multiple times for refetches
    • Use page.mouse for PDF canvas interactions (NOT locator.dragTo)
    • Add settle time after drag operations (500ms UI, 1000ms Apollo cache)
  6. Development Server Configuration:

    • Vite dev server on :3000 proxies to Django on :8000
    • WebSocket proxy for /wsws://localhost:8000
    • GraphQL proxy for /graphqlhttp://localhost:8000
    • REST API proxy for /apihttp://localhost:8000
    • Auth0 optional via REACT_APP_USE_AUTH0 environment variable

Data Flow Architecture

Document Processing:

  1. Upload → Parser Selection (Docling/LlamaParse/Text)
  2. Parser generates PAWLs JSON (tokens with bounding boxes)
  3. Text layer extracted from PAWLs
  4. Annotations created for structure (headers, sections, etc.)
  5. Relationships detected between elements
  6. Vector embeddings generated for search

GraphQL Permission Flow:

  1. Query resolver filters objects with .visible_to_user(user)
  2. GraphQL types resolve my_permissions via AnnotatePermissionsForReadMixin
  3. Frontend uses permissions to enable/disable UI features
  4. Mutations check permissions and return consistent errors to prevent IDOR

Critical Security Patterns

  1. IDOR Prevention:

    • Query by both ID AND user-owned field: Model.objects.get(pk=pk, recipient=user)
    • Return same error message whether object doesn't exist or belongs to another user
    • Prevents enumeration via timing or different error messages
  2. Permission Checks:

    • NEVER trust frontend - always check server-side
    • Use visible_to_user() manager method for querysets
    • Check user_has_permission_for_obj() for individual objects (in opencontractserver.utils.permissioning)
  3. XSS Prevention:

    • User-generated content in JSON fields must be escaped on frontend
    • GraphQL's GenericScalar handles JSON serialization safely
    • Document this requirement in resolver comments

Critical Concepts

  1. No dead code - when deprecating or replacing code, always try to fully replace older code and, once it's no longer in use, delete it and related texts.
  2. DRY - please always architect code for maximal dryness and always see if you can consolidate related code and remove duplicative code.
  3. Single Responsibility Principle - Generally, ensure that each module / script has a single purpose or related purpose.
  4. No magic numbers - we have constants files in opencontractserver/constants/ (backend) and frontend/src/assets/configurations/constants.ts (frontend). Use them for any hardcoded values.
  5. Don't touch old tests without permission - if pre-existing tests fail after changes, try to identify why and present user with root cause analysis. If the test logic is correct but expectations need updating due to intentional behavior changes, document the change clearly.
  6. Utility functions belong in utility files:
    • Before writing new utilities: Check existing utility files first (frontend/src/utils/, opencontractserver/utils/)
    • When reviewing code: If you find utility functions defined inline in components/views, check if they already exist in utility files. If not, consider whether they should be extracted there for reuse.
    • Frontend utilities: frontend/src/utils/formatters.ts (formatting), frontend/src/utils/files.ts (file operations), etc.
    • Backend utilities: opencontractserver/utils/ contains permissioning, PDF processing, and other shared utilities

Testing Patterns

Manual Test Scripts

Location: docs/test_scripts/

When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under docs/test_scripts/. These scripts will later be used to build automated integration tests.

Format:

# Test: [Brief description]

## Purpose
What this test verifies.

## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")

## Steps
1. Step one with exact command
   ```bash
   docker compose -f local.yml run --rm django python manage.py shell -c "..."
  1. Step two...

Expected Results

  • What should happen after each step
  • Success criteria

Cleanup

Commands to restore original state if needed.


**When to document**:
- Migration testing (rollback, create test data, migrate forward)
- Database constraint validation
- Race condition verification
- Any manual verification requested during code review

### Backend Tests

**Location**: `opencontractserver/tests/`

**Parallel Testing** (pytest-xdist):
- Run with `-n 4` (or `-n auto`) for parallel execution across workers
- Use `--dist loadscope` to keep tests from the same class on the same worker (respects `setUpClass`)
- Each worker gets its own database (test_db_gw0, test_db_gw1, etc.)
- Use `@pytest.mark.serial` to mark tests that cannot run in parallel
- First run or after DB schema changes: add `--create-db` flag

**Patterns**:
- Use `TransactionTestCase` for tests with signals/asynchronous behavior
- Use `TestCase` for faster tests without transaction isolation
- Clear auto-created notifications when testing moderation: `Notification.objects.filter(recipient=user).delete()`
- Use `_skip_signals` attribute on instances to prevent signal handlers during fixtures

### Frontend Component Tests

**Location**: `frontend/tests/`

**Critical Requirements**:
- Mount through test wrappers that provide all required context
- GraphQL mocks must match query variables exactly
- Include mocks for empty-string variants (unexpected boot calls)
- Wait for visible evidence, not just network-idle
- Use `page.mouse` for PDF canvas interactions (NOT `locator.dragTo`)
- Add settle time after drag operations (500ms UI, 1000ms Apollo cache)

**Test Wrapper Pattern**:
```typescript
// ALWAYS use wrappers, never mount components directly
const component = await mount(
  <DocumentKnowledgeBaseTestWrapper corpusId="corpus-1" documentId="doc-1">
    <DocumentKnowledgeBase />
  </DocumentKnowledgeBaseTestWrapper>
);

Automated Documentation Screenshots

Location: docs/assets/images/screenshots/auto/ (output) | frontend/tests/utils/docScreenshot.ts (utility)

Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the screenshots.yml CI workflow.

How it works:

  1. Import docScreenshot from ./utils/docScreenshot in any .ct.tsx test file
  2. Call await docScreenshot(page, "area--component--state") after the component reaches the desired visual state
  3. Reference the image in markdown: ![Alt text](../assets/images/screenshots/auto/area--component--state.png)
  4. CI runs tests on every PR, captures screenshots, and auto-commits any changes

Naming convention (-- separates segments, - within words):

Segment Purpose Examples
area Feature area landing, badges, corpus, versioning, annotations
component Specific view or component hero-section, celebration-modal, list-view
state Visual state captured anonymous, with-data, empty, auto-award

At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens.

Example:

import { docScreenshot } from "./utils/docScreenshot";

// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");

Rules:

  • Place docScreenshot() calls AFTER assertions that confirm the desired visual state
  • The filename IS the contract between tests and docs — keep names stable
  • Never manually edit files in docs/assets/images/screenshots/auto/ — they are overwritten by CI
  • Manually curated screenshots stay in docs/assets/images/screenshots/ (parent directory)

Release Screenshots (Point-in-Time)

For release notes, use releaseScreenshot to capture screenshots that are locked in amber — they show the UI at a specific release and never change.

Location: docs/assets/images/screenshots/releases/{version}/ (output)

import { releaseScreenshot } from "./utils/docScreenshot";

await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });

Key differences from docScreenshot:

  • Output: docs/assets/images/screenshots/releases/{version}/{name}.png
  • Write-once: If the file already exists, the function is a no-op (won't overwrite)
  • CI never touches the releases/ directory
  • Name is a simple kebab-case string (no -- segment convention)
  • Version must match v{major}.{minor}.{patch} format (with optional suffix)

When to use which:

  • docScreenshot → README, quickstart, guides (always fresh)
  • releaseScreenshot → Release notes (frozen at release time)

Authenticated Playwright Testing (Live Frontend Debugging)

When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests.

Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to http://localhost:8000/graphql/ (cross-origin from the Vite dev server at localhost:5173). Since fetch defaults to credentials: 'same-origin', browser cookies aren't sent cross-origin. The workaround is to inject the session cookie into request headers via Playwright's route interception.

Step 1: Set a password for the superuser (one-time setup):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"

Step 2: Playwright script pattern:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();
  const page = await context.newPage();

  // Collect console messages for debugging
  const consoleMsgs = [];
  page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));

  // 1. Login to Django admin to get session cookie
  await page.goto('http://localhost:8000/admin/login/');
  await page.fill('#id_username', '<superuser-username>');
  await page.fill('#id_password', 'testpass123');
  await page.click('input[type=submit]');
  await page.waitForTimeout(2000);

  // 2. Extract the session cookie
  const cookies = await context.cookies();
  const sessionCookie = cookies.find(c => c.name === 'sessionid');

  // 3. Intercept GraphQL requests to inject the session cookie
  //    (needed because Apollo sends cross-origin requests to :8000)
  await page.route('**/graphql/**', async (route) => {
    const headers = {
      ...route.request().headers(),
      'Cookie': 'sessionid=' + sessionCookie.value,
    };
    await route.continue({ headers });
  });

  // 4. Navigate to the frontend page under test
  await page.goto('http://localhost:5173/extracts');
  await page.waitForTimeout(5000);

  // 5. Inspect results
  const bodyText = await page.textContent('body');
  console.log(bodyText);

  await browser.close();
})();

Run from the frontend directory (where playwright is a dependency):

cd frontend && node /path/to/script.js

Key details:

  • The Django admin login sets a sessionid cookie for localhost
  • page.route('**/graphql/**') intercepts Apollo's requests to localhost:8000/graphql/ and injects the cookie header
  • The AuthGate will still show anonymous state (no Auth0 session), but GraphQL queries execute as the authenticated Django user
  • This is useful for verifying backend query results, permission filtering, and data flow through the real frontend
  • For verifying just the GraphQL response without the full frontend, curl with the session cookie also works:
    curl -s -X POST http://localhost:8000/graphql/ \
      -H "Content-Type: application/json" \
      -H "Cookie: sessionid=<session-key>" \
      -d '{"query":"query { extracts { edges { node { id name } } } }"}' | python3 -m json.tool

Alternative — create a session programmatically (no admin login needed):

docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"

Then use the printed session key directly in curl or Playwright route interception.

Documentation Locations

  • Permissioning: docs/permissioning/consolidated_permissioning_guide.md
  • Frontend Routing: docs/frontend/routing_system.md
  • Frontend Auth Flow: docs/frontend/auth_flow.md
  • PDF Data Layer: docs/architecture/PDF-data-layer.md
  • PAWLs Format (v1/v2): docs/architecture/pawls-format.md
  • Parser Pipeline: docs/pipelines/pipeline_overview.md
  • LLM Framework: docs/architecture/llms/README.md
  • Collaboration System: docs/commenting_system/README.md
  • Auth Pattern (detailed): frontend/src/docs/AUTHENTICATION_PATTERN.md
  • Documentation Screenshots: docs/development/screenshots.md
  • Query Permission Patterns: docs/architecture/query_permission_patterns.md
  • OS-Legal-Style Migration Guide: docs/frontend/os-legal-style-migration-guide.md

Branch Strategy

This project follows trunk-based development:

  • Work directly on main branch or use short-lived feature branches
  • Feature branches: feature/description-issue-number
  • Merge feature branches quickly (within a day or two)
  • Commit message format: Descriptive with issue references (e.g., "Closes Epic: Notification System #562")

Changelog Maintenance

IMPORTANT: Always update CHANGELOG.md when making significant changes to the codebase.

The changelog follows Keep a Changelog format:

## [Unreleased] - YYYY-MM-DD

### Added
- New features

### Fixed
- Bug fixes with file locations and line numbers

### Changed
- Changes to existing functionality

### Technical Details
- Implementation specifics, architectural notes

When to update:

  • New features or models added
  • Production code bugs fixed (document file location, line numbers, and impact)
  • Breaking changes to APIs or data models
  • Test suite fixes that reveal production issues
  • Database migrations
  • Architecture changes

What to include:

  • File paths and line numbers for code changes
  • Clear description of the issue and fix
  • Impact on system behavior
  • Migration notes if applicable

Pre-commit Hooks

Automatically run on commit:

  • black (Python formatting)
  • isort (import sorting)
  • flake8 (linting)
  • prettier (frontend formatting)
  • pyupgrade (Python syntax modernization)

Run manually: pre-commit run --all-files

Common Pitfalls

  1. Frontend tests hanging: Always use --reporter=list flag
  2. Permission N+1 queries: Use .visible_to_user() NOT individual permission checks. For Conversation list queries in GraphQL resolvers, use ConversationQueryOptimizer from opencontractserver.conversations.query_optimizer - it provides request-level caching to avoid repeated corpus/document visibility subqueries
  3. Missing GraphQL mocks: Check variables match exactly (null vs undefined matters), add duplicates for refetches
  4. Notification duplication in tests: Moderation methods auto-create ModerationAction records
  5. Structural annotation editing: Always read-only except for superusers
  6. Missing signal imports: Import signal handlers in apps.py ready() method
  7. PDF rendering slow in tests: Increase timeouts to 20s+ for Chromium
  8. Cache serialization crashes: Keep InMemoryCache definition inside wrapper, not test file
  9. Backend Tests Waiting > 10 seconds on Postgres to be Ready: Docker network issue. Fix with: docker compose -f test.yml down && docker kill $(docker ps -q) && docker compose -f test.yml down
  10. Empty lists on direct navigation: AuthGate pattern solves this (don't check auth status, it's always ready)
  11. URL desynchronization: Use CentralRouteManager, don't bypass routing system
  12. Jotai state not updating: Ensure atoms are properly imported and used with useAtom hook
  13. Writing sync agent tools: All agent tools must be async. The PydanticAIToolWrapper accepts sync functions but calls them directly (no thread pool) — sync Django ORM calls will raise SynchronousOnlyOperation. Use the a-prefixed async versions in core_tools.py.
  14. PydanticAI system_prompt silently dropped: When creating PydanticAIAgent, use instructions= NOT system_prompt=. The system_prompt parameter is only included when message_history is None, but OpenContracts' chat() persists a HUMAN message before calling pydantic-ai's run(), so history is always non-empty. This causes the system prompt to be silently dropped. See docs/architecture/llms/README.md for full details.
  15. Apollo cache keyArgs must use field argument names, not variable names: In cache.ts, relayStylePagination(["corpus", "name_Contains"]) uses the GraphQL field argument names (e.g., extracts(corpus: $id, name_Contains: $text)), NOT the variable names ($id, $text). Mismatched keyArgs silently fail to isolate cache entries, causing queries with different filters to share stale cached results.
  16. Corrupted Docker iptables chains (RARE): If you see Chain 'DOCKER-ISOLATION-STAGE-2' does not exist errors, Docker's iptables chains have been corrupted during docker cycling. Run this nuclear fix:
    sudo systemctl stop docker && sudo systemctl stop docker.socket && sudo ip link delete docker0 2>/dev/null || true && sudo iptables -t nat -F && sudo iptables -t nat -X && sudo iptables -t filter -F && sudo iptables -t filter -X 2>/dev/null || true && sudo iptables -t mangle -F && sudo iptables -t mangle -X && sudo iptables -t filter -N INPUT 2>/dev/null || true && sudo iptables -t filter -N FORWARD 2>/dev/null || true && sudo iptables -t filter -N OUTPUT 2>/dev/null || true && sudo iptables -P INPUT ACCEPT && sudo iptables -P FORWARD ACCEPT && sudo iptables -P OUTPUT ACCEPT && sudo systemctl start docker
    This completely resets Docker's networking and iptables state. Docker will recreate all required chains on startup.

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

Test comment - please ignore

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer. This is a well-architected feature. Core design of using the same Docxodus library on both backend microservice and frontend WASM for offset alignment is sound. Implementation is clean with good test coverage. See below for detailed feedback. SECURITY: [Important] DOMPurify SANITIZE_CONFIG allows arbitrary style attributes. In DocxAnnotator.tsx the config includes style in ADD_ATTR. This preserves inline DOCX styles but also allows arbitrary CSS values including url() references that can load external resources or exfiltrate data. DOMPurify does not strip dangerous url() from inline styles by default. Consider adding a uponSanitizeAttribute hook that strips url( patterns. At minimum document the trade-off. CODE QUALITY: [Minor] context_length = 64 in useTextSearch.tsx should be in constants.ts (per CLAUDE.md no magic numbers). [Minor] CHANGELOG says docxodus@5.5.0 but package.json pins 5.5.3 - these should match. [Minor] docxodus_parser.py reads the entire DOCX into memory before size-checking - for a 50MB file this is ~100MB peak. Consider checking default_storage.size() first. [Minor] _normalize_annotation silently drops annotation_type when invalid - add logger.debug() to aid version drift diagnosis. [Minor] resolve_annotation_type says tolerate invalid DB values without explaining how they got there - a note or issue link would help. [Minor] getSpan in DocxAnnotatorWrapper throws when selectedLabel not found - caught by handleCreateAnnotation so no crash, but returning null and early-returning would be more idiomatic. ARCHITECTURE: [Observation] docxLoadCancelRef pattern works but a comment on intent would help. [Observation] Three separate style elements could be merged. Low priority. TEST COVERAGE: Good overall. getGlobalOffset tests thorough. Backend transient/permanent error classification correct. Disambiguation test correctly marked fixme. [Minor] DocxAnnotatorTestWrapper says offsets pinned to @5.5.0 but package.json is 5.5.3 - verify sampleAnnotation2 offsets 84-100 are still valid. SUMMARY - Three items before merge: 1. Security: DOMPurify style handling - document risk or add url() filter. 2. Offset verification: confirm offsets against 5.5.3. 3. Magic number: context_length to constants. Everything else advisory.

…unting

The DOM offset walker counted ALL text in the PaginatedDocument DOM
(including CSS from <style> tags, page numbers, attribute text) producing
offsets far beyond docText.length. This made every selection fail the
"empty text or zero range" check.

Replace with text-matching approach using findTextOccurrences:
1. Clean selected text by stripping page number artifacts (\n1\n → \n)
2. Search cleaned text in docText via findTextOccurrences
3. If no exact match (whitespace differences from cross-page), try
   whitespace-normalized fuzzy matching
4. Use DOM position only for disambiguation when multiple matches exist

Also skip <style> and <title> tags in the TreeWalker (still used for
disambiguation), though the primary path no longer relies on DOM offsets.
@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review placeholder - testing gh CLI access

@claude
Copy link
Copy Markdown

claude bot commented Mar 21, 2026

PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer

This is a well-structured feature addition with thoughtful architecture. The three-layer rendering pipeline (convert-once/project-many/CSS-toggle) is clever, and the approach of sharing the same Docxodus library between backend and frontend to guarantee offset alignment is a strong design choice. That said, there are some issues that need attention before merge.


Critical: Docxodus Version Mismatch

The core guarantee of this PR — that backend and frontend character offsets align because they use the same Docxodus library — is broken by a version mismatch:

  • Docker image (all three compose files): ghcr.io/open-source-legal/docxodus-service:1.1.0-docxodus5.4.2
  • npm package (package.json): "docxodus": "5.5.3"
  • CHANGELOG also references docxodus@5.5.0 in one place

If Docxodus 5.4.2 produces different character offsets than 5.5.3 (due to whitespace normalization changes, paragraph handling, etc.), annotations created in the backend will not highlight the correct text in the frontend. The Docker image needs to be updated to use docxodus 5.5.3, or the npm package pinned to match the backend. The test comment // Offsets are pinned to docxodus@5.5.0 also does not match either version.


Bug: Unhandled JSON Parse Error in Backend Parser

In docxodus_parser.py, after response.raise_for_status() succeeds:

result = response.json()
normalized = self._normalize_response(result)

If the microservice returns HTTP 200 with a non-JSON body (e.g. an HTML error page from a proxy), response.json() throws requests.exceptions.JSONDecodeError which is not caught by any of the existing except blocks — those only guard the requests.post() call. This propagates as an unhandled exception rather than a DocumentParsingError.

Suggestion: wrap the response.json() and _normalize_response() calls in a try/except and raise DocumentParsingError(..., is_transient=True).


Security: Unvalidated Color Value in CSS Template

In DocxAnnotator.tsx, the customCss memo interpolates raw color values into a CSS string:

border-bottom: 2px solid #[color];
background-color: #[color];

A crafted label.color value could inject arbitrary CSS rules. Annotation label colors are set by corpus owners/admins rather than arbitrary end users, so the blast radius is limited — but it is easy to fix: validate that color matches a hex pattern before interpolating, or derive both CSS properties from the hexToRgb() return value (which already sanitizes to numeric RGB components).


Code Quality: Magic Number in useTextSearch.tsx

const context_length = 64;

Per CLAUDE.md conventions, magic numbers belong in the constants file. This should be a named constant like TEXT_SEARCH_CONTEXT_LENGTH in constants.ts alongside DOCX_CACHE_MAX_ENTRIES.


Minor: Missing .docx Extension Guarantee in Parser Payload

document.title may be something like "Service Agreement" with no file extension. If the microservice uses the filename extension for processing hints, this could cause subtle failures. Consider always ensuring .docx is appended.


What's Done Well

  • DOMPurify integration: SANITIZE_CONFIG correctly uses ADD_ATTR/ADD_TAGS to preserve docxodus formatting while blocking scripts and event handlers.
  • Cancellation pattern: The docxLoadCancelRef cancel-on-navigation in DocumentKnowledgeBase.tsx and the cancelled flag in DocxAnnotator.tsx effects correctly prevent stale state updates.
  • LRU cache: The Map-based LRU with promotion-on-hit is correct; the lruKey !== undefined guard handles the empty-map edge case.
  • isSpanBasedFileType abstraction: Clean refactor that avoids scattering DOCX-specific checks throughout AnnotationHooks, EnhancedLabelSelector, and UISettingsAtom.
  • Error classification: The transient/permanent split (4xx = permanent, timeout/connection = transient) enables correct retry logic in the pipeline.
  • Test coverage: Parser unit tests cover success, timeout, connection error, HTTP error, and normalization paths. Thumbnailer tests cover embedded thumbnail extraction and fallback text rendering.
  • Structural annotation handling: Correctly filters structural annotations via showStructuralAnnotations before projection, consistent with PDF/TXT behavior.
  • Cleanup on unmount: setDocxBytes(null) prevents stale WASM data from bleeding into subsequent document navigations.
  • required: false on Docker deps: Optional dependency in compose files means installations without the DOCX microservice still start without errors.

Summary

The version mismatch (critical) directly undermines the PR's core guarantee and should be resolved before merge. The JSON parse error (bug) is a reliability gap. The CSS injection (security) is low-severity but a straightforward fix. The remaining items are style/quality notes per CLAUDE.md conventions.

@JSv4 JSv4 merged commit 3dfc8cb into main Mar 21, 2026
18 checks passed
@JSv4 JSv4 deleted the claude/explore-docxodus-G49qt branch March 21, 2026 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants