Add first-class DOCX support with Docxodus microservice and WASM renderer#1102
Add first-class DOCX support with Docxodus microservice and WASM renderer#1102
Conversation
Comprehensive plan for adding first-class DOCX support with: - Backend DocxParser using python-docx for text extraction + structural annotations - Frontend rendering via react-docxodus-viewer (Docxodus WASM) - Character-offset SpanAnnotation format (same as TXT) - Hash-based validation for offset alignment between backend and frontend - DocxThumbnailGenerator for document cards https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
Drop python-docx parser entirely. Use Docxodus .NET microservice as the sole DOCX parser, mirroring the Docling pattern for PDFs. Both backend (microservice) and frontend (WASM) use the same Docxodus library, guaranteeing character offset alignment with zero reconciliation needed. https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
Implements a complete parallel document ingestion pipeline and rendering tree for DOCX files, bringing Word document support alongside existing PDF and TXT pipelines. Backend: - Docxodus .NET 8 microservice (docxodus-service/) wrapping OpenContractExporter.Export() for structural annotation extraction - DocxodusServiceParser REST parser with camelCase→snake_case normalization and transient/permanent error classification - DocxThumbnailGenerator with embedded thumbnail extraction fallback to XML-based text preview rendering - Docker Compose service definitions for local/test/production Frontend: - DocxAnnotator component using docxodus WASM for native DOCX→HTML conversion with annotation projection via convertDocxToHtmlWithExternalAnnotations() - DocxAnnotatorWrapper mirroring TxtAnnotatorWrapper state management - DocumentKnowledgeBase DOCX loading flow and renderer dispatch - isDocxFileType utility, DOCX_MIME_TYPE constant, docxBytesAtom Tests: - Backend: parser unit tests (success, timeout, connection error, normalization) and thumbnailer tests (text preview, embedded thumbnail, invalid DOCX) - Frontend: DocxAnnotator component test with docScreenshot captures https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
WASM initialization takes longer than standard components. Added test.setTimeout(30_000) and a third screenshot for the error/loading state. https://claude.ai/code/session_01XHqUsM8crJxpn5pjMJvttF
…tator - Remove PLAN.md (AI planning artifact) - Remove unused react-docxodus-viewer dependency - Add CSS.escape() for annotation ID interpolation in style block - Document first-occurrence text selection limitation with TODO - Remove console.log from getDocxBytes - Remove no-op identity mappings in _normalize_annotation/_normalize_relationship - Add env_var for use_cloud_run_iam_auth setting - Fix redundant exception type (BadZipFile already covered by Exception) - Add docxodus-parser to production.yml django depends_on - Fix black formatting and flake8 line-too-long
The docxodus-service .NET microservice has been extracted to its own repository at Open-Source-Legal/docxodus-service with tests and CI/CD that publishes to ghcr.io. - Remove docxodus-service/ directory (now at github.com/Open-Source-Legal/docxodus-service) - Update local.yml, test.yml, production.yml to use ghcr.io/open-source-legal/docxodus-service:main
Configure Playwright CT to load and run the Docxodus .NET WASM runtime, enabling real DOCX-to-HTML conversion in component tests instead of testing only loading/error states. - Add setupDocxodusWasm() utility that serves WASM files via page.route() - Set docxodus WASM base path in playwright/index.tsx hooks - Add test.docx fixture generated with python-docx - Update DocxAnnotatorTestWrapper to load real DOCX bytes - Capture 3 screenshots showing actual rendered DOCX content - Add *.wasm to Vite assetsInclude
Production fix: add annotationType and annotationJson.text fields to buildExternalAnnotationSet — required by Docxodus WASM to project annotation highlights onto rendered HTML. Test fixes: - Fix docText to match Docxodus extraction (newlines between paragraphs) - Add second annotation (Important Clause) with correct character offset - Screenshots now show real annotation highlights (red + teal) - Add DOCX document mocks and DKB test scaffolding
…enshot - Fix double-slash in DOCX file route pattern (**/URL → **URL) - Add GET_CORPUS_VERSIONS mock to createSummaryMocks (was unmocked) - Add duplicate mocks for refetch queries - Screenshot shows full DocumentKnowledgeBase with DOCX viewer
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…erage - Add DOMPurify sanitization for DOCX WASM HTML output before dangerouslySetInnerHTML to prevent XSS from crafted documents - Fix docxBytesAtom memory leak by clearing it when navigating to non-DOCX documents (PDF/text load paths) - Add URL-keyed caching to getDocxBytes() to avoid refetching on Apollo query refetches - Fix textLabels deduplication to key by label ID instead of text, preventing collision of labels with identical names - Rename read_only prop to readOnly for camelCase consistency - Make unused CRUD props (updateAnnotation, approveAnnotation, rejectAnnotation) optional on DocxAnnotatorProps - Change DOCX viewer container from id=pdf-container to docx-container - Fix document-only DOCX handler to pass docId/textHash to getDocumentRawText, matching the corpus+document handler - Add DOCXODUS_PARSER_SERVICE_URL to test environment config - Add test for HTTP 4xx permanent failure path in DocxodusServiceParser - Update CHANGELOG to reference dompurify instead of removed react-docxodus-viewer
…ucture for DOCX support - Resolve ambiguous text selection in DocxAnnotator by using DOM TreeWalker to compute approximate character offsets, then picking the closest findTextOccurrences match (handles repeated phrases like "Party", "Agreement", boilerplate) - Fix race condition when switching between DOCX documents by clearing docxBytesAtom before fetching new bytes - Add docling-parser to production.yml depends_on (was missing vs local/test) - Guard zero-length Uint8Array in DocxAnnotatorWrapper to show loading spinner - Document pdf_file field reuse for DOCX storage in parser and frontend - Clarify DOMPurify data-* attribute handling in sanitization comment - Fix pre-existing DocxAnnotator.ct.tsx: replace broken ?url asset import with page.route() interceptor, remove test.describe() that broke Playwright CT component registration - Add unit tests for offset resolution algorithm (11 tests) - Add component test proving second occurrence is selected over first
…overlay Separate expensive DOCX-to-HTML conversion from annotation projection using the convert-once / project-many pattern. Annotation changes now re-project in ~56ms instead of re-converting the full document (~892ms), and label visibility toggling is instant via CSS-only rules.
- Extract getGlobalOffsetFromDomPosition to shared docxOffsetUtils.ts to prevent silent divergence between production code and tests - Pin docxodus-service image to v1.0.0 in production.yml and test.yml for reproducible builds - Add DOCXODUS_PARSER_TIMEOUT to test env for parity with other parsers - Add HTTP 5xx transient error test for docxodus parser - Add cache eviction comment to docxBytesCache - Fix visibleLabels null-coalesce in TxtAnnotatorWrapper for consistency with DocxAnnotatorWrapper
…dd schema validation - Revert visibleLabels ?? [] in TxtAnnotatorWrapper that converted null (meaning "show all") to empty array (meaning "show none"), breaking TXT annotation display - Use consistent :main docker tag for docxodus-parser across all compose files (v1.0.0 tag doesn't exist on GHCR) - Add required field validation in docxodus_parser._normalize_response - Add LRU eviction (3-entry cap) to docxBytesCache - Extract duplicate DOCX loading logic into loadDocxDocument helper - Add useEffect cleanup for docxBytesAtom on unmount - Document DOMPurify threat model and validateBeforeProjection rationale - Use DOCX_MIME_TYPE constant in SUPPORTED_MIME_TYPES - Fix CHANGELOG docxodus version (5.4.2 -> 5.5.0) - Add console.warn for silent annotation creation failures
… fixes - Render search results and chat sources in DocxAnnotator by projecting them as synthetic annotations with dedicated CSS styling - Register annotation DOM refs via onAnnotationRefChange callback so sidebar scroll-into-view works for DOCX documents - Replace inline disambiguation logic with pickClosestOccurrence (DRY) - Eliminate loading flash by setting annotatedHtml = baseHtml immediately after DOCX conversion, before projection completes - Move DOCX_CACHE_MAX_ENTRIES from cachedRest.ts to constants.ts
…arnings - Wrap filteredAnnotations/filteredSearchResults in useMemo (DocxAnnotatorWrapper) - Extract useTestDocxBytes() hook to eliminate duplicate DOCX loading in test wrappers - Show generic user-facing error message instead of raw error strings (DocxAnnotator) - Add sanity assertions on pinned offsets in disambiguation test for clearer failures - Add mutable :main tag warnings to all Docker Compose files for docxodus-parser - Add comment explaining local xml.etree.ElementTree import in docx_thumbnailer
Playwright CT's babel transform fails to replace component refs with import refs when non-component exports (sampleDocText) are imported alongside component exports in the same import statement. The transform only removes the import when ALL specifiers are used in JSX; mixed imports leave both the original import and the generated const declarations, causing the component type to remain as a function instead of an importRef object. Split the import so component names are on a separate statement from non-component exports.
Code Review: First-class DOCX SupportOverall this is a well-architected addition that follows existing patterns cleanly. The convert-once/project-many rendering model is smart, the XSS protection is thoughtful, and the test coverage is thorough. A few important issues to address before merging: Critical1. Mutable
Pin to a semver tag (e.g., 2. Docker Compose will refuse to start the django container if either service is absent from the compose file or fails health checks. Existing deployments that don't run these services will break on the next deploy. Consider using Compose Important3.
4. If the component unmounts while these promises are in flight, all three setters fire on a dead component, and the 5. The guarantee only holds when both sides are pinned to the same version (see issue 1 above). If the frontend npm package is upgraded independently of the Docker image, offsets will silently diverge — and validation would catch that. Consider keeping it enabled (the ~56ms projection budget can absorb validation) and logging a warning on failure rather than disabling it entirely. Suggestions6. The cache evicts the oldest inserted entry. Users are more likely to re-open recently viewed documents. LRU is a one-liner using Map's insertion-order semantics — delete and re-insert on a cache hit. 7. Named export of Any consumer importing the named export gets the un-memoized version. Either apply 8. No file size guard before base64-encoding DOCX and POSTing to microservice Large DOCX files (e.g. 50MB contracts with embedded images) will be encoded to ~67MB and sent over HTTP. 9. The PR description mentions a 10. Minor: duplicate import in Both What's done well
|
- Pin docxodus-service image to v1.0.0 across all compose files - Make docling-parser and docxodus-parser optional in production.yml (required: false) to avoid breaking existing deployments - Fix unmount race condition in loadDocxDocument with cancellation flag - Enable validateBeforeProjection as safety net for version drift - Change docxBytesCache from FIFO to LRU eviction - Remove named export of DocxAnnotatorWrapper (bypassed React.memo) - Add configurable max_file_size_mb setting to DocxodusServiceParser - Merge duplicate import in DocxAnnotator.ct.tsx
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewOpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models. Baseline Commit Rules
Essential CommandsBackend (Django)# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb
# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope
# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope
# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db
# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb
# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope
# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb
# Apply database migrations
docker compose -f local.yml run django python manage.py migrate
# Create new migration
docker compose -f local.yml run django python manage.py makemigrations
# Django shell
docker compose -f local.yml run django python manage.py shell
# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-filesFrontend (React/TypeScript)cd frontend
# Start development server (proxies to Django on :8000)
yarn start
# Run unit tests (Vitest) - watches by default
yarn test:unit
# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list
# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"
# Run E2E tests
yarn test:e2e
# Coverage report
yarn test:coverage
# Linting and formatting
yarn lint
yarn fix-styles
# Build for production
yarn build
# Preview production build locally
yarn serveProduction Deployment# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate
# Then start main services
docker compose -f production.yml upHigh-Level ArchitectureBackend ArchitectureStack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery Key Patterns:
Frontend ArchitectureStack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite Key Patterns:
Data Flow ArchitectureDocument Processing:
GraphQL Permission Flow:
Critical Security Patterns
Critical Concepts
Testing PatternsManual Test ScriptsLocation: When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under Format: # Test: [Brief description]
## Purpose
What this test verifies.
## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")
## Steps
1. Step one with exact command
```bash
docker compose -f local.yml run --rm django python manage.py shell -c "..."
Expected Results
CleanupCommands to restore original state if needed. Automated Documentation ScreenshotsLocation: Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the How it works:
Naming convention (
At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens. Example: import { docScreenshot } from "./utils/docScreenshot";
// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");Rules:
Release Screenshots (Point-in-Time)For release notes, use Location: import { releaseScreenshot } from "./utils/docScreenshot";
await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });Key differences from
When to use which:
Authenticated Playwright Testing (Live Frontend Debugging)When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests. Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to Step 1: Set a password for the superuser (one-time setup): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"Step 2: Playwright script pattern: const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Collect console messages for debugging
const consoleMsgs = [];
page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));
// 1. Login to Django admin to get session cookie
await page.goto('http://localhost:8000/admin/login/');
await page.fill('#id_username', '<superuser-username>');
await page.fill('#id_password', 'testpass123');
await page.click('input[type=submit]');
await page.waitForTimeout(2000);
// 2. Extract the session cookie
const cookies = await context.cookies();
const sessionCookie = cookies.find(c => c.name === 'sessionid');
// 3. Intercept GraphQL requests to inject the session cookie
// (needed because Apollo sends cross-origin requests to :8000)
await page.route('**/graphql/**', async (route) => {
const headers = {
...route.request().headers(),
'Cookie': 'sessionid=' + sessionCookie.value,
};
await route.continue({ headers });
});
// 4. Navigate to the frontend page under test
await page.goto('http://localhost:5173/extracts');
await page.waitForTimeout(5000);
// 5. Inspect results
const bodyText = await page.textContent('body');
console.log(bodyText);
await browser.close();
})();Run from the frontend directory (where cd frontend && node /path/to/script.jsKey details:
Alternative — create a session programmatically (no admin login needed): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"Then use the printed session key directly in curl or Playwright route interception. Documentation Locations
Branch StrategyThis project follows trunk-based development:
Changelog MaintenanceIMPORTANT: Always update The changelog follows Keep a Changelog format: ## [Unreleased] - YYYY-MM-DD
### Added
- New features
### Fixed
- Bug fixes with file locations and line numbers
### Changed
- Changes to existing functionality
### Technical Details
- Implementation specifics, architectural notesWhen to update:
What to include:
Pre-commit HooksAutomatically run on commit:
Run manually: Common Pitfalls
|
|
#!/usr/bin/env python This script tests that rate limiting is properly configured and working import os import django Setup Djangoos.environ.setdefault("DJANGO_SETTINGS_MODULE", "config.settings.local") from config.graphql.ratelimits import ( # noqa: E402 def test_rate_limit_decorator(): def test_rate_limit_configuration(): def test_graphql_integration(): def main(): if name == "main": |
Code ReviewThis is a well-structured PR that follows existing patterns (mirroring TxtAnnotatorWrapper, reusing the convert-once/project-many split, DOMPurify sanitization). The offset disambiguation via getGlobalOffsetFromDomPosition and pickClosestOccurrence is a solid design and the unit test coverage for those utilities is good. A few things need attention before merge: Bug: annotationLabel not normalized in _normalize_annotation In opencontractserver/pipeline/parsers/docxodus_parser.py, the ann_mappings dict converts several camelCase fields to snake_case but is missing annotationLabel to annotation_label. The .NET microservice returns annotationLabel (camelCase), but without this mapping the key passes through unchanged. Downstream code reading OpenContractDocExport typed dicts will silently get annotationLabel instead of annotation_label, leading to either KeyErrors or annotations stored without labels. This was flagged in the prior review and appears to still be unaddressed. Fix: Add annotationLabel: annotation_label to ann_mappings in _normalize_annotation. Hard dependency on docxodus-parser in dev/test but not production local.yml and test.yml wire docxodus-parser as a hard depends_on (condition: service_started) for the django service, meaning the entire dev environment and all test runs including PDF/TXT tests fail if the container is unavailable. Production correctly uses required: false for this service. Since this is an optional microservice only exercised for DOCX documents, dev and test should also use required: false to avoid friction for contributors who do not need DOCX support. Minor: getDocxBytes has no concurrent request deduplication Two simultaneous fetches for the same URL will both fire independent axios.get calls. This matches the existing getPawlsLayer pattern so it is not a regression, but given DOCX files can be large (50 MB limit), it is worth a follow-up improvement (e.g. a pending-promise map keyed by URL). Minor: PDFContainer reused for DOCX viewer DocumentKnowledgeBase.tsx wraps the DOCX branch in PDFContainer id=docx-container. Functionally correct, but semantically confusing. If this is a generic scroll container, aliasing or renaming it would clarify intent. Not blocking. Minor: Enter shortcut hint without a keyboard handler The annotation creation menu renders a ShortcutHint Enter for Annotate Selection, but the only keydown handler present closes the menu on Escape. Enter does not trigger handleCreateAnnotation. Either wire up Enter or remove the hint. Tests look solid
The main blocking item is the annotationLabel normalization bug; the local.yml/test.yml hard dependency issue is worth fixing before merge too. |
PR Review: Add first-class DOCX support with Docxodus microservice and WASM rendererThis is a well-structured and ambitious PR. The architecture — same library on both sides guaranteeing offset alignment, three-phase render pipeline (convert once / project many / CSS toggle), DOM-based disambiguation for repeated text — is solid. The test coverage is thorough. Here are the issues I found, from critical down to minor. Critical: Version mismatch breaks the offset alignment guaranteeThe core correctness guarantee is that the backend microservice and frontend WASM use the same Docxodus version. But they do not:
The Docker image tag explicitly encodes docxodus5.4.2. If character offsets differ between 5.4.2 and 5.5.2, annotations will be projected at wrong positions. The PR description and code comments both stress this alignment requirement. Either the npm package should pin to 5.4.2 or the Docker image needs a rebuild against 5.5.2. Important: File size check happens after reading entire file into memoryIn docxodus_parser.py, the code reads the full file first, then checks the size limit. For a 200 MB file with max_file_size_mb=50, 200 MB is allocated before raising. Better to check size before reading — default_storage.size() is available for both local and S3 backends. Important: HTTP requests are not actually cancelled on component unmountloadDocxDocument in DocumentKnowledgeBase.tsx uses a closure-based cancel flag that prevents state updates after unmount, which is correct — but the underlying axios.get in getDocxBytes continues transferring data to completion, wasting bandwidth for large DOCX files. Consider threading an AbortController.signal through getDocxBytes and calling abort() on cancel. Notable: CSS color value from server is interpolated without validationIn DocxAnnotator.tsx, label colors from availableLabels are interpolated directly into a style block. CSS.escape is correctly applied to label IDs, but color is used raw — e.g., the color string is stripped of its leading hash and embedded directly into the CSS rule for border-bottom. Corpus labels are only editable by users with write permission, so the blast radius is self-contained — but validating that color matches Notable: annotation_type GraphQL field — root cause not addressedThe new resolve_annotation_type resolver returns a plain string "to tolerate invalid DB values." This is reasonable, but the PR does not say how invalid values got into the DB or how many rows are affected. A data migration to clean those rows would fix the root cause. Consider a follow-up ticket. Minor: Local import of xml.etree.ElementTree is unnecessaryIn docx_thumbnailer.py, import xml.etree.ElementTree as ET is placed inside the function with a comment about avoiding loading the XML parser at module level. xml.etree.ElementTree is a standard library module always loaded early in a Django process — the optimization does not apply. Move it to top-level imports for consistency. Minor: getSpan silently discards the selection on missing labelIn DocxAnnotatorWrapper, if activeSpanLabel is set but not found in spanLabels, getSpan throws, handleCreateAnnotation catches it with console.warn, and the annotation menu closes with no user feedback. A toast or inline message would be more informative. Things done well
|
|
PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer This is a well-structured feature addition that follows existing patterns (mirrors TxtAnnotatorWrapper, reuses the parser pipeline, uses Jotai atoms). The architecture of using the same library for both backend parsing and frontend rendering is elegant. A few items warrant attention before merging. HIGH PRIORITY
The PR's key invariant is that both backend parser and frontend WASM renderer use the same Docxodus library version, ensuring character offsets are identical. But the Docker image tag in all three compose files is 1.1.0-docxodus5.4.2 (backend on 5.4.2) while package.json pins docxodus at 5.5.2. The CHANGELOG also mentions 5.5.0 but the installed package is 5.5.2. Minor version bumps can change how Docxodus walks document nodes, causing silent annotation mis-alignment where validation passes but spans land in the wrong place. Fix: Bump the microservice image to use docxodus 5.5.2 and update the tag accordingly, or pin the frontend to 5.4.2. The two version strings must agree exactly.
SANITIZE_CONFIG passes ADD_TAGS: ["style"], meaning a malicious DOCX can inject arbitrary CSS blocks into the page DOM. Even without JavaScript, injected CSS enables clickjacking via overlay elements, hiding annotation controls via display:none, and visual spoofing of UI elements. Fix: Remove "style" from ADD_TAGS. If style blocks are required for PaginatedDocument to render correctly, scope them via Shadow DOM or strip cross-origin selectors before injection. MEDIUM PRIORITY
"from typing import Any" and "from opencontractserver.annotations.models import LABEL_TYPES" are imported but never referenced in the file. These appear to be leftovers from an earlier normalization implementation that was removed.
The atom is only cleared on component unmount. If DocumentKnowledgeBase stays mounted while the user navigates from a DOCX document to a PDF or TXT document, stale DOCX bytes remain in the atom while the PDF/TXT renderer is active. The atom should also be reset in the type-switching branches of the onCompleted handlers, alongside the existing setDocxBytes(null) call at the start of loadDocxDocument.
Effect 6 uses requestAnimationFrame after dangerouslySetInnerHTML updates to query annotation elements. If PaginatedDocument only renders visible pages, annotations on off-screen pages will not be in the DOM yet and will be silently skipped, breaking sidebar scroll-to-annotation for those annotations. Worth verifying against the TXT renderer approach. POSITIVE NOTES Error classification (transient vs permanent for HTTP 4xx/5xx/network errors) in DocxodusServiceParser matches the pipeline contract well. The LRU cache in cachedRest.ts using Map insertion-order semantics is correct and clean. The cancellation pattern with docxLoadCancelRef prevents stale state from in-flight loads. The required: false on Docker Compose dependencies is the right choice so existing deployments do not break. The offset disambiguation logic (pickClosestOccurrence + getGlobalOffsetFromDomPosition) is well-tested including a real-contract scenario with repeated terms. The annotation_type GraphQL resolver returning a plain string is a pragmatic fix with clear rationale. SUMMARY: The Docxodus version mismatch and the DOMPurify style-tag injection are the two issues that should be fixed before merging. Unused imports will be caught by pre-commit hooks. The docxBytesAtom stale-state and RAF ref registration gaps are worth addressing in this PR or a fast follow-up. |
scrollIntoView didn't work with nested scroll containers. Instead, walk up from the annotation element to find the nearest scrollable ancestor (PaginatedDocument's overflow:auto div) and scroll within that specific container. Added small delay for PaginatedDocument render timing.
PR Review: DOCX Support via DocxodusThis is a well-structured PR with solid architecture — the convert-once/project-many pattern, LRU caching, DOMPurify sanitization, and offset disambiguation utilities are all thoughtfully designed. The test coverage is unusually thorough for a feature this size. A few issues need addressing before merge. CriticalVersion mismatch between backend service and frontend WASM — this directly breaks the PR's core guarantee.
The PR's own description states: "Both backend parser and frontend renderer use the same Docxodus library, ensuring character offsets are identical." A 5.4.2 vs 5.5.2 version gap can produce misaligned offsets and silent annotation placement errors. Either bump the service image to HighDOMPurify allows In New GraphQL In MediumFile size check occurs pre-base64 encoding In
When
In Low / NitsHardcoded offset constants in tests — Missing
SummaryThe architecture is sound. The version mismatch between the backend service image and frontend npm package is the blocking issue — the core offset alignment guarantee fails silently without it. The DOMPurify |
…oView The previous attempts failed because PaginatedDocument hadn't finished rendering when the scroll effect fired. Use the onPaginationComplete callback to set paginationReady state, then gate the scroll effect on it. scrollIntoView automatically finds the nearest scrollable ancestor (PaginatedDocument's overflow:auto container).
|
test comment please ignore |
|
Deleting test comment above, actual review incoming. |
|
PR Review Part 1 of 3 - DOCX Support via Docxodus Microservice + WASM Renderer This is a well-structured and ambitious feature. The architecture - same library on both ends guaranteeing offset alignment, convert-once/project-many rendering pattern, and cancellation-safe async loading - is solid. Below is a focused list of items worth addressing before merge. CRITICAL - Backend microservice and frontend npm package version mismatch: The PR description states both the backend parser and frontend renderer use the same Docxodus library to guarantee identical character offsets, but the actual versions differ: Docker images pin docxodus 5.4.2 while frontend/package.json uses 5.5.2. If text extraction changed between those versions (e.g. different whitespace or paragraph separator handling), annotations created by the backend will silently appear at wrong positions in the frontend renderer. The docker-compose comment even flags this warning. Please align these versions before merging. SECURITY - ADD_TAGS style in DOMPurify config: In DocxAnnotator.tsx, SANITIZE_CONFIG allows style tags and style attributes through DOMPurify. Allowing injected style blocks means a maliciously crafted DOCX (or compromised microservice response) can inject arbitrary CSS into the page. While not classic XSS, this enables UI redress/clickjacking, information leakage via CSS attribute selectors, and potential data exfiltration - high-risk for a platform where untrusted documents are a primary input. Consider scoping injected styles to a Shadow DOM, or at minimum filtering url(), @import, and content: values that can trigger network requests. |
1. Search: useTextSearch only handled PDF (token-based search). Add span-based search for TXT/DOCX that produces TextSearchSpanResult objects with start_index/end_index, matching TxtAnnotator's pattern. 2. Annotation visibility: reset paginationReady when projection produces new annotatedHtml, so PaginatedDocument re-paginates and the new annotation becomes visible without a page refresh. 3. Labels: EnhancedLabelSelector's filteredLabelChoices didn't handle DOCX file type — it only checked isTextFileType (text/*) and isPdfFileType, so DOCX fell through to an empty array. Add isDocxFileType check to treat DOCX like TXT (SPAN_LABEL type, humanSpanLabels list).
DOCX documents use span-based annotations (character offsets) like TXT, not token-based like PDF. Multiple code paths only checked isTextFileType (which only matches text/*) and isPdfFileType, causing DOCX to fall through to wrong defaults. Add isSpanBasedFileType(fileType) = isTextFileType || isDocxFileType as a single utility, then update all affected locations: - UISettingsAtom: active label initialization now picks span labels for DOCX - AnnotationHooks: create/update annotation uses ServerSpanAnnotation for DOCX - EnhancedLabelSelector: label type and available labels use span labels for DOCX - ChatMessage: chat source annotation labels use span labels for DOCX
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewOpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models. Baseline Commit Rules
Essential CommandsBackend (Django)# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb
# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope
# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope
# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db
# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb
# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope
# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb
# Apply database migrations
docker compose -f local.yml run django python manage.py migrate
# Create new migration
docker compose -f local.yml run django python manage.py makemigrations
# Django shell
docker compose -f local.yml run django python manage.py shell
# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-filesFrontend (React/TypeScript)cd frontend
# Start development server (proxies to Django on :8000)
yarn start
# Run unit tests (Vitest) - watches by default
yarn test:unit
# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list
# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"
# Run E2E tests
yarn test:e2e
# Coverage report
yarn test:coverage
# Linting and formatting
yarn lint
yarn fix-styles
# Build for production
yarn build
# Preview production build locally
yarn serveProduction Deployment# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate
# Then start main services
docker compose -f production.yml upHigh-Level ArchitectureBackend ArchitectureStack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery Key Patterns:
Frontend ArchitectureStack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite Key Patterns:
Data Flow ArchitectureDocument Processing:
GraphQL Permission Flow:
Critical Security Patterns
Critical Concepts
Testing PatternsManual Test ScriptsLocation: When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under Format: # Test: [Brief description]
## Purpose
What this test verifies.
## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")
## Steps
1. Step one with exact command
```bash
docker compose -f local.yml run --rm django python manage.py shell -c "..."
Expected Results
CleanupCommands to restore original state if needed. Automated Documentation ScreenshotsLocation: Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the How it works:
Naming convention (
At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens. Example: import { docScreenshot } from "./utils/docScreenshot";
// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");Rules:
Release Screenshots (Point-in-Time)For release notes, use Location: import { releaseScreenshot } from "./utils/docScreenshot";
await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });Key differences from
When to use which:
Authenticated Playwright Testing (Live Frontend Debugging)When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests. Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to Step 1: Set a password for the superuser (one-time setup): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"Step 2: Playwright script pattern: const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Collect console messages for debugging
const consoleMsgs = [];
page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));
// 1. Login to Django admin to get session cookie
await page.goto('http://localhost:8000/admin/login/');
await page.fill('#id_username', '<superuser-username>');
await page.fill('#id_password', 'testpass123');
await page.click('input[type=submit]');
await page.waitForTimeout(2000);
// 2. Extract the session cookie
const cookies = await context.cookies();
const sessionCookie = cookies.find(c => c.name === 'sessionid');
// 3. Intercept GraphQL requests to inject the session cookie
// (needed because Apollo sends cross-origin requests to :8000)
await page.route('**/graphql/**', async (route) => {
const headers = {
...route.request().headers(),
'Cookie': 'sessionid=' + sessionCookie.value,
};
await route.continue({ headers });
});
// 4. Navigate to the frontend page under test
await page.goto('http://localhost:5173/extracts');
await page.waitForTimeout(5000);
// 5. Inspect results
const bodyText = await page.textContent('body');
console.log(bodyText);
await browser.close();
})();Run from the frontend directory (where cd frontend && node /path/to/script.jsKey details:
Alternative — create a session programmatically (no admin login needed): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"Then use the printed session key directly in curl or Playwright route interception. Documentation Locations
Branch StrategyThis project follows trunk-based development:
Changelog MaintenanceIMPORTANT: Always update The changelog follows Keep a Changelog format: ## [Unreleased] - YYYY-MM-DD
### Added
- New features
### Fixed
- Bug fixes with file locations and line numbers
### Changed
- Changes to existing functionality
### Technical Details
- Implementation specifics, architectural notesWhen to update:
What to include:
Pre-commit HooksAutomatically run on commit:
Run manually: Common Pitfalls
|
|
PR Review placeholder - content will follow in separate API call |
PR Review: First-class DOCX Support (PR 1102)This is a substantial and well-structured PR. The architecture delegating DOCX parsing to a .NET microservice and using the same Docxodus library in the WASM renderer is clever, and the offset-alignment guarantee is a strong design decision. The code is generally clean, follows existing patterns, and has good test coverage. A few issues worth addressing before merge: Critical: Version Mismatch Breaks the Core Alignment GuaranteeThe PR central claim is that character offsets are guaranteed to align because both sides use the same Docxodus library version. But they do not:
The docker-compose comments explicitly warn that the frontend WASM docxodus version and this microservice must match for character offset alignment, and then immediately violate it. If there were any text-extraction behavior changes between 5.4.2 and 5.5.2, annotations from the backend will silently misalign in the frontend renderer. Action needed: Align to the same version across the image tag and package.json, or use the newer microservice image version. hexToRgb can throw on invalid colorsIn DocxAnnotator.tsx, the customCss memo calls hexToRgb(color) and immediately destructures the result. If any label has an invalid or empty hex color string and hexToRgb returns null or undefined, this crashes. Labels sourced from the DB could have any value. Add a null guard before destructuring the return value. Magic number context_length = 64 in useTextSearchIn the new span-based search branch (useTextSearch.tsx) there is a hardcoded const context_length = 64. Per project guidelines, no magic numbers should live inline. This should be a named export in frontend/src/assets/configurations/constants.ts. The PDF branch presumably uses the same value, worth consolidating. annotation_type GraphQL workaround hides data quality issuesThe resolver returns an empty string for invalid DB values. While pragmatic, this makes it harder to detect corrupted data at the API layer. Consider at minimum logging a warning when an invalid value is encountered so it surfaces in monitoring. A follow-up migration to clean existing invalid values would be more robust. SANITIZE_CONFIG allows style tags: document the riskRe-enabling style tags is unusual in DOMPurify configs. The comment explains it is needed for docxodus formatting, but it opens the door to CSS-based data exfiltration (e.g., CSS history sniffing, font-face-based SSRF). DOMPurify FORCE_BODY helps contain styles to the component, but worth a comment explaining the risk assessment and why it is acceptable here (e.g., content is always server-generated, never user-typed HTML). @ts-expect-error for PaginatedDocument import is tech debtThe tsconfig uses moduleResolution node which does not support package.json exports subpaths. This works at runtime via Vite but bypasses TypeScript type checking for the PaginatedDocument component leaving its props untyped. Consider updating tsconfig.json to moduleResolution bundler (Vite-native) which properly resolves exports subpaths. This is the recommended setting for Vite projects and would eliminate the suppression comment. What is done well
The version mismatch is the only blocking issue. The others are improvements worth addressing but would not cause data loss. |
|
## PR Review: First-class DOCX support with Docxodus microservice and WASM renderer This is a test to see if HTML entities render as markdown headers. |
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewOpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models. Baseline Commit Rules
Essential CommandsBackend (Django)# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb
# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope
# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope
# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db
# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb
# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope
# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb
# Apply database migrations
docker compose -f local.yml run django python manage.py migrate
# Create new migration
docker compose -f local.yml run django python manage.py makemigrations
# Django shell
docker compose -f local.yml run django python manage.py shell
# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-filesFrontend (React/TypeScript)cd frontend
# Start development server (proxies to Django on :8000)
yarn start
# Run unit tests (Vitest) - watches by default
yarn test:unit
# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list
# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"
# Run E2E tests
yarn test:e2e
# Coverage report
yarn test:coverage
# Linting and formatting
yarn lint
yarn fix-styles
# Build for production
yarn build
# Preview production build locally
yarn serveProduction Deployment# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate
# Then start main services
docker compose -f production.yml upHigh-Level ArchitectureBackend ArchitectureStack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery Key Patterns:
Frontend ArchitectureStack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite Key Patterns:
Data Flow ArchitectureDocument Processing:
GraphQL Permission Flow:
Critical Security Patterns
Critical Concepts
Testing PatternsManual Test ScriptsLocation: When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under Format: # Test: [Brief description]
## Purpose
What this test verifies.
## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")
## Steps
1. Step one with exact command
```bash
docker compose -f local.yml run --rm django python manage.py shell -c "..."
Expected Results
CleanupCommands to restore original state if needed. Automated Documentation ScreenshotsLocation: Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the How it works:
Naming convention (
At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens. Example: import { docScreenshot } from "./utils/docScreenshot";
// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");Rules:
Release Screenshots (Point-in-Time)For release notes, use Location: import { releaseScreenshot } from "./utils/docScreenshot";
await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });Key differences from
When to use which:
Authenticated Playwright Testing (Live Frontend Debugging)When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests. Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to Step 1: Set a password for the superuser (one-time setup): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"Step 2: Playwright script pattern: const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Collect console messages for debugging
const consoleMsgs = [];
page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));
// 1. Login to Django admin to get session cookie
await page.goto('http://localhost:8000/admin/login/');
await page.fill('#id_username', '<superuser-username>');
await page.fill('#id_password', 'testpass123');
await page.click('input[type=submit]');
await page.waitForTimeout(2000);
// 2. Extract the session cookie
const cookies = await context.cookies();
const sessionCookie = cookies.find(c => c.name === 'sessionid');
// 3. Intercept GraphQL requests to inject the session cookie
// (needed because Apollo sends cross-origin requests to :8000)
await page.route('**/graphql/**', async (route) => {
const headers = {
...route.request().headers(),
'Cookie': 'sessionid=' + sessionCookie.value,
};
await route.continue({ headers });
});
// 4. Navigate to the frontend page under test
await page.goto('http://localhost:5173/extracts');
await page.waitForTimeout(5000);
// 5. Inspect results
const bodyText = await page.textContent('body');
console.log(bodyText);
await browser.close();
})();Run from the frontend directory (where cd frontend && node /path/to/script.jsKey details:
Alternative — create a session programmatically (no admin login needed): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"Then use the printed session key directly in curl or Playwright route interception. Documentation Locations
Branch StrategyThis project follows trunk-based development:
Changelog MaintenanceIMPORTANT: Always update The changelog follows Keep a Changelog format: ## [Unreleased] - YYYY-MM-DD
### Added
- New features
### Fixed
- Bug fixes with file locations and line numbers
### Changed
- Changes to existing functionality
### Technical Details
- Implementation specifics, architectural notesWhen to update:
What to include:
Pre-commit HooksAutomatically run on commit:
Run manually: Common Pitfalls
|
|
test |
5.5.3 adds typesVersions to package.json, fixing subpath type resolution for moduleResolution:"node". The @ts-expect-error on the docxodus/react import is no longer needed. Also picks up pagination rendering fixes.
PR Review: Add first-class DOCX support with Docxodus microservice and WASM rendererThis is a well-structured PR that adds a complete parallel pipeline for DOCX support. The architecture is sound - offset alignment via a shared library on both sides is clever, and the convert-once / project-many pattern for annotations is a good performance approach. Below are observations organized by severity. Bugs / Correctness Issues1. Chat source ID format mismatch in DocxAnnotatorWrapper In 2. Span-based search fallthrough in useTextSearch.tsx The new span-based search path is in a bare 3. Annotation projection re-runs on selectedChatSourceId change Effect 3 (annotation projection, approximately 56ms) has Security Issues4. CSS injection via label color field In 5. Microservice error response body in exception messages In Performance / Correctness6. No microservice response validation After Code Quality7. Fragile offset coupling in DocxAnnotatorTestWrapper.tsx The 8. annotation_type resolver suggests stale DB data The new 9. sampleDocText export from test wrapper
Minor / Nits
Overall this is solid work. The main items worth addressing before merge are the chat source ID mismatch (item 1), the CSS injection path (item 4), and the span-based search fallthrough (item 2). The rest are improvements rather than blockers. |
…wright CT PaginatedDocument's pagination engine needs a real viewport to compute page dimensions. In Playwright CT, text nodes render with zero bounding rects, making mouse-drag coordinate-based tests impossible. The feature works correctly in the live app. Mark as test.fixme with TODO to re-enable when PaginatedDocument supports headless rendering.
When a selection spans page boundaries in PaginatedDocument, the selected text includes page number artifacts and extra whitespace that don't exist in docText, causing findTextOccurrences to return no matches and silently breaking the annotation menu for all subsequent selections. Fix by resolving both anchor and focus positions via DOM offset computation (getGlobalOffsetFromDomPosition) which walks text nodes and counts characters, giving exact start/end offsets in docText regardless of page boundaries. Fall back to text matching for single- page selections where DOM resolution fails. Also: - Skip page-number elements in the TreeWalker (PaginatedDocument injects these as visible text nodes that aren't part of docText) - Wrap handleMouseUp in try/catch to prevent uncaught errors from corrupting component state and breaking subsequent interactions
|
test-ping-comment-will-delete |
|
PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer Overall this is a well-structured addition. The three-pass render architecture (convert-once then project-many then CSS-only visibility toggle) is thoughtful, and the error classification (transient vs. permanent) in the parser is exactly right. Below are issues ranging from critical to minor. CRITICAL: Version mismatch between microservice image and npm package local.yml pins the microservice to docxodus 5.4.2 (image tag 1.1.0-docxodus5.4.2) but package.json uses docxodus 5.5.3. The entire offset-alignment guarantee rests on both sides using the same version of the Docxodus library. A version skew here will silently produce misaligned annotations in production. The image tag and npm version must match. If 5.5.3 is the intended version, the Docker image needs to be rebuilt and re-tagged, and both the lockfile and image digest should be pinned together so they cannot drift independently. HIGH: Unsanitized label colors injected into CSS In DocxAnnotator.tsx, the customCss memo builds CSS rules from label colors without sanitization. The color.replace strip does not sanitize special characters -- a corpus admin could set a label color to a value that breaks out of the CSS declaration via CSS injection. Add a hex validation guard (e.g. /^[0-9a-fA-F]{3,8}$/) before using the value in CSS. Additionally, hexToRgb(color) is called without null-checking the return value; if color is an empty string or non-hex, hexToRgb may throw or return unexpected values -- add a guard before destructuring. HIGH: handleMouseUp silently falls back to first occurrence on DOM offset failure When the DOM lookup fails (contentEl is null or anchorOffset is null), the annotation is silently created at the first occurrence of the text rather than the one the user selected. Consider bailing out with a user-visible message rather than silently annotating the wrong span. MEDIUM: max_file_size_mb default is a magic number Per CLAUDE.md: no magic numbers -- use constants files. The 50 default for max_file_size_mb should be defined in opencontractserver/constants/ rather than inline in the dataclass. MEDIUM: annotation_type GraphQL resolver returns empty string for invalid data Returning an empty string for invalid/missing annotation types means clients cannot distinguish valid-empty from corrupted data. Returning None instead (the field is already graphene.String which allows null) is more semantically correct. MEDIUM: docxBytesCache entries can be arbitrarily large The LRU cache caps at 3 entries (DOCX_CACHE_MAX_ENTRIES) but places no cap on the size of the stored Uint8Array values. Three 50 MB enterprise contracts would pin 150 MB in browser memory indefinitely. Either add a byte-budget alongside the entry count, or document the intentional trade-off in the constant JSDoc. MINOR: Test fixture offsets are fragile across library version bumps In DocxAnnotatorTestWrapper.tsx, sample annotations hardcode character offsets tied to a specific Docxodus text extraction. If the docxodus npm version changes its extraction output, these offsets will silently break. Consider adding runtime assertions that sampleDocText contains the expected text at each hardcoded offset, similar to the FIRST_THIS_OFFSET/SECOND_THIS_OFFSET sanity checks in the .fixme test. MINOR: Duplicate setDocxBytes(null) calls setDocxBytes(null) is called in three separate unstable_batchedUpdates blocks (PDF onCompleted, TXT onCompleted, and the unmount effect). While correct, extracting this into a reset helper would reduce the risk of missing the reset when a future file type is added. Positive notes
|
CLAUDE.mdThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. Project OverviewOpenContracts is an AGPL-3.0 enterprise document analytics platform for PDFs and text-based formats. It features a Django/GraphQL backend with PostgreSQL + pgvector, a React/TypeScript frontend with Jotai state management, and pluggable document processing pipelines powered by machine learning models. Baseline Commit Rules
Essential CommandsBackend (Django)# Run backend tests (sequential, use --keepdb to speed up subsequent runs)
docker compose -f test.yml run django python manage.py test --keepdb
# Run backend tests in PARALLEL (recommended - ~4x faster)
# Uses pytest-xdist with 4 workers, --dist loadscope keeps class tests together
docker compose -f test.yml run django pytest -n 4 --dist loadscope
# Run parallel tests with auto-detected worker count (uses all CPU cores)
docker compose -f test.yml run django pytest -n auto --dist loadscope
# Run parallel tests with fresh databases (first run or after schema changes)
docker compose -f test.yml run django pytest -n 4 --dist loadscope --create-db
# Run specific test file
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications --keepdb
# Run specific test file in parallel
docker compose -f test.yml run django pytest opencontractserver/tests/test_notifications.py -n 4 --dist loadscope
# Run specific test class/method
docker compose -f test.yml run django python manage.py test opencontractserver.tests.test_notifications.TestNotificationModel.test_create_notification --keepdb
# Apply database migrations
docker compose -f local.yml run django python manage.py migrate
# Create new migration
docker compose -f local.yml run django python manage.py makemigrations
# Django shell
docker compose -f local.yml run django python manage.py shell
# Code quality (runs automatically via pre-commit hooks)
pre-commit run --all-filesFrontend (React/TypeScript)cd frontend
# Start development server (proxies to Django on :8000)
yarn start
# Run unit tests (Vitest) - watches by default
yarn test:unit
# Run component tests (Playwright) - CRITICAL: Use --reporter=list to prevent hanging
yarn test:ct --reporter=list
# Run component tests with grep filter
yarn test:ct --reporter=list -g "test name pattern"
# Run E2E tests
yarn test:e2e
# Coverage report
yarn test:coverage
# Linting and formatting
yarn lint
yarn fix-styles
# Build for production
yarn build
# Preview production build locally
yarn serveProduction Deployment# CRITICAL: Always run migrations FIRST in production
docker compose -f production.yml --profile migrate up migrate
# Then start main services
docker compose -f production.yml upHigh-Level ArchitectureBackend ArchitectureStack: Django 4.x + GraphQL (Graphene) + PostgreSQL + pgvector + Celery Key Patterns:
Frontend ArchitectureStack: React 18 + TypeScript + Apollo Client + Jotai (atoms) + PDF.js + Vite Key Patterns:
Data Flow ArchitectureDocument Processing:
GraphQL Permission Flow:
Critical Security Patterns
Critical Concepts
Testing PatternsManual Test ScriptsLocation: When performing manual testing (e.g., testing migrations, verifying database state, testing API endpoints interactively), always document the test steps in a markdown file under Format: # Test: [Brief description]
## Purpose
What this test verifies.
## Prerequisites
- Required state (e.g., "migration at 0058")
- Required data (e.g., "at least one document exists")
## Steps
1. Step one with exact command
```bash
docker compose -f local.yml run --rm django python manage.py shell -c "..."
Expected Results
CleanupCommands to restore original state if needed. Automated Documentation ScreenshotsLocation: Screenshots for documentation are automatically captured during Playwright component tests and committed back to the PR branch by the How it works:
Naming convention (
At least 2 segments required, 3 recommended. All lowercase alphanumeric with single hyphens. Example: import { docScreenshot } from "./utils/docScreenshot";
// After the component renders and assertions pass:
await docScreenshot(page, "badges--celebration-modal--auto-award");Rules:
Release Screenshots (Point-in-Time)For release notes, use Location: import { releaseScreenshot } from "./utils/docScreenshot";
await releaseScreenshot(page, "v3.0.0.b3", "landing-page", { fullPage: true });Key differences from
When to use which:
Authenticated Playwright Testing (Live Frontend Debugging)When you need to interact with the running frontend as an authenticated user (e.g., debugging why a query returns empty results), use Django admin session cookies to authenticate GraphQL requests. Architecture context: The frontend uses Auth0 for authentication, but the Django backend also accepts session cookie auth. Apollo Client sends GraphQL requests directly to Step 1: Set a password for the superuser (one-time setup): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.auth import get_user_model
User = get_user_model()
u = User.objects.filter(is_superuser=True).first()
u.set_password('testpass123')
u.save()
print(f'Password set for {u.username}')
"Step 2: Playwright script pattern: const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const context = await browser.newContext();
const page = await context.newPage();
// Collect console messages for debugging
const consoleMsgs = [];
page.on('console', msg => consoleMsgs.push('[' + msg.type() + '] ' + msg.text()));
// 1. Login to Django admin to get session cookie
await page.goto('http://localhost:8000/admin/login/');
await page.fill('#id_username', '<superuser-username>');
await page.fill('#id_password', 'testpass123');
await page.click('input[type=submit]');
await page.waitForTimeout(2000);
// 2. Extract the session cookie
const cookies = await context.cookies();
const sessionCookie = cookies.find(c => c.name === 'sessionid');
// 3. Intercept GraphQL requests to inject the session cookie
// (needed because Apollo sends cross-origin requests to :8000)
await page.route('**/graphql/**', async (route) => {
const headers = {
...route.request().headers(),
'Cookie': 'sessionid=' + sessionCookie.value,
};
await route.continue({ headers });
});
// 4. Navigate to the frontend page under test
await page.goto('http://localhost:5173/extracts');
await page.waitForTimeout(5000);
// 5. Inspect results
const bodyText = await page.textContent('body');
console.log(bodyText);
await browser.close();
})();Run from the frontend directory (where cd frontend && node /path/to/script.jsKey details:
Alternative — create a session programmatically (no admin login needed): docker compose -f local.yml exec django python manage.py shell -c "
from django.contrib.sessions.backends.db import SessionStore
from django.contrib.auth import get_user_model
User = get_user_model()
user = User.objects.filter(is_superuser=True).first()
session = SessionStore()
session['_auth_user_id'] = str(user.pk)
session['_auth_user_backend'] = 'django.contrib.auth.backends.ModelBackend'
session['_auth_user_hash'] = user.get_session_auth_hash()
session.save()
print(f'Session key: {session.session_key}')
"Then use the printed session key directly in curl or Playwright route interception. Documentation Locations
Branch StrategyThis project follows trunk-based development:
Changelog MaintenanceIMPORTANT: Always update The changelog follows Keep a Changelog format: ## [Unreleased] - YYYY-MM-DD
### Added
- New features
### Fixed
- Bug fixes with file locations and line numbers
### Changed
- Changes to existing functionality
### Technical Details
- Implementation specifics, architectural notesWhen to update:
What to include:
Pre-commit HooksAutomatically run on commit:
Run manually: Common Pitfalls
|
|
Test comment - please ignore |
|
PR Review: Add first-class DOCX support with Docxodus microservice and WASM renderer. This is a well-architected feature. Core design of using the same Docxodus library on both backend microservice and frontend WASM for offset alignment is sound. Implementation is clean with good test coverage. See below for detailed feedback. SECURITY: [Important] DOMPurify SANITIZE_CONFIG allows arbitrary style attributes. In DocxAnnotator.tsx the config includes style in ADD_ATTR. This preserves inline DOCX styles but also allows arbitrary CSS values including url() references that can load external resources or exfiltrate data. DOMPurify does not strip dangerous url() from inline styles by default. Consider adding a uponSanitizeAttribute hook that strips url( patterns. At minimum document the trade-off. CODE QUALITY: [Minor] context_length = 64 in useTextSearch.tsx should be in constants.ts (per CLAUDE.md no magic numbers). [Minor] CHANGELOG says docxodus@5.5.0 but package.json pins 5.5.3 - these should match. [Minor] docxodus_parser.py reads the entire DOCX into memory before size-checking - for a 50MB file this is ~100MB peak. Consider checking default_storage.size() first. [Minor] _normalize_annotation silently drops annotation_type when invalid - add logger.debug() to aid version drift diagnosis. [Minor] resolve_annotation_type says tolerate invalid DB values without explaining how they got there - a note or issue link would help. [Minor] getSpan in DocxAnnotatorWrapper throws when selectedLabel not found - caught by handleCreateAnnotation so no crash, but returning null and early-returning would be more idiomatic. ARCHITECTURE: [Observation] docxLoadCancelRef pattern works but a comment on intent would help. [Observation] Three separate style elements could be merged. Low priority. TEST COVERAGE: Good overall. getGlobalOffset tests thorough. Backend transient/permanent error classification correct. Disambiguation test correctly marked fixme. [Minor] DocxAnnotatorTestWrapper says offsets pinned to @5.5.0 but package.json is 5.5.3 - verify sampleAnnotation2 offsets 84-100 are still valid. SUMMARY - Three items before merge: 1. Security: DOMPurify style handling - document risk or add url() filter. 2. Offset verification: confirm offsets against 5.5.3. 3. Magic number: context_length to constants. Everything else advisory. |
…unting The DOM offset walker counted ALL text in the PaginatedDocument DOM (including CSS from <style> tags, page numbers, attribute text) producing offsets far beyond docText.length. This made every selection fail the "empty text or zero range" check. Replace with text-matching approach using findTextOccurrences: 1. Clean selected text by stripping page number artifacts (\n1\n → \n) 2. Search cleaned text in docText via findTextOccurrences 3. If no exact match (whitespace differences from cross-page), try whitespace-normalized fuzzy matching 4. Use DOM position only for disambiguation when multiple matches exist Also skip <style> and <title> tags in the TreeWalker (still used for disambiguation), though the primary path no longer relies on DOM offsets.
|
PR Review placeholder - testing gh CLI access |
PR Review: Add first-class DOCX support with Docxodus microservice and WASM rendererThis is a well-structured feature addition with thoughtful architecture. The three-layer rendering pipeline (convert-once/project-many/CSS-toggle) is clever, and the approach of sharing the same Docxodus library between backend and frontend to guarantee offset alignment is a strong design choice. That said, there are some issues that need attention before merge. Critical: Docxodus Version MismatchThe core guarantee of this PR — that backend and frontend character offsets align because they use the same Docxodus library — is broken by a version mismatch:
If Docxodus 5.4.2 produces different character offsets than 5.5.3 (due to whitespace normalization changes, paragraph handling, etc.), annotations created in the backend will not highlight the correct text in the frontend. The Docker image needs to be updated to use docxodus 5.5.3, or the npm package pinned to match the backend. The test comment Bug: Unhandled JSON Parse Error in Backend ParserIn result = response.json()
normalized = self._normalize_response(result)If the microservice returns HTTP 200 with a non-JSON body (e.g. an HTML error page from a proxy), Suggestion: wrap the Security: Unvalidated Color Value in CSS TemplateIn A crafted Code Quality: Magic Number in useTextSearch.tsxconst context_length = 64;Per CLAUDE.md conventions, magic numbers belong in the constants file. This should be a named constant like Minor: Missing .docx Extension Guarantee in Parser Payload
What's Done Well
SummaryThe version mismatch (critical) directly undermines the PR's core guarantee and should be resolved before merge. The JSON parse error (bug) is a reliability gap. The CSS injection (security) is low-severity but a straightforward fix. The remaining items are style/quality notes per CLAUDE.md conventions. |
Summary
This PR adds comprehensive DOCX document support to OpenContracts through a parallel ingestion and rendering pipeline. It introduces a Docxodus .NET microservice for backend parsing and a React WASM component for frontend rendering, enabling native DOCX handling with character-offset-based annotations that align perfectly between backend and frontend.
Key Changes
Backend
opencontractserver/pipeline/parsers/docxodus_parser.py): New REST-based parser that delegates DOCX processing to a containerized .NET microservice. Handles base64 encoding/decoding, timeout/error handling with transient retry logic, and Cloud Run IAM authentication support.opencontractserver/pipeline/thumbnailers/docx_thumbnailer.py): Two-tier thumbnail generation—first attempts to extract embedded thumbnails from DOCX ZIP archives, falls back to text-based rendering.docxodus-service/): Minimal ASP.NET Core Web API wrapping Docxodus'sOpenContractExporter.Export(). Exposes/parseendpoint accepting base64-encoded DOCX files and returningOpenContractDocExportJSON with structural annotations and character offsets.Frontend
frontend/src/components/annotator/renderers/docx/DocxAnnotator.tsx): Core WASM-based DOCX renderer usingconvertDocxToHtmlWithExternalAnnotations()to project server-side annotations onto native DOCX HTML output. Supports text selection for new annotation creation, search highlighting, and chat source integration.frontend/src/components/annotator/components/wrappers/DocxAnnotatorWrapper.tsx): State management wrapper mirroring TxtAnnotatorWrapper pattern to minimize parent component rerenders.DocumentKnowledgeBase.tsxto detect DOCX file types and load DOCX bytes via newgetDocxBytes()API helper.Infrastructure
docxodus-parserservice tolocal.yml,test.yml, andproduction.ymlwith proper dependency ordering.DOCXODUS_PARSER_SERVICE_URLandDOCXODUS_PARSER_TIMEOUTsettings.isDocxFileType()helper andDOCX_MIME_TYPEconstant.docxoduspackage to frontend.Testing
opencontractserver/tests/test_doc_parser_docxodus.py): Comprehensive test coverage for parser success cases, timeout handling, connection errors, and HTTP error responses.frontend/tests/DocxAnnotator.ct.tsx): Playwright component tests for renderer initialization and rendering.frontend/tests/DocxAnnotatorTestWrapper.tsx): Minimal DOCX test fixture with sample annotations and labels.Implementation Details
SpanAnnotationobjects ({start, end}) are identical—no validation or reconciliation needed.SpanAnnotationformat with character offsets, same as TXT documents, enabling consistent annotation handling across file types.