Skip to content

feat: External annotation storage and projection system #57

@JSv4

Description

@JSv4

Summary

Create a system for storing annotations OUTSIDE of the DOCX file, enabling:

  • Annotations to be persisted independently of the document
  • Multiple annotation sets per document
  • Dynamic loading and projection of annotations onto rendered HTML
  • Serialization/deserialization for storage in databases or JSON files

This decouples annotations from the document, enabling collaborative annotation, version control of annotations, and annotation sets from different sources.

Requirements

External Annotation Format

Define a portable annotation format that can be:

  1. Stored in JSON/database outside the DOCX
  2. Loaded and projected onto rendered HTML
  3. Created from user interactions and persisted
  4. Mapped to/from the in-DOCX bookmark-based format
interface ExternalAnnotationSet {
  // Identifies the source document
  documentId: string;  // Hash, filename, or unique ID
  documentHash?: string;  // SHA256 of document for integrity
  
  // Metadata
  createdAt: string;  // ISO timestamp
  updatedAt: string;
  version: string;  // Format version for migrations
  
  // Label definitions (annotation types/categories)
  labels: AnnotationLabel[];
  
  // The annotations themselves
  annotations: ExternalAnnotation[];
  
  // Optional relationships between annotations
  relationships?: AnnotationRelationship[];
}

interface ExternalAnnotation {
  id: string;
  labelId: string;
  
  // Text targeting - multiple strategies
  target: AnnotationTarget;
  
  // Cached/extracted data
  rawText: string;  // The annotated text
  
  // Metadata
  author?: string;
  createdAt?: string;
  metadata?: Record<string, unknown>;
}

interface AnnotationTarget {
  // Strategy 1: Character offsets (most portable)
  startOffset?: number;
  endOffset?: number;
  
  // Strategy 2: Element-based (uses DocumentStructure IDs)
  elementId?: string;  // e.g., "doc/p-5" or "doc/tbl-0/tr-1/tc-2"
  
  // Strategy 3: Text search (fallback)
  searchText?: string;
  occurrence?: number;
  
  // Strategy 4: Page-based with bounding box (for layout)
  page?: number;
  boundingBox?: BoundingBox;
  
  // Strategy 5: XPath (for advanced users)
  xpath?: string;
}

API Design

C# API

// Serialization/Deserialization
public class ExternalAnnotationSerializer
{
    public static string Serialize(ExternalAnnotationSet set);
    public static ExternalAnnotationSet Deserialize(string json);
    
    // Import from DOCX embedded annotations
    public static ExternalAnnotationSet ImportFromDocument(WmlDocument doc);
    
    // Export to DOCX embedded annotations
    public static WmlDocument ExportToDocument(
        WmlDocument doc, 
        ExternalAnnotationSet annotations
    );
}

// Projection onto HTML
public class AnnotationProjector
{
    // Project external annotations onto HTML during conversion
    public static XElement ConvertWithAnnotations(
        WmlDocument doc,
        ExternalAnnotationSet annotations,
        WmlToHtmlConverterSettings settings
    );
}

TypeScript API

// Load and project annotations
export function renderWithExternalAnnotations(
  document: File | Uint8Array,
  annotations: ExternalAnnotationSet,
  options?: ConversionOptions
): Promise<string>;

// Create annotation from user selection
export function createAnnotationFromSelection(
  selection: Selection,
  label: AnnotationLabel,
  documentStructure: DocumentStructure
): ExternalAnnotation;

// Validate annotations against document
export function validateAnnotations(
  document: File | Uint8Array,
  annotations: ExternalAnnotationSet
): Promise<ValidationResult>;

// Types
export interface ExternalAnnotationSet { ... }
export interface ExternalAnnotation { ... }
export interface AnnotationTarget { ... }

React Hooks

// Hook for managing external annotations
export function useExternalAnnotations(
  document: File | Uint8Array | null,
  initialAnnotations?: ExternalAnnotationSet
): {
  annotations: ExternalAnnotationSet | null;
  addAnnotation: (annotation: ExternalAnnotation) => void;
  removeAnnotation: (id: string) => void;
  updateAnnotation: (id: string, updates: Partial<ExternalAnnotation>) => void;
  exportAnnotations: () => string;  // JSON
  importAnnotations: (json: string) => void;
  validateAnnotations: () => Promise<ValidationResult>;
};

// Hook for annotation-aware rendering
export function useAnnotatedDocument(
  document: File | Uint8Array | null,
  annotations: ExternalAnnotationSet | null,
  options?: ConversionOptions
): {
  html: string | null;
  isLoading: boolean;
  error: Error | null;
};

Targeting Strategies

Support multiple targeting strategies with fallback:

  1. Character Offsets (Primary)

    • Most portable across document versions
    • startOffset / endOffset into full document text
    • Requires text extraction for mapping
  2. Element IDs (Structural)

  3. Text Search (Fallback)

    • Find text and occurrence number
    • Works even if document structure changes
    • May fail if text is edited
  4. XPath (Advanced)

    • Direct XML path targeting
    • Most precise but fragile

Validation & Repair

interface ValidationResult {
  valid: boolean;
  warnings: ValidationWarning[];
  errors: ValidationError[];
  repairable: boolean;
}

interface ValidationWarning {
  annotationId: string;
  type: 'text_mismatch' | 'offset_drift' | 'element_not_found';
  message: string;
  suggestedFix?: Partial<AnnotationTarget>;
}

Use Cases

  1. Database Storage: Store annotations in PostgreSQL/MongoDB, load on demand
  2. Collaborative Annotation: Multiple users annotate same document independently
  3. Annotation Versioning: Track annotation changes over time
  4. ML Pipeline: Export annotations for training, import predictions
  5. Multi-document Sets: Same annotation schema across document corpus

Testing Requirements

Serialization Tests

  • Round-trip: serialize → deserialize preserves all data
  • Import from DOCX embedded → external format
  • Export to DOCX embedded from external format

Projection Tests

  • Character offset targeting works
  • Element ID targeting works
  • Text search fallback works
  • Multiple targeting strategies in same set

Validation Tests

  • Detect missing elements
  • Detect text mismatches
  • Suggest repairs for drift
  • Handle document modifications gracefully

Integration Tests

Dependencies

Acceptance Criteria

  1. External annotation format is fully specified and documented
  2. Annotations can be stored as JSON outside DOCX
  3. Annotations can be projected onto rendered HTML
  4. Annotations can be created from user selections
  5. Validation detects and suggests fixes for stale annotations
  6. Works with existing in-DOCX annotation system
  7. TypeScript types exported for consumers

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions