Summary
Create a system for storing annotations OUTSIDE of the DOCX file, enabling:
- Annotations to be persisted independently of the document
- Multiple annotation sets per document
- Dynamic loading and projection of annotations onto rendered HTML
- Serialization/deserialization for storage in databases or JSON files
This decouples annotations from the document, enabling collaborative annotation, version control of annotations, and annotation sets from different sources.
Requirements
External Annotation Format
Define a portable annotation format that can be:
- Stored in JSON/database outside the DOCX
- Loaded and projected onto rendered HTML
- Created from user interactions and persisted
- Mapped to/from the in-DOCX bookmark-based format
interface ExternalAnnotationSet {
// Identifies the source document
documentId: string; // Hash, filename, or unique ID
documentHash?: string; // SHA256 of document for integrity
// Metadata
createdAt: string; // ISO timestamp
updatedAt: string;
version: string; // Format version for migrations
// Label definitions (annotation types/categories)
labels: AnnotationLabel[];
// The annotations themselves
annotations: ExternalAnnotation[];
// Optional relationships between annotations
relationships?: AnnotationRelationship[];
}
interface ExternalAnnotation {
id: string;
labelId: string;
// Text targeting - multiple strategies
target: AnnotationTarget;
// Cached/extracted data
rawText: string; // The annotated text
// Metadata
author?: string;
createdAt?: string;
metadata?: Record<string, unknown>;
}
interface AnnotationTarget {
// Strategy 1: Character offsets (most portable)
startOffset?: number;
endOffset?: number;
// Strategy 2: Element-based (uses DocumentStructure IDs)
elementId?: string; // e.g., "doc/p-5" or "doc/tbl-0/tr-1/tc-2"
// Strategy 3: Text search (fallback)
searchText?: string;
occurrence?: number;
// Strategy 4: Page-based with bounding box (for layout)
page?: number;
boundingBox?: BoundingBox;
// Strategy 5: XPath (for advanced users)
xpath?: string;
}
API Design
C# API
// Serialization/Deserialization
public class ExternalAnnotationSerializer
{
public static string Serialize(ExternalAnnotationSet set);
public static ExternalAnnotationSet Deserialize(string json);
// Import from DOCX embedded annotations
public static ExternalAnnotationSet ImportFromDocument(WmlDocument doc);
// Export to DOCX embedded annotations
public static WmlDocument ExportToDocument(
WmlDocument doc,
ExternalAnnotationSet annotations
);
}
// Projection onto HTML
public class AnnotationProjector
{
// Project external annotations onto HTML during conversion
public static XElement ConvertWithAnnotations(
WmlDocument doc,
ExternalAnnotationSet annotations,
WmlToHtmlConverterSettings settings
);
}
TypeScript API
// Load and project annotations
export function renderWithExternalAnnotations(
document: File | Uint8Array,
annotations: ExternalAnnotationSet,
options?: ConversionOptions
): Promise<string>;
// Create annotation from user selection
export function createAnnotationFromSelection(
selection: Selection,
label: AnnotationLabel,
documentStructure: DocumentStructure
): ExternalAnnotation;
// Validate annotations against document
export function validateAnnotations(
document: File | Uint8Array,
annotations: ExternalAnnotationSet
): Promise<ValidationResult>;
// Types
export interface ExternalAnnotationSet { ... }
export interface ExternalAnnotation { ... }
export interface AnnotationTarget { ... }
React Hooks
// Hook for managing external annotations
export function useExternalAnnotations(
document: File | Uint8Array | null,
initialAnnotations?: ExternalAnnotationSet
): {
annotations: ExternalAnnotationSet | null;
addAnnotation: (annotation: ExternalAnnotation) => void;
removeAnnotation: (id: string) => void;
updateAnnotation: (id: string, updates: Partial<ExternalAnnotation>) => void;
exportAnnotations: () => string; // JSON
importAnnotations: (json: string) => void;
validateAnnotations: () => Promise<ValidationResult>;
};
// Hook for annotation-aware rendering
export function useAnnotatedDocument(
document: File | Uint8Array | null,
annotations: ExternalAnnotationSet | null,
options?: ConversionOptions
): {
html: string | null;
isLoading: boolean;
error: Error | null;
};
Targeting Strategies
Support multiple targeting strategies with fallback:
-
Character Offsets (Primary)
- Most portable across document versions
startOffset / endOffset into full document text
- Requires text extraction for mapping
-
Element IDs (Structural)
-
Text Search (Fallback)
- Find text and occurrence number
- Works even if document structure changes
- May fail if text is edited
-
XPath (Advanced)
- Direct XML path targeting
- Most precise but fragile
Validation & Repair
interface ValidationResult {
valid: boolean;
warnings: ValidationWarning[];
errors: ValidationError[];
repairable: boolean;
}
interface ValidationWarning {
annotationId: string;
type: 'text_mismatch' | 'offset_drift' | 'element_not_found';
message: string;
suggestedFix?: Partial<AnnotationTarget>;
}
Use Cases
- Database Storage: Store annotations in PostgreSQL/MongoDB, load on demand
- Collaborative Annotation: Multiple users annotate same document independently
- Annotation Versioning: Track annotation changes over time
- ML Pipeline: Export annotations for training, import predictions
- Multi-document Sets: Same annotation schema across document corpus
Testing Requirements
Serialization Tests
Projection Tests
Validation Tests
Integration Tests
Dependencies
Acceptance Criteria
- External annotation format is fully specified and documented
- Annotations can be stored as JSON outside DOCX
- Annotations can be projected onto rendered HTML
- Annotations can be created from user selections
- Validation detects and suggests fixes for stale annotations
- Works with existing in-DOCX annotation system
- TypeScript types exported for consumers
Summary
Create a system for storing annotations OUTSIDE of the DOCX file, enabling:
This decouples annotations from the document, enabling collaborative annotation, version control of annotations, and annotation sets from different sources.
Requirements
External Annotation Format
Define a portable annotation format that can be:
API Design
C# API
TypeScript API
React Hooks
Targeting Strategies
Support multiple targeting strategies with fallback:
Character Offsets (Primary)
startOffset/endOffsetinto full document textElement IDs (Structural)
DocumentStructureelement IDs (doc/p-5, etc.)Text Search (Fallback)
XPath (Advanced)
Validation & Repair
Use Cases
Testing Requirements
Serialization Tests
Projection Tests
Validation Tests
Integration Tests
Dependencies
Acceptance Criteria