-
Notifications
You must be signed in to change notification settings - Fork 613
[FEATURE][MCP-SERVER]: Python sample - docx-server #1045
Copy link
Copy link
Labels
enhancementNew feature or requestNew feature or requestmcp-serversMCP Server SamplesMCP Server SamplesoicOpen Innovation Community ContributionsOpen Innovation Community ContributionspythonPython / backend development (FastAPI)Python / backend development (FastAPI)
Milestone
Description
Overview
Create an MCP Server in Python that provides comprehensive DOCX document manipulation, analysis, and generation capabilities, demonstrating advanced document processing patterns.
Server Specifications
Server Details
- Name:
docx-server - Language: Python 3.11+
- Location:
mcp-servers/python/docx_server/ - Complexity: ⭐⭐⭐ Intermediate to Advanced
- Purpose: Demonstrate document processing, generation, and analysis via MCP
Core Features
- Create, read, and modify DOCX documents
- Extract and analyze document content
- Convert between formats
- Apply templates and styles
- Generate reports and documents from data
- Document comparison and merging
Tools Provided
1. read_docx
Extract content and metadata from DOCX files
@dataclass
class ReadDocxRequest:
file_path: str
extract_mode: str = "all" # all, text, metadata, structure, styles
include_images: bool = True
include_tables: bool = True
include_headers_footers: bool = True
preserve_formatting: bool = False2. create_docx
Create new DOCX documents from content
@dataclass
class CreateDocxRequest:
output_path: str
title: str = ""
content: List[Dict[str, Any]] # Paragraphs, tables, images
template: str = "" # Template file path (optional)
styles: Dict[str, Any] = {}
metadata: Dict[str, str] = {}3. modify_docx
Edit existing DOCX documents
@dataclass
class ModifyDocxRequest:
file_path: str
operations: List[Dict[str, Any]] # Add, replace, delete operations
backup_original: bool = True
preserve_styles: bool = True
track_changes: bool = False4. analyze_docx
Perform document analysis and extraction
@dataclass
class AnalyzeDocxRequest:
file_path: str
analysis_type: str = "full" # full, statistics, readability, structure
extract_entities: bool = False # Named entities extraction
extract_keywords: bool = True
language: str = "en"5. convert_docx
Convert DOCX to other formats
@dataclass
class ConvertDocxRequest:
input_path: str
output_format: str # pdf, html, markdown, txt, rtf
output_path: str = ""
options: Dict[str, Any] = {}
preserve_layout: bool = True6. merge_docx
Merge multiple DOCX documents
@dataclass
class MergeDocxRequest:
source_files: List[str]
output_path: str
merge_mode: str = "sequential" # sequential, alternate, custom
preserve_styles: bool = True
add_page_breaks: bool = True7. apply_template
Apply data to DOCX templates
@dataclass
class ApplyTemplateRequest:
template_path: str
output_path: str
data: Dict[str, Any] # Template variables
repeat_sections: List[Dict] = [] # For repeated content
format_dates: bool = True
format_numbers: bool = True8. compare_docx
Compare two DOCX documents
@dataclass
class CompareDocxRequest:
file1_path: str
file2_path: str
comparison_mode: str = "detailed" # detailed, summary, track_changes
ignore_formatting: bool = False
highlight_changes: bool = TrueImplementation Requirements
Directory Structure
mcp-servers/python/docx_server/
├── src/
│ └── docx_server/
│ ├── __init__.py
│ ├── server.py
│ ├── tools/
│ │ ├── __init__.py
│ │ ├── reader.py
│ │ ├── creator.py
│ │ ├── modifier.py
│ │ ├── analyzer.py
│ │ ├── converter.py
│ │ ├── merger.py
│ │ ├── templater.py
│ │ └── comparator.py
│ ├── processors/
│ │ ├── __init__.py
│ │ ├── text_processor.py
│ │ ├── table_processor.py
│ │ ├── image_processor.py
│ │ ├── style_processor.py
│ │ └── metadata_processor.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── document_utils.py
│ │ ├── format_converter.py
│ │ └── validation.py
│ └── config.py
├── tests/
│ ├── __init__.py
│ ├── test_tools.py
│ ├── test_processors.py
│ └── fixtures/
│ ├── sample_documents/
│ └── templates/
├── requirements.txt
├── requirements-dev.txt
├── README.md
├── examples/
│ ├── basic_operations.py
│ ├── template_generation.py
│ ├── document_analysis.py
│ └── batch_processing.py
└── .env.example
Dependencies
# requirements.txt
mcp>=1.0.0
python-docx>=1.0.0
pypandoc>=1.11
pdfkit>=1.0.0
markdown>=3.5.0
jinja2>=3.1.0
pillow>=10.0.0
pydantic>=2.5.0
python-dotenv>=1.0.0
textstat>=0.7.3 # Readability analysis
spacy>=3.7.0 # NLP analysis
openpyxl>=3.1.0 # Excel integration
reportlab>=4.0.0 # PDF generationConfiguration
# config.yaml
document_settings:
max_file_size_mb: 50
supported_formats:
input: [docx, doc, rtf, txt, html, markdown]
output: [docx, pdf, html, markdown, txt, rtf]
default_encoding: utf-8
processing:
enable_ocr: false
ocr_language: "eng"
enable_nlp: true
nlp_model: "en_core_web_sm"
templates:
directory: "./templates"
variables_prefix: "{{"
variables_suffix: "}}"
conversion:
pdf_engine: "pdfkit" # pdfkit, reportlab, pypandoc
preserve_images: true
preserve_tables: true
security:
scan_for_macros: true
remove_personal_info: false
max_processing_time: 300 # secondsEnvironment Variables
# .env.example
# Server configuration
DOCX_SERVER_PORT=8001
DOCX_SERVER_HOST=localhost
# File handling
MAX_FILE_SIZE_MB=50
TEMP_DIR=/tmp/docx_server
OUTPUT_DIR=./output
# NLP features (optional)
ENABLE_NLP_ANALYSIS=true
SPACY_MODEL=en_core_web_sm
# Conversion features
ENABLE_PDF_CONVERSION=true
WKHTMLTOPDF_PATH=/usr/local/bin/wkhtmltopdf
# Security
SCAN_FOR_MACROS=true
SANDBOX_MODE=falseUsage Examples
Reading Documents
# Extract all content from DOCX
echo '{
"method": "tools/call",
"params": {
"name": "read_docx",
"arguments": {
"file_path": "/path/to/document.docx",
"extract_mode": "all",
"preserve_formatting": true
}
}
}' | docx-serverCreating Documents
# Create new document with content
echo '{
"method": "tools/call",
"params": {
"name": "create_docx",
"arguments": {
"output_path": "/path/to/output.docx",
"title": "Report Title",
"content": [
{"type": "heading", "text": "Chapter 1", "level": 1},
{"type": "paragraph", "text": "This is the content..."},
{"type": "table", "data": [["Header1", "Header2"], ["Cell1", "Cell2"]]}
]
}
}
}' | docx-serverTemplate Processing
# Apply data to template
echo '{
"method": "tools/call",
"params": {
"name": "apply_template",
"arguments": {
"template_path": "/templates/invoice.docx",
"output_path": "/output/invoice_001.docx",
"data": {
"invoice_number": "INV-001",
"customer_name": "John Doe",
"amount": 1500.00,
"items": [
{"description": "Service A", "quantity": 2, "price": 500},
{"description": "Service B", "quantity": 1, "price": 500}
]
}
}
}
}' | docx-serverDocument Analysis
# Analyze document statistics and readability
echo '{
"method": "tools/call",
"params": {
"name": "analyze_docx",
"arguments": {
"file_path": "/path/to/document.docx",
"analysis_type": "full",
"extract_entities": true,
"extract_keywords": true
}
}
}' | docx-serverAdvanced Features
- Smart Content Extraction: Extract specific sections, tables, or images
- Template Engine: Jinja2-based templating with loops and conditionals
- NLP Integration: Entity extraction, sentiment analysis, summarization
- Batch Processing: Process multiple documents in parallel
- Version Control: Track document changes and revisions
- Format Preservation: Maintain styles, formatting, and layout
- Content Validation: Check for required sections and formatting
Response Format Examples
{
"document_info": {
"path": "/path/to/document.docx",
"title": "Document Title",
"author": "John Doe",
"created": "2024-01-15T10:00:00Z",
"modified": "2024-01-15T14:30:00Z",
"pages": 15,
"word_count": 3500,
"paragraph_count": 42
},
"content": {
"headings": ["Introduction", "Chapter 1", "Conclusion"],
"tables": 3,
"images": 5,
"lists": 8
},
"analysis": {
"readability_score": 65.2,
"reading_level": "College",
"keywords": ["analysis", "report", "data"],
"entities": ["John Doe", "Company Inc", "New York"]
}
}Testing Requirements
- Unit tests for all document operations
- Integration tests with sample documents
- Performance tests for large documents
- Template rendering tests
- Format conversion accuracy tests
- Error handling for corrupted files
Acceptance Criteria
- Python MCP server with 8+ document tools
- Full DOCX creation, reading, and modification
- Template engine with variable substitution
- Document analysis and statistics
- Format conversion capabilities
- Document merging and comparison
- NLP-based content analysis
- Batch processing support
- Error handling for various file formats
- Comprehensive test suite (>90% coverage)
- Complete documentation with examples
Priority
Medium - Essential for document automation workflows
Use Cases
- Automated report generation
- Document template processing
- Contract and invoice generation
- Document migration and conversion
- Content extraction and analysis
- Compliance document processing
- Academic paper formatting
- Business correspondence automation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestmcp-serversMCP Server SamplesMCP Server SamplesoicOpen Innovation Community ContributionsOpen Innovation Community ContributionspythonPython / backend development (FastAPI)Python / backend development (FastAPI)