Skip to content

fix(skills): harden XML parsing and blob writes in powerpoint extract#1053

Merged
WilliamBerryiii merged 7 commits intomainfrom
fix/pptx-security-xxe-blob-validation-1014-1016
Mar 19, 2026
Merged

fix(skills): harden XML parsing and blob writes in powerpoint extract#1053
WilliamBerryiii merged 7 commits intomainfrom
fix/pptx-security-xxe-blob-validation-1014-1016

Conversation

@WilliamBerryiii
Copy link
Copy Markdown
Member

@WilliamBerryiii WilliamBerryiii commented Mar 15, 2026

Description

Hardens the PowerPoint content extraction skill against two classes of security vulnerability:

  1. XXE entity expansion (security(powerpoint): lxml XXE vector in extract_content.py theme color resolution #1014) — _resolve_theme_colors() now creates an XMLParser with resolve_entities=False and no_network=True, preventing external entity injection and network-based SSRF. Malformed XML is caught and logged rather than raising.

  2. Unrestricted blob writes (security(powerpoint): untrusted blob writes in extract_content.py image extraction #1016) — _save_image_blob() is rewritten with defense-in-depth:

    • Content-type allowlist limits output to bmp, gif, jpeg, png, tiff, and x-wmf
    • 100 MB size cap prevents decompression-bomb writes
    • Path containment via os.path.realpath blocks directory-traversal attacks
    • WMF magic-byte validation rejects blobs that claim image/x-wmf but lack a valid Aldus placeable or standard WMF header prefix (historical CVE-2005-4560 mitigation)

Both fixes follow a fail-closed design: invalid inputs are rejected or logged, never silently passed through.

Related Issue(s)

Closes #1014
Closes #1016

Type of Change

Select all that apply:

Code & Documentation:

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature causing existing functionality to change)
  • Documentation update

Infrastructure & Configuration:

  • GitHub Actions workflow
  • Linting configuration (markdown, PowerShell, etc.)
  • Security configuration
  • DevContainer configuration
  • Dependency update

AI Artifacts:

  • Reviewed contribution with prompt-builder agent and addressed all feedback
  • Copilot instructions (.github/instructions/*.instructions.md)
  • Copilot prompt (.github/prompts/*.prompt.md)
  • Copilot agent (.github/agents/*.agent.md)
  • Copilot skill (.github/skills/*/SKILL.md)

Other:

  • Script/automation (.ps1, .sh, .py)
  • Other (please describe):

Testing

  • 9 new regression tests added to test_extract_content.py covering every hardening layer:
    • Delegation: extract_image correctly delegates to _save_image_blob
    • Content-type allowlist: parametrized across 5 allowed and 5 rejected MIME types
    • Size limit: boundary tests at exactly 100 MB and 100 MB + 1
    • Path containment: traversal attempt (../escape.png) is rejected
    • WMF magic bytes: valid Aldus placeable, valid standard prefix, invalid magic, and too-short blob
    • XXE prevention: entity expansion attempt raises no error and produces no expansion
    • Predefined entities: &, <, etc. are preserved correctly
  • Full test suite: 591 tests passing

Checklist

Required Checks

  • Documentation is updated (if applicable)
  • Files follow existing naming conventions
  • Changes are backwards compatible (if applicable)
  • Tests added for new functionality (if applicable)

Required Automated Checks

The following validation commands must pass before merging:

  • Markdown linting: npm run lint:md
  • Spell checking: npm run spell-check
  • Frontmatter validation: npm run lint:frontmatter
  • Skill structure validation: npm run validate:skills
  • Link validation: npm run lint:md-links (12 pre-existing broken links in unrelated files; 0 in changed files)
  • PowerShell analysis: npm run lint:ps
  • Plugin freshness: npm run plugin:generate (no plugin changes in this PR)

Security Considerations

  • This PR does not contain any sensitive or NDA information
  • Any new dependencies have been reviewed for security issues
  • Security-related scripts follow the principle of least privilege

Additional Notes

  • The no_network=True flag on XMLParser is an additional hardening measure that blocks SSRF attempts via DTD fetches, complementing the resolve_entities=False defense.
  • WMF validation addresses the historical CVE-2005-4560 attack vector where crafted WMF files could trigger arbitrary code execution.
  • All guards are fail-closed: unrecognized content types, oversized blobs, path escapes, and invalid WMF headers all result in ValueError with descriptive messages.

@WilliamBerryiii WilliamBerryiii requested a review from a team as a code owner March 15, 2026 21:17
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 15, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

- add XMLParser(resolve_entities=False) to _resolve_theme_colors for XXE prevention
- rewrite _save_image_blob with content-type allowlist, size limit, path containment
- refactor extract_image to delegate blob writes through secured helper
- add 9 regression tests covering XXE and blob-write vectors
- gitignore .hypothesis test artifact directories

🔒 - Generated by Copilot
@WilliamBerryiii WilliamBerryiii force-pushed the fix/pptx-security-xxe-blob-validation-1014-1016 branch from 7dc2faa to efd34b0 Compare March 15, 2026 21:36
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.12%. Comparing base (f85fe02) to head (b13b881).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1053      +/-   ##
==========================================
+ Coverage   88.01%   88.12%   +0.10%     
==========================================
  Files          45       45              
  Lines        7886     7931      +45     
==========================================
+ Hits         6941     6989      +48     
+ Misses        945      942       -3     
Flag Coverage Δ
pester 86.91% <ø> (ø)
pytest 90.72% <100.00%> (+0.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...experimental/powerpoint/scripts/extract_content.py 83.88% <100.00%> (+1.74%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The XXE and path-traversal hardening is solid, and the test coverage is thorough. One blocker needs to be resolved before merge, plus two smaller items below.


🔴 Blocker — EMF images missing from allowlist (extract_content.py lines ~42–51)

_CONTENT_TYPE_TO_EXT omits image/emf and image/x-emf (Enhanced Metafile Format). This is a hard regression for real-world PPTX files. python-pptx's own source explicitly notes:

"Images produced by PowerPoint itself are generally EMF"

Charts, SmartArt, and diagrams exported from PowerPoint commonly embed EMF blobs. With this PR, any such file will crash the entire extraction run with:

ValueError: Unsupported image content type: image/x-emf

The error propagates entirely uncaught: _save_image_blobextract_image_extract_shape_by_typeextract_slide (catches only AttributeError, TypeError) → main(). A single EMF shape anywhere in a presentation aborts all output.

Suggested fix — add both EMF MIME types to the allowlist, with optional magic-byte validation (EMF header: \x01\x00\x00\x00 at offset 0, b" EMF" at offset 40):

_CONTENT_TYPE_TO_EXT: dict[str, str] = {
    "image/bmp": "bmp",
    "image/emf": "emf",       # add
    "image/gif": "gif",
    "image/jpeg": "jpg",
    "image/png": "png",
    "image/tiff": "tiff",
    "image/x-emf": "emf",    # add — common python-pptx MIME type for EMF
    "image/x-wmf": "wmf",
}

The rejected-type test already includes "image/x-emf" as an expected failure, which confirms the omission — that test entry should be removed once EMF is added to the allowlist.


⚠️ Medium — Unguarded ValueError crashes full extraction on any validation failure (extract_content.py lines ~114–129)

All four guard clauses in _save_image_blob raise ValueError with no upstream handler. For the content-type and size cases this means a single malformed-but-benign image (e.g., an oversized embedded asset or an unusual MIME type) aborts the whole extraction. The fail-closed design is correct for the path-traversal and WMF-magic cases (active security threats), but for the size and type cases a log-and-skip pattern would be more resilient without weakening the security boundary.

Consider catching ValueError in _extract_shape_by_type (or extract_image) and returning None with a logged warning for the non-security cases, or documenting the hard-crash behaviour as intentional in the function docstring so callers know they must handle it.


💡 Low — Duplicate XXE test coverage (test_extract_content.py lines ~1073 and ~1099)

test_xxe_entity_not_resolved and test_resolve_theme_colors_xxe_payload_blocked exercise the exact same code path (_resolve_theme_colors with a SYSTEM entity), differing only in the target path (/dev/null vs /etc/passwd). One of the two is redundant. Consolidating them into a single parametrized test would sharpen intent without losing coverage.


The branch is also behind main and will need a rebase before merge.

@katriendg
Copy link
Copy Markdown
Contributor

@WilliamBerryiii I did a full automated review here, including posting the comments :) YOLO - I believe the 'blocker' one is accurate.

@WilliamBerryiii
Copy link
Copy Markdown
Member Author

@WilliamBerryiii I did a full automated review here, including posting the comments :) YOLO - I believe the 'blocker' one is accurate.

Ah ha ha ha ha ... this is solid ... I was testing with one of the decks made by the tool which was using wmf and I neglected to test with a prebuilt standard PPTX deck where the blocker would have surfaced. Addressing now.

- add image/emf and image/x-emf to allowlist with magic-byte validation
- replace bare ValueError with _ImageSecurityError subclass
- consolidate duplicate XXE tests into parametrized coverage
- add SVG sanitization via hardened XMLParser with DTD rejection
- add cairosvg SVG-to-PNG conversion with 7 new tests

🔒 - Generated by Copilot
@WilliamBerryiii
Copy link
Copy Markdown
Member Author

@katriendg All three items from your review have been addressed in 476bca6. Here's the breakdown:


🔴 Blocker — EMF images missing from allowlist

Resolved. Added image/emf and image/x-emf to _CONTENT_TYPE_TO_EXT with a new _validate_emf_magic_bytes() guard that verifies the EMF header (\x01\x00\x00\x00 at offset 0) before accepting the blob. Removed the image/x-emf entry from the rejected-type test since it's now in the allowlist.


⚠️ Medium — Unguarded ValueError crashes full extraction

Resolved. Introduced _ImageSecurityError(ValueError) as a dedicated subclass for security-critical validation failures (path traversal, WMF magic-byte mismatch). extract_image now catches plain ValueError with log-and-skip behavior for benign cases (unsupported type, oversize blob) while letting _ImageSecurityError propagate uncaught to callers. This preserves fail-closed semantics for the security boundary without aborting extraction on non-threatening validation failures.


💡 Low — Duplicate XXE test coverage

Resolved. Consolidated test_xxe_entity_not_resolved and test_resolve_theme_colors_xxe_payload_blocked into a single parametrized test_xxe_payload_blocked covering multiple entity-injection paths. Removed the redundant test.


Also included in this push: SVG-to-PNG extraction support via cairosvg with XML sanitization through the hardened XMLParser and explicit DTD-declaration rejection. 7 new tests added. 608 total tests passing, lint clean.

Copy link
Copy Markdown
Contributor

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, also for the notes!

WilliamBerryiii added a commit that referenced this pull request Mar 20, 2026
🤖 I have created a release *beep* *boop*
---


##
[3.2.0](hve-core-v3.1.46...hve-core-v3.2.0)
(2026-03-20)


### ✨ Features

* add -OutputPath parameter to Validate-MarkdownFrontmatter.ps1
([#1134](#1134))
([fdf1bcf](fdf1bcf)),
closes [#1006](#1006)
* add action version consistency scan workflow
([#1127](#1127))
([4229df1](4229df1))
* **agent:** MVE Experiment Designer
([#976](#976))
([70f86ca](70f86ca))
* **agents:** add ADO Backlog Manager orchestrator agent
([#800](#800))
([fae3987](fae3987))
* **agents:** add meeting analyst agent for transcript analysis using
work-iq ([#502](#502))
([5345b5b](5345b5b))
* **agents:** add quick-reference line to RPI Phase 5 suggestions
([#897](#897))
([9a90f39](9a90f39))
* **agents:** add RAI Planner, enhance SSSC Planner, and redesign
Security Planner
([#979](#979))
([06f826c](06f826c))
* **agents:** add symmetric cross-system handoff to GitHub Backlog
Manager ([#952](#952))
([ba34a35](ba34a35))
* **agents:** Functional Code Review Agent — pre-PR functional
correctness reviewer
([#733](#733))
([9cf63b7](9cf63b7))
* **build:** add Python extensions and uv 0.10.8 to devcontainer
([#920](#920))
([9ca0579](9ca0579))
* **build:** add uv ecosystem to Dependabot configuration
([#913](#913))
([2a4bd39](2a4bd39))
* **build:** enable npm pinning enforcement in dependency scan
([#838](#838))
([4e9e31f](4e9e31f))
* **build:** migrate attestation actions to v4.1.0 and add SBOM
verification docs
([#841](#841))
([ca1e65b](ca1e65b))
* **collections:** add four new validator checks (orphan, duplicate,
companion, coverage)
([#869](#869))
([1a96b73](1a96b73))
* **devcontainer,security:** add enterprise artifact hub configuration
([#1032](#1032))
([1d56d25](1d56d25))
* **docs:** add Rust coding standards and guidelines
([#809](#809))
([d4c4899](d4c4899))
* **extension:** add Microsoft logo icon to VS Code Marketplace listings
([#906](#906))
([82aca41](82aca41))
* **github:** add declarative label management
([#953](#953))
([a1a6845](a1a6845))
* **instructions:** add ADO backlog shared infrastructure
([#786](#786))
([1914078](1914078))
* **instructions:** add ADO backlog sprint planning and capacity
tracking ([#788](#788))
([d6fb77d](d6fb77d))
* **instructions:** add ADO triage workflow and prompt
([#787](#787))
([cde0190](cde0190))
* **instructions:** add shared story quality conventions and sprint
planning ([#803](#803))
([a2f18e3](a2f18e3))
* **prompts:** add ADO discovery and work item prompts with agent
routing ([#790](#790))
([7e74523](7e74523))
* **prompts:** add security review prompts
([#1118](#1118))
([ad30967](ad30967))
* **scripts:** add dynamic Python skill discovery for lint/test
([#957](#957))
([0a90f57](0a90f57))
* **scripts:** add Get-StandardTimestamp utility to CIHelpers module
([#1126](#1126))
([b273a4b](b273a4b))
* **scripts:** add Python copyright header validation
([#905](#905))
([67df902](67df902))
* **scripts:** add Python skill support to Validate-SkillStructure
([#903](#903))
([68479d9](68479d9))
* **scripts:** add workflow npm command scanning to dependency pinning
([#837](#837))
([6b5ae06](6b5ae06))
* **security:** add basic security reviewer agent with owasp skills
([#1008](#1008))
([cb1fd05](cb1fd05))
* **security:** add sigstore attestation bundles and fix
component-detection action
([#1148](#1148))
([f79c272](f79c272))
* **skills:** add Atheris fuzz harness with CI workflow integration
([#1102](#1102))
([d337e1d](d337e1d))
* **skills:** add PowerPoint automation skill with YAML-driven deck
generation ([#868](#868))
([00465cd](00465cd))
* **skills:** convert hve-core-installer agent to self-contained skill
([#846](#846))
([1d821fb](1d821fb))
* **skills:** enhance pr-reference skill with flexible filtering and
base branch detection
([#1095](#1095))
([26a32ea](26a32ea))
* **workflows:** add devcontainer infrastructure change log workflow
([#899](#899))
([8aca446](8aca446))
* **workflows:** add milestone auto-close on stable and pre-release
publishes ([#834](#834))
([79362b1](79362b1))
* **workflows:** add ms.date documentation freshness checking
([#969](#969))
([3ed441c](3ed441c))
* **workflows:** add Python linting CI workflow with Ruff
([#951](#951))
([f89f0eb](f89f0eb))
* **workflows:** add Python testing CI workflow with pytest and Codecov
([#934](#934))
([5e8306f](5e8306f))
* **workflows:** add uv and Python package sync to copilot-setup-steps
([#921](#921))
([45d517d](45d517d))


### 🐛 Bug Fixes

* **build:** override Linguist vendored flag for Python skill files
([#1155](#1155))
([0eee5b6](0eee5b6))
* **build:** override serialize-javascript to &gt;=7.0.3 for RCE fix
([#876](#876))
([e49039a](e49039a))
* **build:** resolve Pinned-Dependencies alerts for vsce npm commands in
extension workflows
([#782](#782))
([89dad9d](89dad9d))
* **build:** update undici and yauzl overrides for security audit
([#1030](#1030))
([2c2f92f](2c2f92f))
* **docs:** add CLI Plugins to install.md navigation surfaces
([#902](#902))
([79d6595](79d6595))
* **docs:** add sidebar ordering for Design Thinking documentation
([#832](#832))
([551fddc](551fddc)),
closes [#830](#830)
* **docs:** graduate design-thinking to preview and correct stale
collection references
([#831](#831))
([5110e35](5110e35))
* **docs:** include project-planning in UX Designer install guidance
([#908](#908))
([e7aa9bc](e7aa9bc))
* **docs:** remediate writing-style convention violations
([#865](#865))
([68b04bc](68b04bc))
* **docs:** remove draft content announcement banner
([#825](#825))
([b45de80](b45de80))
* **docs:** remove unbounded path-to-regexp override breaking SSG
([#1153](#1153))
([d810018](d810018))
* **docs:** use actual clone paths instead of folder display names in
multi-root workspace settings
([#984](#984))
([5dbab82](5dbab82))
* **instructions:** replace black with ruff in uv-projects
([#898](#898))
([b0c06d9](b0c06d9))
* **scripts:** cover .github/ skill files in copyright header validation
([#1055](#1055))
([#1098](#1098))
([27fbd33](27fbd33))
* **scripts:** eliminate phantom git changes from plugin generation
([#1035](#1035))
([e49a1b5](e49a1b5))
* **scripts:** enable JSON log output for lint:version-consistency
([#1033](#1033))
([52b0885](52b0885))
* **security:** calculate compliance score from total scanned
dependencies ([#930](#930))
([c112c3d](c112c3d))
* **skills:** add AST validation and namespace restriction for
content-extra.py
([#1027](#1027))
([c50c7a3](c50c7a3))
* **skills:** add depth limits to recursive PowerPoint processing
functions ([#1028](#1028))
([bf08994](bf08994))
* **skills:** harden XML parsing and blob writes in powerpoint extract
([#1053](#1053))
([89d24b1](89d24b1))
* **skills:** resolve ruff lint and format violations in powerpoint
skill ([#1048](#1048))
([17bbe7a](17bbe7a))
* **workflows:** add uv.lock dependencies submission have fork-skip
condition ([#1109](#1109))
([dec56ac](dec56ac))
* **workflows:** automate weekly SHA staleness check with issue creation
([#975](#975))
([1ea4caa](1ea4caa))
* **workflows:** close Codecov integration gaps for Pester and pytest
flags ([#1106](#1106))
([cca29b7](cca29b7))
* **workflows:** propagate uv sync errors in copilot-setup-steps
([#961](#961))
([df88d7c](df88d7c))
* **workflows:** resolve release-please skip cascade and Python project
discovery ([#1043](#1043))
([79993e2](79993e2))
* **workflows:** scan only commit subjects for breaking change detection
([#1157](#1157))
([a38a657](a38a657))


### 📚 Documentation

* clarify HVE Core Extension vs Installer messaging across documentation
([#965](#965))
([0fceb8f](0fceb8f))
* **docs:** add ADO integration user documentation
([#935](#935))
([ec89302](ec89302))
* **docs:** add Project Planning agent documentation
([#936](#936))
([3a3a0fd](3a3a0fd))
* **onboarding:** overhaul marketplace onboarding and documentation site
([#982](#982))
([4309e10](4309e10))


### ♻️ Refactoring

* **build:** merge code-review collection into coding-standards
([#863](#863))
([8027e7b](8027e7b))
* **workflows:** rename release pipeline workflows and add marketplace
automation triggers
([#829](#829))
([b6397f4](b6397f4))


### 🔧 Maintenance

* **build:** add clean:logs npm script
([#1122](#1122))
([f85fe02](f85fe02)),
closes [#988](#988)
* **build:** add JSON reporter for cspell
([#1123](#1123))
([6d59f67](6d59f67))
* **ci:** add multi-arch support to copilot-setup-steps binary downloads
([#955](#955))
([8d0c706](8d0c706))
* **deps-dev:** bump cspell from 9.6.4 to 9.7.0 in the npm-dependencies
group ([#839](#839))
([3fa16ff](3fa16ff))
* **deps:** bump actions/dependency-review-action from 4.8.3 to 4.9.0 in
the github-actions group across 1 directory
([#942](#942))
([1a9b858](1a9b858))
* **deps:** bump cairosvg from 2.8.2 to 2.9.0 in
/.github/skills/experimental/powerpoint
([#1025](#1025))
([f4deda7](f4deda7))
* **deps:** bump dompurify from 3.3.1 to 3.3.2 in /docs/docusaurus
([#924](#924))
([d2060d6](d2060d6))
* **deps:** bump svgo from 3.3.2 to 3.3.3 in /docs/docusaurus
([#880](#880))
([6dc2406](6dc2406))
* **deps:** bump the github-actions group across 1 directory with 4
updates ([#1100](#1100))
([2290dc0](2290dc0))
* **deps:** bump the github-actions group with 6 updates
([#840](#840))
([f57bc01](f57bc01))
* **docs:** correct New-MsDateReport table rendering and refresh stale
docs ([#1114](#1114))
([c2b806f](c2b806f))
* **settings:** remove orphaned Checkov config and stale gitignore
entries ([#870](#870))
([98fcd74](98fcd74))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: hve-core-release-please[bot] <254602402+hve-core-release-please[bot]@users.noreply.github.com>
Co-authored-by: Bill Berry <wberry@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants