fix(skills): harden XML parsing and blob writes in powerpoint extract#1053
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
- add XMLParser(resolve_entities=False) to _resolve_theme_colors for XXE prevention - rewrite _save_image_blob with content-type allowlist, size limit, path containment - refactor extract_image to delegate blob writes through secured helper - add 9 regression tests covering XXE and blob-write vectors - gitignore .hypothesis test artifact directories 🔒 - Generated by Copilot
7dc2faa to
efd34b0
Compare
…-blob-validation-1014-1016
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1053 +/- ##
==========================================
+ Coverage 88.01% 88.12% +0.10%
==========================================
Files 45 45
Lines 7886 7931 +45
==========================================
+ Hits 6941 6989 +48
+ Misses 945 942 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
katriendg
left a comment
There was a problem hiding this comment.
The XXE and path-traversal hardening is solid, and the test coverage is thorough. One blocker needs to be resolved before merge, plus two smaller items below.
🔴 Blocker — EMF images missing from allowlist (extract_content.py lines ~42–51)
_CONTENT_TYPE_TO_EXT omits image/emf and image/x-emf (Enhanced Metafile Format). This is a hard regression for real-world PPTX files. python-pptx's own source explicitly notes:
"Images produced by PowerPoint itself are generally EMF"
Charts, SmartArt, and diagrams exported from PowerPoint commonly embed EMF blobs. With this PR, any such file will crash the entire extraction run with:
ValueError: Unsupported image content type: image/x-emf
The error propagates entirely uncaught: _save_image_blob → extract_image → _extract_shape_by_type → extract_slide (catches only AttributeError, TypeError) → main(). A single EMF shape anywhere in a presentation aborts all output.
Suggested fix — add both EMF MIME types to the allowlist, with optional magic-byte validation (EMF header: \x01\x00\x00\x00 at offset 0, b" EMF" at offset 40):
_CONTENT_TYPE_TO_EXT: dict[str, str] = {
"image/bmp": "bmp",
"image/emf": "emf", # add
"image/gif": "gif",
"image/jpeg": "jpg",
"image/png": "png",
"image/tiff": "tiff",
"image/x-emf": "emf", # add — common python-pptx MIME type for EMF
"image/x-wmf": "wmf",
}The rejected-type test already includes "image/x-emf" as an expected failure, which confirms the omission — that test entry should be removed once EMF is added to the allowlist.
⚠️ Medium — Unguarded ValueError crashes full extraction on any validation failure (extract_content.py lines ~114–129)
All four guard clauses in _save_image_blob raise ValueError with no upstream handler. For the content-type and size cases this means a single malformed-but-benign image (e.g., an oversized embedded asset or an unusual MIME type) aborts the whole extraction. The fail-closed design is correct for the path-traversal and WMF-magic cases (active security threats), but for the size and type cases a log-and-skip pattern would be more resilient without weakening the security boundary.
Consider catching ValueError in _extract_shape_by_type (or extract_image) and returning None with a logged warning for the non-security cases, or documenting the hard-crash behaviour as intentional in the function docstring so callers know they must handle it.
💡 Low — Duplicate XXE test coverage (test_extract_content.py lines ~1073 and ~1099)
test_xxe_entity_not_resolved and test_resolve_theme_colors_xxe_payload_blocked exercise the exact same code path (_resolve_theme_colors with a SYSTEM entity), differing only in the target path (/dev/null vs /etc/passwd). One of the two is redundant. Consolidating them into a single parametrized test would sharpen intent without losing coverage.
The branch is also behind main and will need a rebase before merge.
|
@WilliamBerryiii I did a full automated review here, including posting the comments :) YOLO - I believe the 'blocker' one is accurate. |
Ah ha ha ha ha ... this is solid ... I was testing with one of the decks made by the tool which was using wmf and I neglected to test with a prebuilt standard PPTX deck where the blocker would have surfaced. Addressing now. |
- add image/emf and image/x-emf to allowlist with magic-byte validation - replace bare ValueError with _ImageSecurityError subclass - consolidate duplicate XXE tests into parametrized coverage - add SVG sanitization via hardened XMLParser with DTD rejection - add cairosvg SVG-to-PNG conversion with 7 new tests 🔒 - Generated by Copilot
|
@katriendg All three items from your review have been addressed in 🔴 Blocker — EMF images missing from allowlistResolved. Added
|
🔧 - Generated by Copilot
- Add missing blank line after _ImageSecurityError class definition - Unwrap single-line raise statement in _validate_emf_magic_bytes
katriendg
left a comment
There was a problem hiding this comment.
Thanks, also for the notes!
🤖 I have created a release *beep* *boop* --- ## [3.2.0](hve-core-v3.1.46...hve-core-v3.2.0) (2026-03-20) ### ✨ Features * add -OutputPath parameter to Validate-MarkdownFrontmatter.ps1 ([#1134](#1134)) ([fdf1bcf](fdf1bcf)), closes [#1006](#1006) * add action version consistency scan workflow ([#1127](#1127)) ([4229df1](4229df1)) * **agent:** MVE Experiment Designer ([#976](#976)) ([70f86ca](70f86ca)) * **agents:** add ADO Backlog Manager orchestrator agent ([#800](#800)) ([fae3987](fae3987)) * **agents:** add meeting analyst agent for transcript analysis using work-iq ([#502](#502)) ([5345b5b](5345b5b)) * **agents:** add quick-reference line to RPI Phase 5 suggestions ([#897](#897)) ([9a90f39](9a90f39)) * **agents:** add RAI Planner, enhance SSSC Planner, and redesign Security Planner ([#979](#979)) ([06f826c](06f826c)) * **agents:** add symmetric cross-system handoff to GitHub Backlog Manager ([#952](#952)) ([ba34a35](ba34a35)) * **agents:** Functional Code Review Agent — pre-PR functional correctness reviewer ([#733](#733)) ([9cf63b7](9cf63b7)) * **build:** add Python extensions and uv 0.10.8 to devcontainer ([#920](#920)) ([9ca0579](9ca0579)) * **build:** add uv ecosystem to Dependabot configuration ([#913](#913)) ([2a4bd39](2a4bd39)) * **build:** enable npm pinning enforcement in dependency scan ([#838](#838)) ([4e9e31f](4e9e31f)) * **build:** migrate attestation actions to v4.1.0 and add SBOM verification docs ([#841](#841)) ([ca1e65b](ca1e65b)) * **collections:** add four new validator checks (orphan, duplicate, companion, coverage) ([#869](#869)) ([1a96b73](1a96b73)) * **devcontainer,security:** add enterprise artifact hub configuration ([#1032](#1032)) ([1d56d25](1d56d25)) * **docs:** add Rust coding standards and guidelines ([#809](#809)) ([d4c4899](d4c4899)) * **extension:** add Microsoft logo icon to VS Code Marketplace listings ([#906](#906)) ([82aca41](82aca41)) * **github:** add declarative label management ([#953](#953)) ([a1a6845](a1a6845)) * **instructions:** add ADO backlog shared infrastructure ([#786](#786)) ([1914078](1914078)) * **instructions:** add ADO backlog sprint planning and capacity tracking ([#788](#788)) ([d6fb77d](d6fb77d)) * **instructions:** add ADO triage workflow and prompt ([#787](#787)) ([cde0190](cde0190)) * **instructions:** add shared story quality conventions and sprint planning ([#803](#803)) ([a2f18e3](a2f18e3)) * **prompts:** add ADO discovery and work item prompts with agent routing ([#790](#790)) ([7e74523](7e74523)) * **prompts:** add security review prompts ([#1118](#1118)) ([ad30967](ad30967)) * **scripts:** add dynamic Python skill discovery for lint/test ([#957](#957)) ([0a90f57](0a90f57)) * **scripts:** add Get-StandardTimestamp utility to CIHelpers module ([#1126](#1126)) ([b273a4b](b273a4b)) * **scripts:** add Python copyright header validation ([#905](#905)) ([67df902](67df902)) * **scripts:** add Python skill support to Validate-SkillStructure ([#903](#903)) ([68479d9](68479d9)) * **scripts:** add workflow npm command scanning to dependency pinning ([#837](#837)) ([6b5ae06](6b5ae06)) * **security:** add basic security reviewer agent with owasp skills ([#1008](#1008)) ([cb1fd05](cb1fd05)) * **security:** add sigstore attestation bundles and fix component-detection action ([#1148](#1148)) ([f79c272](f79c272)) * **skills:** add Atheris fuzz harness with CI workflow integration ([#1102](#1102)) ([d337e1d](d337e1d)) * **skills:** add PowerPoint automation skill with YAML-driven deck generation ([#868](#868)) ([00465cd](00465cd)) * **skills:** convert hve-core-installer agent to self-contained skill ([#846](#846)) ([1d821fb](1d821fb)) * **skills:** enhance pr-reference skill with flexible filtering and base branch detection ([#1095](#1095)) ([26a32ea](26a32ea)) * **workflows:** add devcontainer infrastructure change log workflow ([#899](#899)) ([8aca446](8aca446)) * **workflows:** add milestone auto-close on stable and pre-release publishes ([#834](#834)) ([79362b1](79362b1)) * **workflows:** add ms.date documentation freshness checking ([#969](#969)) ([3ed441c](3ed441c)) * **workflows:** add Python linting CI workflow with Ruff ([#951](#951)) ([f89f0eb](f89f0eb)) * **workflows:** add Python testing CI workflow with pytest and Codecov ([#934](#934)) ([5e8306f](5e8306f)) * **workflows:** add uv and Python package sync to copilot-setup-steps ([#921](#921)) ([45d517d](45d517d)) ### 🐛 Bug Fixes * **build:** override Linguist vendored flag for Python skill files ([#1155](#1155)) ([0eee5b6](0eee5b6)) * **build:** override serialize-javascript to >=7.0.3 for RCE fix ([#876](#876)) ([e49039a](e49039a)) * **build:** resolve Pinned-Dependencies alerts for vsce npm commands in extension workflows ([#782](#782)) ([89dad9d](89dad9d)) * **build:** update undici and yauzl overrides for security audit ([#1030](#1030)) ([2c2f92f](2c2f92f)) * **docs:** add CLI Plugins to install.md navigation surfaces ([#902](#902)) ([79d6595](79d6595)) * **docs:** add sidebar ordering for Design Thinking documentation ([#832](#832)) ([551fddc](551fddc)), closes [#830](#830) * **docs:** graduate design-thinking to preview and correct stale collection references ([#831](#831)) ([5110e35](5110e35)) * **docs:** include project-planning in UX Designer install guidance ([#908](#908)) ([e7aa9bc](e7aa9bc)) * **docs:** remediate writing-style convention violations ([#865](#865)) ([68b04bc](68b04bc)) * **docs:** remove draft content announcement banner ([#825](#825)) ([b45de80](b45de80)) * **docs:** remove unbounded path-to-regexp override breaking SSG ([#1153](#1153)) ([d810018](d810018)) * **docs:** use actual clone paths instead of folder display names in multi-root workspace settings ([#984](#984)) ([5dbab82](5dbab82)) * **instructions:** replace black with ruff in uv-projects ([#898](#898)) ([b0c06d9](b0c06d9)) * **scripts:** cover .github/ skill files in copyright header validation ([#1055](#1055)) ([#1098](#1098)) ([27fbd33](27fbd33)) * **scripts:** eliminate phantom git changes from plugin generation ([#1035](#1035)) ([e49a1b5](e49a1b5)) * **scripts:** enable JSON log output for lint:version-consistency ([#1033](#1033)) ([52b0885](52b0885)) * **security:** calculate compliance score from total scanned dependencies ([#930](#930)) ([c112c3d](c112c3d)) * **skills:** add AST validation and namespace restriction for content-extra.py ([#1027](#1027)) ([c50c7a3](c50c7a3)) * **skills:** add depth limits to recursive PowerPoint processing functions ([#1028](#1028)) ([bf08994](bf08994)) * **skills:** harden XML parsing and blob writes in powerpoint extract ([#1053](#1053)) ([89d24b1](89d24b1)) * **skills:** resolve ruff lint and format violations in powerpoint skill ([#1048](#1048)) ([17bbe7a](17bbe7a)) * **workflows:** add uv.lock dependencies submission have fork-skip condition ([#1109](#1109)) ([dec56ac](dec56ac)) * **workflows:** automate weekly SHA staleness check with issue creation ([#975](#975)) ([1ea4caa](1ea4caa)) * **workflows:** close Codecov integration gaps for Pester and pytest flags ([#1106](#1106)) ([cca29b7](cca29b7)) * **workflows:** propagate uv sync errors in copilot-setup-steps ([#961](#961)) ([df88d7c](df88d7c)) * **workflows:** resolve release-please skip cascade and Python project discovery ([#1043](#1043)) ([79993e2](79993e2)) * **workflows:** scan only commit subjects for breaking change detection ([#1157](#1157)) ([a38a657](a38a657)) ### 📚 Documentation * clarify HVE Core Extension vs Installer messaging across documentation ([#965](#965)) ([0fceb8f](0fceb8f)) * **docs:** add ADO integration user documentation ([#935](#935)) ([ec89302](ec89302)) * **docs:** add Project Planning agent documentation ([#936](#936)) ([3a3a0fd](3a3a0fd)) * **onboarding:** overhaul marketplace onboarding and documentation site ([#982](#982)) ([4309e10](4309e10)) ### ♻️ Refactoring * **build:** merge code-review collection into coding-standards ([#863](#863)) ([8027e7b](8027e7b)) * **workflows:** rename release pipeline workflows and add marketplace automation triggers ([#829](#829)) ([b6397f4](b6397f4)) ### 🔧 Maintenance * **build:** add clean:logs npm script ([#1122](#1122)) ([f85fe02](f85fe02)), closes [#988](#988) * **build:** add JSON reporter for cspell ([#1123](#1123)) ([6d59f67](6d59f67)) * **ci:** add multi-arch support to copilot-setup-steps binary downloads ([#955](#955)) ([8d0c706](8d0c706)) * **deps-dev:** bump cspell from 9.6.4 to 9.7.0 in the npm-dependencies group ([#839](#839)) ([3fa16ff](3fa16ff)) * **deps:** bump actions/dependency-review-action from 4.8.3 to 4.9.0 in the github-actions group across 1 directory ([#942](#942)) ([1a9b858](1a9b858)) * **deps:** bump cairosvg from 2.8.2 to 2.9.0 in /.github/skills/experimental/powerpoint ([#1025](#1025)) ([f4deda7](f4deda7)) * **deps:** bump dompurify from 3.3.1 to 3.3.2 in /docs/docusaurus ([#924](#924)) ([d2060d6](d2060d6)) * **deps:** bump svgo from 3.3.2 to 3.3.3 in /docs/docusaurus ([#880](#880)) ([6dc2406](6dc2406)) * **deps:** bump the github-actions group across 1 directory with 4 updates ([#1100](#1100)) ([2290dc0](2290dc0)) * **deps:** bump the github-actions group with 6 updates ([#840](#840)) ([f57bc01](f57bc01)) * **docs:** correct New-MsDateReport table rendering and refresh stale docs ([#1114](#1114)) ([c2b806f](c2b806f)) * **settings:** remove orphaned Checkov config and stale gitignore entries ([#870](#870)) ([98fcd74](98fcd74)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: hve-core-release-please[bot] <254602402+hve-core-release-please[bot]@users.noreply.github.com> Co-authored-by: Bill Berry <wberry@microsoft.com>
Description
Hardens the PowerPoint content extraction skill against two classes of security vulnerability:
XXE entity expansion (security(powerpoint): lxml XXE vector in extract_content.py theme color resolution #1014) —
_resolve_theme_colors()now creates anXMLParserwithresolve_entities=Falseandno_network=True, preventing external entity injection and network-based SSRF. Malformed XML is caught and logged rather than raising.Unrestricted blob writes (security(powerpoint): untrusted blob writes in extract_content.py image extraction #1016) —
_save_image_blob()is rewritten with defense-in-depth:os.path.realpathblocks directory-traversal attacksimage/x-wmfbut lack a valid Aldus placeable or standard WMF header prefix (historical CVE-2005-4560 mitigation)Both fixes follow a fail-closed design: invalid inputs are rejected or logged, never silently passed through.
Related Issue(s)
Closes #1014
Closes #1016
Type of Change
Select all that apply:
Code & Documentation:
Infrastructure & Configuration:
AI Artifacts:
prompt-builderagent and addressed all feedback.github/instructions/*.instructions.md).github/prompts/*.prompt.md).github/agents/*.agent.md).github/skills/*/SKILL.md)Other:
.ps1,.sh,.py)Testing
test_extract_content.pycovering every hardening layer:extract_imagecorrectly delegates to_save_image_blob../escape.png) is rejected&,<, etc. are preserved correctlyChecklist
Required Checks
Required Automated Checks
The following validation commands must pass before merging:
npm run lint:mdnpm run spell-checknpm run lint:frontmatternpm run validate:skillsnpm run lint:md-links(12 pre-existing broken links in unrelated files; 0 in changed files)npm run lint:psnpm run plugin:generate(no plugin changes in this PR)Security Considerations
Additional Notes
no_network=Trueflag onXMLParseris an additional hardening measure that blocks SSRF attempts via DTD fetches, complementing theresolve_entities=Falsedefense.ValueErrorwith descriptive messages.