Skip to content

security(powerpoint): lxml XXE vector in extract_content.py theme color resolution #1014

@WilliamBerryiii

Description

@WilliamBerryiii

Summary

_resolve_theme_colors() in extract_content.py calls lxml.etree.fromstring() on raw XML blobs extracted from PPTX relationship parts without explicit parser configuration. The default lxml parser may resolve external entities, creating a potential XML External Entity (XXE) injection vector.

Location

  • File: .github/skills/experimental/powerpoint/scripts/extract_content.py
  • Function: _resolve_theme_colors()lxml.etree.fromstring() call on PPTX relationship part blobs

Risk Assessment

  • Severity: HIGH
  • Attack Vector: A crafted PPTX file containing a malicious theme XML blob with external entity declarations could trigger XXE when the file is processed
  • Impact: Information disclosure (local file read), server-side request forgery (SSRF), or denial of service via recursive entity expansion (billion laughs)
  • CVSS Category: CWE-611 (Improper Restriction of XML External Entity Reference)

Current Behavior

tree = etree.fromstring(blob)  # Parses XML from PPTX relationship part

The XML blob comes from PPTX internal parts (relationship.target_part.blob). While PPTX files are ZIP archives with XML content, a deliberately crafted PPTX can embed malicious XML with DTD declarations containing external entity references.

Expected Behavior

XML parsing should use a hardened parser configuration that disables external entity resolution:

parser = etree.XMLParser(resolve_entities=False, no_network=True)
tree = etree.fromstring(blob, parser=parser)

Or use defusedxml for defense-in-depth:

from defusedxml.lxml import fromstring
tree = fromstring(blob)

RPI Framework

task-researcher

  • Identify all etree.fromstring() call sites in the codebase
  • Determine which calls process external/untrusted input vs internal data
  • Check if defusedxml is available as a dependency or if parser configuration is sufficient
  • Review lxml's default parser behavior for the installed version

task-planner

  • Decide between explicit parser configuration vs defusedxml wrapper
  • Plan consistent XML parsing hardening across all call sites
  • Evaluate whether defusedxml should be added as a dependency

task-implementor

  • Configure all etree.fromstring() calls with resolve_entities=False, no_network=True
  • Alternatively, replace with defusedxml.lxml.fromstring()
  • Add tests with XML payloads containing entity declarations to verify they are blocked
  • Verify CodeQL XXE query (py/xxe) passes after remediation

Acceptance Criteria

  • All lxml.etree.fromstring() calls processing external input use a hardened parser configuration
  • External entity resolution is disabled (resolve_entities=False) or defusedxml is used
  • Network access during XML parsing is disabled (no_network=True)
  • Tests verify that XML with external entity declarations does not trigger entity resolution
  • CodeQL py/xxe query does not flag any call sites after remediation

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingsecuritySecurity-related changes or concernsskillsCopilot skill packages (SKILL.md)

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions