Skip to content

security(powerpoint): untrusted blob writes in extract_content.py image extraction #1016

@WilliamBerryiii

Description

@WilliamBerryiii

Summary

extract_image() in extract_content.py writes img.blob bytes from PPTX image parts directly to disk without content validation, size limits, or path sanitization. A crafted PPTX file could deliver oversized payloads or exploit downstream consumers of the extracted images.

Location

  • File: .github/skills/experimental/powerpoint/scripts/extract_content.py
  • Function: extract_image()img.blob write to disk

Risk Assessment

  • Severity: HIGH
  • Attack Vector: A crafted PPTX file containing malicious image blobs (oversized, polyglot, or path-traversal filenames) could abuse the extraction process
  • Impact: Disk exhaustion via oversized blobs, path traversal if filenames are derived from PPTX metadata, or delivery of malicious content disguised as images
  • CVSS Category: CWE-434 (Unrestricted Upload of File with Dangerous Type) / CWE-22 (Path Traversal)

Current Behavior

# Writes raw blob bytes from PPTX image part directly to output path
with open(output_path, 'wb') as f:
    f.write(img.blob)

The image blob is written without:

  • Validating that the content is actually an image (magic bytes / content-type verification)
  • Enforcing maximum file size limits
  • Sanitizing the output path against directory traversal
  • Checking for polyglot files (files valid as both image and executable formats)

Expected Behavior

Image extraction should include defensive checks:

  1. Size limit: Reject blobs exceeding a reasonable maximum (e.g., 50 MB)
  2. Content validation: Verify magic bytes match expected image formats (PNG, JPEG, EMF, WMF, SVG)
  3. Path sanitization: Ensure output paths are confined to the expected output directory (no ../ traversal)
  4. Filename sanitization: Strip or reject filenames containing path separators or null bytes

RPI Framework

task-researcher

  • Identify all locations where PPTX blob data is written to disk
  • Determine what PPTX metadata controls the output filename/path
  • Catalog the image formats that python-pptx can extract (PNG, JPEG, EMF, WMF, TIFF, SVG, etc.)
  • Assess maximum reasonable image sizes for presentation content

task-planner

  • Define size limits and content validation strategy
  • Plan path sanitization approach (os.path.realpath containment check)
  • Determine if content-type verification is sufficient or if magic byte checking is needed

task-implementor

  • Add blob size validation before write
  • Add output path containment check (resolved path must be under expected directory)
  • Add content-type / magic byte verification for known image formats
  • Add tests with oversized blobs, path traversal filenames, and non-image content
  • Sanitize any filename derived from PPTX metadata

Acceptance Criteria

  • Blob size is validated against a configurable maximum before writing to disk
  • Output paths are verified to be within the expected output directory (no path traversal)
  • Content validation confirms blob data matches expected image formats
  • Filenames derived from PPTX metadata are sanitized (no path separators, null bytes, or special characters)
  • Tests verify rejection of oversized blobs, path traversal attempts, and non-image content
  • No raw blob data is written to arbitrary locations without validation

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingsecuritySecurity-related changes or concernsskillsCopilot skill packages (SKILL.md)

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions