security(powerpoint): untrusted blob writes in extract_content.py image extraction

## Summary

`extract_image()` in `extract_content.py` writes `img.blob` bytes from PPTX image parts directly to disk without content validation, size limits, or path sanitization. A crafted PPTX file could deliver oversized payloads or exploit downstream consumers of the extracted images.

## Location

- **File**: `.github/skills/experimental/powerpoint/scripts/extract_content.py`
- **Function**: `extract_image()` — `img.blob` write to disk

## Risk Assessment

- **Severity**: HIGH
- **Attack Vector**: A crafted PPTX file containing malicious image blobs (oversized, polyglot, or path-traversal filenames) could abuse the extraction process
- **Impact**: Disk exhaustion via oversized blobs, path traversal if filenames are derived from PPTX metadata, or delivery of malicious content disguised as images
- **CVSS Category**: CWE-434 (Unrestricted Upload of File with Dangerous Type) / CWE-22 (Path Traversal)

## Current Behavior

```python
# Writes raw blob bytes from PPTX image part directly to output path
with open(output_path, 'wb') as f:
    f.write(img.blob)
```

The image blob is written without:
- Validating that the content is actually an image (magic bytes / content-type verification)
- Enforcing maximum file size limits
- Sanitizing the output path against directory traversal
- Checking for polyglot files (files valid as both image and executable formats)

## Expected Behavior

Image extraction should include defensive checks:

1. **Size limit**: Reject blobs exceeding a reasonable maximum (e.g., 50 MB)
2. **Content validation**: Verify magic bytes match expected image formats (PNG, JPEG, EMF, WMF, SVG)
3. **Path sanitization**: Ensure output paths are confined to the expected output directory (no `../` traversal)
4. **Filename sanitization**: Strip or reject filenames containing path separators or null bytes

## RPI Framework

### task-researcher
- Identify all locations where PPTX blob data is written to disk
- Determine what PPTX metadata controls the output filename/path
- Catalog the image formats that python-pptx can extract (PNG, JPEG, EMF, WMF, TIFF, SVG, etc.)
- Assess maximum reasonable image sizes for presentation content

### task-planner
- Define size limits and content validation strategy
- Plan path sanitization approach (os.path.realpath containment check)
- Determine if content-type verification is sufficient or if magic byte checking is needed

### task-implementor
- Add blob size validation before write
- Add output path containment check (resolved path must be under expected directory)
- Add content-type / magic byte verification for known image formats
- Add tests with oversized blobs, path traversal filenames, and non-image content
- Sanitize any filename derived from PPTX metadata

## Acceptance Criteria

- [ ] Blob size is validated against a configurable maximum before writing to disk
- [ ] Output paths are verified to be within the expected output directory (no path traversal)
- [ ] Content validation confirms blob data matches expected image formats
- [ ] Filenames derived from PPTX metadata are sanitized (no path separators, null bytes, or special characters)
- [ ] Tests verify rejection of oversized blobs, path traversal attempts, and non-image content
- [ ] No raw blob data is written to arbitrary locations without validation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

security(powerpoint): untrusted blob writes in extract_content.py image extraction #1016

Summary

Location

Risk Assessment

Current Behavior

Expected Behavior

RPI Framework

task-researcher

task-planner

task-implementor

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

security(powerpoint): untrusted blob writes in extract_content.py image extraction #1016

Description

Summary

Location

Risk Assessment

Current Behavior

Expected Behavior

RPI Framework

task-researcher

task-planner

task-implementor

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions