Skip to content

fix: treat only .zip as archive; avoid unzipping ZIP-based container …#410

Merged
qin-ctx merged 1 commit intovolcengine:mainfrom
sponge225:fix/docx-zip-detection
Mar 4, 2026
Merged

fix: treat only .zip as archive; avoid unzipping ZIP-based container …#410
qin-ctx merged 1 commit intovolcengine:mainfrom
sponge225:fix/docx-zip-detection

Conversation

@sponge225
Copy link
Copy Markdown
Contributor

fix: treat only .zip as archive; avoid unzipping ZIP-based container formats (OOXML)

Description

English:
This PR fixes an issue where ZIP-based container formats (such as .docx, .xlsx, .pptx) were incorrectly identified as generic ZIP archives and unzipped, bypassing their dedicated parsers. The fix introduces a specific check for the .zip file extension before attempting to unzip, ensuring that only true ZIP archives are treated as compressed directories. This allows Office documents (which are technically ZIP files) to fall through to their specialized parsers (e.g., via registry.get_parser_for_file).

中文:
本 PR 修复了一个问题:基于 ZIP 容器格式的文件(如 .docx, .xlsx, .pptx)被错误地识别为普通 ZIP 压缩包并被解压,从而绕过了它们专用的解析器。该修复在尝试解压之前引入了针对 .zip 文件扩展名的特定检查,确保只有真正的 ZIP 归档文件才会被作为压缩目录处理。这使得 Office 文档(本质上是 ZIP 文件)能够正确地传递给它们专门的解析器(例如通过 registry.get_parser_for_file)。

Related Issue

Fixes #407

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

English:

  • Added a check for .zip extension combined with zipfile.is_zipfile in openviking/utils/media_processor.py.
  • Prioritized explicit .zip handling to prevent generic ZIP detection from intercepting OOXML files.

中文:

  • openviking/utils/media_processor.py 中添加了结合 .zip 扩展名和 zipfile.is_zipfile 的检查。
  • 优先处理显式的 .zip 文件,防止通用的 ZIP 检测逻辑拦截 OOXML 文件。

Testing

  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

resource_name=file_path.stem,
)
# Check if it's a zip file
if zipfile.is_zipfile(file_path):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

旧代码没有清理,或者只改一行就可以?
if ext == ".zip" and zipfile.is_zipfile(file_path):

@sponge225 sponge225 force-pushed the fix/docx-zip-detection branch 3 times, most recently from 53968ca to 5b7f542 Compare March 4, 2026 09:21
@sponge225 sponge225 force-pushed the fix/docx-zip-detection branch from 5b7f542 to 23c96aa Compare March 4, 2026 09:22
@qin-ctx qin-ctx merged commit f546656 into volcengine:main Mar 4, 2026
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: docx 等基于 ZIP 容器的文件被错误地当作 ZIP 解析,未使用专用解析器

3 participants