fix(parse): preserve original filename and extension when importing from URL by mvanhorn · Pull Request #619 · volcengine/OpenViking

mvanhorn · 2026-03-15T03:19:55Z

Summary

Fix URL resource import to preserve original filenames and extensions instead of using temp file names and converting everything to .md.

Why this matters

When importing resources via URL (e.g., COS paths), three bugs compound:

Temp filename (like tmptzedil33) leaks into the final URI as a directory name
Code files (.py, .js, etc.) are converted to .md by the markdown parser
URL-encoded characters in filenames are not decoded

#251 - Original report with curl repro
@ZaynJarvis (COLLABORATOR) confirmed: "good catch. I can reproduce the issue"
@MaojiaSheng (COLLABORATOR) confirmed: "The path looks like a url indicated code file, and it hits wrong parser"

Expected: viking://resources/dir/schemas.py
Actual: viking://resources/dir/schemas/schemas.md

Changes

In openviking/parse/parsers/html.py:

Added CODE_EXTENSIONS set (~40 common code/text extensions) to URLTypeDetector so code files route to DOWNLOAD_TXT instead of WEBPAGE parsing
Added _extract_filename_from_url() to extract and URL-decode the original filename from URLs
Added _save_downloaded_text() to save downloaded text/code files with their original filename and extension, bypassing the MarkdownParser
Applied urllib.parse.unquote() for proper handling of URL-encoded characters

Testing

Added tests/parse/test_url_filename_preservation.py with tests for:

Filename extraction from various URL formats
URL decoding of encoded characters
Code extension detection

Fixes #251

This contribution was developed with AI assistance (Claude Code).

MaojiaSheng · 2026-03-15T03:43:00Z

openviking/parse/parsers/html.py

    """

+    # Common code/text file extensions that should be downloaded, not parsed as web pages
+    CODE_EXTENSIONS = {


please reuse CODE_EXTENSIONS in openviking/parse/parsers/constants.py

Done in 15eb0dd - imported CODE_EXTENSIONS from constants.py and removed the duplicate set. Put the spread first in EXTENSION_MAP so .html/.htm entries still correctly map to DOWNLOAD_HTML.

…rom URL When importing resources via URL (e.g., COS/HTTP paths to code files), three bugs occurred: an extra directory was created using the temp file name, the file extension was changed from .py to .md, and URL-encoded characters in filenames were not decoded. Root cause: code file extensions (.py, .js, etc.) were not recognized by URLTypeDetector, causing them to be parsed as web pages through the HTML-to-markdown pipeline. This lost the original filename and always produced .md output. Changes: - Add common code/text file extensions to URLTypeDetector so they route to download instead of webpage parsing - Extract and URL-decode the original filename from the URL path - Save downloaded text/code files with their original name and extension instead of routing through MarkdownParser - Pass original filename to markdown and PDF parsers as source_path / resource_name so temp file names don't leak into the final URI Fixes volcengine#251 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Import CODE_EXTENSIONS from openviking.parse.parsers.constants instead of duplicating the set in URLTypeDetector. Spread comes first in EXTENSION_MAP so explicit entries (.html, .htm) correctly override to DOWNLOAD_HTML. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codeCraft-Ritik

Excellent use of Python features to keep the implementation concise

github-project-automation bot moved this to Backlog in OpenViking project Mar 15, 2026

github-project-automation bot added this to OpenViking project Mar 15, 2026

MaojiaSheng reviewed Mar 15, 2026

View reviewed changes

mvanhorn and others added 2 commits March 14, 2026 21:40

mvanhorn force-pushed the osc/251-url-resource-filename-preservation branch from f2c833c to 15eb0dd Compare March 15, 2026 04:40

codeCraft-Ritik reviewed Mar 15, 2026

View reviewed changes

MaojiaSheng approved these changes Mar 15, 2026

View reviewed changes

MaojiaSheng merged commit 45915ae into volcengine:main Mar 15, 2026
6 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parse): preserve original filename and extension when importing from URL#619

fix(parse): preserve original filename and extension when importing from URL#619
MaojiaSheng merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/251-url-resource-filename-preservation

mvanhorn commented Mar 15, 2026

Uh oh!

MaojiaSheng Mar 15, 2026

Uh oh!

mvanhorn Mar 15, 2026

Uh oh!

codeCraft-Ritik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mvanhorn commented Mar 15, 2026

Summary

Why this matters

Changes

Testing

Uh oh!

MaojiaSheng Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

mvanhorn Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

codeCraft-Ritik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants