fix(parse): preserve original filename and extension when importing from URL#619
Merged
MaojiaSheng merged 2 commits intovolcengine:mainfrom Mar 15, 2026
Conversation
MaojiaSheng
reviewed
Mar 15, 2026
openviking/parse/parsers/html.py
Outdated
| """ | ||
|
|
||
| # Common code/text file extensions that should be downloaded, not parsed as web pages | ||
| CODE_EXTENSIONS = { |
Collaborator
There was a problem hiding this comment.
please reuse CODE_EXTENSIONS in openviking/parse/parsers/constants.py
Contributor
Author
There was a problem hiding this comment.
Done in 15eb0dd - imported CODE_EXTENSIONS from constants.py and removed the duplicate set. Put the spread first in EXTENSION_MAP so .html/.htm entries still correctly map to DOWNLOAD_HTML.
…rom URL When importing resources via URL (e.g., COS/HTTP paths to code files), three bugs occurred: an extra directory was created using the temp file name, the file extension was changed from .py to .md, and URL-encoded characters in filenames were not decoded. Root cause: code file extensions (.py, .js, etc.) were not recognized by URLTypeDetector, causing them to be parsed as web pages through the HTML-to-markdown pipeline. This lost the original filename and always produced .md output. Changes: - Add common code/text file extensions to URLTypeDetector so they route to download instead of webpage parsing - Extract and URL-decode the original filename from the URL path - Save downloaded text/code files with their original name and extension instead of routing through MarkdownParser - Pass original filename to markdown and PDF parsers as source_path / resource_name so temp file names don't leak into the final URI Fixes volcengine#251 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Import CODE_EXTENSIONS from openviking.parse.parsers.constants instead of duplicating the set in URLTypeDetector. Spread comes first in EXTENSION_MAP so explicit entries (.html, .htm) correctly override to DOWNLOAD_HTML. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
f2c833c to
15eb0dd
Compare
codeCraft-Ritik
left a comment
There was a problem hiding this comment.
Excellent use of Python features to keep the implementation concise
MaojiaSheng
approved these changes
Mar 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix URL resource import to preserve original filenames and extensions instead of using temp file names and converting everything to
.md.Why this matters
When importing resources via URL (e.g., COS paths), three bugs compound:
tmptzedil33) leaks into the final URI as a directory name.py,.js, etc.) are converted to.mdby the markdown parserExpected:
viking://resources/dir/schemas.pyActual:
viking://resources/dir/schemas/schemas.mdChanges
In
openviking/parse/parsers/html.py:CODE_EXTENSIONSset (~40 common code/text extensions) toURLTypeDetectorso code files route toDOWNLOAD_TXTinstead ofWEBPAGEparsing_extract_filename_from_url()to extract and URL-decode the original filename from URLs_save_downloaded_text()to save downloaded text/code files with their original filename and extension, bypassing the MarkdownParserurllib.parse.unquote()for proper handling of URL-encoded charactersTesting
Added
tests/parse/test_url_filename_preservation.pywith tests for:Fixes #251
This contribution was developed with AI assistance (Claude Code).