Skip to content

fix(parse): preserve original filename and extension when importing from URL#619

Merged
MaojiaSheng merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/251-url-resource-filename-preservation
Mar 15, 2026
Merged

fix(parse): preserve original filename and extension when importing from URL#619
MaojiaSheng merged 2 commits intovolcengine:mainfrom
mvanhorn:osc/251-url-resource-filename-preservation

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

Fix URL resource import to preserve original filenames and extensions instead of using temp file names and converting everything to .md.

Why this matters

When importing resources via URL (e.g., COS paths), three bugs compound:

  1. Temp filename (like tmptzedil33) leaks into the final URI as a directory name
  2. Code files (.py, .js, etc.) are converted to .md by the markdown parser
  3. URL-encoded characters in filenames are not decoded
  • #251 - Original report with curl repro
  • @ZaynJarvis (COLLABORATOR) confirmed: "good catch. I can reproduce the issue"
  • @MaojiaSheng (COLLABORATOR) confirmed: "The path looks like a url indicated code file, and it hits wrong parser"

Expected: viking://resources/dir/schemas.py
Actual: viking://resources/dir/schemas/schemas.md

Changes

In openviking/parse/parsers/html.py:

  • Added CODE_EXTENSIONS set (~40 common code/text extensions) to URLTypeDetector so code files route to DOWNLOAD_TXT instead of WEBPAGE parsing
  • Added _extract_filename_from_url() to extract and URL-decode the original filename from URLs
  • Added _save_downloaded_text() to save downloaded text/code files with their original filename and extension, bypassing the MarkdownParser
  • Applied urllib.parse.unquote() for proper handling of URL-encoded characters

Testing

Added tests/parse/test_url_filename_preservation.py with tests for:

  • Filename extraction from various URL formats
  • URL decoding of encoded characters
  • Code extension detection

Fixes #251

This contribution was developed with AI assistance (Claude Code).

"""

# Common code/text file extensions that should be downloaded, not parsed as web pages
CODE_EXTENSIONS = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reuse CODE_EXTENSIONS in openviking/parse/parsers/constants.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 15eb0dd - imported CODE_EXTENSIONS from constants.py and removed the duplicate set. Put the spread first in EXTENSION_MAP so .html/.htm entries still correctly map to DOWNLOAD_HTML.

mvanhorn and others added 2 commits March 14, 2026 21:40
…rom URL

When importing resources via URL (e.g., COS/HTTP paths to code files),
three bugs occurred: an extra directory was created using the temp file
name, the file extension was changed from .py to .md, and URL-encoded
characters in filenames were not decoded.

Root cause: code file extensions (.py, .js, etc.) were not recognized
by URLTypeDetector, causing them to be parsed as web pages through the
HTML-to-markdown pipeline. This lost the original filename and always
produced .md output.

Changes:
- Add common code/text file extensions to URLTypeDetector so they route
  to download instead of webpage parsing
- Extract and URL-decode the original filename from the URL path
- Save downloaded text/code files with their original name and extension
  instead of routing through MarkdownParser
- Pass original filename to markdown and PDF parsers as source_path /
  resource_name so temp file names don't leak into the final URI

Fixes volcengine#251

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Import CODE_EXTENSIONS from openviking.parse.parsers.constants instead of
duplicating the set in URLTypeDetector. Spread comes first in EXTENSION_MAP
so explicit entries (.html, .htm) correctly override to DOWNLOAD_HTML.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mvanhorn mvanhorn force-pushed the osc/251-url-resource-filename-preservation branch from f2c833c to 15eb0dd Compare March 15, 2026 04:40
Copy link
Copy Markdown

@codeCraft-Ritik codeCraft-Ritik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent use of Python features to keep the implementation concise

@MaojiaSheng MaojiaSheng merged commit 45915ae into volcengine:main Mar 15, 2026
6 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

3 participants