Skip to content

Conversation

@pirate
Copy link
Member

@pirate pirate commented Dec 29, 2025

Implements a new plugin to download 3D and CAD assets from web pages.

Features

  • Detects and downloads files with 3D/CAD extensions (.blend, .stl, .obj, .gltf, .glb, .fbx, .vrm, .usdz, etc.)
  • Parses HTML from singlefile/dom extractors to find asset URLs
  • Configurable timeout, max file size, SSL verification, and user agent
  • Uses curl for downloads (already available in most systems)
  • Isolated plugin that doesn't depend on ArchiveBox core
  • Includes tests and UI templates

Fixes #668

Generated with Claude Code


Summary by cubic

Adds a new caddl plugin to detect and download 3D/CAD assets from captured pages using Chrome + Puppeteer via the existing browser session. Addresses issue #668 by enabling capture of common 3D formats with simple UI views and configurable limits, including support for login/captcha sites.

  • New Features

    • Finds asset links in the DOM (href/src and page text); resolves relative URLs.
    • Supports .blend, .stl, .obj, .gltf/.glb, .fbx, .vrm, .usdz, .step/.stp, and more.
    • Downloads through Chrome CDP with timeout and max size; no curl or extra SSL/cookie config needed.
    • Emits ArchiveResult per file; includes embed, thumbnail, icon, and fullscreen templates.
    • Isolated plugin with tests and filename sanitization; no ArchiveBox core dependency.
  • Bug Fixes

    • Fixes thumbnail layout by adding display:flex so the emoji and label stack and center correctly.

Written for commit e937688. Summary will update automatically on new commits.

Implements a new plugin to download 3D and CAD assets from web pages.

Features:
- Detects and downloads files with 3D/CAD extensions (.blend, .stl, .obj, .gltf, .glb, .fbx, .vrm, .usdz, etc.)
- Parses HTML from singlefile/dom extractors to find asset URLs
- Configurable timeout, max file size, SSL verification, and user agent
- Uses curl for downloads (already available in most systems)
- Isolated plugin that doesn't depend on ArchiveBox core
- Includes tests and UI templates

Addresses issue #668

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 8 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="archivebox/plugins/caddl/templates/thumbnail.html">

<violation number="1" location="archivebox/plugins/caddl/templates/thumbnail.html:3">
P2: Missing `display: flex;` on the inner div. The `flex-direction: column` property has no effect without it, so the emoji and &#39;3D Asset&#39; text won&#39;t stack vertically as intended.</violation>
</file>

<file name="archivebox/plugins/caddl/config.json">

<violation number="1" location="archivebox/plugins/caddl/config.json:6">
P2: Missing `CADDL_BINARY` configuration for curl path. All other plugins that use external binaries (wget, ytdlp, gallerydl, singlefile) define a `*_BINARY` property to allow configuring the binary path. For consistency, consider adding:
```json
&quot;CADDL_BINARY&quot;: {
  &quot;type&quot;: &quot;string&quot;,
  &quot;default&quot;: &quot;curl&quot;,
  &quot;description&quot;: &quot;Path to curl binary&quot;
}
```</violation>
</file>

<file name="archivebox/plugins/caddl/on_Snapshot__65_caddl.bg.py">

<violation number="1" location="archivebox/plugins/caddl/on_Snapshot__65_caddl.bg.py:170">
P1: Path traversal vulnerability: filenames consisting only of dots (e.g., `..`) pass through sanitization and can escape the output directory. Add validation to reject or replace filenames that are `.` or `..`, or use `resolve()` to verify the final path stays within `output_dir`.</violation>
</file>

Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.

@pirate
Copy link
Member Author

pirate commented Dec 29, 2025

@claude unfortunatley I think this basic approach wont work that well in practice, most STL sites like thangs/thingiverse/etc. require solving a captcha to download STLs.

at a minimum I think we need to convert this to use the chrome tab and puppeteer for everything instead of singlefile and curl so it uses the logged in user session

see infiniscroll plugin for an example

@claude
Copy link

claude bot commented Dec 29, 2025

Claude finished @pirate's task in 3m 36s —— View job


Converting caddl plugin to use Chrome + Puppeteer ✓

Understanding the request: Most STL sites require captcha solving, so we need to use Chrome + puppeteer with the logged-in user session instead of singlefile + curl.

Todo List

  • Read current caddl plugin implementation
  • Read infiniscroll plugin as reference implementation
  • Understand Chrome/puppeteer pattern in ArchiveBox plugins
  • Rewrite caddl plugin to use Chrome/puppeteer
  • Update plugin configuration
  • Update tests
  • Commit and push changes

Changes Made

Converted from Python + curl to JavaScript + Puppeteer:

  1. New implementation (on_Snapshot__65_caddl.bg.js):

    • Uses puppeteer-core to connect to existing Chrome session via CDP
    • Leverages authenticated user session (handles captchas, login states, cookies automatically)
    • Finds CAD/3D asset URLs by scanning page DOM and text content
    • Downloads files using Chrome's download mechanism (preserves session state)
    • Validates file sizes and sanitizes filenames (including path traversal protection)
  2. Configuration updates (config.json):

    • Added "required_plugins": ["chrome"] dependency
    • Removed obsolete options: CADDL_CHECK_SSL_VALIDITY, CADDL_USER_AGENT, CADDL_COOKIES_FILE
    • These are now handled automatically by the Chrome session
  3. Test updates (tests/test_caddl.py):

    • Updated to test JavaScript implementation
    • Added tests for size limit parsing and filename sanitization
    • Tests verify proper error handling when Chrome session is unavailable
  4. Removed files:

    • on_Snapshot__65_caddl.bg.py (old Python implementation)
    • binaries.jsonl (no longer needs curl binary)

Why This Approach Is Better

The Chrome + Puppeteer approach solves the core issue you identified:

  • Authenticated sessions: Uses the logged-in Chrome session, so downloads work on sites requiring login (Thingiverse, Thangs, etc.)
  • Captcha handling: Any captchas would be solved in the main Chrome browser before the plugin runs
  • Cookie preservation: Session cookies are automatically used
  • Real browser context: Downloads happen in the same context as the page viewing, avoiding bot detection

This matches the infiniscroll plugin pattern and should work much more reliably in practice.

Note on the Cubic Review Issues

The three issues identified by cubic-dev-ai in the review are now resolved:

  1. ✓ Path traversal vulnerability - Fixed with proper . and .. validation in sanitizeFilename()
  2. ✓ Missing CADDL_BINARY config - No longer needed (removed curl dependency)
  3. ✓ Missing display: flex - Still present in thumbnail.html (not modified in this conversion)

Issue #3 (the CSS in thumbnail.html) still exists. Would you like me to fix that as well?

--- | Branch: claude/issue-668-20251229-2145

- Rewrite implementation in JavaScript using puppeteer-core
- Connect to existing Chrome session via CDP for authenticated downloads
- Removes need for curl binary and cookie/SSL configuration
- Fixes issues with sites requiring captcha/login (Thingiverse, Thangs, etc)
- Updates tests to reflect JavaScript implementation
- Adds path traversal protection for filenames

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 5 files (changes from recent commits).

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="archivebox/plugins/caddl/tests/test_caddl.py">

<violation number="1" location="archivebox/plugins/caddl/tests/test_caddl.py:69">
P2: These tests duplicate the function implementations inline rather than testing the actual script&#39;s functions. If `parseSizeLimit` or `sanitizeFilename` in `on_Snapshot__65_caddl.bg.js` is modified, these tests will still pass because they&#39;re testing their own inline copy. Consider importing and testing the actual functions from the script, or using a separate test module that the script can also import.</violation>
</file>

<file name="archivebox/plugins/caddl/on_Snapshot__65_caddl.bg.js">

<violation number="1" location="archivebox/plugins/caddl/on_Snapshot__65_caddl.bg.js:216">
P1: Throwing an error inside an event listener callback won&#39;t propagate to the outer try/catch - the max size check during download is effectively broken. The error will be unhandled while the download continues. Consider using a flag variable and checking it after the navigation, or calling `downloadPage.close()` to abort.</violation>
</file>

Reply to cubic to teach it or ask questions. Tag @cubic-dev-ai to re-run a review.

cubic-dev-ai[bot] and others added 3 commits December 29, 2025 22:38
## Summary

Fixes a CSS issue in the CAD/3D asset thumbnail template where flex
properties were being applied without `display: flex`.

## Problem

The inner div in `archivebox/plugins/caddl/templates/thumbnail.html` had
`flex-direction: column` and `align-items: center` styles, but was
missing `display: flex`. Without the display property set to flex, these
flex properties have no effect, so the emoji (🧊) and "3D Asset" text
weren't stacking vertically as intended.

## Solution

Added `display: flex;` to the inner div's style attribute so the flexbox
layout takes effect and the content stacks correctly.

## Changes

- `archivebox/plugins/caddl/templates/thumbnail.html`: Added `display:
flex;` to line 3

<!-- This is an auto-generated description by cubic. -->
---
## Summary by cubic
Fixed the CAD/3D asset thumbnail layout by adding display: flex to the
inner div so flexbox applies and the 🧊 emoji and "3D Asset" label stack
vertically and stay centered.

<sup>Written for commit a01bdd4.
Summary will update automatically on new commits.</sup>

<!-- End of auto-generated description by cubic. -->
- Tests now extract and execute actual parseSizeLimit and sanitizeFilename functions from the JS file
- Added more edge case tests (empty strings, invalid input, path traversal)
- Fixed P1 bug where throwing in event listener wouldn't propagate to try/catch
- Use flag variables to track size exceeded state and check after navigation

Co-authored-by: Nick Sweeting <pirate@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New Extractor Idea: Find/write a "cad-dl" to save 3d assets, gltf files, CAD files, shapefiles, STLs, VR views, etc.

2 participants