Conversation
|
@untitaker can I get a review from you for the html5gum extractor changes? The extractor input is now a As a consequence, I converted some methods to return |
|
mupdf has external dependency and I don't think that's light given the amount of code you need to add... and that doesn't seem to be extensible if more format is to be supported in the future. the way ripgrep-all deal with binary text format is to use external binaries: The good part of design is that adaptor is extensible. |
I haven't considered it. I think it would be a good idea regardless, but I am not sure if it solves your problem. if throwing an error in the emitter is something you want, do you really accept invalid utf8 at all? and if not, why not validate the input for utf8 and keep the unsafe/unchecked conversions? |
The amount of code I personally added is minimal; just one file. On Ubuntu:
On mac:
(I don't know where the big discrepancy between those two OSes comes from. Can't see any big linking differences right now.) No external packages need to be installed for mupdf support; I tested it in a vanilla In the worst case users can just disable the feature and build their own binary at no additional cost.
MuPDF apparently supports PDF, XPS, EPUB, XHTML, and CBZ.
I'm not against using a similar pattern in the future, but I'd use external tools as a last resort. Even though it's unlikely for pandoc, in general CLI interfaces might change over time. Bundling external tools brings its own set of challenges: bundling, licensing, availability on different operating systems etc. The way forward could be to have an Curious about your thoughts @lebensterben. |
|
Regarding the big binary on ubuntu: Results in 14M binary. |
|
mupdf has GNU Affero General Public License... lychee will need to be relicensed. |
|
meanwhile, calling an external binary is perfectly okay. You don't need to bundle any external binaries. lychee as a user-facing CLI tool relies on the user to install any third party binaries, provided that the user needs that particular functionality. lychee as a CI workflow may add a few step to grab the binary from their release page. This won't break any copyright. lychee as a service is free to call any binary in its hosting environment, and since we're not delivering the binary as the product, we don't need to worry about most licenses. AGPL may be an exception (I'm not very sude) |
|
That's a good point. I don't think AGPL would work for us. |
|
Even though it's sad, it's clear that this PR can't be merged in its current form because of licensing issues. What @lebensterben mentioned is the way to go. We'll work on a dedicated Trait and implementation for calling external binaries (based on the work in ripgrep), but in a separate pull request. Thanks for the input. |
This adds preliminary PDF and EPUB support using mupdf.
As a side-effect, we also accept raw binary (non-UTF8) input now.
This required some refactoring in the existing extractors.
Overall we just defer the UTF8 conversions to the extractors, if
required, though. Ones that support binary input
can skip that conversion.
Extractors are fallible now. This way we can support cases where the
input cannot be converted to the required internal formats.
Since PDF support depends on system bindings within mupdf-rs,
we put it behind a feature flag (which is enabled by default).
The alternative was to use pandoc or a similar conversion tool for PDF support,
but mupdf is a a lightweight alternative which provides all the functionality we need.