Skip to content

Add PDF/EPUB support#740

Closed
mre wants to merge 2 commits into
masterfrom
mupdf
Closed

Add PDF/EPUB support#740
mre wants to merge 2 commits into
masterfrom
mupdf

Conversation

@mre

@mre mre commented Aug 17, 2022

Copy link
Copy Markdown
Member

This adds preliminary PDF and EPUB support using mupdf.
As a side-effect, we also accept raw binary (non-UTF8) input now.
This required some refactoring in the existing extractors.
Overall we just defer the UTF8 conversions to the extractors, if
required, though. Ones that support binary input
can skip that conversion.

Extractors are fallible now. This way we can support cases where the
input cannot be converted to the required internal formats.

Since PDF support depends on system bindings within mupdf-rs,
we put it behind a feature flag (which is enabled by default).
The alternative was to use pandoc or a similar conversion tool for PDF support,
but mupdf is a a lightweight alternative which provides all the functionality we need.

@mre

mre commented Aug 17, 2022

Copy link
Copy Markdown
Member Author

@untitaker can I get a review from you for the html5gum extractor changes?

The extractor input is now a T: AsRef<[u8]> instead of UTF-8.
I know that html5gum supports that. However we used a few unsafe calls before which relied on the fact that the input was valid UTF-8, which is no longer guaranteed to be the case.

As a consequence, I converted some methods to return Result instead, so when attempting to convert to UTF-8 it would return an error, but html5gum::Emitter is not fallible, so I had to add a few .expect() calls, which I'd love to avoid. Have you considered making the Emitter trait methods return Result to handle such cases? Is there a better approach that I might be missing?

@mre mre changed the title Add PDF support Add PDF/EPUB support Aug 17, 2022
@lebensterben

Copy link
Copy Markdown
Member

mupdf has external dependency and I don't think that's light given the amount of code you need to add... and that doesn't seem to be extensible if more format is to be supported in the future.


the way ripgrep-all deal with binary text format is to use external binaries:

https://github.com/phiresky/ripgrep-all/blob/2d63efd3156a2eca633b1b43d67931fe1cb0df6e/src/adapters/custom.rs#L80-L112

The good part of design is that adaptor is extensible.

@untitaker

Copy link
Copy Markdown
Collaborator

As a consequence, I converted some methods to return Result instead, so when attempting to convert to UTF-8 it would return an error, but html5gum::Emitter is not fallible, so I had to add a few .expect() calls, which I'd love to avoid. Have you considered making the Emitter trait methods return Result to handle such cases? Is there a better approach that I might be missing?

I haven't considered it. I think it would be a good idea regardless, but I am not sure if it solves your problem. if throwing an error in the emitter is something you want, do you really accept invalid utf8 at all? and if not, why not validate the input for utf8 and keep the unsafe/unchecked conversions?

@mre

mre commented Aug 18, 2022

Copy link
Copy Markdown
Member Author

mupdf has external dependency and I don't think that's light given the amount of code you need to add...

The amount of code I personally added is minimal; just one file.
In terms of the external dependency, it's relatively manageable as well:

On Ubuntu:

  • lychee release build master: 271M
  • lychee release build mupdf: 325M

On mac:

  • lychee release build master: 20M
  • lychee release build mupdf: 27M

(I don't know where the big discrepancy between those two OSes comes from. Can't see any big linking differences right now.)

No external packages need to be installed for mupdf support; I tested it in a vanilla rust Docker container (based on Ubuntu) and macOS.

In the worst case users can just disable the feature and build their own binary at no additional cost.

and that doesn't seem to be extensible if more format is to be supported in the future.

MuPDF apparently supports PDF, XPS, EPUB, XHTML, and CBZ.
It has a limited scope and I'd say that's fine. Better to have a tool which works well for a few use-cases instead of one with mediocre support for many use-cases.

the way ripgrep-all deal with binary text format is to use external binaries:

I'm not against using a similar pattern in the future, but I'd use external tools as a last resort. Even though it's unlikely for pandoc, in general CLI interfaces might change over time. Bundling external tools brings its own set of challenges: bundling, licensing, availability on different operating systems etc.

The way forward could be to have an Extractor trait and implement it for different formats. Every extractor implementation would define which formats it can handle. There can be more than one extractor per file format.
The Extractor trait would be quite similar to the adapter trait you mentioned.

Curious about your thoughts @lebensterben.

@mre

mre commented Aug 18, 2022

Copy link
Copy Markdown
Member Author

Regarding the big binary on ubuntu:

strip target/release/lychee

Results in 14M binary.

@mre

mre commented Aug 18, 2022

Copy link
Copy Markdown
Member Author

@lebensterben

Copy link
Copy Markdown
Member

mupdf has GNU Affero General Public License...

lychee will need to be relicensed.

@lebensterben

Copy link
Copy Markdown
Member

meanwhile, calling an external binary is perfectly okay. You don't need to bundle any external binaries.

lychee as a user-facing CLI tool relies on the user to install any third party binaries, provided that the user needs that particular functionality.

lychee as a CI workflow may add a few step to grab the binary from their release page. This won't break any copyright.

lychee as a service is free to call any binary in its hosting environment, and since we're not delivering the binary as the product, we don't need to worry about most licenses. AGPL may be an exception (I'm not very sude)

@mre

mre commented Aug 18, 2022

Copy link
Copy Markdown
Member Author

That's a good point. I don't think AGPL would work for us.
Should have checked the MuPDF license before. 😞

@mre

mre commented Dec 22, 2022

Copy link
Copy Markdown
Member Author

Even though it's sad, it's clear that this PR can't be merged in its current form because of licensing issues. What @lebensterben mentioned is the way to go. We'll work on a dedicated Trait and implementation for calling external binaries (based on the work in ripgrep), but in a separate pull request. Thanks for the input.

@mre mre closed this Dec 22, 2022
@mre mre mentioned this pull request Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants