Add PDF/EPUB support by mre · Pull Request #740 · lycheeverse/lychee

mre · 2022-08-17T13:46:09Z

This adds preliminary PDF and EPUB support using mupdf.
As a side-effect, we also accept raw binary (non-UTF8) input now.
This required some refactoring in the existing extractors.
Overall we just defer the UTF8 conversions to the extractors, if
required, though. Ones that support binary input
can skip that conversion.

Extractors are fallible now. This way we can support cases where the
input cannot be converted to the required internal formats.

Since PDF support depends on system bindings within mupdf-rs,
we put it behind a feature flag (which is enabled by default).
The alternative was to use pandoc or a similar conversion tool for PDF support,
but mupdf is a a lightweight alternative which provides all the functionality we need.

mre · 2022-08-17T14:23:12Z

@untitaker can I get a review from you for the html5gum extractor changes?

The extractor input is now a T: AsRef<[u8]> instead of UTF-8.
I know that html5gum supports that. However we used a few unsafe calls before which relied on the fact that the input was valid UTF-8, which is no longer guaranteed to be the case.

As a consequence, I converted some methods to return Result instead, so when attempting to convert to UTF-8 it would return an error, but html5gum::Emitter is not fallible, so I had to add a few .expect() calls, which I'd love to avoid. Have you considered making the Emitter trait methods return Result to handle such cases? Is there a better approach that I might be missing?

lebensterben · 2022-08-17T17:00:11Z

mupdf has external dependency and I don't think that's light given the amount of code you need to add... and that doesn't seem to be extensible if more format is to be supported in the future.

the way ripgrep-all deal with binary text format is to use external binaries:

https://github.com/phiresky/ripgrep-all/blob/2d63efd3156a2eca633b1b43d67931fe1cb0df6e/src/adapters/custom.rs#L80-L112

The good part of design is that adaptor is extensible.

untitaker · 2022-08-17T18:49:37Z

As a consequence, I converted some methods to return Result instead, so when attempting to convert to UTF-8 it would return an error, but html5gum::Emitter is not fallible, so I had to add a few .expect() calls, which I'd love to avoid. Have you considered making the Emitter trait methods return Result to handle such cases? Is there a better approach that I might be missing?

I haven't considered it. I think it would be a good idea regardless, but I am not sure if it solves your problem. if throwing an error in the emitter is something you want, do you really accept invalid utf8 at all? and if not, why not validate the input for utf8 and keep the unsafe/unchecked conversions?

mre · 2022-08-18T09:42:28Z

mupdf has external dependency and I don't think that's light given the amount of code you need to add...

The amount of code I personally added is minimal; just one file.
In terms of the external dependency, it's relatively manageable as well:

On Ubuntu:

lychee release build master: 271M
lychee release build mupdf: 325M

On mac:

lychee release build master: 20M
lychee release build mupdf: 27M

(I don't know where the big discrepancy between those two OSes comes from. Can't see any big linking differences right now.)

No external packages need to be installed for mupdf support; I tested it in a vanilla rust Docker container (based on Ubuntu) and macOS.

In the worst case users can just disable the feature and build their own binary at no additional cost.

and that doesn't seem to be extensible if more format is to be supported in the future.

MuPDF apparently supports PDF, XPS, EPUB, XHTML, and CBZ.
It has a limited scope and I'd say that's fine. Better to have a tool which works well for a few use-cases instead of one with mediocre support for many use-cases.

the way ripgrep-all deal with binary text format is to use external binaries:

I'm not against using a similar pattern in the future, but I'd use external tools as a last resort. Even though it's unlikely for pandoc, in general CLI interfaces might change over time. Bundling external tools brings its own set of challenges: bundling, licensing, availability on different operating systems etc.

The way forward could be to have an Extractor trait and implement it for different formats. Every extractor implementation would define which formats it can handle. There can be more than one extractor per file format.
The Extractor trait would be quite similar to the adapter trait you mentioned.

Curious about your thoughts @lebensterben.

mre · 2022-08-18T10:27:56Z

Regarding the big binary on ubuntu:

strip target/release/lychee

Results in 14M binary.

mre · 2022-08-18T10:36:03Z

https://users.rust-lang.org/t/binary-is-way-bigger-on-linux-than-on-macos/14814/4

lebensterben · 2022-08-18T11:08:21Z

mupdf has GNU Affero General Public License...

lychee will need to be relicensed.

lebensterben · 2022-08-18T11:17:11Z

meanwhile, calling an external binary is perfectly okay. You don't need to bundle any external binaries.

lychee as a user-facing CLI tool relies on the user to install any third party binaries, provided that the user needs that particular functionality.

lychee as a CI workflow may add a few step to grab the binary from their release page. This won't break any copyright.

lychee as a service is free to call any binary in its hosting environment, and since we're not delivering the binary as the product, we don't need to worry about most licenses. AGPL may be an exception (I'm not very sude)

mre · 2022-08-18T12:09:32Z

That's a good point. I don't think AGPL would work for us.
Should have checked the MuPDF license before. 😞

mre · 2022-12-22T14:32:17Z

Even though it's sad, it's clear that this PR can't be merged in its current form because of licensing issues. What @lebensterben mentioned is the way to go. We'll work on a dedicated Trait and implementation for calling external binaries (based on the work in ripgrep), but in a separate pull request. Thanks for the input.

mre force-pushed the mupdf branch from 801f3f7 to 28077df Compare August 17, 2022 14:14

mre force-pushed the mupdf branch from 28077df to 2ad2919 Compare August 17, 2022 14:56

Twitter quirk fixed; adjust test (#741)

2fff479

mre force-pushed the mupdf branch from 2ad2919 to 2fff479 Compare August 17, 2022 14:59

Add epub support

3641c14

mre changed the title ~~Add PDF support~~ Add PDF/EPUB support Aug 17, 2022

mre closed this Dec 22, 2022

mre mentioned this pull request Dec 14, 2024

PDF Support #1583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add PDF/EPUB support#740

Add PDF/EPUB support#740
mre wants to merge 2 commits into
masterfrom
mupdf

mre commented Aug 17, 2022 •

edited

Loading

Uh oh!

mre commented Aug 17, 2022 •

edited

Loading

Uh oh!

lebensterben commented Aug 17, 2022

Uh oh!

untitaker commented Aug 17, 2022

Uh oh!

mre commented Aug 18, 2022

Uh oh!

mre commented Aug 18, 2022

Uh oh!

mre commented Aug 18, 2022 •

edited

Loading

Uh oh!

lebensterben commented Aug 18, 2022

Uh oh!

lebensterben commented Aug 18, 2022

Uh oh!

mre commented Aug 18, 2022

Uh oh!

mre commented Dec 22, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mre commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mre commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lebensterben commented Aug 17, 2022

Uh oh!

untitaker commented Aug 17, 2022

Uh oh!

mre commented Aug 18, 2022

Uh oh!

mre commented Aug 18, 2022

Uh oh!

mre commented Aug 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lebensterben commented Aug 18, 2022

Uh oh!

lebensterben commented Aug 18, 2022

Uh oh!

mre commented Aug 18, 2022

Uh oh!

mre commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mre commented Aug 17, 2022 •

edited

Loading

mre commented Aug 17, 2022 •

edited

Loading

mre commented Aug 18, 2022 •

edited

Loading

mre commented Dec 22, 2022 •

edited

Loading