File format preprocessing

lychee can currently extract and check links from plain text file formats. In addition to "simple" plaintext link extraction lychee internally tries to detect HTML and Markdown formats to parse them with more specific parser.

This way link extraction on websites (HTML) and local plaintext file formats works quite well most of the time. But some plain text files will not work well, such as CSV files without whitespace and quotes. (https://github.com/lycheeverse/lychee/issues/1299) This is because there is no way of knowing when a URL ends as subsequent CSV columns are separated with `,` which could be part of the path in the URL. Additionally, it is impossible to use lychee on non-plaintext files. It was previously suggested that lychee should add support for the PDF (https://github.com/lycheeverse/lychee/issues/1583), EPUB (https://github.com/lycheeverse/lychee/issues/202) and AsciiDoc (https://github.com/lycheeverse/lychee/issues/291)  file format.

I think that officially supporting the above examples with lychee for every user would be the wrong approach. It should not be lychee's responsibility to understand every single file format and every edge-case of file format types. (as with the CSV example) Attempting to do so would bloat lychee and make it hard to maintain.

Instead it should be the user's responsibility to preprocess file formats in such a way that lychee can understand them. To help users do this we should provide documentation on how this is done and we should probably introduce a new command line flag called `--preprocess`:

```
lychee --preprocess "csv: cat {} | csvq 'select *'" --preprocess "pdf: pdftotext {} -" my-project-to-check/
```

This flag is comparable to the `exec` flag for find and fd.

Additionally, there were discussions about depending on pandoc to support more file formats. (https://github.com/lycheeverse/lychee/issues/202#issuecomment-1080771637) However, I think this should not be necessary when we introduce this new flag and provide enough user documentation. I think depending on pandoc would be a very opinionated choice. And for certain file formats or specific use-cases pandoc might not be suitable. In the EPUB example, a user suggested another program that was probably better suited to their needs. (https://github.com/lycheeverse/lychee/issues/202#issuecomment-879634519) If the demand for a "all file type knowing" link checker really would be that big we could still create a new repository called "lychee-all" akin to [ripgrep-all](https://github.com/phiresky/ripgrep-all).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

File format preprocessing #1672

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

File format preprocessing #1672

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions