Skip to content

Many glyphs are not searchable in the PDF #479

@ghost

Description

Example input:

The

replaces Th by the ligature and is then only searchable by e. I think the ToUnicode key is not set in the pdf object with the appropriate content, currently.

When we use the font LinLibertine_R.ttf and compile the following with LuaTeX, we can search it as expected.

\documentclass[a4paper]{article}

\usepackage{fontspec}

\setmainfont{LinLibertine_R.ttf}

\begin{document}

The

\end{document}

Edit: default ligatures seem to work (fi, ffi, ...)
The way to solve this is probably to look for an optional but often used underscore naming scheme of the glyphs like T_h or t_z or a_b_c_d and then replace the underscore by empty string and take this for the ToUnicode data. If the content that is split by underscore has a special meaning that should be resolved, too. Like: longs_c_h should be ſch instead of longsch.
Additionally the dot syntax: a.end, a.alt, a.suffix should all be a, longs.alt should be ſ.

Edit 2: It seems to affect all non-standard glyphs, not just non-standard ligatures.

Edit 3: It is also possible to collect the output from rustybuzz::shape, then we know which input text created the glyph.
The only problem here would be that maybe with some fonts there can be multiple different input results that created the same glyph. For example, when the font has two ligature rules aaaa -> Glyph(aa) and ääää -> Glyph(aa), we would collect aaaa and ääää for the Glyph Glyph(aa). But now it would be unclear which one to choose. This situation is probably very rare, though. But also this problem could be solved by counting which one of the possibilities is the most common in the document.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpdfRelated to PDF export or PDF embedding.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions