Skip to content

PDF text extraction can fail in complex shaping scenarios #4225

@mm-tea

Description

@mm-tea

Description

Compiling a pdf with non-latin (in this case specifically devanagari) text in it can sometimes result is strange text encoding. This results in text that is not properly selectable. The part of the font that is selectable differs across pdf viewers, and sometimes introduces non-existent characters into the selection.

Minimal Working Example:

#set text(font: "Siddhanta")

आ रु॒क्मैरा यु॒धा नर॑ ऋ॒ष्वा ऋ॒ष्टीर॑सृक्षत ।

अन्वे॑नाँ॒ अह॑ वि॒द्युतो॑ म॒रुतो॒ जज्झ॑तीरिव भा॒नुर॑र्त॒ त्मना॑ दि॒वः ॥

(explicitly states font for repeatability)
typst output

This results in the following chunk of selectable text in Ubuntu's Document Viewer:
अा
ैरा युध
ा नर॑ ऋ॒ ष्वा ऋ॒ ीर॑ सृक्षत ꠰
अन्वे॑नाँ॒ अह॑ व॒ ुताे॑ म॒ ताे॒ जज्झ॑ती रव भा॒नुर॑त॒ त्मना॑ द॒वः ꠱

In other pdf viewers the output may be different, e.g. firefox gives:
अा ˳॒ƨैरा यु॒धा नर॑ ऋ॒ष्वा ऋ॒ʆीर॑सृक्षत ꠰
अन्वे॑ नाँ॒ अह॑ Vव॒ȭुताे॑ म॒˳ताे॒ जज्झ॑तीRरव भा॒नुर॑तA॒ त् मना॑ Tद॒वः ꠱
(notice the introduced latin characters)

It seems that this is not an inherent limitation of pdf itself, LuaLaTeX can generate properly functioning pdfs. A minimal working example is provided for LaTeX as well:

\documentclass[a4paper]{article}

\usepackage{fontspec}
\setromanfont{Siddhanta}

\begin{document}

आ रु॒क्मैरा यु॒धा नर॑ ऋ॒ष्वा ऋ॒ष्टीर॑सृक्षत ।

अन्वे॑नाँ॒ अह॑ वि॒द्युतो॑ म॒रुतो॒ जज्झ॑तीरव भनर॑र्त॒ त्मना॑ दि॒वः ॥

\end{document}

LaTeX output

Because of this it seems that it is possible to do this in such a way that makes the text encoding work in pdfs.

Reproduction URL

No response

Operating system

Linux

Typst version

  • I am using the latest version of Typst

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingpdfRelated to PDF export or PDF embedding.textRelated to the text category, which is all about text handling, shaping, etc.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions