Skip to content

[Bug]: Incorrect charSpacing handling causes drifting bounding boxes in getTextContent. #20930

@Kalinovcic

Description

@Kalinovcic

Attach (recommended) or Link to PDF file

Example PDF: vertical2.pdf

The text content items returned by getTextContent for vertical text writing modes have incorrect bounding boxes that typically either drift away or towards the beginning of the line. This discrepancy occurs because the charSpacing variable is subtracted from the transformation matrix instead of added.

From the PDF specification, section 9.3.2:

When the glyph for each character in the string is rendered, Tc
shall be *added* to the horizontal or vertical component of the
glyph’s displacement, depending on the writing mode.
  ...
In the default coordinate system, horizontal coordinates increase
from left to right and vertical coordinates from bottom to top.
Therefore, for horizontal writing, a positive value of Tc has the
effect of expanding the distance between glyphs (see Figure 41),
whereas for vertical writing, a negative value of Tc has this
effect.

The Tc value is already negative in a valid PDF file, so negating it produces the opposite effect.

Web browser and its version

Firefox (but issue appears in any browser)

Operating system and its version

Linux (but issue appears on any OS)

PDF.js version

5.5.207

Is the bug present in the latest PDF.js version?

Yes

Is a browser extension

No

Steps to reproduce the problem

  1. Open a PDF with text written in vertical writing mode.
  2. Inspect the bounding boxes of text content items, or select text to see where the bounding boxes are.
  3. Notice that the bounding boxes do not align with text.

What is the expected behavior?

Here is a rendering of the character bounding boxes as they are in the current version of PDF.js:

Image

And here is a rendering of the correct character bounding boxes after fixing the error.

Image

You can also notice the incorrect bounding box without rendering them out by selecting text or inspecting elements:

Image

NOTE: There seems to be some other issue in the viewer which causes the bounding boxes to be misaligned horizontally which I have not investigated since I haven't been using the viewer.

What went wrong?

The issue is the following line which appears three times in core/evaluator.js:

textState.translateTextMatrix(0, -charSpacing);

textState.translateTextMatrix(0, -charSpacing);

textState.translateTextMatrix(0, -charSpacing);

This negation is incorrect and should simply be removed like so:

textState.translateTextMatrix(0, charSpacing);

Link to a viewer

No response

Additional context

Let me know if you would prefer a pull request for this.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions