-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Attach (recommended) or Link to PDF file
Example PDF: vertical2.pdf
The text content items returned by getTextContent for vertical text writing modes have incorrect bounding boxes that typically either drift away or towards the beginning of the line. This discrepancy occurs because the charSpacing variable is subtracted from the transformation matrix instead of added.
From the PDF specification, section 9.3.2:
When the glyph for each character in the string is rendered, Tc
shall be *added* to the horizontal or vertical component of the
glyph’s displacement, depending on the writing mode.
...
In the default coordinate system, horizontal coordinates increase
from left to right and vertical coordinates from bottom to top.
Therefore, for horizontal writing, a positive value of Tc has the
effect of expanding the distance between glyphs (see Figure 41),
whereas for vertical writing, a negative value of Tc has this
effect.
The Tc value is already negative in a valid PDF file, so negating it produces the opposite effect.
Web browser and its version
Firefox (but issue appears in any browser)
Operating system and its version
Linux (but issue appears on any OS)
PDF.js version
5.5.207
Is the bug present in the latest PDF.js version?
Yes
Is a browser extension
No
Steps to reproduce the problem
- Open a PDF with text written in vertical writing mode.
- Inspect the bounding boxes of text content items, or select text to see where the bounding boxes are.
- Notice that the bounding boxes do not align with text.
What is the expected behavior?
Here is a rendering of the character bounding boxes as they are in the current version of PDF.js:
And here is a rendering of the correct character bounding boxes after fixing the error.
You can also notice the incorrect bounding box without rendering them out by selecting text or inspecting elements:
NOTE: There seems to be some other issue in the viewer which causes the bounding boxes to be misaligned horizontally which I have not investigated since I haven't been using the viewer.
What went wrong?
The issue is the following line which appears three times in core/evaluator.js:
Line 2955 in ff1af5a
| textState.translateTextMatrix(0, -charSpacing); |
Line 3003 in ff1af5a
| textState.translateTextMatrix(0, -charSpacing); |
Line 3078 in ff1af5a
| textState.translateTextMatrix(0, -charSpacing); |
This negation is incorrect and should simply be removed like so:
textState.translateTextMatrix(0, charSpacing);
Link to a viewer
No response
Additional context
Let me know if you would prefer a pull request for this.