Add ability to read math in PDF documents by NSoiffer · Pull Request #17276 · nvaccess/nvda

NSoiffer · 2024-10-10T22:43:37Z

Link to issue number:

Summary of the issue:

PDF 2.0 added Associated Files (AF). It also describes a method for Formula tags to make use of AF that contain MathML. The LaTeX Project (the group that maintains LaTeX) has released an update to LaTeX that uses this technique. Hence, there will soon be a large body of PDF documents generated from LaTeX (pdflatex and lualatex) that contain MathML.

In conjunction with Foxit and an informal agreement with someone at Adobe, we agreed on a method to expose the MathML in an AF without a change to the PDF accessibility interface: the Formula tag gets role=Math (in windows, ROLE_SYSTEM_EQUATION) and the contents of tag is the MathML.

Note: this does not change the legality of the previous method of fully tagging the PDF math with children elements pointing to subexpressions in the PDF. However, that method has proved difficult to implement for PDF generators. This method seems to be much simpler and hence will be used.

The latest release of Foxit contains the support of AF with MathML. So far, Adobe has not made a change but with Foxit and NVDA supporting this, there will be more of an impetuses to do so. According to the Foxit implementer, it only took 1-2 days to implement.

Description of user facing changes

The math in documents will be spoken and brailled just as it is done for HTML documents. It will also be navigable. This should work with any of the MathML add-ons.

Description of development approach

Support required only about 3 lines to be added to the AdobeAcobat.py file. I changed a few more lines to add debug warnings when various COM interfaces were not found.

There was a commit in January 2024 that wiped out the MathML support in PDF in favor of alt text. This was in the .cpp file that is part of this PR. This PR mostly reverts that change. Alt text is still supported via the creation of a MathML <mtext> element. Potentially, this is a better solution because sometimes the alt text is LaTeX and LaTeX contains lots of punctuation characters that are not spoken by NVDA by default. Pushing this to the Math handler gives them the ability to override this behavior and speak all the characters. Currently MathCAT just passes the mtext content directly to NVDA, but I will look into making it smarter about that.

Because Adobe Reader currently does not handle AFs, the alt text will get read if a formula has both an AF and alt text.

Testing strategy:

Here are two PDF files for testing:

Known issues with pull request:

None

Code Review Checklist:

I need some guidance on what to do, if anything, about "change log entry" and unit or system tests.

Documentation:
- Change log entry
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English
API is compatible with existing add-ons.
Security precautions taken.

@coderabbitai summary

…e Reader: support alternative text on formulas in PDFs (nvaccess#16067)") This change doesn't allow MathML in formula tags. That was ok before, but now pdftex puts out MathML in an associated file and FoxIt reader makes it the content of a formula tag. Adobe is generally on board with this idea but hasn't implemented handling of associated files yet. Currently Adobe Reader does not know about Associated Files. So the contents are the alternative text. This is handled by making it an `mtext` element. Note: if the alt text is LaTeX, then it has lots of "punctuation" such as `{` and `\`. This typically would _not_ be read by NVDA. MathCAT currently passes this along unchanged to NVDA, but I have marked this as a bug and will have MathCAT produce speech that won't be discarded, so the experience will be an improved one over just speaking the alt text.

ABuffEr · 2024-10-11T07:28:15Z

This sounds so beautiful!
@NSoiffer, thanks a lot to follow these evolutions. I ended my studies in 2019, and never had MathML materials; but it's very comforting to know that, in the future, I could find accessible PDF almost as easily as a sighted person.

codeofdusk · 2024-10-11T22:34:50Z

@NSoiffer Just curious, is this new support part of the base amspath package or does it require an inclusion of an additional package at build time?

codeofdusk · 2024-10-11T22:37:31Z

As for "change log entry", add a new item in user_docs/en/changes.md (similar to the others). Include the undelying issue number and your username as an at mention.

It looks like NV Access might need to manually start CI for this PR. Be sure that existing unit and system tests pass.

NSoiffer · 2024-10-12T00:20:01Z

@NSoiffer Just curious, is this new support part of the base amspath package or does it require an inclusion of an additional package at build time?

This is separate, although it does include support for the amsmath package. It is a project that has been worked on for 4(?) years, funded by Adobe that is nearing completion of phase V (their last listed phase). Here is a relatively recent paper on their work. It (currently) requires adding some metadata at the start of the file:

\DocumentMetadata{
  lang        = en,
  pdfversion  = 2.0,
  pdfstandard = ua-2,
  pdfstandard = a-4f, %or a-4
  testphase   = 
   {phase-III,
    title,
    table,
    math,
    firstaid}  
}

The first part of the metadata will be required as it states some important things that go into PDF's metadata. I suspect the "testphase" part will not be required at some point in the future.

More info about the project is on their web page and also in their repo (it is a little out of date -- a month ago or so they started automatically generating MathML from TeX).

…MathML

NSoiffer · 2024-10-12T03:36:43Z

@codeofdusk: thanks. I've added something to the userDoc file.

As for testing, the only code that I changed is PDF related. As far as I can find, there are no tests for PDF files. runsystemtests.bat -i "xxx" where "xxx" is "PDF", "Adobe", and "Acrobat" all come back with "Suite 'Robot' contains no tests matching tags". So no tests should have been broken.

Because the code does COM calls, I don't see a way to do Unit tests. I don't know how to do PDF system tests because one needs a PDF processor running (AdobeReader, Foxit Reader, ...). The need for a PDF processor probably makes system testing hard (impossible?). If someone can tell me how to do some tests, I'll add some. Otherwise, I think the checklist is complete.

NSoiffer · 2024-10-14T20:12:29Z

Having not heard any suggestions about how I could do testing, I've checked the box off. That means there are no tasks left. I think this is ready for review.

seanbudd

Thanks @NSoiffer this will be a much appreciated fix

Added info about PDF processors that support this.

…MathML

AppVeyorBot · 2024-10-17T08:47:10Z

Build (for testing PR): https://ci.appveyor.com/api/buildjobs/0e8xt8ak3i7n12qm/artifacts/output/
CI timing (mins):
INIT 0.0,
INSTALL_START 1.4,
INSTALL_END 0.9,
BUILD_START 0.0,
FINISH_END 15.2

See test results for failed build of commit b6a64ad505

SaschaCowley

Thanks @NSoiffer. Just a few small documentation requests.

cary-rowen · 2024-10-17T23:44:51Z

To be honest I wouldn't be opposed to documenting some technical details on how to use this feature in the user guide. Even if they look somewhat professional.
Overall, those who wish to use a screen reader to read mathematical content will probably always need to know some technical knowledge.

Also, make a comment a bit clearer.

…MathML

Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com>

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

Co-authored-by: Luke Davis <8139760+XLTechie@users.noreply.github.com>

NSoiffer · 2024-10-24T05:29:05Z

GitHub says that @SaschaCowley requested updates. I must be missing something -- as far as I can see there are no unresolved requests. Can someone clarify for me what is not resolved?

Thanks.

SaschaCowley · 2024-10-25T01:36:17Z

GitHub says that @SaschaCowley requested updates. I must be missing something -- as far as I can see there are no unresolved requests. Can someone clarify for me what is not resolved?

Thanks.

This is just because I have not yet submitted an approving review. My apologies for the delay, I've been off sick for the last four days.

AppVeyorBot · 2024-10-25T02:33:59Z

PASS: Translation comments check.
PASS: License check.
PASS: Unit tests.
PASS: Lint check.
FAIL: System tests (tags: installer NVDA). See test results for more information.
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/u626a9h32hjde57y/artifacts/output/l10nUtil.exe nvda_snapshot_pr17276-34412,b46a1cde.exe
CI timing (mins):
INIT 0.0,
INSTALL_START 1.4,
INSTALL_END 0.9,
BUILD_START 0.0,
BUILD_END 25.0,
TESTSETUP_START 0.0,
TESTSETUP_END 0.4,
TEST_START 0.0,
TEST_END 18.4,
FINISH_END 0.2

See test results for failed build of commit b46a1cde7a

Blocked by #17984 Summary of the issue: In #17276, NVDA now treats the value of formular nodes in Adobe Acrobat as mathml, with out any real validation. In PDF 2.0 documents, this is no doubt an okay assumption, but for PDF 1.7 documents generated by Microsoft Word, this is now causing Microsoft word generated math speech alternative text to be processed by mathCAT, resulting in broken or junk navigation, as Microsoft Word is exposing its math speech text as the value of the node. However, at the same time Microsoft has also introduced a new custom mathml attribute it is exposing on formula nodes in PDFs generated from Microsoft Word, that contains real mathMl which is suitable for MathCAT. NVDA should make use of this new custom attribute if it exists. Description of user facing changes In Adobe Acrobat, NVDA can now read and interact with Math equations in PDF documents generated by Microsoft word. Description of development approach AcrobatNode NVDAObject's mathml property: first try and fetch Microsoft Office's custom mathMl custom attribute if it exists. Otherwise fallback to using the node's value or descendants.

### Link to issue number: Fixes #18133 ### Summary of the issue: In #17276, support was added for alt text in a formula (a fallback when MathML is not detected) by wrapping the alt text inside of MathML's mtext tag. That text might have `<` and `&` characters which should be "escaped" but were not. The PDF support was added to 2025.1 beta. The fix is trivial: `<` -> `<` and `&` -> `&` (in that order). In #18133 I said that the punctuation was not being spoken, but it turns out that MathCAT actually does convert the punctuation in strings, so LaTeX is spoken sensibly (but it is still hard to understand when read quickly). ### Description of user facing changes In Adobe Acrobat, NVDA now properly reads alt text inside of Formula tags when no MathML is present (in various forms). ### Description of development approach Simple string `replace()` calls to do the substitutions. ### Testing strategy: I opened the test file [alt.pdf](https://github.com/user-attachments/files/20325821/alt.pdf) and listened what it said. I checked that in the speech viewer. ### Known issues with pull request: None ### Code Review Checklist: - [ ] Documentation: - Change log entry - User Documentation - Developer / Technical Documentation - Context sensitive help for GUI changes - [x] Testing: - Unit tests - System (end to end) tests - Manual testing - [x] UX of all users considered: - Speech - Braille - Low Vision - Different web browsers - Localization in other languages / culture than English - [x] API is compatible with existing add-ons. - [x] Security precautions taken.  @coderabbitai summary --------- Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com> Co-authored-by: Sean Budd <seanbudd123@gmail.com>

NSoiffer and others added 2 commits October 10, 2024 08:19

Pre-commit auto-fix

e923065

NSoiffer and others added 3 commits October 12, 2024 03:18

Add MathML support in PDF change.

be541bd

Merge branch 'SupportMathML' of github.com:NSoiffer/nvda into Support…

9588e13

…MathML

Pre-commit auto-fix

11e2128

NSoiffer marked this pull request as ready for review October 14, 2024 20:12

NSoiffer requested a review from a team as a code owner October 14, 2024 20:12

NSoiffer requested a review from SaschaCowley October 14, 2024 20:12

OzancanKaratas requested a review from michaelDCurran October 14, 2024 20:24

seanbudd approved these changes Oct 16, 2024

View reviewed changes

Comment thread source/NVDAObjects/IAccessible/adobeAcrobat.py Outdated

Comment thread source/NVDAObjects/IAccessible/adobeAcrobat.py Outdated

Comment thread source/NVDAObjects/IAccessible/adobeAcrobat.py Outdated

Comment thread user_docs/en/changes.md Outdated

seanbudd and others added 2 commits October 17, 2024 10:09

Apply suggestions from code review

1485fdf

Pre-commit auto-fix

b95c078

SaschaCowley requested changes Oct 16, 2024

View reviewed changes

Comment thread source/NVDAObjects/IAccessible/adobeAcrobat.py Outdated

Comment thread source/NVDAObjects/IAccessible/adobeAcrobat.py

Comment thread user_docs/en/changes.md Outdated

NSoiffer added 4 commits October 17, 2024 01:12

Clean up comments and log.debug statements

0a62929

Improve the change comment

e626dae

Added info about PDF processors that support this.

Merge branch 'SupportMathML' of github.com:NSoiffer/nvda into Support…

62926e2

…MathML

Made each sentence be on its own line as requested by @codeofdusk.

909423d

NSoiffer and others added 2 commits October 17, 2024 19:33

Fix multiline debug "fix" -- actually tested the code this time...

c6d85d6

Pre-commit auto-fix

a8c59f2

SaschaCowley requested changes Oct 17, 2024

View reviewed changes

Comment thread user_docs/en/changes.md

Comment thread user_docs/en/changes.md Outdated

Comment thread source/NVDAObjects/IAccessible/adobeAcrobat.py

Comment thread source/NVDAObjects/IAccessible/adobeAcrobat.py

Update user_docs/en/changes.md

c2f6529

Add documentation string.

ba3803a

Also, make a comment a bit clearer.

NSoiffer and others added 3 commits October 19, 2024 05:09

Merge branch 'SupportMathML' of github.com:NSoiffer/nvda into Support…

b4a7a41

…MathML

Update user_docs/en/changes.md

de5e89c

Co-authored-by: Sascha Cowley <16543535+SaschaCowley@users.noreply.github.com>

Merge branch 'master' into SupportMathML

188119d

seanbudd approved these changes Oct 20, 2024

View reviewed changes

Comment thread user_docs/en/changes.md Outdated

seanbudd added the conceptApproved Similar 'triaged' for issues, PR accepted in theory, implementation needs review. label Oct 21, 2024

Update user_docs/en/changes.md

c77a07e

Co-authored-by: Sean Budd <seanbudd123@gmail.com>

XLTechie reviewed Oct 22, 2024

View reviewed changes

Comment thread user_docs/en/changes.md Outdated

Update user_docs/en/changes.md

6d92dc4

Co-authored-by: Luke Davis <8139760+XLTechie@users.noreply.github.com>

SaschaCowley reviewed Oct 25, 2024

View reviewed changes

Comment thread user_docs/en/changes.md Outdated

Update user_docs/en/changes.md

4edc60d

SaschaCowley approved these changes Oct 25, 2024

View reviewed changes

SaschaCowley merged commit 38e12d3 into nvaccess:master Oct 25, 2024

github-actions Bot added this to the 2025.1 milestone Oct 25, 2024

codeofdusk mentioned this pull request Apr 20, 2025

Use MathML attributes for PDFs read in Adobe Acrobat #17984

Merged

5 tasks

michaelDCurran mentioned this pull request May 5, 2025

AdobeAcrobat: support custom Microsoft Office mathml attribute. #18056

Merged

5 tasks

This was referenced May 20, 2025

Need to escape special characters in HTML when reading alt text for PDF documents #18133

Closed

Fix to case where alt text for math in PDF contains & and < #18135

Merged

Uh oh!

Conversation

NSoiffer commented Oct 10, 2024 • edited by seanbudd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Link to issue number:

Summary of the issue:

Description of user facing changes

Description of development approach

Testing strategy:

Known issues with pull request:

Code Review Checklist:

Uh oh!

ABuffEr commented Oct 11, 2024

Uh oh!

codeofdusk commented Oct 11, 2024

Uh oh!

codeofdusk commented Oct 11, 2024

Uh oh!

NSoiffer commented Oct 12, 2024

Uh oh!

NSoiffer commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NSoiffer commented Oct 14, 2024

Uh oh!

seanbudd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AppVeyorBot commented Oct 17, 2024

Uh oh!

SaschaCowley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cary-rowen commented Oct 17, 2024

Uh oh!

Uh oh!

Uh oh!

NSoiffer commented Oct 24, 2024

Uh oh!

SaschaCowley commented Oct 25, 2024

Uh oh!

Uh oh!

AppVeyorBot commented Oct 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

NSoiffer commented Oct 10, 2024 •

edited by seanbudd

Loading

NSoiffer commented Oct 12, 2024 •

edited

Loading