Skip to content

Add MathML support when importing PubMed#9963

Merged
calixtus merged 5 commits into
JabRef:mainfrom
aqurilla:fix-for-issue-4273
Jun 1, 2023
Merged

Add MathML support when importing PubMed#9963
calixtus merged 5 commits into
JabRef:mainfrom
aqurilla:fix-for-issue-4273

Conversation

@aqurilla

@aqurilla aqurilla commented May 30, 2023

Copy link
Copy Markdown
Contributor

This fixes #4273 and fixes #6302 by adding a MathML parser that handles <math> elements in the imported XML file. The parser uses an XLST transformation file to perform the conversion from MathML to LaTeX.

I tried out a couple of different XLST files and the one at https://xsltml.sourceforge.net/ works the best for string output. This library contains a README file which I have included - please let me know if we need to remove it or reorganize its contents elsewhere.

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@aqurilla aqurilla marked this pull request as draft May 30, 2023 03:40
@aqurilla aqurilla marked this pull request as ready for review May 30, 2023 04:30
@Siedlerchr

Copy link
Copy Markdown
Member

Thanks for tackling this issue! I checked the license/readme of the xslt library and when I see this right it is based on MIT license. So yeah, we definitely need to keep this Readme together with the xslt statements.
I'll take a look at your implementaiton later


public class MathMLParser {
private static final Logger LOGGER = LoggerFactory.getLogger(MathMLParser.class);
private static final String XSLT_FILE_PATH = "src/main/resources/xslt/mathml_latex/mmltex.xsl";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could lead to problems when JabRef is packaged, as then the file path is inside the jar.
better:

Suggested change
private static final String XSLT_FILE_PATH = "src/main/resources/xslt/mathml_latex/mmltex.xsl";
private static final String XSLT_FILE_PATH = "/xslt/mathml_latex/mmltex.xsl";

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback! I'll make these changes


// convert to LaTeX using XSLT file
Source xmlSource = new StreamSource(new StringReader(xmlContent));
Source xsltSource = new StreamSource(new File(XSLT_FILE_PATH));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, when Jabref is packaged as modularized app the resource loading is different, so I would rather use something like this. Important is the second argument; otherwise the file cannot be found.

Suggested change
Source xsltSource = new StreamSource(new File(XSLT_FILE_PATH));
URL xsltResource = MathMLParser.class.getResource(XSLT_FILE_PATH);
xsltSource = new StreamSource(xsltResource.openStream(), xsltResource.toURI().toASCIIString());

}

private static String getXMLCData(XMLStreamReader reader) {
return "<![CDATA[" + reader.getText() + "]]>";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't understand this class. What is the purpose of building xml tags manually again?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was added since we are using a StaX parser in the MedlineImporter class, which does not load the full XML data into memory. Instead we have a stream and can progress only in a forward manner traversing through all the tag elements. I was not able to find any inbuilt library method on the StaX parser itself to easily extract the content between two main parent tags (<math> in this case), so this custom logic was required. Following that, the extracted XML string is used for carrying out the transformation. I hope that clarifies things!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah I see, thanks for the explanation. Searched a bit around but seems like the Stax Parser is only for single documents. So this is fine for me then!

Comment thread src/main/java/org/jabref/logic/importer/util/MathMLParser.java
@Siedlerchr Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Jun 1, 2023
@calixtus calixtus merged commit c7ada34 into JabRef:main Jun 1, 2023
@calixtus

calixtus commented Jun 1, 2023

Copy link
Copy Markdown
Member

Thank you for your contribution. We would be happy to see more contributions from your side. 😍

@koppor koppor mentioned this pull request Sep 7, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pubmed XML import: html tags in title produce javax.xml.bind.JAXBElement@ Add MathML support when importing PubMed

3 participants