Add funding statement in TEI output #959
Conversation
|
Thanks ! |
|
From the discussion #956, I am updating the PR with post-processing limited to processShort(), instead of changing the general fulltext model decoding. |
|
The changes seem to work fine, on the example from #956 we obtain: <div type="funding">
<div
xmlns="http://www.tei-c.org/ns/1.0">
<p>This study was supported by the South Asian Clinical Toxicology Research Collaboration, which is funded by The Wellcome Trust/ National Health and Medical Research Council International Collaborative Research Grant GR071669MA. The funding bodies had no role in analyzing or interpreting the data or writing the article.</p>
</div>
</div>No runtime error on PMC_sample set and biorxiv_test_2000. |
|
I found several cases of empty funding where the papers has the funding statement in the back of the document, the TEI serialisation calls E.g. this document: bj4360053.pdf Unfortunately, the result does not help, and the hack in processShort() does not work because of the different label at the first place won't fix the sequence. I'm not sure how I could fix this properly |
|
We could wait for future versions which will remove the tables and figures from the fulltext model, or apply the same fix as in processShort for every table and figure labels (not just the ones starting the sequence). What do you think? |
|
At the moment is better to fix it, as it is as some data are lost. I can apply the same fix for processshort for all the figures and tables labels. |
|
Actually |
|
Using the biorxiv corpus for end2end evaluation, we get this result: |
|
So the two problematic cases are working fine with the current version, yeah :D I cleaned the previous hack in |
|
Thanks !! |
This PR adds the funding statement (already in training data and models) in the TEI output.
The funding output will be placed in a standardised position in the
<back>of the TEI output, as the availability statement.e.g. (with sentence segmentation)
This should fix #895 #652
As discussed in #957, this does not cover the case when only header is processed.