Skip to content

Add segmentation + fulltext annotations.#1301

Merged
lfoppiano merged 37 commits into
grobidOrg:masterfrom
haydn-jones:annotations
Jul 28, 2025
Merged

Add segmentation + fulltext annotations.#1301
lfoppiano merged 37 commits into
grobidOrg:masterfrom
haydn-jones:annotations

Conversation

@haydn-jones

@haydn-jones haydn-jones commented Jun 8, 2025

Copy link
Copy Markdown
Contributor

Discussed in #1249

This contains only the TEI right now, will add raw info / updated models later on.

@haydn-jones

haydn-jones commented Jun 9, 2025

Copy link
Copy Markdown
Contributor Author

@lfoppiano Did a header pass, also found this header which is not split (not one of the files I added). Should I fix it?

https://github.com/haydn-jones/grobid/blob/01a2a55fe3cbe19e19cebef6ee240d1151e00963/grobid-trainer/resources/dataset/fulltext/corpus/tei/1309.7222.training.fulltext.tei.xml#L23

Similarly, there are some instances I found in PR #1254 of <listbibl> applied to body text during segmentation, would it be reasonable to fix those in this PR?

@lfoppiano

Copy link
Copy Markdown
Member

@lfoppiano Did a header pass, also found this header which is not split (not one of the files I added). Should I fix it?

https://github.com/haydn-jones/grobid/blob/01a2a55fe3cbe19e19cebef6ee240d1151e00963/grobid-trainer/resources/dataset/fulltext/corpus/tei/1309.7222.training.fulltext.tei.xml#L23

Yes, please 😄

Similarly, there are some instances I found in PR #1254 of <listbibl> applied to body text during segmentation, would it be reasonable to fix those in this PR?

I'm not sure I understood, generally, if there is any fix of the training data you find also from previous files, you can add it to this PR, it's fine for me.

@lfoppiano

Copy link
Copy Markdown
Member

@haydn-jones could you please also add the corresponding files .fulltext and .segmentation in the respective raw directories for each model?

@haydn-jones

Copy link
Copy Markdown
Contributor Author

@lfoppiano Done. I'd be happy to train the segmentation model tonight if that would be helpful.

@lfoppiano

Copy link
Copy Markdown
Member

Yes please, go ahead.

Validating the fulltext training data will take me some more days.

@lfoppiano

Copy link
Copy Markdown
Member

@haydn-jones I've completed the validation of the fulltext files. I've also started retraining the models (there was a small correction in the segmentation).

I don't know how much capacity you have for correcting more data, I could a few more articles in the mix. Let me know.

@haydn-jones

Copy link
Copy Markdown
Contributor Author

@lfoppiano Correcting a few more would be fine, I know there are a few issues like #1279 and #1270 that might be fixed with a few added papers. I can find relevant CC-0/BY papers/drug labels for those two and do corrections then push for you to review if you'd like.

If you have some other papers in mind id be happy to look at those too.

@lfoppiano

Copy link
Copy Markdown
Member

Sure, however I would wait when we have the updated models, then we can see whose issues are still occurring and add more papers.
I will also dig into the accumulated papers on my side 😄

@lfoppiano

Copy link
Copy Markdown
Member

@haydn-jones I have pushed the new re-trained models. Could you try to see whether there are some improvements, and maybe if you could quantify them with your data? I will try to run some benchmarks, in the meantime.

We should double check that the <head> are not mixing up with the different structures (see Patrice's comment)

@haydn-jones

haydn-jones commented Jun 18, 2025

Copy link
Copy Markdown
Contributor Author

@lfoppiano In general, everything looks much much better, though there are still some minor issues like headings not being split up (i.e. <head>4 Experimental procedures<lb/> 4.1 General experimental procedures<lb/></head>), and parts of IUPAC names being treated as bib refs, but overall much better.

Would quantification involve creating a held out set and running a benchmark on it? I see e2e eval discussed here but I'm not 100% sure on how to do it https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/.

@lfoppiano

Copy link
Copy Markdown
Member

I'm dividing the answer in two.

@haydn-jones did you test using papers, other than the one used for training?

For the header, I need to revise the logic that reconstruct them still respecting the TEI.
We might start going down a slipper slope here 😅

Now, Grobid allow a set of divs with a single :

<text>
<body>
<div>
<head>
<p>...</p>
<p>...</p>
</div>
....

When there is a head and a sub-head it was identified using an empty set of paragraphs

<text>
<body>
<div>
<head>head level 1
</div>
<div>
<head>head level 2
<p>...</p>
<p>...</p>
</div>
....

For simplicity the structure have been maintained simple, and we should avoid changing the main structure to avoid breaking backward compatibility. However the clean solution would be to have nested <div> but it might have been discarded because the unreliability of the <head> recognition.

I wonder if there is a way that is flexible enough and TEI compliant, like, for example:

<text>
<body>
<div>
<head>
<p part="subhead">....
<p>...</p>
<p>...</p>
</div>
....

@laurentromary do you have any suggestions/comments? @kermitt2 already gave some info about past attempts here.

@lfoppiano

Copy link
Copy Markdown
Member

@haydn-jones Regarding the evaluation, yes I'm talking about the link to the benchmarks that you posted. I've had some problems with my automated docker image so I had to re-run them on a separate environment and that took me a few days.

In general the fulltext metrics have increased as compared to before (particularly for the PLOS benchmark).
How's your timeline? If you have more time now, let's keep the momentum and add a few new files with chemical formulas that are badly recognised

@lfoppiano lfoppiano linked an issue Jul 5, 2025 that may be closed by this pull request
@lfoppiano

Copy link
Copy Markdown
Member

I would dream of nested div's. Would it be worth a try and make it resilient to the quality of the output?

@laurentromary Yes, however it is a fairly breaking change, it could be done easily once we find a robust approach for obtaining a hierarchical set of headers with relatively higher precision. 😅

@laurentromary it's fairly complicated to reconstruct the hierarchy from a flat list (the labelled sequence) without an overview of the visual characteristics.

One option is to cluster the different head by font and assume that they are at the same level. I believe @kermitt2 already done that without promising results, because the fonts not always reliable.
I'm looking into using the outline (when available) only to identify the root set of <head> to at least trying to identify the main sections correctly.

Currently a sequence with a double <head> will result, as I showed before in a double <div>, but this does not solve the issue as:

<head>Materials and Methods</head>
<head>Data Collection</head>
<p ......>
<p .....>
<head>Conclusions</head>
<p .....>

The results in TEI is now

<div>
  <head>Materials and Methods</head>
</div>
<div>
<head>Data Collection</head>
<p ......>
<p .....>
</div>
<div>
<head>Conclusions</head>
<p .....>
</div>

But it should be

<div>
  <head>Materials and Methods</head>
</div>
<div>
<head>Data Collection</head>
<p ......>
<p .....>
</div>
<div>
<head>Conclusions</head>
</div>
<div>
<p .....>
</div>

otherwise if someone tries to use the hierarchy will get conclusions in Materials and Methods

@lfoppiano

Copy link
Copy Markdown
Member

@haydn-jones I've corrected your annotations and re-trained the two models. I'm currently looking on how to improve the <head> structure but this task may take longer, so I will commit on a separate branch.

@haydn-jones

haydn-jones commented Jul 7, 2025

Copy link
Copy Markdown
Contributor Author

@lfoppiano Was on vacation so I stepped away, I'll have your files reviewed by Friday.

Edit: With respect to <head> structure, I've noticed that the full body model still frequently won't break up a heading immediately followed by a subheading (like <head>3 Method <lb>3.1 Assays <lb></head>), any idea why?

@haydn-jones

Copy link
Copy Markdown
Contributor Author

@lfoppiano Sorry that took so long, I pushed corrections. I think I'm all out of energy for adding more training data, not sure about you.

In this last batch, I noticed a lot of instances of unknown characters, like If V W,i � 0 here. This also happens in a lot of my papers with characters that I really need (they are usually units for measurements, not having them makes my extractions more or less useless). Not super familiar with the code, but I expect this could be fixed by including more mappings in https://github.com/kermitt2/grobid/blob/master/grobid-home/pdfalto/languages/xpdf-others/symbols.unicodeRemapping, is that right? If so I'd like to hunt down some of these remappings and fix that. Could hijack this PR or open another.

@lfoppiano

lfoppiano commented Jul 17, 2025

Copy link
Copy Markdown
Member

@haydn-jones Thanks! I think we're done with annotations. I will train and update the models.

Regarding the invalid characters for the equations, there are two type of problems:

  1. the one you pointed out, happening in equation, that cannot be solved without a visual model because here the mapping of the text stream is simply incorrect, however one use case could be to use the coordinates of equations and send them to a VLLM for example as a post processing (we need to experiment on this, but the amount of data might be relatively small and should not impact too much the processing time)

    image
  2. the lack of symbols extracted due to invalid font mapping (see here) can be adjusted on systematic problems in certain publishers by tuning the extraction. If you could send me some examples I can have a look to be more certain.

Perhaps, I suggest we discuss these problems in a separate issue so that we can close this one once the models are working (I have drafted a solution for the headers that need to be tested, still)

@haydn-jones

Copy link
Copy Markdown
Contributor Author

Ok sounds good, I'll see if any of the unknown character issues are widespread across certain journals and handle that in a different issue.

With respect to the annotations, I'm a lot happier with the output now! Its not perfect of course, but everything looks so much better on my heldout papers. An example of old vs new for synthesis / NMR information:

image

Would be nice if this merges at some point as I have a large batch of papers I would like to process.

@lfoppiano

Copy link
Copy Markdown
Member

@haydn-jones Steady improvements in all metrics related to fulltext. I think we can merge. Will do that next week.

@lfoppiano lfoppiano requested a review from Copilot July 20, 2025 05:51
@lfoppiano lfoppiano added this to the 0.9.0 milestone Jul 20, 2025

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds segmentation and fulltext annotations to the GROBID system, focusing on TEI output format with improvements to model performance across multiple evaluation datasets.

  • Updated benchmarking results across three major datasets (PLOS, eLife, and bioRxiv) showing general performance improvements in fulltext structure recognition
  • Enhanced performance metrics for reference extraction, citation context resolution, and document structure parsing
  • Improved evaluation time measurements and formatting consistency across benchmark reports

Reviewed Changes

Copilot reviewed 3 out of 199 changed files in this pull request and generated no comments.

File Description
doc/Benchmarking-plos.md Updated benchmark metrics showing improvements in reference citation extraction, figure/table reference accuracy, and fulltext structure recognition
doc/Benchmarking-elife.md Refreshed performance metrics with enhanced citation context resolution and improved fulltext parsing results
doc/Benchmarking-biorxiv.md Updated evaluation results demonstrating better reference extraction and document structure analysis capabilities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment