Add segmentation + fulltext annotations. by haydn-jones · Pull Request #1301 · grobidOrg/grobid

haydn-jones · 2025-06-08T17:12:07Z

Discussed in #1249

This contains only the TEI right now, will add raw info / updated models later on.

haydn-jones · 2025-06-09T17:32:54Z

@lfoppiano Did a header pass, also found this header which is not split (not one of the files I added). Should I fix it?

https://github.com/haydn-jones/grobid/blob/01a2a55fe3cbe19e19cebef6ee240d1151e00963/grobid-trainer/resources/dataset/fulltext/corpus/tei/1309.7222.training.fulltext.tei.xml#L23

Similarly, there are some instances I found in PR #1254 of <listbibl> applied to body text during segmentation, would it be reasonable to fix those in this PR?

lfoppiano · 2025-06-09T18:25:36Z

@lfoppiano Did a header pass, also found this header which is not split (not one of the files I added). Should I fix it?

https://github.com/haydn-jones/grobid/blob/01a2a55fe3cbe19e19cebef6ee240d1151e00963/grobid-trainer/resources/dataset/fulltext/corpus/tei/1309.7222.training.fulltext.tei.xml#L23

Yes, please 😄

Similarly, there are some instances I found in PR #1254 of <listbibl> applied to body text during segmentation, would it be reasonable to fix those in this PR?

I'm not sure I understood, generally, if there is any fix of the training data you find also from previous files, you can add it to this PR, it's fine for me.

lfoppiano · 2025-06-10T07:20:31Z

@haydn-jones could you please also add the corresponding files .fulltext and .segmentation in the respective raw directories for each model?

…tations

haydn-jones · 2025-06-10T23:15:21Z

@lfoppiano Done. I'd be happy to train the segmentation model tonight if that would be helpful.

lfoppiano · 2025-06-11T03:52:48Z

Yes please, go ahead.

Validating the fulltext training data will take me some more days.

…re, etc).

…gmentation from this batch

lfoppiano · 2025-06-13T13:15:16Z

@haydn-jones I've completed the validation of the fulltext files. I've also started retraining the models (there was a small correction in the segmentation).

I don't know how much capacity you have for correcting more data, I could a few more articles in the mix. Let me know.

haydn-jones · 2025-06-13T14:45:33Z

@lfoppiano Correcting a few more would be fine, I know there are a few issues like #1279 and #1270 that might be fixed with a few added papers. I can find relevant CC-0/BY papers/drug labels for those two and do corrections then push for you to review if you'd like.

If you have some other papers in mind id be happy to look at those too.

lfoppiano · 2025-06-13T15:05:05Z

Sure, however I would wait when we have the updated models, then we can see whose issues are still occurring and add more papers.
I will also dig into the accumulated papers on my side 😄

lfoppiano · 2025-06-16T17:13:28Z

@haydn-jones I have pushed the new re-trained models. Could you try to see whether there are some improvements, and maybe if you could quantify them with your data? I will try to run some benchmarks, in the meantime.

We should double check that the <head> are not mixing up with the different structures (see Patrice's comment)

haydn-jones · 2025-06-18T15:21:05Z

@lfoppiano In general, everything looks much much better, though there are still some minor issues like headings not being split up (i.e. <head>4 Experimental procedures<lb/> 4.1 General experimental procedures<lb/></head>), and parts of IUPAC names being treated as bib refs, but overall much better.

Would quantification involve creating a held out set and running a benchmark on it? I see e2e eval discussed here but I'm not 100% sure on how to do it https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/.

lfoppiano · 2025-06-19T09:21:12Z

I'm dividing the answer in two.

@haydn-jones did you test using papers, other than the one used for training?

For the header, I need to revise the logic that reconstruct them still respecting the TEI.
We might start going down a slipper slope here 😅

Now, Grobid allow a set of divs with a single :

<text>
<body>
<div>
<head>
<p>...</p>
<p>...</p>
</div>
....

When there is a head and a sub-head it was identified using an empty set of paragraphs

<text>
<body>
<div>
<head>head level 1
</div>
<div>
<head>head level 2
<p>...</p>
<p>...</p>
</div>
....

For simplicity the structure have been maintained simple, and we should avoid changing the main structure to avoid breaking backward compatibility. However the clean solution would be to have nested <div> but it might have been discarded because the unreliability of the <head> recognition.

I wonder if there is a way that is flexible enough and TEI compliant, like, for example:

<text>
<body>
<div>
<head>
<p part="subhead">....
<p>...</p>
<p>...</p>
</div>
....

@laurentromary do you have any suggestions/comments? @kermitt2 already gave some info about past attempts here.

lfoppiano · 2025-06-19T09:23:33Z

@haydn-jones Regarding the evaluation, yes I'm talking about the link to the benchmarks that you posted. I've had some problems with my automated docker image so I had to re-run them on a separate environment and that took me a few days.

In general the fulltext metrics have increased as compared to before (particularly for the PLOS benchmark).
How's your timeline? If you have more time now, let's keep the momentum and add a few new files with chemical formulas that are badly recognised

lfoppiano · 2025-07-05T16:48:30Z

I would dream of nested div's. Would it be worth a try and make it resilient to the quality of the output?

@laurentromary Yes, however it is a fairly breaking change, it could be done easily once we find a robust approach for obtaining a hierarchical set of headers with relatively higher precision. 😅

@laurentromary it's fairly complicated to reconstruct the hierarchy from a flat list (the labelled sequence) without an overview of the visual characteristics.

One option is to cluster the different head by font and assume that they are at the same level. I believe @kermitt2 already done that without promising results, because the fonts not always reliable.
I'm looking into using the outline (when available) only to identify the root set of <head> to at least trying to identify the main sections correctly.

Currently a sequence with a double <head> will result, as I showed before in a double <div>, but this does not solve the issue as:

<head>Materials and Methods</head>
<head>Data Collection</head>
<p ......>
<p .....>
<head>Conclusions</head>
<p .....>

The results in TEI is now

<div>
  <head>Materials and Methods</head>
</div>
<div>
<head>Data Collection</head>
<p ......>
<p .....>
</div>
<div>
<head>Conclusions</head>
<p .....>
</div>

But it should be

<div>
  <head>Materials and Methods</head>
</div>
<div>
<head>Data Collection</head>
<p ......>
<p .....>
</div>
<div>
<head>Conclusions</head>
</div>
<div>
<p .....>
</div>

otherwise if someone tries to use the hierarchy will get conclusions in Materials and Methods

lfoppiano · 2025-07-05T17:56:21Z

@haydn-jones I've corrected your annotations and re-trained the two models. I'm currently looking on how to improve the <head> structure but this task may take longer, so I will commit on a separate branch.

haydn-jones · 2025-07-07T14:44:11Z

@lfoppiano Was on vacation so I stepped away, I'll have your files reviewed by Friday.

Edit: With respect to <head> structure, I've noticed that the full body model still frequently won't break up a heading immediately followed by a subheading (like <head>3 Method <lb>3.1 Assays <lb></head>), any idea why?

haydn-jones · 2025-07-16T19:07:12Z

@lfoppiano Sorry that took so long, I pushed corrections. I think I'm all out of energy for adding more training data, not sure about you.

In this last batch, I noticed a lot of instances of unknown characters, like If V W,i � 0 here. This also happens in a lot of my papers with characters that I really need (they are usually units for measurements, not having them makes my extractions more or less useless). Not super familiar with the code, but I expect this could be fixed by including more mappings in https://github.com/kermitt2/grobid/blob/master/grobid-home/pdfalto/languages/xpdf-others/symbols.unicodeRemapping, is that right? If so I'd like to hunt down some of these remappings and fix that. Could hijack this PR or open another.

lfoppiano · 2025-07-17T05:57:44Z

@haydn-jones Thanks! I think we're done with annotations. I will train and update the models.

Regarding the invalid characters for the equations, there are two type of problems:

the one you pointed out, happening in equation, that cannot be solved without a visual model because here the mapping of the text stream is simply incorrect, however one use case could be to use the coordinates of equations and send them to a VLLM for example as a post processing (we need to experiment on this, but the amount of data might be relatively small and should not impact too much the processing time)
the lack of symbols extracted due to invalid font mapping (see here) can be adjusted on systematic problems in certain publishers by tuning the extraction. If you could send me some examples I can have a look to be more certain.

Perhaps, I suggest we discuss these problems in a separate issue so that we can close this one once the models are working (I have drafted a solution for the headers that need to be tested, still)

haydn-jones · 2025-07-18T17:12:56Z

Ok sounds good, I'll see if any of the unknown character issues are widespread across certain journals and handle that in a different issue.

With respect to the annotations, I'm a lot happier with the output now! Its not perfect of course, but everything looks so much better on my heldout papers. An example of old vs new for synthesis / NMR information:

Would be nice if this merges at some point as I have a large batch of papers I would like to process.

lfoppiano · 2025-07-19T19:41:23Z

@haydn-jones Steady improvements in all metrics related to fulltext. I think we can merge. Will do that next week.

Copilot

Pull Request Overview

This pull request adds segmentation and fulltext annotations to the GROBID system, focusing on TEI output format with improvements to model performance across multiple evaluation datasets.

Updated benchmarking results across three major datasets (PLOS, eLife, and bioRxiv) showing general performance improvements in fulltext structure recognition
Enhanced performance metrics for reference extraction, citation context resolution, and document structure parsing
Improved evaluation time measurements and formatting consistency across benchmark reports

Reviewed Changes

Copilot reviewed 3 out of 199 changed files in this pull request and generated no comments.

File	Description
doc/Benchmarking-plos.md	Updated benchmark metrics showing improvements in reference citation extraction, figure/table reference accuracy, and fulltext structure recognition
doc/Benchmarking-elife.md	Refreshed performance metrics with enhanced citation context resolution and improved fulltext parsing results
doc/Benchmarking-biorxiv.md	Updated evaluation results demonstrating better reference extraction and document structure analysis capabilities

Add TEI

7188a85

haydn-jones mentioned this pull request Jun 8, 2025

Add segmentation + fulltext annotations. #1300

Closed

lfoppiano and others added 4 commits June 8, 2025 22:12

revised segmentation files

41d8352

revised 2 fulltext files

63bc07f

Header pass

8c32716

revised 2 more fulltexts

01a2a55

correct 3 seg files with incorrect listBibl, 1 fulltext heading split

91fbb66

lfoppiano and others added 3 commits June 10, 2025 10:03

adjust formulas

dec6580

Add raw data

8ccf2a5

Merge branch 'annotations' of github.com:haydn-jones/grobid into anno…

0b16307

…tations

haydn-jones and others added 3 commits June 12, 2025 09:45

Update Segmentaion Model

27d0be0

Fix various issues with figures (figures not split, body text as figu…

146e397

…re, etc).

correct all remaining fulltexts. Fix one old fulltext file and one se…

eaa95b5

…gmentation from this batch

haydn-jones and others added 3 commits June 14, 2025 09:02

minor ref fixes

e1f3998

type="ref" -> type="biblio"

3c3c1ad

Add trained models

00ca2e5

lfoppiano mentioned this pull request Jun 18, 2025

Extra spaces in the parsed text #1303

Closed

update evaluation

a5e1167

lfoppiano added 4 commits July 3, 2025 16:18

revise segmentation files

1f314e4

revise fulltext files

2939b6f

fix typos

bb8c894

updated models for segmentation and fulltext

9d8c8f2

lfoppiano linked an issue Jul 5, 2025 that may be closed by this pull request

Incorrectly extracted chemistry information. #1249

Closed

Corrections

53d70ff

lfoppiano and others added 2 commits July 18, 2025 12:04

new models

d6dc9b8

Merge branch 'kermitt2:master' into annotations

5b1faa4

add evaluations

c7853e9

lfoppiano requested a review from Copilot July 20, 2025 05:51

lfoppiano added this to the 0.9.0 milestone Jul 20, 2025

Copilot AI reviewed Jul 20, 2025

View reviewed changes

lfoppiano added models:fulltext models:segmentation labels Jul 20, 2025

lfoppiano merged commit 1ad5901 into grobidOrg:master Jul 28, 2025
1 of 2 checks passed

This was linked to issues Jul 28, 2025

DAS extraction issues with new page #1221

Closed

Missing last line in a page but it's due to the segmentation model #1213

Closed

Headnote missing and/or having wrong labels #1208

Closed

This was referenced Jul 28, 2025

Headnote missing and/or having wrong labels #1208

Closed

general paragraph text wrongly recognized as "figDesc/div/p" #1077

Closed

This was linked to issues Jul 28, 2025

general paragraph text wrongly recognized as "figDesc/div/p" #1077

Closed

Wrong head of sections extracted #865

Closed

Extra spaces in the parsed text #1303

Closed

Conversation

haydn-jones commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haydn-jones commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfoppiano commented Jun 9, 2025

Uh oh!

lfoppiano commented Jun 10, 2025

Uh oh!

haydn-jones commented Jun 10, 2025

Uh oh!

lfoppiano commented Jun 11, 2025

Uh oh!

lfoppiano commented Jun 13, 2025

Uh oh!

haydn-jones commented Jun 13, 2025

Uh oh!

lfoppiano commented Jun 13, 2025

Uh oh!

lfoppiano commented Jun 16, 2025

Uh oh!

haydn-jones commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfoppiano commented Jun 19, 2025

Uh oh!

lfoppiano commented Jun 19, 2025

Uh oh!

lfoppiano commented Jul 5, 2025

Uh oh!

lfoppiano commented Jul 5, 2025

Uh oh!

haydn-jones commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haydn-jones commented Jul 16, 2025

Uh oh!

lfoppiano commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haydn-jones commented Jul 18, 2025

Uh oh!

lfoppiano commented Jul 19, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haydn-jones commented Jun 8, 2025 •

edited

Loading

haydn-jones commented Jun 9, 2025 •

edited

Loading

haydn-jones commented Jun 18, 2025 •

edited

Loading

haydn-jones commented Jul 7, 2025 •

edited

Loading

lfoppiano commented Jul 17, 2025 •

edited

Loading