Add segmentation + fulltext annotations.#1301
Conversation
|
@lfoppiano Did a header pass, also found this header which is not split (not one of the files I added). Should I fix it? Similarly, there are some instances I found in PR #1254 of |
Yes, please 😄
I'm not sure I understood, generally, if there is any fix of the training data you find also from previous files, you can add it to this PR, it's fine for me. |
|
@haydn-jones could you please also add the corresponding files .fulltext and .segmentation in the respective |
|
@lfoppiano Done. I'd be happy to train the segmentation model tonight if that would be helpful. |
|
Yes please, go ahead. Validating the fulltext training data will take me some more days. |
|
@haydn-jones I've completed the validation of the fulltext files. I've also started retraining the models (there was a small correction in the segmentation). I don't know how much capacity you have for correcting more data, I could a few more articles in the mix. Let me know. |
|
@lfoppiano Correcting a few more would be fine, I know there are a few issues like #1279 and #1270 that might be fixed with a few added papers. I can find relevant CC-0/BY papers/drug labels for those two and do corrections then push for you to review if you'd like. If you have some other papers in mind id be happy to look at those too. |
|
Sure, however I would wait when we have the updated models, then we can see whose issues are still occurring and add more papers. |
|
@haydn-jones I have pushed the new re-trained models. Could you try to see whether there are some improvements, and maybe if you could quantify them with your data? I will try to run some benchmarks, in the meantime. We should double check that the |
|
@lfoppiano In general, everything looks much much better, though there are still some minor issues like headings not being split up (i.e. Would quantification involve creating a held out set and running a benchmark on it? I see e2e eval discussed here but I'm not 100% sure on how to do it https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/. |
|
I'm dividing the answer in two. @haydn-jones did you test using papers, other than the one used for training? For the header, I need to revise the logic that reconstruct them still respecting the TEI. Now, Grobid allow a set of divs with a single : When there is a head and a sub-head it was identified using an empty set of paragraphs For simplicity the structure have been maintained simple, and we should avoid changing the main structure to avoid breaking backward compatibility. However the clean solution would be to have nested I wonder if there is a way that is flexible enough and TEI compliant, like, for example: @laurentromary do you have any suggestions/comments? @kermitt2 already gave some info about past attempts here. |
|
@haydn-jones Regarding the evaluation, yes I'm talking about the link to the benchmarks that you posted. I've had some problems with my automated docker image so I had to re-run them on a separate environment and that took me a few days. In general the fulltext metrics have increased as compared to before (particularly for the PLOS benchmark). |
@laurentromary it's fairly complicated to reconstruct the hierarchy from a flat list (the labelled sequence) without an overview of the visual characteristics. One option is to cluster the different head by font and assume that they are at the same level. I believe @kermitt2 already done that without promising results, because the fonts not always reliable. Currently a sequence with a double The results in TEI is now But it should be otherwise if someone tries to use the hierarchy will get conclusions in |
|
@haydn-jones I've corrected your annotations and re-trained the two models. I'm currently looking on how to improve the |
|
@lfoppiano Was on vacation so I stepped away, I'll have your files reviewed by Friday. Edit: With respect to |
|
@lfoppiano Sorry that took so long, I pushed corrections. I think I'm all out of energy for adding more training data, not sure about you. In this last batch, I noticed a lot of instances of unknown characters, like |
|
@haydn-jones Thanks! I think we're done with annotations. I will train and update the models. Regarding the invalid characters for the equations, there are two type of problems:
Perhaps, I suggest we discuss these problems in a separate issue so that we can close this one once the models are working (I have drafted a solution for the headers that need to be tested, still) |
|
@haydn-jones Steady improvements in all metrics related to fulltext. I think we can merge. Will do that next week. |
There was a problem hiding this comment.
Pull Request Overview
This pull request adds segmentation and fulltext annotations to the GROBID system, focusing on TEI output format with improvements to model performance across multiple evaluation datasets.
- Updated benchmarking results across three major datasets (PLOS, eLife, and bioRxiv) showing general performance improvements in fulltext structure recognition
- Enhanced performance metrics for reference extraction, citation context resolution, and document structure parsing
- Improved evaluation time measurements and formatting consistency across benchmark reports
Reviewed Changes
Copilot reviewed 3 out of 199 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| doc/Benchmarking-plos.md | Updated benchmark metrics showing improvements in reference citation extraction, figure/table reference accuracy, and fulltext structure recognition |
| doc/Benchmarking-elife.md | Refreshed performance metrics with enhanced citation context resolution and improved fulltext parsing results |
| doc/Benchmarking-biorxiv.md | Updated evaluation results demonstrating better reference extraction and document structure analysis capabilities |


Discussed in #1249
This contains only the TEI right now, will add raw info / updated models later on.