Skip to content

Fix some segmentation training data.#1254

Closed
haydn-jones wants to merge 7 commits into
grobidOrg:masterfrom
haydn-jones:master
Closed

Fix some segmentation training data.#1254
haydn-jones wants to merge 7 commits into
grobidOrg:masterfrom
haydn-jones:master

Conversation

@haydn-jones

Copy link
Copy Markdown
Contributor

Found some training data that was incorrectly assigning listBibl to body elements, might be contributing to the issues I'm experiencing.

@lfoppiano

lfoppiano commented Feb 25, 2025

Copy link
Copy Markdown
Member

Hi @haydn-jones, thanks for the PR.

These training data are for the lightweight models, article-light, indeed the references are supposed to be considered <body>, however this is done in the XML parser 😅 so there is no need to fix this training data, actually.

See https://github.com/kermitt2/grobid/blob/8b9d113d665bc1bd64c3c38e4ca93a19a7426cb9/grobid-trainer/src/main/java/org/grobid/trainer/sax/TEISegmentationArticleLightSaxParser.java#L116

@haydn-jones

Copy link
Copy Markdown
Contributor Author

Ah I see. Half of the files are in the standard segmentation training set though, right?

@lfoppiano

Copy link
Copy Markdown
Member

Yes, I prefer to annotate them with the full segmentation approach, expecially for what concern headnote and footnote, because we might change the lightweight model in the future. I suggest you to focus only on the data into grobid-trainer/resources/dataset/segmentation/corpus

@haydn-jones

Copy link
Copy Markdown
Contributor Author

@lfoppiano Sounds good, I reverted the changes for the lightweight models.

@haydn-jones haydn-jones closed this by deleting the head repository May 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants