Skip to content

Fix affiliation missing when using DL affiliation-address model#1166

Merged
lfoppiano merged 4 commits intomasterfrom
fix-affiliation-dl
Sep 18, 2024
Merged

Fix affiliation missing when using DL affiliation-address model#1166
lfoppiano merged 4 commits intomasterfrom
fix-affiliation-dl

Conversation

@lfoppiano
Copy link
Copy Markdown
Member

@lfoppiano lfoppiano commented Sep 17, 2024

This PR propose a fix for the affiliation, that are lost when processing them with a DL model.

The issue seems to be in the method: getAffiliationBlocksFromSegments() where new \n are added (in general they should be added if there is a misalignment, however they are added for sure at the beginning).

https://github.com/kermitt2/grobid/blob/a95d2533f1019e900b49ea5c39a5afe355dbb4a3/grobid-core/src/main/java/org/grobid/core/engines/AffiliationAddressParser.java#L81

I patched quickly by checking that end is not zero. However this \n does not work well with the DL models, at contrary with the CRF models that they are ignoring it.

I've left two tests which are showing the problem from both CRF and DL: https://github.com/kermitt2/grobid/blob/bd93a61f4542f218299e2c34a82c37b75bc727ef/grobid-core/src/test/java/org/grobid/core/engines/AffiliationAddressParserTest.java#L262

The DL test is still failing, as I'm not sure really where to fix the issue.

After this is fix we would need to rebuild the grobid-full image.

@lfoppiano
Copy link
Copy Markdown
Member Author

After a few iteration over it, I think I understood the principle which is of separating blocks of affiliations that are on different offset differences. My fix just avoid adding \n at the beginning. The \n helps to separate the blocks and, with the DL models, to process the blocks in parallel, among other things.

@lfoppiano
Copy link
Copy Markdown
Member Author

@kermitt2 I've tried to fix this a bit in a rush, at least to mitigate the issue on the docker image. I'm sorry, I might need a quick review on your side.

I've pushed this fix on the branch 0.8.1-fixes (which is a branch from the tag 0.8.1) and I've pushed an updated docker image lfoppiano/grobid:0.8.1-full which should at least mitigate this issue. It's deployed here.

@lfoppiano lfoppiano requested a review from kermitt2 September 17, 2024 15:28
@lfoppiano lfoppiano changed the title Fix affiliation missing for DL models Fix affiliation missing when using DL affiliation-address model Sep 17, 2024
@kermitt2
Copy link
Copy Markdown
Collaborator

Hi @lfoppiano the fix works fine no problem. It is surprising that the starting "\n" has such effect on the DL processing. There's nothing else to change, the segmentation goes then normally, including parallel processing. I changed this part last December and it seems I only tested with the CRF model :)
Unfortunately the end-to-end benchmarks are not covering affiliations. The docker image and the huggingface demo are also updated for the grobid account.

@lfoppiano lfoppiano merged commit f501033 into master Sep 18, 2024
@lfoppiano
Copy link
Copy Markdown
Member Author

Thanks!

@lfoppiano lfoppiano deleted the fix-affiliation-dl branch September 18, 2024 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants