fixed fulltext block start by de-code · Pull Request #714 · grobidOrg/grobid

de-code · 2021-02-15T18:40:58Z

resolves #712

de-code · 2021-02-15T18:42:38Z

I am currently running into the issue described in #712 (comment)
I guess the way I am creating the test data isn't quite correct (using createFromText and getDocumentPartsForLayoutTokens, the latter extracted from processShortNew).

coveralls · 2021-02-15T18:57:26Z

Coverage increased (+0.03%) to 38.19% when pulling 6911ef2 on elifesciences:fix-fulltext-block-start into bfc10f7 on kermitt2:master.

This reverts commit c8a7944.

de-code · 2021-02-15T19:55:49Z

I am currently running into the issue described in #712 (comment)
I guess the way I am creating the test data isn't quite correct (using createFromText and getDocumentPartsForLayoutTokens, the latter extracted from processShortNew).

I got around the problem by manually creating DocumentPiece for the whole document. I therefore reverted getDocumentPartsForLayoutTokens as it wouldn't be used by the test and could be refactored separately.

de-code · 2021-02-15T20:01:38Z

grobid-core/src/test/java/org/grobid/core/engines/FullTextParserTest.java

 package org.grobid.core.engines;

+import org.apache.commons.lang3.tuple.Pair;
 import org.grobid.core.analyzers.GrobidAnalyzer;


There are quite a few unused imports. Might be good to tidy it up (and remove commented out code, perhaps mark test to be ignored instead).

de-code · 2021-02-15T20:02:25Z

grobid-core/src/test/java/org/grobid/core/engines/FullTextParserTest.java

        GrobidFactory.reset();
    }

+    public DocumentPiece getWholeDocumentPiece(Document doc) {


maybe getWholeDocumentPiece and getWholeDocumentParts could be moved to a central place, if it doesn't exist already.

de-code · 2021-02-15T20:24:36Z

Other parsers are potentially affected as well.

There appear to be a lot of duplication and and could probably be refactored.

de-code · 2021-02-16T17:52:39Z

Raised suggestion for refactoring the features: #718

lfoppiano

I'm having trouble understanding the whole picture here, so I'm not sure I can really review this part.

de-code · 2021-02-19T00:14:26Z

I'm having trouble understanding the whole picture here, so I'm not sure I can really review this part.

There is an example in the issue #712 - happy to add more information?

kermitt2 · 2021-06-10T13:24:56Z

I have regenerated the fulltext model feature files with the fix, but actually there is no difference with before with respect to starting block. I also retrained the model (before checking the difference in features), in branch https://github.com/kermitt2/grobid/tree/fix-712
So apparently the problem in #712 never applies in practice in the training data at least.

It might appear in some new files like in your example, but the benchmark on PMC set only changes at the second decimal, so it should be minor.

So we could simply merge the fix, but no need to update the training files and model for this.

de-code added 2 commits February 15, 2021 14:35

extracted getDocumentPartsForLayoutTokens

c8a7944

added testShouldOutputBlockStartForRegularBlock

398eb44

de-code requested review from kermitt2 and lfoppiano February 15, 2021 18:40

de-code self-assigned this Feb 15, 2021

de-code mentioned this pull request Feb 15, 2021

Full text model layout features: BLOCKSTART missing, if very first block token is a new line #712

Closed

de-code added 2 commits February 15, 2021 19:51

added testShouldOutputBlockStartForBlockStartingWithLineFeed

3c22497

Revert "extracted getDocumentPartsForLayoutTokens"

fe6e454

This reverts commit c8a7944.

fixed BLOCKSTART with block starting with lf

d79baaa

de-code marked this pull request as ready for review February 15, 2021 20:00

de-code commented Feb 15, 2021

View reviewed changes

added check, that the first block token is lf

6911ef2

lfoppiano reviewed Feb 18, 2021

View reviewed changes

de-code mentioned this pull request Feb 22, 2021

fixed fulltext BLOCKSTART with first block token being whitespace elifesciences/grobid#33

Merged

kermitt2 added this to the 0.7.0 milestone Mar 20, 2021

kermitt2 added the bug From Hemiptera and especially its suborder Heteroptera label Apr 18, 2021

kermitt2 modified the milestones: 0.7.0, 0.7.1 Jul 9, 2021

kermitt2 modified the milestones: 0.7.1, 0.7.2 Sep 25, 2022

kermitt2 modified the milestones: 0.7.2, 0.7.3 Oct 28, 2022

kermitt2 modified the milestones: 0.7.3, 0.8.0 May 6, 2023

lfoppiano removed this from the 0.8.0 milestone Jun 9, 2024

lfoppiano mentioned this pull request Nov 21, 2024

Fix fulltext block start #1203

Merged

lfoppiano closed this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed fulltext block start#714

fixed fulltext block start#714
de-code wants to merge 6 commits intogrobidOrg:masterfrom
elifesciences:fix-fulltext-block-start

de-code commented Feb 15, 2021

Uh oh!

de-code commented Feb 15, 2021

Uh oh!

coveralls commented Feb 15, 2021 •

edited

Loading

Uh oh!

de-code commented Feb 15, 2021

Uh oh!

de-code Feb 15, 2021

Uh oh!

de-code Feb 15, 2021

Uh oh!

de-code commented Feb 15, 2021

Uh oh!

de-code commented Feb 16, 2021

Uh oh!

lfoppiano left a comment

Uh oh!

de-code commented Feb 19, 2021

Uh oh!

kermitt2 commented Jun 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

de-code commented Feb 15, 2021

Uh oh!

de-code commented Feb 15, 2021

Uh oh!

coveralls commented Feb 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

de-code commented Feb 15, 2021

Uh oh!

de-code Feb 15, 2021

Choose a reason for hiding this comment

Uh oh!

de-code Feb 15, 2021

Choose a reason for hiding this comment

Uh oh!

de-code commented Feb 15, 2021

Uh oh!

de-code commented Feb 16, 2021

Uh oh!

lfoppiano left a comment

Choose a reason for hiding this comment

Uh oh!

de-code commented Feb 19, 2021

Uh oh!

kermitt2 commented Jun 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Feb 15, 2021 •

edited

Loading