override normalised date only when is empty by lfoppiano · Pull Request #761 · grobidOrg/grobid

lfoppiano · 2021-05-27T07:58:51Z

I've implemented a fix for #760. I leave the PR as a draft as it might require a couple of iterations...

Right now the normalised date is updated only if it's empty. We could add some additional checks from the date parser to check that the date is actually within the minimal requirement to be valid. Is there already code for doing that?

The output is:

<date type="published" when="2008-05-28">2775 2773 2773 0 2 3 2968 2947 2947</date>

@kermitt2 let me know if you have any remark.

coveralls · 2021-05-27T08:11:05Z

Coverage increased (+0.1%) to 39.213% when pulling dfe480c on bugfix/fix-date-normalisation into d80b464 on master.

kermitt2 · 2021-05-27T10:10:29Z

Right now, the only way to have the normalised date not empty (and so a date that should not be normalized) is after successful consolidation. So in this case, I think it's better to also update non-normalized date during consolidation (via the method Utilities.dateToString() for example), to have something like that:

<date type="published" when="2008-05-28">2008-05-28</date>

About the case where we have a list of normalized dates returned from a raw extracted date field, I think the problem is not that much impossible dates (I don't remember really having seen some), but rather several incomplete dates that could be combined in one. The way to study that could be to process some thousand of PDF, get all the extracted dates and see where are the problems - but it's quite a lot of time and we probably don't want to do that now with this fix?

lfoppiano · 2021-05-28T00:28:03Z

Right now, the only way to have the normalised date not empty (and so a date that should not be normalized) is after successful consolidation. So in this case, I think it's better to also update non-normalized date during consolidation (via the method Utilities.dateToString() for example), to have something like that:
[..]

That method requires a java.util.Date, while we have org.grobid.data.Date. I wonder if there isn't any method to transform the Grobid Date to its ISO format.

Update: I just found that in the TEIFormatter, there is already some code to format the date: around line 189

if (biblio.getNormalizedPublicationDate() != null) {
                Date date = biblio.getNormalizedPublicationDate();
                int year = date.getYear();
                int month = date.getMonth();
                int day = date.getDay();

                String when = "";
                if (year != -1) {
[...]

Maybe we could move this somewhere and reuse it to get out the ISO date?

About the case where we have a list of normalized dates returned from a raw extracted date field, I think the problem is not that much impossible dates (I don't remember really having seen some), but rather several incomplete dates that could be combined in one. The way to study that could be to process some thousand PDF, get all the extracted dates and see where are the problems - but it's quite a lot of time and we probably don't want to do that now with this fix?

I guess it's better to limit this PR to just the fix 😉
Since we have already the normalised date separated into the year, month and day, if they are not null (or equal to -1), we could just check that year has 4 digits, months and days has 2 digits. We could enforce the intervals for days and months when they are available but I guess that would be a bonus.

In bug #760 I got a year of 5 digits for example, in which case I have preferred to have none.

kermitt2 · 2021-05-28T01:24:48Z

That method requires a java.util.Date, while we have org.grobid.data.Date. I wonder if there isn't any method to transform the Grobid Date to its ISO format.

Ah yes, I've forgotten the two Date objects in Grobid! The Grobid normalized Date is serialized into ISO format in the TEI. However, I think it would be easier to apply well-formedness criteria to the Grobid object rather than considering java.util.Date, because days and months can often be unspecified (and it does not mean it's first day of the month or first month of the year).

The DateParser.normalize() is the place where date normalization takes place - when I wrote it I didn't apply well-formedness criteria at this stage. I think my idea was to leave further constrains to the applications using Grobid, but for years with 5 digits or far in the future, it's for sure good to reject it in GROBID directly.

…the consolidation, parse the date only when there is no normalised date

… validation part

lfoppiano · 2021-05-31T03:02:21Z

WIth the last commit, I've improved a bit the efficiency (avoiding parsing the date if there is already a normalised date) and we obtain the ISO version:

<date type="published" when="2008-05-28">2008-05-28</date>

For that, I created a new static method in the TEIFormatter.java, and I modified the formatter to use it.

I've added a few code improvement to avoid NPE in the various IFs and I've removed a hardcoded date (year 2025) in the DateParser.normalise() method.

Update: in the last commit (7df7f9f) I've overhauled the normalise() method to:

return a date rather than modify the passed object
ignore information that is not passing the postValidation() which is the simple test on the number of digits

The output now look like the following (the when is absent because the normalised date is not valid):

          <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
                <date>10360 10370 10380 10390 10400</date>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic></analytic>
                    <monogr>
                        <imprint>
                            <date type="published">10360 10370 10380 10390 10400</date>
                        </imprint>
                    </monogr>
                    <idno type="MD5">3C20A67187D8EAB4C3B08E9E158A908D</idno>

@kermitt2, does it make sense this behaviour?

lfoppiano · 2021-06-07T06:59:24Z

I think this PR can be reviewed

grobid-core/src/main/java/org/grobid/core/engines/DateParser.java

de-code · 2021-06-07T16:33:18Z

Many parts seem to read better. I don't feel fully qualified to review what the fix is actually about, as I haven't really used the consolidation. (It does seem to make sense in general)

lfoppiano · 2021-06-09T07:06:48Z

As discussed, I've migrated the DateParser to the Clusteror classes.
Few comments:

I've renamed the various method using process() instead of processing. It seems more correct to me, howver we can revert it back. I left the original one with the deprecation note, we can revert.
I've kept the List<String> because I wasn't feeling too confident to replace it with the List<LayoutToken>.
The unit tests are all working except one (see comment).

kermitt2 · 2021-07-18T14:19:02Z

@lfoppiano Hi Luca! Would it be possible on your side to update the test class for new config to have the branch merging without conflicts? Thanks !

lfoppiano · 2021-07-19T01:31:43Z

I updated the branch. The ParseDate test is fixed.

There are two tests in the org.grobid.core.engines.SegmentationTest which are failing for me on the macOs..
The change at rev a012f6b seems to be causing the problem for me, but I guess it solved it for you on Linux.

I wonder whether we have the same pdfalto version in both platforms...

BTW the ci-build seems dead...

kermitt2

When running test with latest PR version, I have 16 fails all due to this:

  java.lang.ExceptionInInitializerError
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)
  Caused by: java.lang.NullPointerException
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)

grobid-core/src/main/java/org/grobid/core/engines/DateParser.java

lfoppiano · 2021-07-25T23:31:08Z

When running test with latest PR version, I have 16 fails all due to this:

  java.lang.ExceptionInInitializerError
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)
  Caused by: java.lang.NullPointerException
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)

Sorry, there was a modification that I forgot to commit. It's pushed now.

override normalised data only when is empty

9e4cff0

lfoppiano requested a review from kermitt2 May 27, 2021 07:58

lfoppiano linked an issue May 27, 2021 that may be closed by this pull request

Publication date potentially corrected from the consolidation service seems to be lost in a limbo #760

Closed

Luca Foppiano added 2 commits May 28, 2021 08:13

remove re-formatting

ea4bdaf

re-fix formatting

e064ee3

lfoppiano mentioned this pull request May 31, 2021

Regression in the segmentation model results due to changes in pdfalto #759

Closed

Luca Foppiano added 4 commits May 31, 2021 11:11

override the publication date when the normalised date is taken from …

e94aa95

…the consolidation, parse the date only when there is no normalised date

reduce duplicated code in the TEI formatter

1776add

minor code improvements including removing hardcoded year 2025 in the…

e579f83

… validation part

remove hardcoded max date

a5b900c

Luca Foppiano added 7 commits May 31, 2021 15:39

postValidate parsed Date

7df7f9f

fix bug spotted thanks to our tests

07ddfd3

adding more unit tests

ef92564

add boolean to enable/disable validation after normalising a date

ec09f48

forgot to update the date parser

0c698e1

update pdfalto

762d40e

minor improvements

dfe480c

lfoppiano marked this pull request as ready for review June 7, 2021 06:58

lfoppiano requested a review from de-code June 7, 2021 06:59

lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Jun 7, 2021

de-code reviewed Jun 7, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/DateParser.java Outdated Show resolved Hide resolved

de-code reviewed Jun 7, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/DateParser.java Outdated Show resolved Hide resolved

de-code reviewed Jun 7, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/DateParser.java Show resolved Hide resolved

de-code reviewed Jun 7, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/DateParser.java Outdated Show resolved Hide resolved

Luca Foppiano added 3 commits June 8, 2021 10:55

to remove the queen from the code :-)

bc84070

use better naming

9dffe67

migrate DateParser to the Clusteror

7668ffa

kermitt2 changed the title ~~override normalised data only when is empty~~ override normalised date only when is empty Jul 18, 2021

kermitt2 added this to the 0.7.1 milestone Jul 18, 2021

Luca Foppiano added 2 commits July 19, 2021 10:15

Merge branch 'master' into bugfix/fix-date-normalisation

031e3eb

fix tests

bf5d666

kermitt2 reviewed Jul 23, 2021

View reviewed changes

grobid-core/src/main/java/org/grobid/core/engines/DateParser.java Show resolved Hide resolved

forgotten fixes

4fcbf67

kermitt2 merged commit b6d7281 into master Jul 27, 2021

kermitt2 mentioned this pull request Aug 2, 2021

Review date handling with header consolidation and date serialization in references #807

Merged

lfoppiano deleted the bugfix/fix-date-normalisation branch August 4, 2021 06:10

btut mentioned this pull request Aug 12, 2021

Revise BibTeX date output #814

Closed

Conversation

lfoppiano commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented May 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kermitt2 commented May 27, 2021

Uh oh!

lfoppiano commented May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kermitt2 commented May 28, 2021

Uh oh!

lfoppiano commented May 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfoppiano commented Jun 7, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

de-code commented Jun 7, 2021

Uh oh!

lfoppiano commented Jun 9, 2021

Uh oh!

kermitt2 commented Jul 18, 2021

Uh oh!

lfoppiano commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kermitt2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lfoppiano commented Jul 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lfoppiano commented May 27, 2021 •

edited

Loading

coveralls commented May 27, 2021 •

edited

Loading

lfoppiano commented May 28, 2021 •

edited

Loading

lfoppiano commented May 31, 2021 •

edited

Loading

lfoppiano commented Jul 19, 2021 •

edited

Loading