Skip to content

override normalised date only when is empty#761

Merged
kermitt2 merged 20 commits intomasterfrom
bugfix/fix-date-normalisation
Jul 27, 2021
Merged

override normalised date only when is empty#761
kermitt2 merged 20 commits intomasterfrom
bugfix/fix-date-normalisation

Conversation

@lfoppiano
Copy link
Copy Markdown
Member

@lfoppiano lfoppiano commented May 27, 2021

I've implemented a fix for #760. I leave the PR as a draft as it might require a couple of iterations...

Right now the normalised date is updated only if it's empty. We could add some additional checks from the date parser to check that the date is actually within the minimal requirement to be valid. Is there already code for doing that?

The output is:

<date type="published" when="2008-05-28">2775 2773 2773 0 2 3 2968 2947 2947</date>

@kermitt2 let me know if you have any remark.

@coveralls
Copy link
Copy Markdown

coveralls commented May 27, 2021

Coverage Status

Coverage increased (+0.1%) to 39.213% when pulling dfe480c on bugfix/fix-date-normalisation into d80b464 on master.

@kermitt2
Copy link
Copy Markdown
Collaborator

Right now, the only way to have the normalised date not empty (and so a date that should not be normalized) is after successful consolidation. So in this case, I think it's better to also update non-normalized date during consolidation (via the method Utilities.dateToString() for example), to have something like that:

<date type="published" when="2008-05-28">2008-05-28</date>

About the case where we have a list of normalized dates returned from a raw extracted date field, I think the problem is not that much impossible dates (I don't remember really having seen some), but rather several incomplete dates that could be combined in one. The way to study that could be to process some thousand of PDF, get all the extracted dates and see where are the problems - but it's quite a lot of time and we probably don't want to do that now with this fix?

@lfoppiano
Copy link
Copy Markdown
Member Author

lfoppiano commented May 28, 2021

Right now, the only way to have the normalised date not empty (and so a date that should not be normalized) is after successful consolidation. So in this case, I think it's better to also update non-normalized date during consolidation (via the method Utilities.dateToString() for example), to have something like that:
[..]

That method requires a java.util.Date, while we have org.grobid.data.Date. I wonder if there isn't any method to transform the Grobid Date to its ISO format.

Update: I just found that in the TEIFormatter, there is already some code to format the date: around line 189

if (biblio.getNormalizedPublicationDate() != null) {
                Date date = biblio.getNormalizedPublicationDate();
                int year = date.getYear();
                int month = date.getMonth();
                int day = date.getDay();

                String when = "";
                if (year != -1) {
[...]

Maybe we could move this somewhere and reuse it to get out the ISO date?

About the case where we have a list of normalized dates returned from a raw extracted date field, I think the problem is not that much impossible dates (I don't remember really having seen some), but rather several incomplete dates that could be combined in one. The way to study that could be to process some thousand PDF, get all the extracted dates and see where are the problems - but it's quite a lot of time and we probably don't want to do that now with this fix?

I guess it's better to limit this PR to just the fix 😉
Since we have already the normalised date separated into the year, month and day, if they are not null (or equal to -1), we could just check that year has 4 digits, months and days has 2 digits. We could enforce the intervals for days and months when they are available but I guess that would be a bonus.

In bug #760 I got a year of 5 digits for example, in which case I have preferred to have none.

@kermitt2
Copy link
Copy Markdown
Collaborator

That method requires a java.util.Date, while we have org.grobid.data.Date. I wonder if there isn't any method to transform the Grobid Date to its ISO format.

Ah yes, I've forgotten the two Date objects in Grobid! The Grobid normalized Date is serialized into ISO format in the TEI. However, I think it would be easier to apply well-formedness criteria to the Grobid object rather than considering java.util.Date, because days and months can often be unspecified (and it does not mean it's first day of the month or first month of the year).

The DateParser.normalize() is the place where date normalization takes place - when I wrote it I didn't apply well-formedness criteria at this stage. I think my idea was to leave further constrains to the applications using Grobid, but for years with 5 digits or far in the future, it's for sure good to reject it in GROBID directly.

@lfoppiano
Copy link
Copy Markdown
Member Author

lfoppiano commented May 31, 2021

WIth the last commit, I've improved a bit the efficiency (avoiding parsing the date if there is already a normalised date) and we obtain the ISO version:

<date type="published" when="2008-05-28">2008-05-28</date>

For that, I created a new static method in the TEIFormatter.java, and I modified the formatter to use it.

I've added a few code improvement to avoid NPE in the various IFs and I've removed a hardcoded date (year 2025) in the DateParser.normalise() method.

Update: in the last commit (7df7f9f) I've overhauled the normalise() method to:

  • return a date rather than modify the passed object
  • ignore information that is not passing the postValidation() which is the simple test on the number of digits

The output now look like the following (the when is absent because the normalised date is not valid):

          <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
                <date>10360 10370 10380 10390 10400</date>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic></analytic>
                    <monogr>
                        <imprint>
                            <date type="published">10360 10370 10380 10390 10400</date>
                        </imprint>
                    </monogr>
                    <idno type="MD5">3C20A67187D8EAB4C3B08E9E158A908D</idno>

@kermitt2, does it make sense this behaviour?

@lfoppiano lfoppiano marked this pull request as ready for review June 7, 2021 06:58
@lfoppiano
Copy link
Copy Markdown
Member Author

I think this PR can be reviewed

@lfoppiano lfoppiano requested a review from de-code June 7, 2021 06:59
@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Jun 7, 2021
@de-code
Copy link
Copy Markdown
Collaborator

de-code commented Jun 7, 2021

Many parts seem to read better. I don't feel fully qualified to review what the fix is actually about, as I haven't really used the consolidation. (It does seem to make sense in general)

@lfoppiano
Copy link
Copy Markdown
Member Author

As discussed, I've migrated the DateParser to the Clusteror classes.
Few comments:

  • I've renamed the various method using process() instead of processing. It seems more correct to me, howver we can revert it back. I left the original one with the deprecation note, we can revert.
  • I've kept the List<String> because I wasn't feeling too confident to replace it with the List<LayoutToken>.
  • The unit tests are all working except one (see comment).

@kermitt2 kermitt2 changed the title override normalised data only when is empty override normalised date only when is empty Jul 18, 2021
@kermitt2
Copy link
Copy Markdown
Collaborator

@lfoppiano Hi Luca! Would it be possible on your side to update the test class for new config to have the branch merging without conflicts? Thanks !

@kermitt2 kermitt2 added this to the 0.7.1 milestone Jul 18, 2021
@lfoppiano
Copy link
Copy Markdown
Member Author

lfoppiano commented Jul 19, 2021

I updated the branch. The ParseDate test is fixed.

There are two tests in the org.grobid.core.engines.SegmentationTest which are failing for me on the macOs..
The change at rev a012f6b seems to be causing the problem for me, but I guess it solved it for you on Linux.

I wonder whether we have the same pdfalto version in both platforms...

BTW the ci-build seems dead...

Copy link
Copy Markdown
Collaborator

@kermitt2 kermitt2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running test with latest PR version, I have 16 fails all due to this:

  java.lang.ExceptionInInitializerError
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)
  Caused by: java.lang.NullPointerException
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)

@lfoppiano
Copy link
Copy Markdown
Member Author

When running test with latest PR version, I have 16 fails all due to this:

  java.lang.ExceptionInInitializerError
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)
  Caused by: java.lang.NullPointerException
      at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)

Sorry, there was a modification that I forgot to commit. It's pushed now.

@kermitt2 kermitt2 merged commit b6d7281 into master Jul 27, 2021
@lfoppiano lfoppiano deleted the bugfix/fix-date-normalisation branch August 4, 2021 06:10
@btut btut mentioned this pull request Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug From Hemiptera and especially its suborder Heteroptera

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publication date potentially corrected from the consolidation service seems to be lost in a limbo

4 participants