override normalised date only when is empty#761
Conversation
|
Right now, the only way to have the normalised date not empty (and so a date that should not be normalized) is after successful consolidation. So in this case, I think it's better to also update non-normalized date during consolidation (via the method <date type="published" when="2008-05-28">2008-05-28</date>About the case where we have a list of normalized dates returned from a raw extracted date field, I think the problem is not that much impossible dates (I don't remember really having seen some), but rather several incomplete dates that could be combined in one. The way to study that could be to process some thousand of PDF, get all the extracted dates and see where are the problems - but it's quite a lot of time and we probably don't want to do that now with this fix? |
That method requires a Update: I just found that in the TEIFormatter, there is already some code to format the date: around line 189 if (biblio.getNormalizedPublicationDate() != null) {
Date date = biblio.getNormalizedPublicationDate();
int year = date.getYear();
int month = date.getMonth();
int day = date.getDay();
String when = "";
if (year != -1) {
[...]Maybe we could move this somewhere and reuse it to get out the ISO date?
I guess it's better to limit this PR to just the fix 😉 In bug #760 I got a year of 5 digits for example, in which case I have preferred to have none. |
Ah yes, I've forgotten the two Date objects in Grobid! The Grobid normalized Date is serialized into ISO format in the TEI. However, I think it would be easier to apply well-formedness criteria to the Grobid object rather than considering java.util.Date, because days and months can often be unspecified (and it does not mean it's first day of the month or first month of the year). The |
…the consolidation, parse the date only when there is no normalised date
|
WIth the last commit, I've improved a bit the efficiency (avoiding parsing the date if there is already a normalised date) and we obtain the ISO version: For that, I created a new static method in the TEIFormatter.java, and I modified the formatter to use it. I've added a few code improvement to avoid NPE in the various IFs and I've removed a hardcoded date (year 2025) in the Update: in the last commit (7df7f9f) I've overhauled the normalise() method to:
The output now look like the following (the when is absent because the normalised date is not valid): <publicationStmt>
<publisher/>
<availability status="unknown">
<licence/>
</availability>
<date>10360 10370 10380 10390 10400</date>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic></analytic>
<monogr>
<imprint>
<date type="published">10360 10370 10380 10390 10400</date>
</imprint>
</monogr>
<idno type="MD5">3C20A67187D8EAB4C3B08E9E158A908D</idno>@kermitt2, does it make sense this behaviour? |
|
I think this PR can be reviewed |
grobid-core/src/main/java/org/grobid/core/engines/DateParser.java
Outdated
Show resolved
Hide resolved
grobid-core/src/main/java/org/grobid/core/engines/DateParser.java
Outdated
Show resolved
Hide resolved
grobid-core/src/main/java/org/grobid/core/engines/DateParser.java
Outdated
Show resolved
Hide resolved
|
Many parts seem to read better. I don't feel fully qualified to review what the fix is actually about, as I haven't really used the consolidation. (It does seem to make sense in general) |
|
As discussed, I've migrated the DateParser to the Clusteror classes.
|
|
@lfoppiano Hi Luca! Would it be possible on your side to update the test class for new config to have the branch merging without conflicts? Thanks ! |
|
I updated the branch. The ParseDate test is fixed. There are two tests in the org.grobid.core.engines.SegmentationTest which are failing for me on the macOs.. I wonder whether we have the same pdfalto version in both platforms... BTW the ci-build seems dead... |
kermitt2
left a comment
There was a problem hiding this comment.
When running test with latest PR version, I have 16 fails all due to this:
java.lang.ExceptionInInitializerError
at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)
Caused by: java.lang.NullPointerException
at org.grobid.core.engines.DateParserTest.setUp(DateParserTest.java:38)
Sorry, there was a modification that I forgot to commit. It's pushed now. |
I've implemented a fix for #760. I leave the PR as a draft as it might require a couple of iterations...
Right now the normalised date is updated only if it's empty. We could add some additional checks from the date parser to check that the date is actually within the minimal requirement to be valid. Is there already code for doing that?
The output is:
@kermitt2 let me know if you have any remark.