Fix date_parser with prefer_month_of_year wrong results#1224
Fix date_parser with prefer_month_of_year wrong results#1224wRAR merged 4 commits intoscrapinghub:masterfrom
Conversation
Fix two problems 1. Parser would use current month even if prefer_month_of_year was not current when relative_base was not none 2. Parser would use current month to derive 'what is the last day of this month' - for example, with prefer_month=last and prefer_day=past, but current_month=april, it would return december 30th, because it would use april to find that the last day was the 30th, when it should use the month. Additionally, add a test to test_date_parser that uses prefer_month
|
Tagging folks here who were involved in the original PR #1146 @adnan-awan @Gallaecio @serhii73 could I have your comments and reviews please |
|
Could you have a look at the 2 failing tests? |
It is parsing "0:4", french, with settings
```
"PREFER_DATES_FROM": "past",
"PREFER_DAY_OF_MONTH": "first",
"PREFER_LOCALE_DATE_ORDER": True,
"PREFER_MONTH_OF_YEAR": "current",
"RELATIVE_BASE": datetime(
year=1970, month=1, day=1, hour=0, minute=0, second=0
),
```
It used to expect to get `expected_date=datetime(1969, 12, 31, 14, 4)` but after my change it gets `datetime(1969, 1, 31, 14, 4)`
I would argue that with PREFER_MONTH_OF_YEAR set to "Current", and "Current" being January 1st 1970, that `datetime(1969, 1, 31, 14, 4)` is a better result
However with this particular set of configuration, I am not exactly 100% sure what to expect. These settings were generated by a fuzzer so perhaps they don't really make a ton of sense together anyway; rather than change the settings (and thus deviate from what the parser caught) I have opted to update the test expectation to accept January.
'Die' It is searching a German string for dates and asserting that when it finds the word "Die" in the string, it should be parsed as `datetime.datetime(1999, 12, 28, 0, 0)` Similarly, my change makes this `datetime.datetime(1999, 1, 28, 0, 0)` instead. I don't speak German, but as far as I can tell "Die" just means "The" so I have no idea why it is even matching it. In my opinion, this could be a bug with the search identifying a non-date word, and so I can't really guess as to what a sensible result would be. For the sake of simplicity, I also just updated this test to accept January,
|
Certainly @Gallaecio - sorry I missed that locally - tox / all tests do not run well on Windows where I developed this so I just ran select tests and expected to check all of them in CI In the first case (test_dates_parse_utc_offset_does_not_throw), it is parsing "0:4", French, with settings it used to expect to get The next one is even more perplexing, it is searching a German string for dates and asserting that when it finds the word "Die" in the string, it should be parsed as |
Fix two problems
Parser would use current month even if prefer_month_of_year was not current when relative_base was not none
Parser would use current month to derive 'what is the last day of this month' - for example, with prefer_month=last and prefer_day=last, but current_month=april, it would return december 30th, because it would use april to find that the last day was the 30th, when it should have used December (Parsing using
{"PREFER_DAY_OF_MONTH": "last", "PREFER_MONTH_OF_YEAR": "last"}doesn't use last day of the last month #1217)Additionally, add a test to test_date_parser that uses prefer_month
Fixes #1217