Skip to content

possible fix of 'ago' problem in Russian#340

Closed
eszakharova wants to merge 1 commit intoscrapinghub:masterfrom
eszakharova:fix-ago-problem
Closed

possible fix of 'ago' problem in Russian#340
eszakharova wants to merge 1 commit intoscrapinghub:masterfrom
eszakharova:fix-ago-problem

Conversation

@eszakharova
Copy link
Copy Markdown
Contributor

'2000 год' in Russian is 'year 2000'
Before was parsed as '2000 years ago'

parse('2000 год')
datetime.datetime(17, 8, 3, 18, 44, 48, 222615)

After

parse('2000 год')
datetime.datetime(2000, 8, 3, 0, 0)

@codecov
Copy link
Copy Markdown

codecov bot commented Aug 3, 2017

Codecov Report

Merging #340 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #340      +/-   ##
==========================================
+ Coverage   97.61%   97.61%   +<.01%     
==========================================
  Files          20       20              
  Lines        1674     1677       +3     
==========================================
+ Hits         1634     1637       +3     
  Misses         40       40
Impacted Files Coverage Δ
dateparser/freshness_date_parser.py 98.97% <100%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5308434...3cf105a. Read the comment docs.

@Gallaecio
Copy link
Copy Markdown
Contributor

@noviluni
Copy link
Copy Markdown
Contributor

noviluni commented May 14, 2020

The reason behind this is that he relative-time parser is applied before the absolute-time parser... so it could be fixed by using the PARSERS settings, either deleting the relative-time parser or putting it in the last position.
i.e.:

dateparser.parse('2000 год', settings={'PARSERS': ['timestamp', 'custom-formats', 'absolute-time', 'base-formats', 'relative-time']})

Your approach is not good because we can't hardcode that regex there and would have side effects for other languages. However, the issue is still valid.

We could probably add a regex into the ru.yaml file.

Maybe a simplification as - (\d+)\s*год: \1 would work, however, I can't speak Russian and I don't know if we should add something more, as it could be that with other following or preceding words (as "ago" or "in") this simplification shouldn't be applied.

@serhii73 serhii73 self-assigned this Nov 7, 2022
serhii73 added a commit that referenced this pull request Jan 24, 2023
@serhii73 serhii73 mentioned this pull request Jan 24, 2023
serhii73 added a commit that referenced this pull request Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants