Skip to content

Japanese test improvements#962

Merged
eikek merged 4 commits intoeikek:masterfrom
wallace11:japanese-test-improvements
Jul 28, 2021
Merged

Japanese test improvements#962
eikek merged 4 commits intoeikek:masterfrom
wallace11:japanese-test-improvements

Conversation

@wallace11
Copy link
Copy Markdown
Contributor

Hi there,
Here's some more sensible Japanese tests.
I hope that they pass 😆

@eikek
Copy link
Copy Markdown
Owner

eikek commented Jul 28, 2021

Hi @wallace11 thanks! I'm afraid that this won't pass. The current algorithm first generates a sequence of words that are determined by some "separating characters" like whitespace/punctuation etc. But now the dates are surrounded by text: 付は2021.7.21で - there is no whitespace or something like that here?

Edit: the CI complains about formatting, this can be fixed by running sbt fix (just fyi)

@wallace11
Copy link
Copy Markdown
Contributor Author

@eikek
Hey!
Sorry, I noticed your message only after pushing a possible fix (manual, I don't have a Scala environment set up...).

Regarding spaces, that's the thing - in "normal" Japanese there's no such thing. That's exactly why I wanted to create a proper Japanese tests to see if it catches that.

I looked at some of my documents and indeed on some of them you've got the date as part of the first sentence or the title (which is also a sentence).

Do you think it'd be possible to fix that?
If these tests won't work, would you like me to give it a test run on a couple of documents and see how it catches the dates?

@eikek
Copy link
Copy Markdown
Owner

eikek commented Jul 28, 2021

@wallace11 no worries! (you only would need to install sbt for this) Thanks for your explanation! I just read around wikipedia that there are no spaces in Japanese :) Well, I guess this means doing it completely differently here. If you have some documents you could share, that would help! That way I could run this against some "real" data. I might be able to remove all characters that are not arabic numbers or the letters for year/month/day… maybe this gives some results.

Not very efficient, but should work to find the position of dates in
japanese text.
@eikek
Copy link
Copy Markdown
Owner

eikek commented Jul 28, 2021

@wallace11 I just pushed a quite crude fix :-). It preprocesses the text and removes all characters that don't take part in a date. Your tests should pass now. You could try this against your documents. I can merge this and some minutes later a nightly version is published.

@wallace11
Copy link
Copy Markdown
Contributor Author

@eikek
Looks perfect.
I'll definitely give it a go and let you know how it went.
I guess the upcoming weekend is going to be all about organizing documents ;)

@eikek
Copy link
Copy Markdown
Owner

eikek commented Jul 28, 2021

@wallace11 Great 😃 ! Sounds like a weekend 😉 Thank you four your help!

@eikek eikek merged commit 16ade69 into eikek:master Jul 28, 2021
@eikek eikek added this to the Docspell 0.25.0 milestone Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants