Japanese test improvements by wallace11 · Pull Request #962 · eikek/docspell

wallace11 · 2021-07-28T22:14:24Z

Hi there,
Here's some more sensible Japanese tests.
I hope that they pass 😆

eikek · 2021-07-28T23:05:09Z

Hi @wallace11 thanks! I'm afraid that this won't pass. The current algorithm first generates a sequence of words that are determined by some "separating characters" like whitespace/punctuation etc. But now the dates are surrounded by text: 付は2021.7.21で - there is no whitespace or something like that here?

Edit: the CI complains about formatting, this can be fixed by running sbt fix (just fyi)

wallace11 · 2021-07-28T23:15:29Z

@eikek
Hey!
Sorry, I noticed your message only after pushing a possible fix (manual, I don't have a Scala environment set up...).

Regarding spaces, that's the thing - in "normal" Japanese there's no such thing. That's exactly why I wanted to create a proper Japanese tests to see if it catches that.

I looked at some of my documents and indeed on some of them you've got the date as part of the first sentence or the title (which is also a sentence).

Do you think it'd be possible to fix that?
If these tests won't work, would you like me to give it a test run on a couple of documents and see how it catches the dates?

eikek · 2021-07-28T23:20:12Z

@wallace11 no worries! (you only would need to install sbt for this) Thanks for your explanation! I just read around wikipedia that there are no spaces in Japanese :) Well, I guess this means doing it completely differently here. If you have some documents you could share, that would help! That way I could run this against some "real" data. I might be able to remove all characters that are not arabic numbers or the letters for year/month/day… maybe this gives some results.

Not very efficient, but should work to find the position of dates in japanese text.

eikek · 2021-07-28T23:40:40Z

@wallace11 I just pushed a quite crude fix :-). It preprocesses the text and removes all characters that don't take part in a date. Your tests should pass now. You could try this against your documents. I can merge this and some minutes later a nightly version is published.

wallace11 · 2021-07-28T23:44:59Z

@eikek
Looks perfect.
I'll definitely give it a go and let you know how it went.
I guess the upcoming weekend is going to be all about organizing documents ;)

eikek · 2021-07-28T23:46:56Z

@wallace11 Great 😃 ! Sounds like a weekend 😉 Thank you four your help!

wallace11 added 2 commits July 29, 2021 01:08

Update Japanese tests with more sensible data

119a4ff

Add another Japanese test

1095a7d

Remove excessive spaces

e8348e2

Preprocess japanese texts to find dates

4af8dd0

Not very efficient, but should work to find the position of dates in japanese text.

eikek merged commit 16ade69 into eikek:master Jul 28, 2021

eikek added this to the Docspell 0.25.0 milestone Jul 29, 2021

wallace11 mentioned this pull request Jul 31, 2021

Issue with Japanese and numbers (digits) #973

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese test improvements#962

Japanese test improvements#962
eikek merged 4 commits intoeikek:masterfrom
wallace11:japanese-test-improvements

wallace11 commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021 •

edited

Loading

Uh oh!

wallace11 commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021

Uh oh!

wallace11 commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wallace11 commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wallace11 commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021

Uh oh!

wallace11 commented Jul 28, 2021

Uh oh!

eikek commented Jul 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eikek commented Jul 28, 2021 •

edited

Loading