OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false#792
OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false#792mawiesne merged 4 commits intoapache:mainfrom
Conversation
|
Dear Reviewers, This PR is opened to clarify a few things around the usage of 'useTokenEnd' flag in SentenceDetector. 1. We have below issue prioritized for release 2.6.0. Question : Could someone pls. provide some clarification on the changes required to fix OPENNLP-205? 2. The Sentence Detector documentation says that for training : We see examples of two sentences in one line. E.g. Should the text sample be standardized as per documentation?
3. Can we add some documentation in the manual for this flag? Best Regards. |
| sentdetectModel = train(factory, Locale.GERMAN); | ||
| Assertions.assertNotNull(sentdetectModel); | ||
| Assertions.assertEquals("deu", sentdetectModel.getLanguage()); | ||
| private void prepareResources(boolean useTokenEnd) { |
There was a problem hiding this comment.
We need a way to configure the SentenceDetectorFactory differently i.e. useTokenEnd=true for some tests and with
useTokenEnd=false for one other test.
Approach1 - @BeforeAll is removed, and the method prepareResources() is made private and parameterized so that the respective test case can get the trained model based on its need.
Appraoch-2 : Exclusively configure the SentenceDetectorFactory for the new test (i.e. useTokenEnd=false scenario) in its own definition and leave the previous @BeforeAll method as is.
Pls. suggest any preference of approach-2 over approach1 ?
|
FYI @NishantShri4: |
|
Looks like the rebase went wrong. Would expect was less changes. |
|
Thanks Reviewers, this is now rebased against latest main branch. |
|
Regarding your questions / thoughts above: (1) I wasn't involved in the project back than, so no idea about the original thoughts of the issue you mentioned. Perhaps, it is already solved (2) We should fix the test sample. (3) Feel free to add docs ;-) |
|
Is this good to merge? |
|
have u tried with 'Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz' ? |
| Assertions.assertEquals("deu", sentdetectModel.getLanguage()); | ||
| private void prepareResources(boolean useTokenEnd) { | ||
| try { | ||
| Dictionary abbreviationDict = loadAbbDictionary(Locale.GERMAN); |
There was a problem hiding this comment.
This could be done in an @BeforeAll test preparation step, as the abbreviationDict is constant for this scenario.
If left like this, more effort is required, as it is reloaded per actual test case.
There was a problem hiding this comment.
Thanks. This is done.
| @@ -45,19 +48,25 @@ public class SentenceDetectorMEGermanTest extends AbstractSentenceDetectorTest { | |||
|
|
|||
| private static SentenceModel sentdetectModel; | |||
There was a problem hiding this comment.
Given prepareResources(..) is not static any longer, this field should also avoid "static" and become an instance per Test case. Please consider removing static here.
There was a problem hiding this comment.
Thanks this is done.
| String[] sents = sentDetect.sentDetect(sent1 + sent2); | ||
| double[] probs = sentDetect.getSentenceProbabilities(); | ||
| Assertions.assertEquals(1, probs.length); | ||
| assertAll( |
There was a problem hiding this comment.
Could you apply the assertAll(..) pattern to the other two test scenario analogously?
It reads nicely this way.
There was a problem hiding this comment.
Sure. This is done.
|
Thanks very much Reviewers. I am Converting this to Draft to finish accommodating review comments and also update documentation for this flag. Will present for review again shortly. One query - Is there any work required for https://issues.apache.org/jira/browse/OPENNLP-205 (Refactor the SentenceDetectorME class to do the mapping of end-of-sent positions to spans better)? |
Great that you show interest in OPENNLP-205. Potentially: Yes. It all starts with a test case that demonstrates the topic / issue. Haven't had the time to look into it deeper, that is, from a functional perspective. If you have an understanding of "205", feel free to open a separate branch with a test exemplifying what's described in the issue for now. If it is hard to extract that, the issue might not be relevant any longer. Keep in mind: It's a lower three-digit issue. Other opinions/comments welcome. |
This doesn't include any punctuation marks. Is the suggestion to use this for sentence detection? |
…se (#792) * OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false * Added useTokenEnd to the list of optional params available for sentence detector tool.
- adapts PR #792 for OpenNLP 2.x
- adapts PR #792 for OpenNLP 2.x
- adapts PR #792 for OpenNLP 2.x
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.