OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false by NishantShri4 · Pull Request #792 · apache/opennlp

NishantShri4 · 2025-06-14T17:57:15Z

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?

For code changes:

Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
Have you written or updated unit tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

NishantShri4 · 2025-06-14T18:56:00Z

Dear Reviewers,

This PR is opened to clarify a few things around the usage of 'useTokenEnd' flag in SentenceDetector.

1. We have below issue prioritized for release 2.6.0.

 _https://issues.apache.org/jira/browse/OPENNLP-205
  (Refactor the SentenceDetectorME class to do the mapping of end-of-sent  positions to spans better)_

Above issue  says that the code fails in some scenarios when useTokenEnd is false. 
However, I see that a fix was already made previously for usage of this flag in  
https://issues.apache.org/jira/browse/OPENNLP-711.
I have added a simple test, which demonstrates the use of useTokenEnd flag when set as false.

Question : Could someone pls. provide some clarification on the changes required to fix OPENNLP-205?

2. The Sentence Detector documentation says that for training :
" The data must be converted to the OpenNLP Sentence Detector training format. Which is one sentence per line. "

However, in the test data sample for German text - 
https://github.com/apache/opennlp/blob/main/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/Sentences_DE.txt

We see examples of two sentences in one line. E.g. Should the text sample be standardized as per documentation?

Ein älterer Herr gesellt sich zu ihm und schimpft über den König von Italien. Am Ende der Anhöhe geht er dann viel leichter.

3. Can we add some documentation in the manual for this flag?

Best Regards.

NishantShri4 · 2025-06-15T06:38:06Z

opennlp-tools/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEGermanTest.java

-    sentdetectModel = train(factory, Locale.GERMAN);
-    Assertions.assertNotNull(sentdetectModel);
-    Assertions.assertEquals("deu", sentdetectModel.getLanguage());
+  private void prepareResources(boolean useTokenEnd) {


We need a way to configure the SentenceDetectorFactory differently i.e. useTokenEnd=true for some tests and with
useTokenEnd=false for one other test.

Approach1 - @BeforeAll is removed, and the method prepareResources() is made private and parameterized so that the respective test case can get the trained model based on its need.

Appraoch-2 : Exclusively configure the SentenceDetectorFactory for the new test (i.e. useTokenEnd=false scenario) in its own definition and leave the previous @BeforeAll method as is.

Pls. suggest any preference of approach-2 over approach1 ?

mawiesne · 2025-06-18T15:52:32Z

FYI @NishantShri4:
Given #786 has been merged today, please rebase your changes against latest main branch and the different module structure. Once this is consistent with 3.x ideas, we can have a deeper look into this PR.

rzo1 · 2025-06-18T19:15:10Z

Looks like the rebase went wrong. Would expect was less changes.

NishantShri4 · 2025-06-18T21:28:28Z

Thanks Reviewers, this is now rebased against latest main branch.

rzo1 · 2025-06-20T14:28:51Z

Regarding your questions / thoughts above:

(1) I wasn't involved in the project back than, so no idea about the original thoughts of the issue you mentioned. Perhaps, it is already solved

(2) We should fix the test sample.

(3) Feel free to add docs ;-)

jzonthemtn · 2025-06-23T16:31:00Z

Is this good to merge?

smarthi · 2025-06-23T16:35:56Z

have u tried with 'Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz' ?

mawiesne · 2025-06-23T17:23:40Z

...ore/opennlp-runtime/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEGermanTest.java

-    Assertions.assertEquals("deu", sentdetectModel.getLanguage());
+  private void prepareResources(boolean useTokenEnd) {
+    try {
+      Dictionary abbreviationDict = loadAbbDictionary(Locale.GERMAN);


This could be done in an @BeforeAll test preparation step, as the abbreviationDict is constant for this scenario.
If left like this, more effort is required, as it is reloaded per actual test case.

Thanks. This is done.

mawiesne · 2025-06-23T17:26:26Z

...ore/opennlp-runtime/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEGermanTest.java

@@ -45,19 +48,25 @@ public class SentenceDetectorMEGermanTest extends AbstractSentenceDetectorTest {

  private static SentenceModel sentdetectModel;


Given prepareResources(..) is not static any longer, this field should also avoid "static" and become an instance per Test case. Please consider removing static here.

Thanks this is done.

mawiesne · 2025-06-23T17:28:20Z

...ore/opennlp-runtime/src/test/java/opennlp/tools/sentdetect/SentenceDetectorMEGermanTest.java

+    String[] sents = sentDetect.sentDetect(sent1 + sent2);
    double[] probs = sentDetect.getSentenceProbabilities();
-    Assertions.assertEquals(1, probs.length);
+    assertAll(


Could you apply the assertAll(..) pattern to the other two test scenario analogously?
It reads nicely this way.

Sure. This is done.

NishantShri4 · 2025-06-23T19:01:03Z

Thanks very much Reviewers. I am Converting this to Draft to finish accommodating review comments and also update documentation for this flag. Will present for review again shortly.

One query - Is there any work required for https://issues.apache.org/jira/browse/OPENNLP-205 (Refactor the SentenceDetectorME class to do the mapping of end-of-sent positions to spans better)?

mawiesne · 2025-06-23T20:49:01Z

One query - Is there any work required for https://issues.apache.org/jira/browse/OPENNLP-205

Great that you show interest in OPENNLP-205. Potentially: Yes.

It all starts with a test case that demonstrates the topic / issue.

Haven't had the time to look into it deeper, that is, from a functional perspective. If you have an understanding of "205", feel free to open a separate branch with a test exemplifying what's described in the issue for now. If it is hard to extract that, the issue might not be relevant any longer. Keep in mind: It's a lower three-digit issue.

Other opinions/comments welcome.

…ce detector tool.

NishantShri4 · 2025-06-24T18:47:04Z

have u tried with 'Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz' ?

This doesn't include any punctuation marks. Is the suggestion to use this for sentence detection?

…se (#792) * OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false * Added useTokenEnd to the list of optional params available for sentence detector tool.

- adapts PR #792 for OpenNLP 2.x

…se (OpenNLP 2.x) (#809) - adapts PR #792 for OpenNLP 2.x

NishantShri4 marked this pull request as draft June 14, 2025 18:03

NishantShri4 force-pushed the main branch from 78bb0a3 to 428b626 Compare June 15, 2025 06:21

NishantShri4 commented Jun 15, 2025

View reviewed changes

NishantShri4 marked this pull request as ready for review June 15, 2025 06:38

NishantShri4 force-pushed the main branch from 11105a4 to fe9da10 Compare June 15, 2025 06:42

NishantShri4 changed the title ~~OPENNLP-1745 : SentenceDetector - Add Junit test for useTokenEnd = false~~ OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false Jun 16, 2025

mawiesne added the tests Pull requests that add or update test code label Jun 18, 2025

NishantShri4 marked this pull request as draft June 18, 2025 19:31

NishantShri4 force-pushed the main branch from d242368 to 84d1068 Compare June 18, 2025 21:05

NishantShri4 closed this Jun 18, 2025

NishantShri4 force-pushed the main branch from 84d1068 to b4559df Compare June 18, 2025 21:06

OPENNLP-1745 : SentenceDetector - Add Junit test for useTokenEnd = false

983b183

NishantShri4 reopened this Jun 18, 2025

NishantShri4 marked this pull request as ready for review June 18, 2025 21:28

jzonthemtn approved these changes Jun 18, 2025

View reviewed changes

mawiesne assigned rzo1 Jun 19, 2025

mawiesne requested a review from rzo1 June 19, 2025 08:13

rzo1 requested a review from mawiesne June 19, 2025 08:53

rzo1 approved these changes Jun 20, 2025

View reviewed changes

mawiesne reviewed Jun 23, 2025

View reviewed changes

NishantShri4 marked this pull request as draft June 23, 2025 19:01

NishantShri4 and others added 3 commits June 24, 2025 19:33

Merge branch 'apache:main' into main

e95168b

Added useTokenEnd to the list of optional params available for senten…

fc94e5f

…ce detector tool.

Merge remote-tracking branch 'origin/main'

038a6f2

mawiesne approved these changes Jun 24, 2025

View reviewed changes

mawiesne marked this pull request as ready for review June 24, 2025 18:56

mawiesne merged commit eab70aa into apache:main Jun 24, 2025
10 checks passed

mawiesne added a commit that referenced this pull request Jul 7, 2025

OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false

8169b47

- adapts PR #792 for OpenNLP 2.x

mawiesne mentioned this pull request Jul 7, 2025

OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false (OpenNLP 2.x) #809

Merged

10 tasks

mawiesne added a commit that referenced this pull request Jul 7, 2025

OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false

1e6fffb

- adapts PR #792 for OpenNLP 2.x

mawiesne added a commit that referenced this pull request Jul 7, 2025

OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false

c52dea8

- adapts PR #792 for OpenNLP 2.x

mawiesne added a commit that referenced this pull request Jul 7, 2025

OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = fal…

5fb0530

…se (OpenNLP 2.x) (#809) - adapts PR #792 for OpenNLP 2.x

		@@ -45,19 +48,25 @@ public class SentenceDetectorMEGermanTest extends AbstractSentenceDetectorTest {

		private static SentenceModel sentdetectModel;

Conversation

NishantShri4 commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

For all changes:

For code changes:

For documentation related changes:

Note:

Uh oh!

NishantShri4 commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NishantShri4 Jun 15, 2025

Choose a reason for hiding this comment

Uh oh!

mawiesne commented Jun 18, 2025

Uh oh!

rzo1 commented Jun 18, 2025

Uh oh!

NishantShri4 commented Jun 18, 2025

Uh oh!

rzo1 commented Jun 20, 2025

Uh oh!

jzonthemtn commented Jun 23, 2025

Uh oh!

smarthi commented Jun 23, 2025

Uh oh!

mawiesne Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

NishantShri4 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

mawiesne Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

NishantShri4 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

mawiesne Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

NishantShri4 Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

NishantShri4 commented Jun 23, 2025

Uh oh!

mawiesne commented Jun 23, 2025

Uh oh!

NishantShri4 commented Jun 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

NishantShri4 commented Jun 14, 2025 •

edited

Loading

NishantShri4 commented Jun 14, 2025 •

edited

Loading