Skip to content

OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false#792

Merged
mawiesne merged 4 commits intoapache:mainfrom
NishantShri4:main
Jun 24, 2025
Merged

OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false#792
mawiesne merged 4 commits intoapache:mainfrom
NishantShri4:main

Conversation

@NishantShri4
Copy link
Contributor

@NishantShri4 NishantShri4 commented Jun 14, 2025

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check GitHub Actions for build issues and submit an update to your PR as soon as possible.

@NishantShri4 NishantShri4 marked this pull request as draft June 14, 2025 18:03
@NishantShri4
Copy link
Contributor Author

NishantShri4 commented Jun 14, 2025

Dear Reviewers,

This PR is opened to clarify a few things around the usage of 'useTokenEnd' flag in SentenceDetector.

1. We have below issue prioritized for release 2.6.0.

 _https://issues.apache.org/jira/browse/OPENNLP-205
  (Refactor the SentenceDetectorME class to do the mapping of end-of-sent  positions to spans better)_

Above issue  says that the code fails in some scenarios when useTokenEnd is false. 
However, I see that a fix was already made previously for usage of this flag in  
https://issues.apache.org/jira/browse/OPENNLP-711.
I have added a simple test, which demonstrates the use of useTokenEnd flag when set as false.

Question : Could someone pls. provide some clarification on the changes required to fix OPENNLP-205?

2. The Sentence Detector documentation says that for training :
" The data must be converted to the OpenNLP Sentence Detector training format. Which is one sentence per line. "

However, in the test data sample for German text - 
https://github.com/apache/opennlp/blob/main/opennlp-tools/src/test/resources/opennlp/tools/sentdetect/Sentences_DE.txt

We see examples of two sentences in one line. E.g. Should the text sample be standardized as per documentation?

Ein älterer Herr gesellt sich zu ihm und schimpft über den König von Italien. Am Ende der Anhöhe geht er dann viel leichter.

3. Can we add some documentation in the manual for this flag?

Best Regards.

sentdetectModel = train(factory, Locale.GERMAN);
Assertions.assertNotNull(sentdetectModel);
Assertions.assertEquals("deu", sentdetectModel.getLanguage());
private void prepareResources(boolean useTokenEnd) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a way to configure the SentenceDetectorFactory differently i.e. useTokenEnd=true for some tests and with
useTokenEnd=false for one other test.

Approach1 - @BeforeAll is removed, and the method prepareResources() is made private and parameterized so that the respective test case can get the trained model based on its need.

Appraoch-2 : Exclusively configure the SentenceDetectorFactory for the new test (i.e. useTokenEnd=false scenario) in its own definition and leave the previous @BeforeAll method as is.

Pls. suggest any preference of approach-2 over approach1 ?

@NishantShri4 NishantShri4 marked this pull request as ready for review June 15, 2025 06:38
@NishantShri4 NishantShri4 changed the title OPENNLP-1745 : SentenceDetector - Add Junit test for useTokenEnd = false OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false Jun 16, 2025
@mawiesne
Copy link
Contributor

FYI @NishantShri4:
Given #786 has been merged today, please rebase your changes against latest main branch and the different module structure. Once this is consistent with 3.x ideas, we can have a deeper look into this PR.

@mawiesne mawiesne added the tests Pull requests that add or update test code label Jun 18, 2025
@rzo1
Copy link
Contributor

rzo1 commented Jun 18, 2025

Looks like the rebase went wrong. Would expect was less changes.

@NishantShri4 NishantShri4 reopened this Jun 18, 2025
@NishantShri4
Copy link
Contributor Author

Thanks Reviewers, this is now rebased against latest main branch.

@NishantShri4 NishantShri4 marked this pull request as ready for review June 18, 2025 21:28
@mawiesne mawiesne requested a review from rzo1 June 19, 2025 08:13
@rzo1 rzo1 requested a review from mawiesne June 19, 2025 08:53
@rzo1
Copy link
Contributor

rzo1 commented Jun 20, 2025

Regarding your questions / thoughts above:

(1) I wasn't involved in the project back than, so no idea about the original thoughts of the issue you mentioned. Perhaps, it is already solved

(2) We should fix the test sample.

(3) Feel free to add docs ;-)

@jzonthemtn
Copy link
Contributor

Is this good to merge?

@smarthi
Copy link
Member

smarthi commented Jun 23, 2025

have u tried with 'Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz' ?

Assertions.assertEquals("deu", sentdetectModel.getLanguage());
private void prepareResources(boolean useTokenEnd) {
try {
Dictionary abbreviationDict = loadAbbDictionary(Locale.GERMAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be done in an @BeforeAll test preparation step, as the abbreviationDict is constant for this scenario.
If left like this, more effort is required, as it is reloaded per actual test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This is done.

@@ -45,19 +48,25 @@ public class SentenceDetectorMEGermanTest extends AbstractSentenceDetectorTest {

private static SentenceModel sentdetectModel;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given prepareResources(..) is not static any longer, this field should also avoid "static" and become an instance per Test case. Please consider removing static here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is done.

String[] sents = sentDetect.sentDetect(sent1 + sent2);
double[] probs = sentDetect.getSentenceProbabilities();
Assertions.assertEquals(1, probs.length);
assertAll(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you apply the assertAll(..) pattern to the other two test scenario analogously?
It reads nicely this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This is done.

@NishantShri4
Copy link
Contributor Author

Thanks very much Reviewers. I am Converting this to Draft to finish accommodating review comments and also update documentation for this flag. Will present for review again shortly.

One query - Is there any work required for https://issues.apache.org/jira/browse/OPENNLP-205 (Refactor the SentenceDetectorME class to do the mapping of end-of-sent positions to spans better)?

@NishantShri4 NishantShri4 marked this pull request as draft June 23, 2025 19:01
@mawiesne
Copy link
Contributor

One query - Is there any work required for https://issues.apache.org/jira/browse/OPENNLP-205

Great that you show interest in OPENNLP-205. Potentially: Yes.

It all starts with a test case that demonstrates the topic / issue.

Haven't had the time to look into it deeper, that is, from a functional perspective. If you have an understanding of "205", feel free to open a separate branch with a test exemplifying what's described in the issue for now. If it is hard to extract that, the issue might not be relevant any longer. Keep in mind: It's a lower three-digit issue.

Other opinions/comments welcome.

@NishantShri4
Copy link
Contributor Author

have u tried with 'Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz' ?

This doesn't include any punctuation marks. Is the suggestion to use this for sentence detection?

@mawiesne mawiesne marked this pull request as ready for review June 24, 2025 18:56
@mawiesne mawiesne merged commit eab70aa into apache:main Jun 24, 2025
10 checks passed
mawiesne pushed a commit that referenced this pull request Jun 27, 2025
…se (#792)

* OPENNLP-1745: SentenceDetector - Add Junit test for useTokenEnd = false

* Added useTokenEnd to the list of optional params available for sentence detector tool.
mawiesne added a commit that referenced this pull request Jul 7, 2025
mawiesne added a commit that referenced this pull request Jul 7, 2025
mawiesne added a commit that referenced this pull request Jul 7, 2025
mawiesne added a commit that referenced this pull request Jul 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests Pull requests that add or update test code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants