Skip to content

ICU-22080 Add plain ASCII as an explicitly detected type#2127

Open
koppor wants to merge 1 commit into
unicode-org:mainfrom
JabRef:addASciiTest
Open

ICU-22080 Add plain ASCII as an explicitly detected type#2127
koppor wants to merge 1 commit into
unicode-org:mainfrom
JabRef:addASciiTest

Conversation

@koppor

@koppor koppor commented Jul 4, 2022

Copy link
Copy Markdown
Checklist
  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22080
  • Required: The PR title must be prefixed with a JIRA Issue number.
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number.
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@CLAassistant

CLAassistant commented Jul 4, 2022

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/charset/src/com/ibm/icu/dev/test/charset/TestCharSetRecognition.java is no longer changed in the branch
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@koppor koppor marked this pull request as draft July 5, 2022 06:55
@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java is different
  • icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/CharsetDetectionTests.xml is different
  • icu4j/main/tests/core/src/com/ibm/icu/dev/test/charsetdet/TestCharsetDetector.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@koppor koppor marked this pull request as ready for review July 6, 2022 22:02
@markusicu markusicu self-assigned this Jul 14, 2022
@markusicu markusicu requested a review from richgillam July 14, 2022 16:02
@richgillam

Copy link
Copy Markdown
Contributor

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

@markusicu markusicu requested a review from FrankYFTang July 14, 2022 16:03
@markusicu

Copy link
Copy Markdown
Member

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.

Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

Comment thread icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java Outdated
Comment thread icu4j/main/classes/core/src/com/ibm/icu/text/CharsetDetector.java Outdated
Comment thread icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java Outdated
return null;
} else {
// ASCII, because ALL bytes in the stream are <= 127.
// However, there could be some unicode (such as Hebrew) which also has this property.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm? It could be some unusual charset like UTF-7 or HZ, but those are "prohibited character encodings" in modern HTML and have generally fallen out of favor.

I also don't think that Hebrew is relevant here.

Seems like the confidence for ASCII should be high if all bytes are 00..7F.

Comment thread icu4j/main/classes/core/src/com/ibm/icu/text/CharsetRecog_ASCII.java Outdated
@markusicu

Copy link
Copy Markdown
Member

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

@macchiati

macchiati commented Sep 10, 2022 via email

Copy link
Copy Markdown
Member

@srl295

srl295 commented Sep 10, 2022

Copy link
Copy Markdown
Member

FYI @srl295 @aheninger

Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic?

ASCII is the same. US-ASCII is probably slightly more pedantic. The canonical id is ansi xj 1967 something.

So.. I think ASCII is fine probably more recognizable these days.

@koppor

koppor commented Sep 24, 2022

Copy link
Copy Markdown
Author

I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++.
Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR...

Oh, OK, I understood the "we" in a wrong way. I would suggest to get the Java code finished and then I'll look around for a C++ expert having the time to work on this.

@koppor

koppor commented Sep 24, 2022

Copy link
Copy Markdown
Author

I committed the suggestions using GitHub's features. I did some other tweaks. The tests in "TestCharsetDector" successfully run locally. Not sure why. I think, there will be some other classes failing, so I'll wait for the CI.

After "we" sorted everything out, I'll squash into one commit and add a Co-authored-by for @markusicu

@koppor

koppor commented Nov 27, 2022

Copy link
Copy Markdown
Author

Note that I needed to replace ISO-2022-JP in the context of com.ibm.icu.dev.test.charsetdet.TestCharsetDetector#TestBufferOverflow by ASCII as the byte sequence (line 283) does not contain any non-7-bit characters. Think, it is no harm, because the shift state "at the start" is "a bad one".

@koppor

koppor commented Nov 27, 2022

Copy link
Copy Markdown
Author

Java code finished. Now, I would try to port it to the C implementation.

@koppor

koppor commented Jun 7, 2026

Copy link
Copy Markdown
Author

It was hard to find time and knowledge concerning the CPP port. Fortunately, there is Claude Code now. I have used it to a) do the pot and b) merge the upstream/main.

I left the commits separate to ease reviewing. I can do a squash after everything is settled.

Update: Will create one commit - I think, for everyone too long ago.

@jira-pull-request-webhook

Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

srl295
srl295 previously approved these changes Jun 8, 2026
Comment thread icu4c/source/i18n/csrascii.cpp Outdated
Comment on lines +4 to +8
**********************************************************************
* Copyright (C) 2005-2022, International Business Machines
* Corporation and others. All Rights Reserved.
**********************************************************************
*/

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can drop the IBM copyright if it's a new file— this was only for source being brought in before 2016.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, new file. I adapted the copyright accordingly.

Add a CharsetRecog_ASCII recognizer that reports "ASCII" when every byte
is 7-bit (<= 127). It reuses CharsetRecog_8859_1 for language detection
and confidence (95 when no language matches, otherwise the 8859-1
confidence + 1), and is registered first so pure 7-bit text prefers
ASCII over ISO-8859-1.

Implemented in both ICU4J (CharsetRecog_ASCII.java) and ICU4C
(csrascii.h/.cpp), with the C/C++ files added to sources.txt, the Visual
Studio projects and the depstest dependency list. Charset detection
tests in both languages are updated to expect ASCII for pure 7-bit input.

Co-authored-by: Carl Christian Snethlage <50491877+calixtus@users.noreply.github.com>
Co-authored-by: Christoph <siedlerkiller@gmail.com>
Co-authored-by: Markus Scherer <markus.icu@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jira-pull-request-webhook

Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants