ICU-22080 Add plain ASCII as an explicitly detected type#2127
Conversation
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
|
I notice the implementation is only in Java. To put a change like this into ICU we'll also need it ported back to C/C++. |
Hi @koppor I see that you gave Rich's comment a 👍 but I think @richgillam was really asking whether you would be willing to add the C/C++ port in this PR... |
| return null; | ||
| } else { | ||
| // ASCII, because ALL bytes in the stream are <= 127. | ||
| // However, there could be some unicode (such as Hebrew) which also has this property. |
There was a problem hiding this comment.
Hm? It could be some unusual charset like UTF-7 or HZ, but those are "prohibited character encodings" in modern HTML and have generally fallen out of favor.
I also don't think that Hebrew is relevant here.
Seems like the confidence for ASCII should be high if all bytes are 00..7F.
|
FYI @srl295 @aheninger Also, should we return "US-ASCII" which is more specific than "ASCII"? Or is that too pedantic? |
|
I think ASCII is sufficiently unambiguous.
…On Fri, Sep 9, 2022 at 5:23 PM Markus Scherer ***@***.***> wrote:
FYI @srl295 <https://github.com/srl295> @aheninger
<https://github.com/aheninger>
Also, should we return "US-ASCII" which is more specific than "ASCII"? Or
is that too pedantic?
—
Reply to this email directly, view it on GitHub
<#2127 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMER6NH6YHFWFK6WXWTV5PIHFANCNFSM52UC2RXQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
ASCII is the same. US-ASCII is probably slightly more pedantic. The canonical id is ansi xj 1967 something. So.. I think ASCII is fine probably more recognizable these days. |
Oh, OK, I understood the "we" in a wrong way. I would suggest to get the Java code finished and then I'll look around for a C++ expert having the time to work on this. |
|
I committed the suggestions using GitHub's features. I did some other tweaks. The tests in "TestCharsetDector" successfully run locally. Not sure why. I think, there will be some other classes failing, so I'll wait for the CI. After "we" sorted everything out, I'll squash into one commit and add a |
|
Note that I needed to replace |
|
Java code finished. Now, I would try to port it to the C implementation. |
|
It was hard to find time and knowledge concerning the CPP port. Fortunately, there is Claude Code now. I have used it to a) do the pot and b) merge the upstream/main.
Update: Will create one commit - I think, for everyone too long ago. |
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
| ********************************************************************** | ||
| * Copyright (C) 2005-2022, International Business Machines | ||
| * Corporation and others. All Rights Reserved. | ||
| ********************************************************************** | ||
| */ |
There was a problem hiding this comment.
You can drop the IBM copyright if it's a new file— this was only for source being brought in before 2016.
There was a problem hiding this comment.
Yeah, new file. I adapted the copyright accordingly.
Add a CharsetRecog_ASCII recognizer that reports "ASCII" when every byte is 7-bit (<= 127). It reuses CharsetRecog_8859_1 for language detection and confidence (95 when no language matches, otherwise the 8859-1 confidence + 1), and is registered first so pure 7-bit text prefers ASCII over ISO-8859-1. Implemented in both ICU4J (CharsetRecog_ASCII.java) and ICU4C (csrascii.h/.cpp), with the C/C++ files added to sources.txt, the Visual Studio projects and the depstest dependency list. Charset detection tests in both languages are updated to expect ASCII for pure 7-bit input. Co-authored-by: Carl Christian Snethlage <50491877+calixtus@users.noreply.github.com> Co-authored-by: Christoph <siedlerkiller@gmail.com> Co-authored-by: Markus Scherer <markus.icu@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
Checklist