feat(developer): ldml load ccc from icu 🙀 #10515
Conversation
User Test ResultsTest specification and instructions User tests are not required Test Artifacts
|
e0b6acf to
af3e339
Compare
- use nfd-table - note: Array.indexOf() was 10x faster than a function attempting to do a smart search of the array For: #10317
mcdurdin
left a comment
There was a problem hiding this comment.
This change looks solid, but I think I'd like to reduce the tech debt hole on file generation, by moving the tooling into a separate module, even though that's a little more work.
Various other feedback too 😁
| @@ -0,0 +1,923 @@ | |||
| // NFD hasBoundaryBefore | |||
| // Generated by test_transforms.cpp | |||
| // regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd | |||
There was a problem hiding this comment.
Ideally, the tooling should not be under core 😁. But having instructions on it is probably good enough for now.
That said, I think we should put it under /core/tools if possible rather than under core/tests because it is going to become a maintenance burden otherwise. Appreciate that this means meson work.
| // Generated by test_transforms.cpp | ||
| // regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd | ||
| export const ICU_VERSION='70.1'; | ||
| export const UNICODE_VERSION='14.0'; |
There was a problem hiding this comment.
Shouldn't we be on Unicode 15.1?
There was a problem hiding this comment.
I'm at a loss here. Where did this ICU come from ?!
maybe it was picked up on the path?
There was a problem hiding this comment.
Oof, if we are picking up ICU from path then we're going to get random versions. Not ideal!
| // regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd | ||
| export const ICU_VERSION='70.1'; | ||
| export const UNICODE_VERSION='14.0'; | ||
| export const nfdNoBoundaryBefore = [ |
There was a problem hiding this comment.
This smells like it would really benefit from RLE. I threw together a quick test. It comes to 186 pairs, reduction in data down from 10kb to roughly 3kb (before minification), including an expansion function.
const nfdNoBoundaryBeforeRLE = [
[0x300, 78],
...
[0x1e944, 6],
];
export const nfdNoBoundaryBefore = nfdNoBoundaryBeforeRLE.reduce((prev,cur)=>{
for(let i = cur[0]; i <= cur[0]+cur[1]; i++) {
prev.push(i);
}
return prev;
},[]);Hack function for generation (in js):
let rle = [];
let firstcp = null, lastcp = -1;
for(let cp of nfdNoBoundaryBefore) {
if(cp !== lastcp+1) {
if(firstcp != null) {
rle.push(firstcp, lastcp-firstcp);
}
firstcp = cp;
}
lastcp = cp;
}
rle.push(firstcp, lastcp-firstcp);
console.log("export const nfdNoBoundaryBeforeRLE = [");
for(let i = 0; i < rle.length; i+=2) {
console.log(' [0x'+rle[i].toString(16)+', ' + rle[i+1].toString() + '],');
}
console.log('];');Needs further verification and polish, but I think a 70% reduction in size is probably worth pursuing, especially if we may need this data later in KeymanWeb.
Happy for this to be an optimization task for 18.0 -- as this is only being used for kmc in 17.0.
There was a problem hiding this comment.
we will need it in Keyman Web.
There was a problem hiding this comment.
well, yeah, it's a complex folded trie structure in ICU.
| const rest = a.slice(i).join(''); | ||
| const p = MarkerParser.parse_next_marker(rest, forMatch); | ||
| const have_marker = !!(p?.match); | ||
| const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i]?.codePointAt(0)) === -1); |
There was a problem hiding this comment.
When can a[i] be undefined and not a string? If a[i] is guaranteed to be a string then we should not be using optional chaining.
| const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i]?.codePointAt(0)) === -1); | |
| const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i].codePointAt(0)) === -1); |
There was a problem hiding this comment.
when i === str_end (see line 203) - this loop is called once at the end of the string.
so a[i] can be undefined.
There was a problem hiding this comment.
can you add a comment accordingly somewhere? Because that's not necessarily intuitive to run loop past the end 😁
| if (bb != lccc) { | ||
| printf("0x%04x - bb=%s but lccc=%s\n", (unsigned int)ch, bb ? "y" : "n", lccc ? "y" : "n"); | ||
| } | ||
| assert(bb == lccc); |
There was a problem hiding this comment.
If this is an error condition then it really should fail in all cases, not just print a warning. asserts only fail in debug builds.
There was a problem hiding this comment.
this is a test, so don't asserts always fail in tests?
There was a problem hiding this comment.
This is not really a test, it's a build process, and while it is currently hosted within a unit test executable, that will hopefully change, at which point we want to be aborting here even on a release build.
| fprintf(f, "export const UNICODE_VERSION='%s';\n", U_UNICODE_VERSION); | ||
|
|
||
| fprintf(f, "export const nfdNoBoundaryBefore = [\n"); | ||
| for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) { |
There was a problem hiding this comment.
Might as well start at 0x20 ... less weird than starting at 0, and looks more like a normal "Unicode Character Range".
| for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) { | |
| for (km_core_usv ch = 0x20; ch < 0x10FFFF; ch++) { |
There was a problem hiding this comment.
why 0x20? Nulls are characters too!
There was a problem hiding this comment.
ok, no worries. doesn't change output anyway.
|
|
||
| #define NFD_FILE "common/web/types/src/ldml-keyboard/nfd-table.ts" | ||
| int | ||
| write_nfd_table() { |
There was a problem hiding this comment.
This function does not appear to be called from main()?
There was a problem hiding this comment.
bad merge, ok i will fix this. (I had shelved this in favor of the segment)
|
Obviated, see note on #10565 |
#10317
nfd.hasBoundaryBefore(ch)is exactly equivalent toisNFD(ch) && u_getCombiningClass(ch)==0Currently this is a .ts file in common/web/types which is updated by a special call to the
test_transformsexecutable. Alternatives:@keymanapp-test-bot skip