Skip to content

feat(developer): ldml load ccc from icu 🙀 #10515

Closed
srl295 wants to merge 3 commits intomasterfrom
feat/developer/10317-ccc-from-icu-epic-ldml
Closed

feat(developer): ldml load ccc from icu 🙀 #10515
srl295 wants to merge 3 commits intomasterfrom
feat/developer/10317-ccc-from-icu-epic-ldml

Conversation

@srl295
Copy link
Copy Markdown
Member

@srl295 srl295 commented Jan 25, 2024

#10317

  • segmenter isn't right, use a table from ICU4C
  • also verify that nfd.hasBoundaryBefore(ch) is exactly equivalent to isNFD(ch) && u_getCombiningClass(ch)==0

Currently this is a .ts file in common/web/types which is updated by a special call to the test_transforms executable. Alternatives:

  • ICU-aware regex was considered, but is not exposed in any current implementation
  • could be loaded through wasm - but, common doesn't link against wasm!
  • this file could be generated at build time, but this then makes common/ depend on core, or on kmcmplib (the two places ICU code is done)

@keymanapp-test-bot skip

@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Jan 25, 2024
@keymanapp-test-bot keymanapp-test-bot bot added this to the A17S31 milestone Jan 25, 2024
@srl295 srl295 changed the title feat(developer): ldml load ccc from icu :scream feat(developer): ldml load ccc from icu 🙀 Jan 25, 2024
@srl295 srl295 self-assigned this Jan 25, 2024
@keymanapp-test-bot keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Jan 25, 2024
- verify that hasBoundaryBefore is exactly equivalent to lccc's API docs

For: #10317
@srl295 srl295 force-pushed the feat/developer/10317-ccc-from-icu-epic-ldml branch from e0b6acf to af3e339 Compare January 29, 2024 15:41
@srl295 srl295 changed the base branch from master to feat/developer/10317-compiler-norm-epic-ldml January 29, 2024 15:42
- use nfd-table
- note: Array.indexOf() was 10x faster than a function attempting to do
a smart search of the array

For: #10317
Copy link
Copy Markdown
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks solid, but I think I'd like to reduce the tech debt hole on file generation, by moving the tooling into a separate module, even though that's a little more work.

Various other feedback too 😁

@@ -0,0 +1,923 @@
// NFD hasBoundaryBefore
// Generated by test_transforms.cpp
// regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, the tooling should not be under core 😁. But having instructions on it is probably good enough for now.

That said, I think we should put it under /core/tools if possible rather than under core/tests because it is going to become a maintenance burden otherwise. Appreciate that this means meson work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I can work on that.

// Generated by test_transforms.cpp
// regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd
export const ICU_VERSION='70.1';
export const UNICODE_VERSION='14.0';
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be on Unicode 15.1?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm at a loss here. Where did this ICU come from ?!

maybe it was picked up on the path?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, if we are picking up ICU from path then we're going to get random versions. Not ideal!

// regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd
export const ICU_VERSION='70.1';
export const UNICODE_VERSION='14.0';
export const nfdNoBoundaryBefore = [
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smells like it would really benefit from RLE. I threw together a quick test. It comes to 186 pairs, reduction in data down from 10kb to roughly 3kb (before minification), including an expansion function.

const nfdNoBoundaryBeforeRLE = [
  [0x300, 78],
  ...
  [0x1e944, 6],
];

export const nfdNoBoundaryBefore = nfdNoBoundaryBeforeRLE.reduce((prev,cur)=>{
  for(let i = cur[0]; i <= cur[0]+cur[1]; i++) {
    prev.push(i);
  }
  return prev;
},[]);

Hack function for generation (in js):

let rle = [];
let firstcp = null, lastcp = -1;
for(let cp of nfdNoBoundaryBefore) {
  if(cp !== lastcp+1) {
    if(firstcp != null) {
      rle.push(firstcp, lastcp-firstcp);
    }
    firstcp = cp;
  }
  lastcp = cp;
}
rle.push(firstcp, lastcp-firstcp);

console.log("export const nfdNoBoundaryBeforeRLE = [");
for(let i = 0; i < rle.length; i+=2) {
  console.log('  [0x'+rle[i].toString(16)+', ' + rle[i+1].toString() + '],');
}
console.log('];');

Needs further verification and polish, but I think a 70% reduction in size is probably worth pursuing, especially if we may need this data later in KeymanWeb.

Happy for this to be an optimization task for 18.0 -- as this is only being used for kmc in 17.0.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need it in Keyman Web.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, yeah, it's a complex folded trie structure in ICU.

const rest = a.slice(i).join('');
const p = MarkerParser.parse_next_marker(rest, forMatch);
const have_marker = !!(p?.match);
const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i]?.codePointAt(0)) === -1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can a[i] be undefined and not a string? If a[i] is guaranteed to be a string then we should not be using optional chaining.

Suggested change
const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i]?.codePointAt(0)) === -1);
const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i].codePointAt(0)) === -1);

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when i === str_end (see line 203) - this loop is called once at the end of the string.

so a[i] can be undefined.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment accordingly somewhere? Because that's not necessarily intuitive to run loop past the end 😁

Comment on lines +1061 to +1064
if (bb != lccc) {
printf("0x%04x - bb=%s but lccc=%s\n", (unsigned int)ch, bb ? "y" : "n", lccc ? "y" : "n");
}
assert(bb == lccc);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is an error condition then it really should fail in all cases, not just print a warning. asserts only fail in debug builds.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a test, so don't asserts always fail in tests?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really a test, it's a build process, and while it is currently hosted within a unit test executable, that will hopefully change, at which point we want to be aborting here even on a release build.

fprintf(f, "export const UNICODE_VERSION='%s';\n", U_UNICODE_VERSION);

fprintf(f, "export const nfdNoBoundaryBefore = [\n");
for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well start at 0x20 ... less weird than starting at 0, and looks more like a normal "Unicode Character Range".

Suggested change
for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {
for (km_core_usv ch = 0x20; ch < 0x10FFFF; ch++) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0x20? Nulls are characters too!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, no worries. doesn't change output anyway.


#define NFD_FILE "common/web/types/src/ldml-keyboard/nfd-table.ts"
int
write_nfd_table() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not appear to be called from main()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad merge, ok i will fix this. (I had shelved this in favor of the segment)

Base automatically changed from feat/developer/10317-compiler-norm-epic-ldml to master January 30, 2024 16:06
@srl295
Copy link
Copy Markdown
Member Author

srl295 commented Jan 31, 2024

Obviated, see note on #10565

@srl295 srl295 closed this Jan 31, 2024
@srl295 srl295 deleted the feat/developer/10317-ccc-from-icu-epic-ldml branch February 1, 2024 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants