feat(developer): ldml load ccc from icu 🙀 by srl295 · Pull Request #10515 · keymanapp/keyman

srl295 · 2024-01-25T22:12:49Z

segmenter isn't right, use a table from ICU4C
also verify that nfd.hasBoundaryBefore(ch) is exactly equivalent to isNFD(ch) && u_getCombiningClass(ch)==0

Currently this is a .ts file in common/web/types which is updated by a special call to the test_transforms executable. Alternatives:

ICU-aware regex was considered, but is not exposed in any current implementation
could be loaded through wasm - but, common doesn't link against wasm!
this file could be generated at build time, but this then makes common/ depend on core, or on kmcmplib (the two places ICU code is done)

@keymanapp-test-bot skip

keymanapp-test-bot · 2024-01-25T22:12:54Z

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

Android
Developer
iOS
- Keyman for iOS (simulator image)
- FirstVoices Keyboards for iOS (simulator image)
- TestFlight internal PR build version - 17.0.255 (0.10515.10241)
Keyboards
- Test Keyboards
Linux
- Keyman for Linux
macOS
- Keyman for macOS
Web
- KeymanWeb Test Home
Windows

- add nfd table from ICU For: #10317

- verify that hasBoundaryBefore is exactly equivalent to lccc's API docs For: #10317

- use nfd-table - note: Array.indexOf() was 10x faster than a function attempting to do a smart search of the array For: #10317

mcdurdin

This change looks solid, but I think I'd like to reduce the tech debt hole on file generation, by moving the tooling into a separate module, even though that's a little more work.

Various other feedback too 😁

mcdurdin · 2024-01-29T22:35:15Z

common/web/types/src/ldml-keyboard/nfd-table.ts

@@ -0,0 +1,923 @@
+// NFD hasBoundaryBefore
+// Generated by test_transforms.cpp
+// regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd


Ideally, the tooling should not be under core 😁. But having instructions on it is probably good enough for now.

That said, I think we should put it under /core/tools if possible rather than under core/tests because it is going to become a maintenance burden otherwise. Appreciate that this means meson work.

Ok, I can work on that.

mcdurdin · 2024-01-29T22:36:38Z

common/web/types/src/ldml-keyboard/nfd-table.ts

+// Generated by test_transforms.cpp
+// regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd
+export const ICU_VERSION='70.1';
+export const UNICODE_VERSION='14.0';


Shouldn't we be on Unicode 15.1?

keyman/resources/standards-data/unicode-character-database/Blocks.txt

Line 1 in ec2732f

# Blocks-15.1.0.txt

I'm at a loss here. Where did this ICU come from ?!

maybe it was picked up on the path?

Oof, if we are picking up ICU from path then we're going to get random versions. Not ideal!

mcdurdin · 2024-01-29T22:57:58Z

common/web/types/src/ldml-keyboard/nfd-table.ts

+// regenerate with: core/build/SOMETHING/SOMETHING/tests/unit/ldml/test_transforms --write-nfd
+export const ICU_VERSION='70.1';
+export const UNICODE_VERSION='14.0';
+export const nfdNoBoundaryBefore = [


This smells like it would really benefit from RLE. I threw together a quick test. It comes to 186 pairs, reduction in data down from 10kb to roughly 3kb (before minification), including an expansion function.

const nfdNoBoundaryBeforeRLE = [ [0x300, 78], ... [0x1e944, 6], ]; export const nfdNoBoundaryBefore = nfdNoBoundaryBeforeRLE.reduce((prev,cur)=>{ for(let i = cur[0]; i <= cur[0]+cur[1]; i++) { prev.push(i); } return prev; },[]);

Hack function for generation (in js):

let rle = []; let firstcp = null, lastcp = -1; for(let cp of nfdNoBoundaryBefore) { if(cp !== lastcp+1) { if(firstcp != null) { rle.push(firstcp, lastcp-firstcp); } firstcp = cp; } lastcp = cp; } rle.push(firstcp, lastcp-firstcp); console.log("export const nfdNoBoundaryBeforeRLE = ["); for(let i = 0; i < rle.length; i+=2) { console.log(' [0x'+rle[i].toString(16)+', ' + rle[i+1].toString() + '],'); } console.log('];');

Needs further verification and polish, but I think a 70% reduction in size is probably worth pursuing, especially if we may need this data later in KeymanWeb.

Happy for this to be an optimization task for 18.0 -- as this is only being used for kmc in 17.0.

we will need it in Keyman Web.

well, yeah, it's a complex folded trie structure in ICU.

mcdurdin · 2024-01-29T23:02:12Z

common/web/types/src/ldml-keyboard/pattern-parser.ts

      const rest = a.slice(i).join('');
      const p = MarkerParser.parse_next_marker(rest, forMatch);
      const have_marker = !!(p?.match);
+      const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i]?.codePointAt(0)) === -1);


When can a[i] be undefined and not a string? If a[i] is guaranteed to be a string then we should not be using optional chaining.

Suggested change

const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i]?.codePointAt(0)) === -1);

const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i].codePointAt(0)) === -1);

when i === str_end (see line 203) - this loop is called once at the end of the string.

so a[i] can be undefined.

can you add a comment accordingly somewhere? Because that's not necessarily intuitive to run loop past the end 😁

mcdurdin · 2024-01-29T23:05:32Z

core/tests/unit/ldml/test_transforms.cpp

+    if (bb != lccc) {
+      printf("0x%04x - bb=%s but lccc=%s\n", (unsigned int)ch, bb ? "y" : "n", lccc ? "y" : "n");
+    }
+    assert(bb == lccc);


If this is an error condition then it really should fail in all cases, not just print a warning. asserts only fail in debug builds.

this is a test, so don't asserts always fail in tests?

This is not really a test, it's a build process, and while it is currently hosted within a unit test executable, that will hopefully change, at which point we want to be aborting here even on a release build.

mcdurdin · 2024-01-29T23:07:01Z

core/tests/unit/ldml/test_transforms.cpp

+  fprintf(f, "export const UNICODE_VERSION='%s';\n", U_UNICODE_VERSION);
+
+  fprintf(f, "export const nfdNoBoundaryBefore = [\n");
+  for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {


Might as well start at 0x20 ... less weird than starting at 0, and looks more like a normal "Unicode Character Range".

Suggested change

for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {

for (km_core_usv ch = 0x20; ch < 0x10FFFF; ch++) {

why 0x20? Nulls are characters too!

ok, no worries. doesn't change output anyway.

mcdurdin · 2024-01-29T23:08:13Z

core/tests/unit/ldml/test_transforms.cpp

+
+#define NFD_FILE "common/web/types/src/ldml-keyboard/nfd-table.ts"
+int
+write_nfd_table() {


This function does not appear to be called from main()?

bad merge, ok i will fix this. (I had shelved this in favor of the segment)

srl295 · 2024-01-31T21:25:17Z

Obviated, see note on #10565

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Jan 25, 2024

keymanapp-test-bot bot added this to the A17S31 milestone Jan 25, 2024

github-actions bot added common/ core/ Keyman Core common/web/ developer/ feat labels Jan 25, 2024

srl295 changed the title ~~feat(developer): ldml load ccc from icu :scream~~ feat(developer): ldml load ccc from icu 🙀 Jan 25, 2024

srl295 self-assigned this Jan 25, 2024

keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Jan 25, 2024

github-actions bot added developer/ and removed developer/ labels Jan 25, 2024

srl295 added 2 commits January 29, 2024 09:39

feat(developer): ldml marker normalization in ts 🙀

71b6674

- add nfd table from ICU For: #10317

feat(developer): ldml marker normalization in ts 🙀

af3e339

- verify that hasBoundaryBefore is exactly equivalent to lccc's API docs For: #10317

srl295 force-pushed the feat/developer/10317-ccc-from-icu-epic-ldml branch from e0b6acf to af3e339 Compare January 29, 2024 15:41

github-actions bot added the developer/compilers/ label Jan 29, 2024

srl295 changed the base branch from master to feat/developer/10317-compiler-norm-epic-ldml January 29, 2024 15:42

feat(developer): dev side norm 🙀

769327e

- use nfd-table - note: Array.indexOf() was 10x faster than a function attempting to do a smart search of the array For: #10317

github-actions bot added developer/ developer/compilers/ and removed developer/ developer/compilers/ labels Jan 29, 2024

srl295 marked this pull request as ready for review January 29, 2024 21:07

srl295 requested review from jahorton, mcdurdin and rc-swag as code owners January 29, 2024 21:07

mcdurdin requested changes Jan 29, 2024

View reviewed changes

Base automatically changed from feat/developer/10317-compiler-norm-epic-ldml to master January 30, 2024 16:06

srl295 mentioned this pull request Jan 31, 2024

feat(core,developer): simplify markers 🙀 #10565

Merged

srl295 closed this Jan 31, 2024

srl295 deleted the feat/developer/10317-ccc-from-icu-epic-ldml branch February 1, 2024 04:30

srl295 mentioned this pull request May 24, 2024

feat(core): devolve normalization to js 🙀 #11541

Merged

1 task

	const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i]?.codePointAt(0)) === -1);
	const has_nfd_boundary_before = (nfdNoBoundaryBefore.indexOf(a[i].codePointAt(0)) === -1);

	for (km_core_usv ch = 0; ch < 0x10FFFF; ch++) {
	for (km_core_usv ch = 0x20; ch < 0x10FFFF; ch++) {

Uh oh!

Conversation

srl295 commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keymanapp-test-bot bot commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User Test Results

Test Artifacts

Uh oh!

mcdurdin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srl295 commented Jan 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srl295 commented Jan 25, 2024 •

edited

Loading

keymanapp-test-bot bot commented Jan 25, 2024 •

edited

Loading