Skip to content

feat(core): support normalization in Core 🌱 #9999

@mcdurdin

Description

@mcdurdin

Impacted modules

  • Core Consumers:
    • Keyman Engine for Windows
    • Keyman Engine for macOS
    • Keyman Engine for Linux
    • Keyman Developer Debugger
  • Keyman Core
  • Core Keyboard Processors:
    • LDML Keyboard Processor
  • Unit tests

Steps

  1. Add a Core API to convert km_core_actions to km_core_action_item* (mirrors action_item_list_to_actions_object)

  2. Update Engines to always call km_core_state_context_set_if_needed, and stop managing their own cached context.

  3. LDML processor can then generate km_core_actions and existing engines can consume it with no further modification.

  4. Normalization support for LDML keyboards:

  • input context (km_core_state_set_context_if_needed)
  • and action output (km_core_state_get_actions and actions_object_to_action_item_list)

17.0 target API usage is:

  • kmn: emit queue
  • ldml: emit struct
  • windows: read queue
  • mac: read struct
  • linux: read queue
  • debugger: read queue
  • ldml unit tests: read struct
  • Core will translate only as needed, so this way, the big blocker (debugger) can wait until 18.0.

17.0 TODO

Move to km_core_state_context_set_if_needed

Longer term

  1. Move to action struct in Engine for Windows, Developer Debugger, Engine for Linux (#10353) {tech debt}
  • Removes dead wood apis
  • Possible difficulties with the Keyman Developer debugger (will need sideband data on cached context as debugger is, and must continue to be, marker-aware)
  1. Move to action struct for output from kmn Keyboard Processor. Then the concept of Action Queue moves to an internal component of kmn Keyboard Processor (and the debugger will query kmn keyboard processor directly).

Implementation notes

in km_core_state_context_set_if_needed we'll need to do something like this:

  // TODO: normalization
  km_core_cp* text;
  switch(get_keyboard_processor_normalization()) {
    case NFD: text = normalize_nfd(application_context); break;
    case NFC: text = normalize_nfc(application_context); break;
    default: /*NFW*/ text = application_context;
  }

In km_core_state_get_actions and actions_object_to_action_item_list, Core will need to track the context back to the last base codepoint (per Unicode spec) to give a starting anchor for normalization of output.

Core will need to maintain a copy of the exact text from the app as well as the NFD cached context for LDML, and need to sync those up to determine the number of codepoints to delete from the app. Remember that the input context is in NFU "Normalization Form Unknown" -- and may well have mixed content, and we need to know exactly how many codepoints to delete from that mixed content.

Application context synchronization

Note: this discussion is working in UTF32; if working in UTF16 then consideration needs to be given to whether counts are in codepoints or codeunits.

Given an application context in NFU and a cached context in NFD (for the purposes of this discussion, stripped of markers), we need to be able to calculate the number of characters to delete.

  • The transform function has two parameters: del (number of characters) and ins (string to insert).
  • cached_context is always NFD.
  • app_context is always NFU.

The LDML keyboard processor will return a number of NFD codepoints to delete. However, this number cannot be passed straight to the app, because the normalization of the string may not not match up. We also need to consider normalization with string concatenation. This requires a two step process:

  1. Normalize insertion side of the concatenation boundary (https://unicode.org/reports/tr15/#Concatenation):
  • remove del characters from cached_context
  • while cached_context is not empty, and the last character of cached_context is not a stable character, prepend it to the ins and increment del.
let chopped_cached_context = cached_context.substr(0, cached_context.length - del)
while(chopped_cached_context != "" && !is_stable(chopped_cached_context.last())) {
  let ch = chopped_cached_context.pop()
  ins.prepend(ch)
  del++
}
  1. Calculate the number of characters to delete from app_context, and additional fixups that need to be prepended to the inserted text ins because the change happened over a normalization boundary:
let app_del = 0
let new_app_context = app_context
let chopped_cached_context = cached_context.substr(0, cached_context.length - del)

while(nfd(new_app_context).length > chopped_cached_context.length) {
  new_app_context.pop()
  app_del++
}

ins.prepend(substr(chopped_cached_context, nfd(new_app_context).length))
  • app_del will then be the number of characters to delete from the document
  • ins will be the text to insert

UTF-16 Considerations

Core has enough data to provide the number of codepoints to delete and number of UTF-16 codeunits to delete. Recommend that this information is provided in the action object to consumers, as that covers both of our deletion use cases (injecting backspaces vs direct string manipulation).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions