Skip to content

spec(core): marker spec 🙀#9196

Merged
srl295 merged 4 commits intomasterfrom
core/9118-spec-markers-epic-ldml
Jul 7, 2023
Merged

spec(core): marker spec 🙀#9196
srl295 merged 4 commits intomasterfrom
core/9118-spec-markers-epic-ldml

Conversation

@srl295
Copy link
Copy Markdown
Member

@srl295 srl295 commented Jul 5, 2023

  • first cut

Fixes: #9118

@keymanapp-test-bot skip

@srl295 srl295 self-assigned this Jul 5, 2023
@srl295 srl295 requested review from mcdurdin and rc-swag as code owners July 5, 2023 23:16
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Jul 5, 2023
@keymanapp-test-bot
Copy link
Copy Markdown

keymanapp-test-bot bot commented Jul 5, 2023

@srl295 srl295 force-pushed the core/9118-spec-markers-epic-ldml branch from 9b276ec to 5510321 Compare July 5, 2023 23:16
@github-actions github-actions bot added core/ Keyman Core epic-ldml labels Jul 5, 2023
@keymanapp-test-bot keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Jul 5, 2023
@srl295 srl295 requested review from ermshiperete and jahorton July 5, 2023 23:17
@srl295 srl295 added this to the A17S16 milestone Jul 5, 2023
## Implementation (core)

- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility. I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the LDML processor can push the U+FFFF sequences as markers to the core processor using state->context().push_marker(a.dwData). I think it should be required that it does correlate the marker number with the marker id . This is because in the Windows implementation, the platform layer has the context including markers (when) possible and it will set the context in the core on each keystroke, this propagates down to the ldml processor and therefore there needs to be a correlation of some type so the when converted back to the ldml U+FFFF ..... it matches the ldml rules.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To define terms here, the 'number' is just the serial number, so if we had \m{a},\m{banana},\m{c} as the only markers in a keyboard they would get number 0, 1, 2 respectively (alphabetic order). with ids (strings) of a, banana, c. Looking at push_marker() it takes a uint32_t. So I would expect we would do state->context().push_marker(0), state->context().push_marker(1), state->context().push_marker(2) for those three markers, right?

- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility. I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.
- Core needs to remove `U+FFFF …` sequences before they are passed to the OS.
- The default backspace processing needs to ignore `U+FFFF …` markers as it is deleting.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Core needs to maintain its markers. Backspace processsing does not ignore markers, but rather recognises them and removes them as appropriate, and emit backpaces to the OS as appropriate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Web, if a marker exists between the current caret position and the text to be deleted by a backspace, the marker is deleted. I believe this also affects markers immediately preceding backspace-deleted text as well. I'm pretty sure this should line up with Core.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 different backspace modes in Core:

  1. User-pressed default backspace handling: this deletes one codepoint to left of caret, along with all markers on either side of that codepoint. (A submode of this is where there is no context, but that doesn't impact marker handling). Result of no matching rule for backspace in the keyboard.
  2. Delete marker: emitted by the keyboard processor to the engine to delete one and only one marker prior to caret. Action coming from rule processing.
  3. Delete codepoint: emitted by the keyboard processor to the engine to delete one and only one character prior to caret. Action coming from rule processing.

Copy link
Copy Markdown
Member Author

@srl295 srl295 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I've updated slightly, though haven't captured everything here.
Marc to your numbers:
№1: Ok, this is the default behavior. it's something like the regex from the spec (though see note): (:?\m{.})*.(:?\m{.})*

№2 / №3 - don't these correspond to:

№2: <transforms type="backspace">…<transform from="\m{.}"/>…</transforms>
№3: <transforms type="backspace">…<transform from="."/>…</transforms>

… but the user could use these or any number of other possible rules?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more about how keyboard processor implements the transform in terms of actions that it passes back out to Core to handover to the Engine.

- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility. I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.
- Core needs to remove `U+FFFF …` sequences before they are passed to the OS.
- The default backspace processing needs to ignore `U+FFFF …` markers as it is deleting.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Web, if a marker exists between the current caret position and the text to be deleted by a backspace, the marker is deleted. I believe this also affects markers immediately preceding backspace-deleted text as well. I'm pretty sure this should line up with Core.

- Keyman already uses U_SENTINEL `U+FFFF` (noncharacter)
- The general proposal here is to use the sequence `U+FFFF U+EXXX` to represent marker #XXX
- `U+FFFF` cannot otherwise occur in text, so it is unique
- `U+EXXX` is always a private use character, so the marker number cannot collide with non-PUA text, however it may certainly collide with PUA text. The intention here is to reduce this possibility of collision, at least when (a) humans read a binary stream or (b) human _error_ causes the marker to show up in acutal text. As a counter example, if we used `U+FFFF U+0022` to indicate marker 0x22, then the marker might show up as a doublequote (`"`).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting that the kmx keyboard processor uses U+0001 and up as its marker code space. These have never escaped the confines of Keyman Core (and in fact, are entirely internal to kmx keyboard processor and are not even visible on any API surfaces, except debug log ones).

So, the proposed safety net may not be necessary -- if we start at U+0001, then we get 65533 (U+0000, U+FFFF, and U+FFFE are reserved) codes to play with, which I think is more generous than 4095 markers, even if we can't see an immediate need for that many.

Copy link
Copy Markdown
Member Author

@srl295 srl295 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think following in kmx's footsteps is an even better safety net. probably use U+FFFF for 'all'

- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility. I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.
- Core needs to remove `U+FFFF …` sequences before they are passed to the OS.
- The default backspace processing needs to ignore `U+FFFF …` markers as it is deleting.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 different backspace modes in Core:

  1. User-pressed default backspace handling: this deletes one codepoint to left of caret, along with all markers on either side of that codepoint. (A submode of this is where there is no context, but that doesn't impact marker handling). Result of no matching rule for backspace in the keyboard.
  2. Delete marker: emitted by the keyboard processor to the engine to delete one and only one marker prior to caret. Action coming from rule processing.
  3. Delete codepoint: emitted by the keyboard processor to the engine to delete one and only one character prior to caret. Action coming from rule processing.

Comment on lines +56 to +57
- Transforms will need to match against the marker or markers desired, so may need to emit sequences such as `(?:\uFFFF\uE123)` meaning a match to marker #0x123
- matching `\m{.}` may need to expand to `(?:\uFFFF[\uE000-\uEFFE])`
Copy link
Copy Markdown
Member Author

@srl295 srl295 Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcdurdin although… this is why I was using a PUA marker: because otherwise the regex will need to emit (?:\uFFFF\u0022) for the 34th marker or (?:\uFFFF\u005C) for the 92nd, that is U+FFFF" or U+FFFF\ respectively.

Copy link
Copy Markdown
Member

@mcdurdin mcdurdin Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just (?:\uFFFF.), assuming that U+FFFF will always be followed by a single codepoint for the marker id.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 56, needs to match a specific marker. so \m{abc} needs to match the _n_th marker.

- switch from PUA to U+0001 based indices
- add some notes from comments (discussion ongoing)
## Implementation (core)

- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream, with `state->context().push_marker(marker_number)`
- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility. I.e. `U+FFFF U+0123` corresponding to entry 0x0123 in the `vars.markers` -> `list` table.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that U+FFFF U+0000 is illegal so we may need to shift by one (kmx does this)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's shifted by one (noted above)

@srl295 srl295 merged commit 90eeb2a into master Jul 7, 2023
@srl295 srl295 deleted the core/9118-spec-markers-epic-ldml branch July 7, 2023 12:44
@keyman-server
Copy link
Copy Markdown
Collaborator

Changes in this pull request will be available for download in Keyman version 17.0.136-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spec(core): ldml spec for marker implementation 🙀

5 participants