spec(core): marker spec 🙀 by srl295 · Pull Request #9196 · keymanapp/keyman

srl295 · 2023-07-05T23:16:06Z

first cut

Fixes: #9118

@keymanapp-test-bot skip

- first cut Fixes: #9118

keymanapp-test-bot · 2023-07-05T23:16:10Z

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

Developer
Keyboards
- Test Keyboards
Linux
- Keyman for Linux
- Keyman for Linux
macOS
- Keyman for macOS
Web
- KeymanWeb Test Home
Windows

rc-swag · 2023-07-06T00:39:37Z

core/src/ldml/C9134_ldml_markers.md

+## Implementation (core)
+
+- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
+- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility.  I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.


Yes the LDML processor can push the U+FFFF sequences as markers to the core processor using state->context().push_marker(a.dwData). I think it should be required that it does correlate the marker number with the marker id . This is because in the Windows implementation, the platform layer has the context including markers (when) possible and it will set the context in the core on each keystroke, this propagates down to the ldml processor and therefore there needs to be a correlation of some type so the when converted back to the ldml U+FFFF ..... it matches the ldml rules.

To define terms here, the 'number' is just the serial number, so if we had \m{a},\m{banana},\m{c} as the only markers in a keyboard they would get number 0, 1, 2 respectively (alphabetic order). with ids (strings) of a, banana, c. Looking at push_marker() it takes a uint32_t. So I would expect we would do state->context().push_marker(0), state->context().push_marker(1), state->context().push_marker(2) for those three markers, right?

rc-swag · 2023-07-06T00:45:58Z

core/src/ldml/C9134_ldml_markers.md

+- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
+- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility.  I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.
+- Core needs to remove `U+FFFF …` sequences before they are passed to the OS.
+- The default backspace processing needs to ignore `U+FFFF …` markers as it is deleting.


The Core needs to maintain its markers. Backspace processsing does not ignore markers, but rather recognises them and removes them as appropriate, and emit backpaces to the OS as appropriate.

On Web, if a marker exists between the current caret position and the text to be deleted by a backspace, the marker is deleted. I believe this also affects markers immediately preceding backspace-deleted text as well. I'm pretty sure this should line up with Core.

There are 3 different backspace modes in Core:

User-pressed default backspace handling: this deletes one codepoint to left of caret, along with all markers on either side of that codepoint. (A submode of this is where there is no context, but that doesn't impact marker handling). Result of no matching rule for backspace in the keyboard.

Delete marker: emitted by the keyboard processor to the engine to delete one and only one marker prior to caret. Action coming from rule processing.

Delete codepoint: emitted by the keyboard processor to the engine to delete one and only one character prior to caret. Action coming from rule processing.

thanks. I've updated slightly, though haven't captured everything here.
Marc to your numbers:
№1: Ok, this is the default behavior. it's something like the regex from the spec (though see note): (:?\m{.})*.(:?\m{.})*

№2 / №3 - don't these correspond to:

№2: <transforms type="backspace">…<transform from="\m{.}"/>…</transforms>
№3: <transforms type="backspace">…<transform from="."/>…</transforms>

… but the user could use these or any number of other possible rules?

This is more about how keyboard processor implements the transform in terms of actions that it passes back out to Core to handover to the Engine.

core/src/ldml/C9134_ldml_markers.md

jahorton · 2023-07-06T04:40:48Z

core/src/ldml/C9134_ldml_markers.md

+- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
+- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility.  I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.
+- Core needs to remove `U+FFFF …` sequences before they are passed to the OS.
+- The default backspace processing needs to ignore `U+FFFF …` markers as it is deleting.


On Web, if a marker exists between the current caret position and the text to be deleted by a backspace, the marker is deleted. I believe this also affects markers immediately preceding backspace-deleted text as well. I'm pretty sure this should line up with Core.

mcdurdin · 2023-07-06T07:00:41Z

core/src/ldml/C9134_ldml_markers.md

+- Keyman already uses U_SENTINEL `U+FFFF` (noncharacter)
+- The general proposal here is to use the sequence `U+FFFF U+EXXX` to represent marker #XXX
+- `U+FFFF` cannot otherwise occur in text, so it is unique
+- `U+EXXX` is always a private use character, so the marker number cannot collide with non-PUA text, however it may certainly collide with PUA text. The intention here is to reduce this possibility of collision, at least when (a) humans read a binary stream or (b) human _error_ causes the marker to show up in acutal text.  As a counter example, if we used `U+FFFF U+0022` to indicate marker 0x22, then the marker might show up as a doublequote (`"`).


It's worth noting that the kmx keyboard processor uses U+0001 and up as its marker code space. These have never escaped the confines of Keyman Core (and in fact, are entirely internal to kmx keyboard processor and are not even visible on any API surfaces, except debug log ones).

So, the proposed safety net may not be necessary -- if we start at U+0001, then we get 65533 (U+0000, U+FFFF, and U+FFFE are reserved) codes to play with, which I think is more generous than 4095 markers, even if we can't see an immediate need for that many.

I think following in kmx's footsteps is an even better safety net. probably use U+FFFF for 'all'

mcdurdin · 2023-07-06T07:06:28Z

core/src/ldml/C9134_ldml_markers.md

+- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream.
+- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility.  I.e. `U+FFFF U+E123` corresponding to entry 0x123 in the `vars.markers` -> `list` table.
+- Core needs to remove `U+FFFF …` sequences before they are passed to the OS.
+- The default backspace processing needs to ignore `U+FFFF …` markers as it is deleting.


There are 3 different backspace modes in Core:

User-pressed default backspace handling: this deletes one codepoint to left of caret, along with all markers on either side of that codepoint. (A submode of this is where there is no context, but that doesn't impact marker handling). Result of no matching rule for backspace in the keyboard.

Delete marker: emitted by the keyboard processor to the engine to delete one and only one marker prior to caret. Action coming from rule processing.

Delete codepoint: emitted by the keyboard processor to the engine to delete one and only one character prior to caret. Action coming from rule processing.

Co-authored-by: Joshua Horton <joshua_horton@sil.org>

srl295 · 2023-07-06T21:43:08Z

core/src/ldml/C9134_ldml_markers.md

+- Transforms will need to match against the marker or markers desired, so may need to emit sequences such as `(?:\uFFFF\uE123)` meaning a match to marker #0x123
+- matching `\m{.}` may need to expand to `(?:\uFFFF[\uE000-\uEFFE])`


@mcdurdin although… this is why I was using a PUA marker: because otherwise the regex will need to emit (?:\uFFFF\u0022) for the 34th marker or (?:\uFFFF\u005C) for the 92nd, that is U+FFFF" or U+FFFF\ respectively.

Why not just (?:\uFFFF.), assuming that U+FFFF will always be followed by a single codepoint for the marker id.

line 56, needs to match a specific marker. so \m{abc} needs to match the _n_th marker.

- switch from PUA to U+0001 based indices - add some notes from comments (discussion ongoing)

mcdurdin · 2023-07-07T03:30:08Z

core/src/ldml/C9134_ldml_markers.md

+## Implementation (core)
+
+- Core needs to recognize `U+FFFF …` sequences and convert them to markers in the context stream, with `state->context().push_marker(marker_number)`
+- For normal processing, Core does _not_ need to correlate the marker _number_ with a marker _id_, although this would be helpful for a debugging or tracing facility.  I.e. `U+FFFF U+0123` corresponding to entry 0x0123 in the `vars.markers` -> `list` table.


Note that U+FFFF U+0000 is illegal so we may need to shift by one (kmx does this)

yes, it's shifted by one (noted above)

keyman-server · 2023-07-07T18:02:00Z

Changes in this pull request will be available for download in Keyman version 17.0.136-alpha

srl295 self-assigned this Jul 5, 2023

srl295 requested review from mcdurdin and rc-swag as code owners July 5, 2023 23:16

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Jul 5, 2023

spec(core): marker spec 🙀

5510321

- first cut Fixes: #9118

srl295 force-pushed the core/9118-spec-markers-epic-ldml branch from 9b276ec to 5510321 Compare July 5, 2023 23:16

github-actions bot added core/ Keyman Core epic-ldml labels Jul 5, 2023

keymanapp-test-bot bot removed the user-test-missing User tests have not yet been defined for the PR label Jul 5, 2023

srl295 requested review from ermshiperete and jahorton July 5, 2023 23:17

srl295 added this to the A17S16 milestone Jul 5, 2023

rc-swag reviewed Jul 6, 2023

View reviewed changes

jahorton reviewed Jul 6, 2023

View reviewed changes

mcdurdin requested changes Jul 6, 2023

View reviewed changes

srl295 and others added 2 commits July 6, 2023 16:32

Merge branch 'master' into core/9118-spec-markers-epic-ldml

bab22c8

Update core/src/ldml/C9134_ldml_markers.md

4c096b7

Co-authored-by: Joshua Horton <joshua_horton@sil.org>

srl295 commented Jul 6, 2023

View reviewed changes

spec(core): marker spec 🙀

4950f74

- switch from PUA to U+0001 based indices - add some notes from comments (discussion ongoing)

mcdurdin approved these changes Jul 7, 2023

View reviewed changes

srl295 merged commit 90eeb2a into master Jul 7, 2023

srl295 deleted the core/9118-spec-markers-epic-ldml branch July 7, 2023 12:44

		- Transforms will need to match against the marker or markers desired, so may need to emit sequences such as `(?:\uFFFF\uE123)` meaning a match to marker #0x123
		- matching `\m{.}` may need to expand to `(?:\uFFFF[\uE000-\uEFFE])`

Uh oh!

Conversation

srl295 commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keymanapp-test-bot bot commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User Test Results

Test Artifacts

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srl295 Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srl295 Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srl295 Jul 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mcdurdin Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keyman-server commented Jul 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

srl295 commented Jul 5, 2023 •

edited

Loading

keymanapp-test-bot bot commented Jul 5, 2023 •

edited

Loading

srl295 Jul 6, 2023 •

edited

Loading

srl295 Jul 6, 2023 •

edited

Loading

srl295 Jul 6, 2023 •

edited

Loading

mcdurdin Jul 7, 2023 •

edited

Loading