Add Arabic and Chinese punctuation symbols by OzancanKaratas · Pull Request #13661 · nvaccess/nvda

OzancanKaratas · 2022-05-03T13:55:24Z

Link to issue number:

Fixes #12097
Fixes #12086

Summary of the issue:

Some punctuation marks in Chinese and Arabic are treated as emoji characters by NVDA.

Description of how this pull request fixes the issue:

This pull request adds those symbols into NVDA's symbols dictionary.

Testing strategy:

Manual test: Chinese and Arab users should download the AppVeyor build and test.

Known issues with pull request:

Wait for test results.

Change log entries:

Bug fixes

Chinese and Arabic punctuation marks are no longer treated as emoji by NVDA.

Code Review Checklist:

Pull Request description:
- description is up to date
- change log entries
Testing:
- Unit tests
- System (end to end) tests
- Manual testing
API is compatible with existing add-ons.
Documentation:
- User Documentation
- Developer / Technical Documentation
- Context sensitive help for GUI changes
UX of all users considered:
- Speech
- Braille
- Low Vision
- Different web browsers
- Localization in other languages / culture than English

…nctuations Fixes nvaccess#12097 Fixes nvaccess#12086

mzanm · 2022-05-04T05:29:42Z

Have tested with Arabic and it works, the only issue is the Arabic question mark does not trigger the pitch change in Espeak and other synths, only the normal question mark does. Is there some way to make them both behayve the same? Thanks.

CyrilleB79 · 2022-05-04T07:53:11Z

Actually do you have any idea why these characters are part of CLDR? And why not characters such as English punctuation (dot, comma, question mark, etc.)?

I do not understand this Western-centric implementation.

An alternative implementation in NVDA may be considered: to disable CLDR processing for characters that are unicode punctuation. Have you investigated this path?

OzancanKaratas · 2022-05-04T10:36:30Z

Actually do you have any idea why these characters are part of CLDR?

Actually, I don't know why either. Is there currently a better solution than CLDR?

@Mazen428 said:

Have tested with Arabic and it works, the only issue is the Arabic question mark does not trigger the pitch change in Espeak and other synths, only the normal question mark does. Is there some way to make them both behayve the same? Thanks.

Does the pitch issue continues after disabling the “Include Unicode Consortium data (including emoji) when processing characters and symbols” option?

…rEmoji

CyrilleB79

Hi
If the strategy of modifying the symbol file is validated, there are many things to change anyway.

CyrilleB79 · 2022-05-04T12:24:49Z

 # identifier	regexp
 # Sentence endings.
 . sentence ending	(?<=[^\s.])\.(?=[\"'”’)\s]|$)
+。 sentence ending	(?<=[^\s.])\.(?=[\"'”’)\s]|$)


Does the concept of sentence ending period really exist for ideographic period, i.e. is ideographic period used elsewhere than to end a sentence in the languages using it? And if yes, does this regexp really match its usage in this types of languages, i.e. should it be followed by a space and are the simple/double quotes used as in latin writing? I doubt it.

In any case, even if all of these assumptions were true, the regexp does not contain the ideographic period, so it is not correct.

If there is no concept of sentence ending period in languages using ideographs, you should remove the regexp instead and set the 'preserve' parameter to 'always' to avoid pitch issues of the synth in the sentence prosody.

At last, the same regexp is used two times, for normal period and ideographic period. I do not know which rule will be used by NVDA in this case; anyway, it does not make sense.

Let's decide what to do after users test it.

I do not know what user can test if ideographic period and English period are mapped to the same regexp. Please clarify it because it makes no sense to me.

I think the ideographic period should be treated the same as the Latin period. Because the ideographic period is only used in certain languages. However, the Latin period is also used in these languages.

I think the ideographic period should be treated the same as the Latin period. Because the ideographic period is only used in certain languages. However, the Latin period is also used in these languages.

This is not the point. But it seems we do not understand each other.
What I am saying is that this does not make sense to use the same regexp for two complex symbols in the section complexSymbols:.
You have associated the same regexp for latin and ideographic sentence ending period.
When the text will be parsed and if the corresponding regexp is recognized, NVDA will report it either as latin sentence ending period or as ideographic sentence ending period, but it will choose one way to report. And the other way to report will never be used.

To be more concrete, try to associate a dummy pronunciation to latin sendence ending period and another one to ideographic sentence ending period. Then, try to generate a text to have each case reported. You will never be able to create such text since both regexps are the same.

I hope to have clarified my point now.

I invited you as a collaborator. Please help me.

CyrilleB79 · 2022-05-04T12:31:08Z

+。 sentence ending	(?<=[^\s.])\.(?=[\"'”’)\s]|$)
 ! sentence ending	(?<=[^\s!])\!(?=[\"'”’)\s]|$)
 ? sentence ending	(?<=[^\s?])\?(?=[\"'”’)\s]|$)
+؟ sentence ending	(?<=[^\s?])\?(?=[\"'”’)\s]|$)


Same comments as above for this regexp.

CyrilleB79 · 2022-05-04T12:31:34Z

+؟ sentence ending	(?<=[^\s?])\?(?=[\"'”’)\s]|$)
 # Phrase endings.
 ; phrase ending	(?<=[^\s;]);(?=\s|$)
+؛ phrase ending	(?<=[^\s;]);(?=\s|$)


CyrilleB79 · 2022-05-04T12:33:52Z


 # Complex symbols
 . sentence ending	dot	all	always
+。 sentence ending	dot	all	always


In English, should have a name indicating the charater name and allowing to differenciate it from English one.

Suggested change

。 sentence ending dot all always

。 sentence ending ideographic period all always

CyrilleB79 · 2022-05-04T12:34:20Z

+。 sentence ending	dot	all	always
 ! sentence ending	bang	all	always
 ? sentence ending	question	all	always
+؟ sentence ending	question	all	always


CyrilleB79 · 2022-05-04T12:34:34Z

 ? sentence ending	question	all	always
+؟ sentence ending	question	all	always
 ; phrase ending	semi	most	always
+؛ phrase ending	semi	most	always


CyrilleB79 · 2022-05-04T12:36:19Z

 ⌘	mac Command key	none
+
+#Locale specific punctuations
+。	all


Missing replacement.

Suggested change

。 all

。 ideographic period all

CyrilleB79 · 2022-05-04T12:42:57Z

+،	comma	all	always
+؟	question	all
+؛	semi	most
+、	comma	all	always


Idem for all these ones: use distinct names with respect to English:

…rEmoji

AppVeyorBot · 2022-05-07T11:38:25Z

PASS: Translation comments check.
PASS: Unit tests.
PASS: Lint check.
FAIL: System tests. See test results for more information.
Build (for testing PR): https://ci.appveyor.com/api/buildjobs/pp1gplci5yk1vxlo/artifacts/output/nvda_snapshot_pr13661-25361,716b687f.exe

See test results for failed build of commit 716b687f6e

OzancanKaratas · 2022-05-08T11:56:30Z

@Mazen428, please test again.

CyrilleB79 · 2022-05-09T10:00:38Z

@OzancanKaratas, you write:

I invited you as a collaborator. Please help me.

For now, I do not need to suggest some code.
I have actually commented on some details and bugs in this PR; probably I should not have done it at this stage.
Because as also written before, more general questions need to be answered before implementing a solution.

Does the concept of sentence ending period really exist for ideographic period, i.e. is ideographic period used elsewhere than to end a sentence in the languages using it? And if yes, does this regexp really match its usage in this types of languages, i.e. should it be followed by a space and are the simple/double quotes used as in latin writing?

Actually do you have any idea why these characters are part of CLDR? And why not characters such as English punctuation (dot, comma, question mark, etc.)?

An alternative implementation in NVDA may be considered: to disable CLDR processing for characters that are unicode punctuation. Have you investigated this path?

And also a new general question I am thinking too:
Why is the English symbol file modification needed? Wouldn't it be enough to modify the symbol files of the impacted languages (ar, zh_CN, zh_TW, etc.)

Side note: regarding collaboration, I do not need to be invited to your NVDA repo as a collaborator; I can suggest you some code in this PR if needed.

OzancanKaratas · 2022-05-09T12:51:13Z

Actually it makes more sense to remove the problematic marks from the CLDR dictionary. However, I don't know exactly how to do this. Because we need to be able to understand what these marks mean when shown in other languages.

For example, although the Latin apostrophe is found in both the CLDR and NVDA dictionary, NVDA primarily uses its own dictionary. This is why I changed the NVDA symbols dictionary.

CyrilleB79 · 2022-05-09T21:56:59Z

Actually, I just realize that all the punctuation symbols are in the cldr.dic file, ideographic and arabic ones, but also Latin ones. Thus this changes a bit my questions.
However, since Latin ones are in the symbol files for all languages, CLDR ones are useless.

You write:

Actually it makes more sense to remove the problematic marks from the CLDR dictionary. However, I don't know exactly how to do this.

No I think that cldr.dic is auto-generated and mirrors what is determined in the Unicode Consortium. Thus this source should remain as is.
Symbol files are here to perform the needed fixes.

So the question that still stands is which symbol files need to be modified?

English symbol files and the symbol files for all other languages
Only the symbol files for languages that are impacted by this punctuation.

I assume that Latin punctuation symbols are commonly used in Arabic or Chinese whereas the contrary is not true: Arabic and Chinese punctuation symbols are not used in Latin-writing languages. If this assumption is confirmed, the second solution should be implemented, probably by the translators of these languages in their symbol files.

Once the targeted symbol file will be identified, we will be able to discuss how to modify its content.

OzancanKaratas · 2022-05-09T23:31:20Z

I saw the email you sent to the nvda-translations group. I think for now we should wait for responses and keep this pull request as a draft.

seanbudd · 2022-06-10T08:59:36Z

What's the status here?

OzancanKaratas · 2022-06-10T11:13:57Z

Problematic symbols should be translated by translators who speak the relevant language. There are no changes to be made in the main symbol file. But I keep this pull request open in case it needs to be changed.

seanbudd · 2022-06-13T23:44:50Z

This PR should be closed if the approach is invalid. Feel free to reopen it or a new PR if that changes.

seanbudd · 2022-06-16T06:26:25Z

Can you remove the regex matches? It would be good to add the symbols to /en/symbols.dic, but not the semantic matching.

I would suggest following Cyrille's review comments

…rEmoji

seanbudd · 2022-06-23T09:46:12Z

Thanks @OzancanKaratas

Do not trigger emoji characters when processing Arabic and Chinese pu…

513c0d4

…nctuations Fixes nvaccess#12097 Fixes nvaccess#12086

OzancanKaratas added 2 commits May 4, 2022 14:21

Fix pitch change after punctuation

d7ae807

Merge remote-tracking branch 'remotes/origin/master' into doNotTrigge…

bfb9724

…rEmoji

CyrilleB79 reviewed May 4, 2022

View reviewed changes

OzancanKaratas added 3 commits May 4, 2022 16:17

Apply suggestions partially

2bddd3d

Rework for suggestions

493acd8

Merge remote-tracking branch 'remotes/origin/master' into doNotTrigge…

f3715aa

…rEmoji

Adapted to real name

7176281

CyrilleB79 mentioned this pull request Jun 9, 2022

Arabic punctuation missing from symbols dic #13770

Closed

seanbudd closed this Jun 13, 2022

seanbudd reopened this Jun 16, 2022

OzancanKaratas added 3 commits June 16, 2022 15:14

Merge remote-tracking branch 'remotes/origin/master' into doNotTrigge…

49987a3

…rEmoji

Remove regex matches

9171963

Add as normal symbols

bd791cd

OzancanKaratas marked this pull request as ready for review June 22, 2022 17:59

OzancanKaratas requested a review from a team as a code owner June 22, 2022 17:59

OzancanKaratas requested a review from seanbudd June 22, 2022 17:59

seanbudd approved these changes Jun 23, 2022

View reviewed changes

seanbudd changed the title ~~Do not trigger emoji characters when processing Arabic and Chinese punctuations~~ Add Arabic and Chinese punctuation symbols Jun 23, 2022

seanbudd merged commit ac7215b into nvaccess:master Jun 23, 2022

nvaccessAuto added this to the 2022.3 milestone Jun 23, 2022

OzancanKaratas deleted the doNotTriggerEmoji branch June 23, 2022 13:00

	。 sentence ending dot all always
	。 sentence ending ideographic period all always

Uh oh!

Conversation

OzancanKaratas commented May 3, 2022

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing strategy:

Known issues with pull request:

Change log entries:

Code Review Checklist:

Uh oh!

mzanm commented May 4, 2022

Uh oh!

CyrilleB79 commented May 4, 2022

Uh oh!

OzancanKaratas commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CyrilleB79 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AppVeyorBot commented May 7, 2022

Uh oh!

OzancanKaratas commented May 8, 2022

Uh oh!

CyrilleB79 commented May 9, 2022

Uh oh!

OzancanKaratas commented May 9, 2022

Uh oh!

CyrilleB79 commented May 9, 2022

Uh oh!

OzancanKaratas commented May 9, 2022

Uh oh!

seanbudd commented Jun 10, 2022

Uh oh!

OzancanKaratas commented Jun 10, 2022

Uh oh!

seanbudd commented Jun 13, 2022

Uh oh!

seanbudd commented Jun 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanbudd commented Jun 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

OzancanKaratas commented May 4, 2022 •

edited

Loading

seanbudd commented Jun 16, 2022 •

edited

Loading