Skip to content

Add Arabic and Chinese punctuation symbols#13661

Merged
seanbudd merged 10 commits into
nvaccess:masterfrom
OzancanKaratas:doNotTriggerEmoji
Jun 23, 2022
Merged

Add Arabic and Chinese punctuation symbols#13661
seanbudd merged 10 commits into
nvaccess:masterfrom
OzancanKaratas:doNotTriggerEmoji

Conversation

@OzancanKaratas

Copy link
Copy Markdown
Collaborator

Link to issue number:

Fixes #12097
Fixes #12086

Summary of the issue:

Some punctuation marks in Chinese and Arabic are treated as emoji characters by NVDA.

Description of how this pull request fixes the issue:

This pull request adds those symbols into NVDA's symbols dictionary.

Testing strategy:

Manual test: Chinese and Arab users should download the AppVeyor build and test.

Known issues with pull request:

Wait for test results.

Change log entries:

Bug fixes

  • Chinese and Arabic punctuation marks are no longer treated as emoji by NVDA.

Code Review Checklist:

  • Pull Request description:
    • description is up to date
    • change log entries
  • Testing:
    • Unit tests
    • System (end to end) tests
    • Manual testing
  • API is compatible with existing add-ons.
  • Documentation:
    • User Documentation
    • Developer / Technical Documentation
    • Context sensitive help for GUI changes
  • UX of all users considered:
    • Speech
    • Braille
    • Low Vision
    • Different web browsers
    • Localization in other languages / culture than English

@mzanm

mzanm commented May 4, 2022

Copy link
Copy Markdown
Contributor

Have tested with Arabic and it works, the only issue is the Arabic question mark does not trigger the pitch change in Espeak and other synths, only the normal question mark does. Is there some way to make them both behayve the same? Thanks.

@CyrilleB79

Copy link
Copy Markdown
Contributor

Actually do you have any idea why these characters are part of CLDR? And why not characters such as English punctuation (dot, comma, question mark, etc.)?

I do not understand this Western-centric implementation.

An alternative implementation in NVDA may be considered: to disable CLDR processing for characters that are unicode punctuation. Have you investigated this path?

@OzancanKaratas

OzancanKaratas commented May 4, 2022

Copy link
Copy Markdown
Collaborator Author

Actually do you have any idea why these characters are part of CLDR?

Actually, I don't know why either. Is there currently a better solution than CLDR?

@Mazen428 said:

Have tested with Arabic and it works, the only issue is the Arabic question mark does not trigger the pitch change in Espeak and other synths, only the normal question mark does. Is there some way to make them both behayve the same? Thanks.

Does the pitch issue continues after disabling the “Include Unicode Consortium data (including emoji) when processing characters and symbols” option?

@CyrilleB79 CyrilleB79 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi
If the strategy of modifying the symbol file is validated, there are many things to change anyway.

Comment thread source/locale/en/symbols.dic Outdated
# identifier regexp
# Sentence endings.
. sentence ending (?<=[^\s.])\.(?=[\"'”’)\s]|$)
。 sentence ending (?<=[^\s.])\.(?=[\"'”’)\s]|$)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the concept of sentence ending period really exist for ideographic period, i.e. is ideographic period used elsewhere than to end a sentence in the languages using it? And if yes, does this regexp really match its usage in this types of languages, i.e. should it be followed by a space and are the simple/double quotes used as in latin writing? I doubt it.

In any case, even if all of these assumptions were true, the regexp does not contain the ideographic period, so it is not correct.

If there is no concept of sentence ending period in languages using ideographs, you should remove the regexp instead and set the 'preserve' parameter to 'always' to avoid pitch issues of the synth in the sentence prosody.

At last, the same regexp is used two times, for normal period and ideographic period. I do not know which rule will be used by NVDA in this case; anyway, it does not make sense.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's decide what to do after users test it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know what user can test if ideographic period and English period are mapped to the same regexp. Please clarify it because it makes no sense to me.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ideographic period should be treated the same as the Latin period. Because the ideographic period is only used in certain languages. However, the Latin period is also used in these languages.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ideographic period should be treated the same as the Latin period. Because the ideographic period is only used in certain languages. However, the Latin period is also used in these languages.

This is not the point. But it seems we do not understand each other.
What I am saying is that this does not make sense to use the same regexp for two complex symbols in the section complexSymbols:.
You have associated the same regexp for latin and ideographic sentence ending period.
When the text will be parsed and if the corresponding regexp is recognized, NVDA will report it either as latin sentence ending period or as ideographic sentence ending period, but it will choose one way to report. And the other way to report will never be used.

To be more concrete, try to associate a dummy pronunciation to latin sendence ending period and another one to ideographic sentence ending period. Then, try to generate a text to have each case reported. You will never be able to create such text since both regexps are the same.

I hope to have clarified my point now.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I invited you as a collaborator. Please help me.

Comment thread source/locale/en/symbols.dic Outdated
。 sentence ending (?<=[^\s.])\.(?=[\"'”’)\s]|$)
! sentence ending (?<=[^\s!])\!(?=[\"'”’)\s]|$)
? sentence ending (?<=[^\s?])\?(?=[\"'”’)\s]|$)
؟ sentence ending (?<=[^\s?])\?(?=[\"'”’)\s]|$)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above for this regexp.

Comment thread source/locale/en/symbols.dic Outdated
؟ sentence ending (?<=[^\s?])\?(?=[\"'”’)\s]|$)
# Phrase endings.
; phrase ending (?<=[^\s;]);(?=\s|$)
؛ phrase ending (?<=[^\s;]);(?=\s|$)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

Comment thread source/locale/en/symbols.dic Outdated

# Complex symbols
. sentence ending dot all always
。 sentence ending dot all always

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In English, should have a name indicating the charater name and allowing to differenciate it from English one.

Suggested change
。 sentence ending dot all always
。 sentence ending ideographic period all always

Comment thread source/locale/en/symbols.dic Outdated
。 sentence ending dot all always
! sentence ending bang all always
? sentence ending question all always
؟ sentence ending question all always

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

Comment thread source/locale/en/symbols.dic Outdated
? sentence ending question all always
؟ sentence ending question all always
; phrase ending semi most always
؛ phrase ending semi most always

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

Comment thread source/locale/en/symbols.dic Outdated
⌘ mac Command key none

#Locale specific punctuations
。 all

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing replacement.

Suggested change
。 all
ideographic period all

Comment thread source/locale/en/symbols.dic Outdated
Comment on lines +368 to +371
، comma all always
؟ question all
؛ semi most
、 comma all always

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem for all these ones: use distinct names with respect to English:

@AppVeyorBot

Copy link
Copy Markdown

See test results for failed build of commit 716b687f6e

@OzancanKaratas

Copy link
Copy Markdown
Collaborator Author

@Mazen428, please test again.

@CyrilleB79

Copy link
Copy Markdown
Contributor

@OzancanKaratas, you write:

I invited you as a collaborator. Please help me.

For now, I do not need to suggest some code.
I have actually commented on some details and bugs in this PR; probably I should not have done it at this stage.
Because as also written before, more general questions need to be answered before implementing a solution.

Does the concept of sentence ending period really exist for ideographic period, i.e. is ideographic period used elsewhere than to end a sentence in the languages using it? And if yes, does this regexp really match its usage in this types of languages, i.e. should it be followed by a space and are the simple/double quotes used as in latin writing?

Actually do you have any idea why these characters are part of CLDR? And why not characters such as English punctuation (dot, comma, question mark, etc.)?

An alternative implementation in NVDA may be considered: to disable CLDR processing for characters that are unicode punctuation. Have you investigated this path?

And also a new general question I am thinking too:
Why is the English symbol file modification needed? Wouldn't it be enough to modify the symbol files of the impacted languages (ar, zh_CN, zh_TW, etc.)

Side note: regarding collaboration, I do not need to be invited to your NVDA repo as a collaborator; I can suggest you some code in this PR if needed.

@OzancanKaratas

Copy link
Copy Markdown
Collaborator Author

Actually it makes more sense to remove the problematic marks from the CLDR dictionary. However, I don't know exactly how to do this. Because we need to be able to understand what these marks mean when shown in other languages.

For example, although the Latin apostrophe is found in both the CLDR and NVDA dictionary, NVDA primarily uses its own dictionary. This is why I changed the NVDA symbols dictionary.

@CyrilleB79

Copy link
Copy Markdown
Contributor

Actually, I just realize that all the punctuation symbols are in the cldr.dic file, ideographic and arabic ones, but also Latin ones. Thus this changes a bit my questions.
However, since Latin ones are in the symbol files for all languages, CLDR ones are useless.

You write:

Actually it makes more sense to remove the problematic marks from the CLDR dictionary. However, I don't know exactly how to do this.

No I think that cldr.dic is auto-generated and mirrors what is determined in the Unicode Consortium. Thus this source should remain as is.
Symbol files are here to perform the needed fixes.

So the question that still stands is which symbol files need to be modified?

  1. English symbol files and the symbol files for all other languages
  2. Only the symbol files for languages that are impacted by this punctuation.

I assume that Latin punctuation symbols are commonly used in Arabic or Chinese whereas the contrary is not true: Arabic and Chinese punctuation symbols are not used in Latin-writing languages. If this assumption is confirmed, the second solution should be implemented, probably by the translators of these languages in their symbol files.

Once the targeted symbol file will be identified, we will be able to discuss how to modify its content.

@OzancanKaratas

Copy link
Copy Markdown
Collaborator Author

I saw the email you sent to the nvda-translations group. I think for now we should wait for responses and keep this pull request as a draft.

@seanbudd

Copy link
Copy Markdown
Member

What's the status here?

@OzancanKaratas

Copy link
Copy Markdown
Collaborator Author

Problematic symbols should be translated by translators who speak the relevant language. There are no changes to be made in the main symbol file. But I keep this pull request open in case it needs to be changed.

@seanbudd

Copy link
Copy Markdown
Member

This PR should be closed if the approach is invalid. Feel free to reopen it or a new PR if that changes.

@seanbudd seanbudd closed this Jun 13, 2022
@seanbudd

seanbudd commented Jun 16, 2022

Copy link
Copy Markdown
Member

Can you remove the regex matches? It would be good to add the symbols to /en/symbols.dic, but not the semantic matching.

I would suggest following Cyrille's review comments

@seanbudd seanbudd reopened this Jun 16, 2022
@OzancanKaratas OzancanKaratas marked this pull request as ready for review June 22, 2022 17:59
@OzancanKaratas OzancanKaratas requested a review from a team as a code owner June 22, 2022 17:59
@OzancanKaratas OzancanKaratas requested a review from seanbudd June 22, 2022 17:59
@seanbudd

Copy link
Copy Markdown
Member

Thanks @OzancanKaratas

@seanbudd seanbudd changed the title Do not trigger emoji characters when processing Arabic and Chinese punctuations Add Arabic and Chinese punctuation symbols Jun 23, 2022
@seanbudd seanbudd merged commit ac7215b into nvaccess:master Jun 23, 2022
@nvaccessAuto nvaccessAuto added this to the 2022.3 milestone Jun 23, 2022
@OzancanKaratas OzancanKaratas deleted the doNotTriggerEmoji branch June 23, 2022 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

language detection for some Chinese punctuations some symbols are treated as emojis

6 participants