Skip to content

Add characters count in the table of contents#14589

Closed
Jackie6 wants to merge 9 commits intoWordPress:trunkfrom
Jackie6:update/word-count
Closed

Add characters count in the table of contents#14589
Jackie6 wants to merge 9 commits intoWordPress:trunkfrom
Jackie6:update/word-count

Conversation

@Jackie6
Copy link
Copy Markdown
Contributor

@Jackie6 Jackie6 commented Mar 22, 2019

Description

fix #13796

When writing a post in Chinese the word count shown in the content structure does not show an accurate word count.

Screenshot

image

How has this been tested?

  • Local
  • Browser testing

Types of changes

Add the count of characters(no spaces) in the table of contents

Testing

Steps to reproduce the behavior:

  1. Use http://generator.lorem-ipsum.info/_chinese to generate some sample Chinese text
  2. Create new page in Gutenberg
  3. Paste in content
  4. Click the info icon at the top
  5. See if the number of characters is correct

Checklist:

  • My code is tested.
  • My code follows the WordPress code style.
  • My code follows the accessibility standards.
  • My code has proper inline documentation.
  • I've included developer documentation if appropriate.

@Jackie6 Jackie6 force-pushed the update/word-count branch from 3864006 to bde47d3 Compare March 22, 2019 23:52
@gziolo gziolo added Internationalization (i18n) Issues or PRs related to internationalization efforts Needs Design Feedback Needs general design feedback. labels Mar 25, 2019
@gziolo gziolo added the [Type] Enhancement A suggestion for improvement. label Mar 25, 2019
@kjellr
Copy link
Copy Markdown
Contributor

kjellr commented Apr 2, 2019

With the addition of this item, we'll also want to adjust the spacing for the table-of-contents__count items. Currently, it looks a little weird to have that last "Blocks" item sitting off on its own:

Screen Shot 2019-04-02 at 4 12 25 PM

Two options:

  1. Widen the modal, remove the existing 25% width rule, and give each items some right-padding instead:

Screen Shot 2019-04-02 at 4 08 17 PM

  1. Change the 25% width rule to 33.33% instead, to force the 4th and 5th items onto the next line. We'll also want to add a tiny bit of bottom padding to these items. This keeps things compact, but also avoids having just one item all alone:

Screen Shot 2019-04-02 at 4 01 42 PM

Option 2 seems like the simplest fix to me.

@Jackie6
Copy link
Copy Markdown
Contributor Author

Jackie6 commented Apr 2, 2019

Hi, @kjellr Thanks for your comments. I think the option two looks better for me. But we may still need to wait for some design feedback.

@talldan talldan removed the Needs Design Feedback Needs general design feedback. label Apr 3, 2019
Copy link
Copy Markdown
Contributor

@talldan talldan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Jackie6, this looks good, thanks so much for your contribution. I did some manual testing, and the character count looks good. I tried a few edge cases, like characters that are represented by multiple code points, and the character count looks correct.

I've noticed a couple of minor things with the code, if you're able to tackle those and the discussed style changes it'd be much appreciated.

const charCountType = _x( 'characters_excluding_spaces', 'Word count type. Do not translate!' );

return (
<span className="word-count">{ wordCount( content, charCountType ) }</span>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

word-count isn't the right class name here. Instead I think it should be called char-count, reflecting the component name. To get the styling to work an extra selector will have to be added here for the new .char-count class:

.table-of-contents__number,
.table-of-contents__popover .word-count {
font-size: 21px;
font-weight: 400;
line-height: 30px;
color: $dark-gray-500;
}

It is a bit unusual that these components (CharCount, WordCount) receive their style from a completely different component (TableOfContentsPanel), but I think it's outside of the scope of this PR to fix that.

*/
import { withSelect } from '@wordpress/data';
import { _x } from '@wordpress/i18n';
import { count as wordCount } from '@wordpress/wordcount';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wordCount alias is a bit misleading, given it returns that characterCount when used in this file. It's minor, but it'd be good to either leave it as count or alias it as charCount.

@talldan
Copy link
Copy Markdown
Contributor

talldan commented Apr 3, 2019

Just thought I'd also comment that the feedback from @kjellr was the design feedback.

Your PR was discussed during the design triage meeting:
(https://wordpress.slack.com/archives/C02S78ZAL/p1554221572033000 - you may have to sign up to slack to view the discussion).

It seems like you both agree on option 2, so that looks like a good way forward.

@Jackie6
Copy link
Copy Markdown
Contributor Author

Jackie6 commented Apr 10, 2019

Hi @ellatrix, thanks for clarifying this pr further. This pr still does not fix the counting for mixed script post. But it can count the correct number of Chinese words.
image
I expect to create a new feature to support counting the words in a mixed script post in the future. But at this point, I think it is good to add characters count in the content structure as most editors already provide this. An example is MS Word:
image

@ellatrix
Copy link
Copy Markdown
Member

@Jackie6 Thanks, this is looking great!

One final thing: for Latin scripts it might make sense to include both word count and character count, but does it also make sense for e.g. a Chinese translation? Should word count not be removed in that case (perhaps by allowing an empty translation)?

@gziolo gziolo removed the Needs Decision Needs a decision to be actionable or relevant label Apr 24, 2019
@Jackie6
Copy link
Copy Markdown
Contributor Author

Jackie6 commented Apr 25, 2019

Hi @ellatrix thanks for comments. Do you mean we need to remove the word count when the language is Chinese? I think it will cause the inconsistency of UI. That is sometimes both word count and char count show but sometimes only char count show.

@ellatrix
Copy link
Copy Markdown
Member

@Jackie6 Yes, the UI will be different, but is word count useful at all in a Chinese translation? If it's not useful, it shouldn't be there?

@Jackie6
Copy link
Copy Markdown
Contributor Author

Jackie6 commented Apr 25, 2019

@ellatrix But apart from Chinese, Korean and Japenese are also character-based rather than word-based. For languages like these, we also need to remove the word count?

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Apr 25, 2019

Just a random drive-by, but even though we have characters in CJK languages that by themselves are words it doesn't mean that there aren't multi-character words.

Other languages such as early Greek, Linear B, early Egyption also don't use spaces to separate words. Okay, we'll ignore the fact that these earlier languages could be equally written left-to-right, right-to-left, top-to-bottom, bottom-to-top, or changing text direction at intervals to form spirals and other patterns.

Word segmentation is a real problem with an extremely complicated solution. Check out this post from the Solr project talking about ambiguity in Chinese word segmentation - ambiguity that exists because there are multiple valid ways to segment the words in the sentence.

the first segmentation means “I like New Zealand Flowers.” The second segmentation means “I like fresh broccoli”


it's impractical to think that we have the energy and time at the moment to conclusively solve this generally-unsolved problem but we do have some reasonable ways we can move forward in the meantime:

  • cut out word counts entirely: a drastic choice that excludes being helpful to a large swath of cases where it otherwise would be
  • cut out word counts for any language we think might be troublesome: this is the conservative option to prevent making basic mistakes but may not be super helpful
  • add some qualifying notice to the word count when we think it could be wrong, such as when we detect characters from languages that don't us spaces for segmentation or when the ratio of spaces to characters is below a given threshold

my shallow thoughts on the matter are that it'd be more helpful to show the word count with some disclaimer - maybe a popover tooltip explaining that we think it could be wrong - because if we hide it altogether at the first detection of a Chinese character then we'll be degrading the experience without an explanation for not only Chinese posts but also posts that quote Chinese text, even if it's just a small one.

for inclusionary purposes I think it would make more sense to develop some basic heuristics to detect when we think the word count could be wrong; possible indicators based on no research could be:

  • word-to-character ratio is out of whack, less than 1:20 (spaces to other characters)
  • we encounter a long stream of characters in non-roman letters, we could treat the blob as a whole or split it based on approximate character-per-words estimates

@ellatrix
Copy link
Copy Markdown
Member

Thanks @dmsnell for you comment.

My (small) problem with this change is that each of these counts can be translated.

So e.g. a Chinese translating might be:

  • words => characters_excluding_spaces.
  • characters => characters_excluding_spaces.

So you'd just have two counts that are the same thing.

Possible solution:

  • Remove word count if __( 'words' ) === __( 'characters' ).
  • OR remove word count if __( 'words' ) === 'characters_excluding_spaces' && __( 'characters' ) === 'characters_excluding_spaces', and if __( 'words' ) === 'characters_excluding_spaces', force character count to be characters_including_spaces, or if __( 'words' ) === 'characters_including_spaces', remove character count.

What do you think @dmsnell and @Jackie6?

@ellatrix
Copy link
Copy Markdown
Member

Another possible solution is to display all three possible counters, and don't translate anything.

@dmsnell
Copy link
Copy Markdown
Member

dmsnell commented Apr 26, 2019

Remove word count if __( 'words' ) === __( 'characters' ).

@ellatrix this suggestion is much more favorable in my opinion because it's starting at least with some kind of principle: if the translator says that these are the same then only show one

special-casing specific languages in groups I'm sure will cause more stir and degradation than we intend.

having looked at @wordpress/wordcount it became clear that we're starting from the point of poking-at-holes vs. building a solid ground. that is, making this robustly-solved is a much bigger project than just showing characters and so I think getting character counts out even when word counts will be wrong is a positive step forward

I'm a fan of adding a disclaimer popup but I can see where that could introduce needless confusion - for the time being checking if __( 'words' ) === __( 'characters' ) or whatever form is used here __( 'words', { context: 'label for word-count' } ) is a good tradeoff between getting it perfect and getting it shipped

@Jackie6
Copy link
Copy Markdown
Contributor Author

Jackie6 commented Apr 26, 2019

@ellatrix I am a little confused. Why would __( 'words' ) === __( 'characters' ) ? In Chinese, the translated strings of words and characters would be different. And I guess we cannot control the localization here, right? From my understanding, we just wrap translatable strings in the __() function, and translators will translate the strings.

@ellatrix
Copy link
Copy Markdown
Member

ellatrix commented May 1, 2019

@Jackie6 If it is different, it would still be displayed. If it's translated the same, one of them is removed. It doesn't make sense to have both counts if they are the same. This may not be the case in Chinese, but maybe it is in another language.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops! Word vs. Words

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Sorry about that.

@Jackie6 Jackie6 force-pushed the update/word-count branch from 41e7fc2 to f869445 Compare May 1, 2019 15:42
@Jackie6 Jackie6 closed this May 2, 2019
@Jackie6 Jackie6 reopened this May 2, 2019
tabIndex="0"
>
{
__( 'Words' ) !== __( 'Characters' ) &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the goal of this? Do you rely on translators on changing this? That's not possible as there is no context.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swissspidy Ah, yes, I meant #14589 (comment)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should export the translation in both count components, then compare the types here. https://github.com/WordPress/gutenberg/pull/14589/files#r282803952

@swissspidy
Copy link
Copy Markdown
Member

@ellatrix I assume with Remove word count if __( 'words' ) === __( 'characters' ) you meant charCountType === wordCountType? Because __( 'words' ) is totally different from _x( 'words', 'Word count type. Do not translate!' ).

* Two options available for counting: 'characters_excluding_spaces' or 'characters_including_spaces'
* Do not translate into your own language.
*/
const charCountType = _x( 'characters_including_spaces', 'Character count type. Do not translate literally!' );
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to export this, as well in word-count, to be used table-of-contents to compare.

@aristath
Copy link
Copy Markdown
Member

If I'm not mistaken this one can be closed now that #24823 was merged?

@swissspidy
Copy link
Copy Markdown
Member

@aristath Sort of. #24823 doesn't really fully address #13796, which this PR aimed to do.

Base automatically changed from master to trunk March 1, 2021 15:42
@t-hamano
Copy link
Copy Markdown
Contributor

I tried to understand what this PR is trying to solve.

The character count is implemented by #24823. I think the rest of what this PR is trying to do is to add an option to control the presence or absence of spaces in the character count.

In word count, we can control the count type by words, characters_excluding_spaces and characters_including_spaces.

/*
* translators: If your word count is based on single characters (e.g. East Asian characters),
* enter 'characters_excluding_spaces' or 'characters_including_spaces'. Otherwise, enter 'words'.
* Do not translate into your own language.
*/
const wordCountType = _x( 'words', 'Word count type. Do not translate!' );

In character count, on the logic is fixed on characters_including_spaces.

return characterCount( content, 'characters_including_spaces' );

Is this the right goal for this PR?

@youknowriad
Copy link
Copy Markdown
Contributor

Doing some cleanup in the repository. I was wondering if this PR still worth keeping open. Probably needs to be redone entirely if we ever get to it. Let's close it. Please let me know if you think otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Internationalization (i18n) Issues or PRs related to internationalization efforts [Type] Enhancement A suggestion for improvement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Word Count in content structures does not count Chinese words properly