Skip to content

Allow dashes in author name #42#43

Merged
sebastianquek merged 16 commits intose-edu:masterfrom
yamgent:42-dashes-author-name
Dec 13, 2016
Merged

Allow dashes in author name #42#43
sebastianquek merged 16 commits intose-edu:masterfrom
yamgent:42-dashes-author-name

Conversation

@yamgent
Copy link
Member

@yamgent yamgent commented Dec 9, 2016

Fixes #42.

Ready for review.

The new files for test case for 'author name with dashes' caused a change in
the overall statistics.
HTML comment end tags contain dashes, and dashes are no
longer forbidden characters.
The new files for test case for 'author name in HTML files' caused a change in
the overall statistics.
@riwu
Copy link
Contributor

riwu commented Dec 9, 2016

Should leading/trailing dashes in names be removed? eg. "--author-name -" to "author-name"
If so, this would do the job: master...riwu:dashInName

@damithc
Copy link
Contributor

damithc commented Dec 9, 2016

@sebastianquek for your review please.

@sebastianquek
Copy link
Contributor

Thanks @yamgent for your contributions!

Background

Looking at the various use cases, the author tag can be followed by several variants of author names:

  1. // @@author A0123456Z
  2. // @@author generated
  3. // @@author A0123456Z-unused
  4. // @@author A0123456Z-reused
  5. <!-- @@author A0123456Z -->
  6. <!-- @@author A0123456Z-unused -->
  7. /* @@author A0123456Z */

The current implementation allows for spaces in names, and when we allow dashes as well, we will need to handle edge cases such as in point 6 (as @yamgent has found). However, the fix is very specific to -->, and may be adding unnecessary complexity to the codebase.

Given that we currently restrict author names to only alphanumeric characters and dashes as in the above examples, I propose that we remove the feature that allows for spaces in names.

Proposed implementation

One approach is to extract the author's name according to a regex expression. In a very verbose manner: ^[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9].

  • The ^ indicates the beginning of the sentence,
  • the 1st matching character set ensures the author's name has to start with an alphanumeric character,
  • the 2nd set allows any number of alphanumeric characters and -,
  • the 3rd set is similar to the 1st set.

This ensures that author names that start and/or end with non-alphanumeric characters are illegal (as @riwu has mentioned). Furthermore, alphanumeric characters can be interleaved with - characters (e.g. long-author-name).


Any thoughts/ideas?

@damithc
Copy link
Contributor

damithc commented Dec 10, 2016

I'm ok with not allowing spaces in author name.

Copy link
Contributor

@sebastianquek sebastianquek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove ability to have whitespace in author names, use regex that accepts and extracts the valid author names.

This streamlines the extraction of the author name from an author tag as special rules (HTML comments) are not required anymore.

Regex now extracts name, instead of looking for forbidden characters.
Now spaces in author's name are forbidden.

private static final String REGEX_NEITHER_ALPHANUMERIC_NOR_SPACE_NOR_DASH =
"[^ a-zA-Z0-9\\-]+";
private static final String REGEX_AUTHOR_NAME_FORMAT = "(^[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]).*";
Copy link
Contributor

@sebastianquek sebastianquek Dec 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .* at the end will match any number of characters that are not line breaks! Do test it with the new test file you created (testFileHtml.html).

@sebastianquek
Copy link
Contributor

One last thing, @yamgent could you add test cases for author names that start and/or end with -?

@yamgent
Copy link
Member Author

yamgent commented Dec 10, 2016

@sebastianquek Okay sure, am working on it.


private static final String REGEX_NEITHER_ALPHANUMERIC_NOR_SPACE =
"[^ a-zA-Z0-9]+";
private static final String REGEX_AUTHOR_NAME_FORMAT = "(^[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to fail if author string doesn't start with alphanumeric?
Removing '^' from the regex would simply strip leading non-alphanumeric instead of failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, if we allow @@author matric-num----, which results with the author name being matric-num, @@author --matric-num---- and @@author !@%$&*^matric-num likewise should not fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it in 769d213.

@yamgent
Copy link
Member Author

yamgent commented Dec 10, 2016

@sebastianquek I added test cases for the dashes, do take a look.

@riwu thanks for your input!

@sebastianquek
Copy link
Contributor

Awesome! 👍
As an exercise about equivalence partitions and their associated boundary values, you can also try adding additional test cases. Recall that any number of non-alphanumeric characters before and after the author's name should not cause a failure. 😄

@yamgent
Copy link
Member Author

yamgent commented Dec 11, 2016

@sebastianquek Few more test cases added

@sebastianquek
Copy link
Contributor

LGTM 👍

@sebastianquek sebastianquek merged commit 14cb274 into se-edu:master Dec 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants