invert the regex matching contentline values to also allow emojis by N-Coder · Pull Request #227 · ics-py/ics-py

N-Coder · 2020-02-15T16:20:16Z

contributed by @Azhrei:
"Inverting the REs and selecting the characters I don't want vs. those that I do, not only does the list get shorter but the other Unicode characters are implicitly allowed."

Fixes #206, fixes #211

@Azhrei

contributed by @Azhrei: "Inverting the REs and selecting the characters I don't want vs. those that I do, not only does the list get shorter but the other Unicode characters are implicitly allowed." Fixes #206, #211

N-Coder · 2020-02-15T16:39:14Z

We should also add testcases to make sure we have no regressions regarding unicode / emoji / weird characters support in values. I'll merge once that is covered.

N-Coder · 2020-02-16T15:39:21Z

Here are some links that could be useful for testing:
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
https://unicode.org/emoji/charts/emoji-style.txt
https://unicode.org/Public/emoji/latest/emoji-test.txt
http://www.madore.org/~david/misc/unitest/
https://character.construction/az
https://github.com/janlelis/unicode-emoji/blob/master/README.md
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/TeX.txt
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/combining-keycap.txt
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/digraphs.txt
https://www.cl.cam.ac.uk/~mgk25/ucs/examples/postscript-utf-8.txt

N-Coder · 2020-02-21T21:47:50Z

So I've added some test files with a lot of emoji data resulting in pretty long contentlines, but I don't have the slightest idea why tatsu fails to handle those in the CI tests.

Azhrei · 2020-02-21T23:52:00Z

It appears to be because the text used for the test has embedded new lines and the CRLF rule is looking for CR-LF, so the rule never matches.

I don't understand why it's looking for CR-LF in the first case, since Python3 will always convert CR-LF into just \n on input... However, if someone manages to get \r\n into the parser somehow, that CR will need to be ignored, so just changing the CRLF rule to \n probably isn't the right solution overall...

N-Coder · 2020-02-23T13:35:48Z

I'm also wondering why it is looking for a line break at that specific place. It kind of looks like a bug in Tatsu. I'm stuck here, sorry.

Azhrei · 2020-02-23T14:12:08Z

Yeah, I don’t know tatsu at all. When I get a chance, I’ll try replacing CRLF with just LF and see if that works. It’ll fail when both chars are actually there, but I expect that’ll be rare anyway (and good enough for my case!).

Ideally, the CR should be optional in front of LF, I just don’t know how to do that. Normally, I’d do that in the tokenizer, not the parser. Are they one and the same with tatsu? (See? I don’t know anything about it!)

Thanks for getting it this far! I may have time to play with it again late in the week...

N-Coder · 2020-02-23T18:45:34Z

Oh yeah, I forgot that python translates newlines. Anyways, the newlines within the description are escaped ones, i.e. a backslash followed by an 'n' and not an actual newline.
The code that does the parsing is the following:

CRLF = '\r\n'
grammar_path = Path(__file__).parent.joinpath('contentline.ebnf')
with open(grammar_path) as fd:
    GRAMMAR = tatsu.compile(fd.read())
# [...]
try:
    ast = GRAMMAR.parse(line + CRLF)
except tatsu.exceptions.FailedToken:
    raise ParseError()

So the only carriage return (followed by the only line feed) in the content line is actually appended by us. I'm slowly wondering whether using a whole EBNF engine just for the separation of content lines having the format KEY;PARAM=PARAM_VALUE:VALUE is actually worth it...
Maybe we could reuse the parser from icalendar instead or revert back to our own parser that previously directly built on regexes? Or maybe we could try another PEG engine? Tatsu has the advantage of mostly following the EBNF specification, but I also read people reporting it to be pretty slow...

CRs and CRLFs (\r\n) are converted to LF (\n) by python's universal newline mode when reading a file [1]. Even though the RFC [2] says the ics files should be using CRLF, we should simply ignore the '\r' when working with python strings. [1] https://docs.python.org/3/library/functions.html#open-newline-parameter [2] https://www.ietf.org/rfc/rfc2445.html#page-15

the horizontal tab \x09 is a legal whitespace (WSP) character and may appear in content line values without escaping

N-Coder · 2020-02-23T19:43:50Z

Ahhhh I fixed it!! We forgot to include the tab in our regex, see a060afe. The RFC doesn't differentiate between "space" \x20 and "horizontal tab" \x09 as whitespace ("WSP") characters, so the latter also needs no escaping when appearing in content line values. But we accidentally excluded it with all the other control caracters. It also seems as if that wasn't handled in the code previously, before 0478f61 we also only matched space but no tab character.

Azhrei · 2020-02-23T21:19:45Z

Anyways, the newlines within the description are escaped ones, i.e. a backslash followed by an 'n' and not an actual newline.

Hm, interesting. I guess it makes sense, since the ics file will depict a multiline description that way, I just hadn't thought about it.

Ahhhh I fixed it!!

Good catch! I never read through the RFC so didn't think to compare it against the actual grammar being used.

I look forward to giving it a shot when I get a chance! Thanks for being persistent! 😉

C4ptainCrunch

👌
Should i merge it ? Or do you want to wait ? (you can also merge it yourself 🙂 )

C4ptainCrunch · 2020-02-29T13:39:11Z

I'm slowly wondering whether using a whole EBNF engine just for the separation of content lines having the format KEY;PARAM=PARAM_VALUE:VALUE is actually worth it...

It might indeed have been overkill 😞
The choice of Tatsu was indeed the EBNF-like grammar that was close the the RFC and it might not have been a great choice. Maybe we could schedule the change for something like ics 0.9 :)

N-Coder · 2020-02-29T17:17:31Z

Maybe we could schedule the change for something like ics 0.9 :)

I guess we can leave it in now and as long as it isn't causing more problems than it is solving there's no need to invest much time. Being able to re-use the grammar from the RFC is actually quite nice.

invert the regex matching contentline values to also allow emojis

0478f61

contributed by @Azhrei: "Inverting the REs and selecting the characters I don't want vs. those that I do, not only does the list get shorter but the other Unicode characters are implicitly allowed." Fixes #206, #211

N-Coder mentioned this pull request Feb 15, 2020

Remove arrow, transition to attrs and add Timespan #222

Merged

add UTF-9 and emoji test data and try to parse all test fixtures

d9923b3

add test data for long descriptions

3ab3ddc

N-Coder added 4 commits February 23, 2020 19:57

add fixture for spaces and tabs in long descriptions

a02978b

fix tab character being ignored by ebnf grammar

a060afe

the horizontal tab \x09 is a legal whitespace (WSP) character and may appear in content line values without escaping

fix missing trailing newlines in romeo and juliet

56de393

N-Coder added this to the Version 0.7 milestone Feb 25, 2020

C4ptainCrunch approved these changes Feb 29, 2020

View reviewed changes

N-Coder merged commit b206dd2 into master Feb 29, 2020

N-Coder deleted the fix-emoji-values branch February 29, 2020 17:18

N-Coder mentioned this pull request Mar 1, 2020

[POC] Create a corpus of various ics files for more robust tests #236

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invert the regex matching contentline values to also allow emojis#227

invert the regex matching contentline values to also allow emojis#227
N-Coder merged 7 commits intomasterfrom
fix-emoji-values

N-Coder commented Feb 15, 2020

Uh oh!

N-Coder commented Feb 15, 2020

Uh oh!

N-Coder commented Feb 16, 2020

Uh oh!

N-Coder commented Feb 21, 2020

Uh oh!

Azhrei commented Feb 21, 2020

Uh oh!

N-Coder commented Feb 23, 2020 •

edited

Loading

Uh oh!

Azhrei commented Feb 23, 2020 •

edited

Loading

Uh oh!

N-Coder commented Feb 23, 2020

Uh oh!

N-Coder commented Feb 23, 2020

Uh oh!

Azhrei commented Feb 23, 2020

Uh oh!

C4ptainCrunch left a comment •

edited

Loading

Uh oh!

C4ptainCrunch commented Feb 29, 2020

Uh oh!

N-Coder commented Feb 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

N-Coder commented Feb 15, 2020

Uh oh!

N-Coder commented Feb 15, 2020

Uh oh!

N-Coder commented Feb 16, 2020

Uh oh!

N-Coder commented Feb 21, 2020

Uh oh!

Azhrei commented Feb 21, 2020

Uh oh!

N-Coder commented Feb 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Azhrei commented Feb 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

N-Coder commented Feb 23, 2020

Uh oh!

N-Coder commented Feb 23, 2020

Uh oh!

Azhrei commented Feb 23, 2020

Uh oh!

C4ptainCrunch left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

C4ptainCrunch commented Feb 29, 2020

Uh oh!

N-Coder commented Feb 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

N-Coder commented Feb 23, 2020 •

edited

Loading

Azhrei commented Feb 23, 2020 •

edited

Loading

C4ptainCrunch left a comment •

edited

Loading