Skip to content

invert the regex matching contentline values to also allow emojis#227

Merged
N-Coder merged 7 commits intomasterfrom
fix-emoji-values
Feb 29, 2020
Merged

invert the regex matching contentline values to also allow emojis#227
N-Coder merged 7 commits intomasterfrom
fix-emoji-values

Conversation

@N-Coder
Copy link
Copy Markdown
Member

@N-Coder N-Coder commented Feb 15, 2020

contributed by @Azhrei:
"Inverting the REs and selecting the characters I don't want vs. those that I do, not only does the list get shorter but the other Unicode characters are implicitly allowed."

Fixes #206, fixes #211

contributed by @Azhrei:
"Inverting the REs and selecting the characters I don't want vs. those that I do, not only does the list get shorter but the other Unicode characters are implicitly allowed."

Fixes #206, #211
@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Feb 15, 2020

We should also add testcases to make sure we have no regressions regarding unicode / emoji / weird characters support in values. I'll merge once that is covered.

@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Feb 21, 2020

So I've added some test files with a lot of emoji data resulting in pretty long contentlines, but I don't have the slightest idea why tatsu fails to handle those in the CI tests.

@Azhrei
Copy link
Copy Markdown

Azhrei commented Feb 21, 2020

It appears to be because the text used for the test has embedded new lines and the CRLF rule is looking for CR-LF, so the rule never matches.

I don't understand why it's looking for CR-LF in the first case, since Python3 will always convert CR-LF into just \n on input... However, if someone manages to get \r\n into the parser somehow, that CR will need to be ignored, so just changing the CRLF rule to \n probably isn't the right solution overall...

@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Feb 23, 2020

I'm also wondering why it is looking for a line break at that specific place. It kind of looks like a bug in Tatsu. I'm stuck here, sorry.

@Azhrei
Copy link
Copy Markdown

Azhrei commented Feb 23, 2020

Yeah, I don’t know tatsu at all. When I get a chance, I’ll try replacing CRLF with just LF and see if that works. It’ll fail when both chars are actually there, but I expect that’ll be rare anyway (and good enough for my case!).

Ideally, the CR should be optional in front of LF, I just don’t know how to do that. Normally, I’d do that in the tokenizer, not the parser. Are they one and the same with tatsu? (See? I don’t know anything about it!)

Thanks for getting it this far! I may have time to play with it again late in the week...

@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Feb 23, 2020

Oh yeah, I forgot that python translates newlines. Anyways, the newlines within the description are escaped ones, i.e. a backslash followed by an 'n' and not an actual newline.
The code that does the parsing is the following:

CRLF = '\r\n'
grammar_path = Path(__file__).parent.joinpath('contentline.ebnf')
with open(grammar_path) as fd:
    GRAMMAR = tatsu.compile(fd.read())
# [...]
try:
    ast = GRAMMAR.parse(line + CRLF)
except tatsu.exceptions.FailedToken:
    raise ParseError()

So the only carriage return (followed by the only line feed) in the content line is actually appended by us. I'm slowly wondering whether using a whole EBNF engine just for the separation of content lines having the format KEY;PARAM=PARAM_VALUE:VALUE is actually worth it...
Maybe we could reuse the parser from icalendar instead or revert back to our own parser that previously directly built on regexes? Or maybe we could try another PEG engine? Tatsu has the advantage of mostly following the EBNF specification, but I also read people reporting it to be pretty slow...

CRs and CRLFs (\r\n) are converted to LF (\n) by python's universal
newline mode when reading a file [1].
Even though the RFC [2] says the ics files should be using CRLF, we
should simply ignore the '\r' when working with python strings.

[1] https://docs.python.org/3/library/functions.html#open-newline-parameter
[2] https://www.ietf.org/rfc/rfc2445.html#page-15
the horizontal tab \x09 is a legal whitespace (WSP) character and may
appear in content line values without escaping
@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Feb 23, 2020

Ahhhh I fixed it!! We forgot to include the tab in our regex, see a060afe. The RFC doesn't differentiate between "space" \x20 and "horizontal tab" \x09 as whitespace ("WSP") characters, so the latter also needs no escaping when appearing in content line values. But we accidentally excluded it with all the other control caracters. It also seems as if that wasn't handled in the code previously, before 0478f61 we also only matched space but no tab character.

@Azhrei
Copy link
Copy Markdown

Azhrei commented Feb 23, 2020

Anyways, the newlines within the description are escaped ones, i.e. a backslash followed by an 'n' and not an actual newline.

Hm, interesting. I guess it makes sense, since the ics file will depict a multiline description that way, I just hadn't thought about it.

Ahhhh I fixed it!!

Good catch! I never read through the RFC so didn't think to compare it against the actual grammar being used.

I look forward to giving it a shot when I get a chance! Thanks for being persistent! 😉

@N-Coder N-Coder added this to the Version 0.7 milestone Feb 25, 2020
Copy link
Copy Markdown
Member

@C4ptainCrunch C4ptainCrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌
Should i merge it ? Or do you want to wait ? (you can also merge it yourself 🙂 )

@C4ptainCrunch
Copy link
Copy Markdown
Member

I'm slowly wondering whether using a whole EBNF engine just for the separation of content lines having the format KEY;PARAM=PARAM_VALUE:VALUE is actually worth it...

It might indeed have been overkill 😞
The choice of Tatsu was indeed the EBNF-like grammar that was close the the RFC and it might not have been a great choice. Maybe we could schedule the change for something like ics 0.9 :)

@N-Coder
Copy link
Copy Markdown
Member Author

N-Coder commented Feb 29, 2020

Maybe we could schedule the change for something like ics 0.9 :)

I guess we can leave it in now and as long as it isn't causing more problems than it is solving there's no need to invest much time. Being able to re-use the grammar from the RFC is actually quite nice.

@N-Coder N-Coder merged commit b206dd2 into master Feb 29, 2020
@N-Coder N-Coder deleted the fix-emoji-values branch February 29, 2020 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New grammar approach doesn't allow for some Unicode chars ParseError if emojis/unicode in description/text

3 participants