invert the regex matching contentline values to also allow emojis#227
invert the regex matching contentline values to also allow emojis#227
Conversation
|
We should also add testcases to make sure we have no regressions regarding unicode / emoji / weird characters support in values. I'll merge once that is covered. |
|
So I've added some test files with a lot of emoji data resulting in pretty long contentlines, but I don't have the slightest idea why tatsu fails to handle those in the CI tests. |
|
It appears to be because the text used for the test has embedded new lines and the CRLF rule is looking for CR-LF, so the rule never matches. I don't understand why it's looking for CR-LF in the first case, since Python3 will always convert CR-LF into just |
|
I'm also wondering why it is looking for a line break at that specific place. It kind of looks like a bug in Tatsu. I'm stuck here, sorry. |
|
Yeah, I don’t know tatsu at all. When I get a chance, I’ll try replacing CRLF with just LF and see if that works. It’ll fail when both chars are actually there, but I expect that’ll be rare anyway (and good enough for my case!). Ideally, the CR should be optional in front of LF, I just don’t know how to do that. Normally, I’d do that in the tokenizer, not the parser. Are they one and the same with tatsu? (See? I don’t know anything about it!) Thanks for getting it this far! I may have time to play with it again late in the week... |
|
Oh yeah, I forgot that python translates newlines. Anyways, the newlines within the description are escaped ones, i.e. a backslash followed by an 'n' and not an actual newline. CRLF = '\r\n'
grammar_path = Path(__file__).parent.joinpath('contentline.ebnf')
with open(grammar_path) as fd:
GRAMMAR = tatsu.compile(fd.read())
# [...]
try:
ast = GRAMMAR.parse(line + CRLF)
except tatsu.exceptions.FailedToken:
raise ParseError()So the only carriage return (followed by the only line feed) in the content line is actually appended by us. I'm slowly wondering whether using a whole EBNF engine just for the separation of content lines having the format |
CRs and CRLFs (\r\n) are converted to LF (\n) by python's universal newline mode when reading a file [1]. Even though the RFC [2] says the ics files should be using CRLF, we should simply ignore the '\r' when working with python strings. [1] https://docs.python.org/3/library/functions.html#open-newline-parameter [2] https://www.ietf.org/rfc/rfc2445.html#page-15
the horizontal tab \x09 is a legal whitespace (WSP) character and may appear in content line values without escaping
|
Ahhhh I fixed it!! We forgot to include the tab in our regex, see a060afe. The RFC doesn't differentiate between "space" |
Hm, interesting. I guess it makes sense, since the ics file will depict a multiline description that way, I just hadn't thought about it.
Good catch! I never read through the RFC so didn't think to compare it against the actual grammar being used. I look forward to giving it a shot when I get a chance! Thanks for being persistent! 😉 |
It might indeed have been overkill 😞 |
I guess we can leave it in now and as long as it isn't causing more problems than it is solving there's no need to invest much time. Being able to re-use the grammar from the RFC is actually quite nice. |
contributed by @Azhrei:
"Inverting the REs and selecting the characters I don't want vs. those that I do, not only does the list get shorter but the other Unicode characters are implicitly allowed."
Fixes #206, fixes #211