Skip to content

3.7 regex fix#2162

Merged
cvium merged 27 commits intoFlexget:developfrom
tobinjt:3.7_regex_fix
Sep 27, 2018
Merged

3.7 regex fix#2162
cvium merged 27 commits intoFlexget:developfrom
tobinjt:3.7_regex_fix

Conversation

@tobinjt
Copy link
Copy Markdown
Contributor

@tobinjt tobinjt commented Jul 1, 2018

Motivation for changes:

Fix breakage with Python 3.7.

Detailed changes:

https://docs.python.org/3.7/library/re.html#re.sub
In Python 3.6 onwards, the replacement text can't contain \w or any similar regex sequences. Use re.split and join instead.

Addressed issues:

Without this change this crash occurs:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/flexget/task.py", line 486, in __run_plugin
    return method(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/flexget/event.py", line 23, in __call__
    return self.func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/flexget/plugins/filter/all_series.py", line 48, in on_task_metainfo
    if guess_entry(entry, config=group_settings):
  File "/usr/local/lib/python3.7/site-packages/flexget/plugins/metainfo/series.py", line 47, in guess_entry
    allow_seasonless=allow_seasonless)
  File "/usr/local/lib/python3.7/site-packages/flexget/plugins/parsers/plugin_parsing.py", line 74, in parse_series
    return parser.parse_series(data, name=name, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/flexget/plugins/parsers/parser_internal.py", line 47, in parse_series
    parser.parse(data)
  File "/usr/local/lib/python3.7/site-packages/flexget/utils/titles/series.py", line 225, in parse
    name_to_re(name, self.ignore_prefixes, self) for name in [self.name] + self.alternate_names)
  File "/usr/local/lib/python3.7/site-packages/flexget/utils/tools.py", line 205, in __init__
    list.__init__(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/flexget/utils/titles/series.py", line 225, in <genexpr>
    name_to_re(name, self.ignore_prefixes, self) for name in [self.name] + self.alternate_names)
  File "/usr/local/lib/python3.7/site-packages/flexget/plugins/parsers/parser_common.py", line 85, in name_to_re
    res = re.sub(' +', blank + '*', res, re.UNICODE)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 309, in _subx
    template = _compile_repl(template, pattern)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/re.py", line 300, in _compile_repl
    return sre_parse.parse_template(repl, pattern)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/sre_parse.py", line 1024, in parse_template
    raise s.error('bad escape %s' % this, len(this))
re.error: bad escape \w at position 5

To Do:

3 tests still fail with similar errors, but tracking them down is hard because an exception is raised when handling the exception :(

To make debugging easier, log contents of fetched URLs at trace level.
Browsers process Javascript, handle iframes, and more that flexget
doesn't; seeing the actual data processed by flexget rather than looking
at the URL in your browser makes it easier to figure out why you're not
getting any results.
#1558 will systemically handle
this problem.
https://docs.python.org/3.7/library/re.html#re.sub
In Python 3.6 onwards, the replacement text can't contain \w or any
similar regex sequences.  Use re.split and join instead.
@serhiy-storchaka
Copy link
Copy Markdown

An alternate fix is duplicating backslashes in the replacement string:

res = re.sub(' +', blank.replace('\\', r'\\') + '*', res, flags=re.UNICODE)

Note also that the third positional argument of re.sub() is a maximal count of replacements, not flags. Passing re.UNICODE as a third positional argument you just order re.sub() to make not more than 128 replacements. There is the same bug in other places.

@tobinjt
Copy link
Copy Markdown
Contributor Author

tobinjt commented Jul 6, 2018

I can do that instead - would that be preferred?

Thanks,

@paranoidi
Copy link
Copy Markdown
Member

@tobinjt
Hmm, I think duplicate \'s are preferred, it's proper escaping after all ..

@JohnDoee JohnDoee mentioned this pull request Aug 19, 2018
4 tasks
@JohnDoee
Copy link
Copy Markdown

This is currently blocking 3.7 support and the only issue left discussed seems to be how to style the line.

The proposed fix works as far as I can see and the flags issue highlighted by @serhiy-storchaka should be moved to a different issue.

@tobinjt
Copy link
Copy Markdown
Contributor Author

tobinjt commented Aug 21, 2018

I've changed it to use replace() instead of split() and join().
There are still a small number of test failures but this resolves the vast majority.

https://docs.python.org/3.7/library/re.html#re.sub
In Python 3.6 onwards, the replacement text can't contain \w or any
similar regex sequences.
https://docs.python.org/3/library/re.html#re.escape
"Changed in version 3.7: Only characters that can have special meaning
in a regular expression are escaped."
"/" does not have special meaning in regular expressions so it is no
longer escaped.
@tobinjt
Copy link
Copy Markdown
Contributor Author

tobinjt commented Aug 25, 2018

Added commits fixing another 2 failing tests, 1 failing test remains.

assert 'd.directory.set=\\/data\\/downloads' in fields
assert ('d.directory.set=\\/data\\/downloads' in fields
# Python 3.7+.
or 'd.directory.set=/data/downloads' in fields)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that additional slash intentional ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python <= 3.6 more characters were escaped by re.escape(), including '/', but in Python >= 3.7 re.escape() only escapes regex metacharacters like '$' and '.'. To make the test pass with Python >= 3.7 we need to look for 'd.directory.set=/data/downloads', with earlier versions we need to look for 'd.directory.set=\\/data\\/downloads'.

I've pushed a new commit that expands the comment, let me know if it's clear enough or it needs more details.

fields = [p for p in called_args[2:]]
assert len(fields) == 3
assert 'd.directory.set=\\/data\\/downloads' in fields
assert ('d.directory.set=\\/data\\/downloads' in fields
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use implementation independent test:

assert ('d.directory.set=' + re.escape('/data/downloads')) in fields

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

@tobinjt
Copy link
Copy Markdown
Contributor Author

tobinjt commented Aug 25, 2018

Hmm, I don't think those failing tests are my fault:

Build-agent version 0.1.437-d180d4d7 (2018-08-22T16:07:19+0000)
Starting container flexget/cci-python:2.7
  image cache not found on this host, downloading flexget/cci-python:2.7

  Error pulling image flexget/cci-python:2.7: Error response from daemon: received unexpected HTTP status: 503 Service Unavailable... retrying
  image cache not found on this host, downloading flexget/cci-python:2.7

  Error pulling image flexget/cci-python:2.7: Error response from daemon: received unexpected HTTP status: 503 Service Unavailable... retrying
  image cache not found on this host, downloading flexget/cci-python:2.7

How can I trigger those tests again?

@JohnDoee
Copy link
Copy Markdown

@tobinjt The rtorrent test case isn't doing anything regexp related, it is testing the exact data sent to the rtorrent API. That's why I find it all a bit weird and question the slash change.

@tobinjt
Copy link
Copy Markdown
Contributor Author

tobinjt commented Aug 25, 2018

@JohnDoee https://github.com/Flexget/Flexget/blob/develop/flexget/plugins/clients/rtorrent.py#L315
re.escape() is used by the plugin, not by the test, and the test validates that the data is set correctly by the plugin so it needs to change.

@cvium
Copy link
Copy Markdown
Contributor

cvium commented Sep 11, 2018

What's the status on this?

@JohnDoee
Copy link
Copy Markdown

@cvium To me it looks like it can be merged as is.

The rtorrent regex escape stuff is just plain strange and completely unrelated to how rtorrent API works. Might be smart to fix correctly in a different pull request if the tests doesn't pass (although I think they were fixed).

@tobinjt
Copy link
Copy Markdown
Contributor Author

tobinjt commented Sep 12, 2018

@JohnDoee @cvium As far as I know it can be merged - there are no outstanding comments and it fixes a huge number of test failures under Python 3.7.
There are other problems with Python 3.7 but they will need to be fixed separately.

Thanks,

@cvium cvium merged commit a72fd85 into Flexget:develop Sep 27, 2018
@tobinjt tobinjt deleted the 3.7_regex_fix branch September 27, 2018 22:35
cvium added a commit that referenced this pull request Sep 30, 2018
@cvium
Copy link
Copy Markdown
Contributor

cvium commented Sep 30, 2018

@tobinjt @JohnDoee I had to revert these changes. I should've looked at it more carefully. The changes in this PR are definitely not the correct way to handle this.

Changed in version 3.7: Unknown escapes in repl consisting of '\' and an ASCII letter now are errors.

The keyword here is unknown escapes. There are quite a few known escapes that are affected by simply replacing \ with \\ in the repl string.

I think a better solution for the manipulate plugin may be to restrict the use of backslashes in the config schema.

@asm0dey
Copy link
Copy Markdown
Contributor

asm0dey commented Sep 30, 2018

@cvium should I report bug here or IRC is enough?

@cvium
Copy link
Copy Markdown
Contributor

cvium commented Sep 30, 2018

There is no bug (anymore).

@serhiy-storchaka
Copy link
Copy Markdown

As the author of this change in Python 3.7 I found this PR correct. Have I missed something?

@cvium
Copy link
Copy Markdown
Contributor

cvium commented Sep 30, 2018

Well, it broke something that worked previously. This is an excerpt from IRC:

<removed> Hey guys
<removed> Suddenly my replace patterns are broken
<removed> they look like this: replace:
<removed> regexp: 'e(\d{2}-)?(\d{2})'
<removed> format: 'e\2'
<removed> and suddenly I see \2 in names!

I also added a unit test that verifies that the capture groups could be referenced with \1 prior to the changes, but not after.

@serhiy-storchaka
Copy link
Copy Markdown

Ah, if replace_config['format'] can contain group references, it shouldn't be escaped.

@asm0dey
Copy link
Copy Markdown
Contributor

asm0dey commented Sep 30, 2018

@serhiy-storchaka well, then you should update wiki, I think. Because right now Manipulate post on wiki tells that syntax is correct.
@cvium thank you for reverting merge and fixing bug this way, but I suppose one day you'll need to add compatibility with python 3.7.

@asm0dey
Copy link
Copy Markdown
Contributor

asm0dey commented Sep 30, 2018

@serhiy-storchaka oh, I see, it was notice for yourself, not for me.

@tobinjt
Copy link
Copy Markdown
Contributor Author

tobinjt commented Sep 30, 2018

Sorry for the breakage :(

@serhiy-storchaka Is there a function in Python 3.7 that escapes the necessary characters but doesn't touch others? I wasn't able to find one.

@cvium It would be great if there were tests for that functionality. Are there other places where this is likely to break?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants