added urlrewrite plugin for rmz.cr (rapidmoviez.com|rapidmoviez.eu) by thawn · Pull Request #1879 · Flexget/Flexget

thawn · 2017-06-19T09:54:45Z

Motivation for changes:

rmz.cr needs an urlrewrite plugin for the rss feed (rmz.cr/feed) because the feed only provides links back to the rmz.cr release page.

Detailed changes:

rmz.cr (rapidmoviez.com) urlrewriter
Version 0.1

Configuration

rmz:
  url_re:
    - domain\.com
    - domain2\.org

Only add links that match any of the regular expressions listed under url_re.

If more than one valid link is found, the url of the entry is rewritten to
the first link found. The complete list of valid links is placed in the
'urls' field of the entry.

Therefore, it is recommended, that you configure your output to use the
'urls' field instead of the 'url' field.

For example, to use jdownloader 2 as output, you would use the exec plugin:
exec:
  - echo "text={{urls}}" >> "/path/to/jd2/folderwatch/{{title}}.crawljob"

Config usage if relevant (new plugin or updated schema):

    rmz:
      url_re:
        - domain\.com
        - domain2\.org

Log and/or tests output (preferably both):

DEBUG  rmz  TASK  Using links_re filters: <list of filter expressions>
DEBUG  rmz  TASK  Url: "<link found at entry url>" matched filter: <expression>

INFO      rmz  TASK  Found 2 links at <entry url>

To Do:

maybe extend this to work also as a search plugin for rmz.cr

cvium · 2017-06-19T10:07:56Z

flexget/plugins/sites/rmz.py

+                        accept = True
+                        log.debug('Url: "%s" matched filter: %s', urls[i], regexp)
+                        break
+                if not accept:


Since you break when finding a match, you could just use for-else instead of the accept variable.

Thanks for the tip, I'll try that.

liiight · 2017-06-19T10:07:24Z

flexget/plugins/sites/rmz.py

+    schema = {
+        'type': 'object',
+        'properties': {
+            'links_re': {'type': 'array', 'items': {'format': 'regexp'} }


there's a mismatch between this value and the one in the doc (url_re).

I'm not sure why is this needed. Don't you know the TLD and format you need for the rewrite? why does the user need to input this? URL rewrite plugin do not need to rely on config.

I'll fix that asap.

The link_re config parameter works just like the link_re parameter in the html input plugin (thats why I renamed it).
The reason for using this parameter is that on the rmz.cr site, they often offer links to several filehosters. This parameter allows the user to choose which filehosters he prefers. For example, the user may have an account for one of the filehosters or he may prefer some filehosters because they offer superior download speed or no file size restrictions for free users.

ah, gotcha. rmz.cr is an indexing site and the links are to external sites.
maybe name it filehoster instead then?

good idea, done in the latest commit.

liiight · 2017-06-19T10:10:46Z

flexget/plugins/sites/rmz.py

+    def url_rewritable(self, task, entry):
+        url = entry['url']
+        rewritable_regex = '^https?:\/\/(www.)?(rmz\.cr|rapidmoviez\.(com|eu))\/.*'
+        if re.match(rewritable_regex, url):


suggestion:

return re.match(rewritable_regex, url) is not None

liiight · 2017-06-19T10:11:56Z

flexget/plugins/sites/rmz.py

+            log.debug('Using links_re filters: %s', str(regexps))
+        else:
+            log.debug('No link filters configured, using all found links.')
+        for i in reversed(range(len(urls))):


i'd use enumerate here

I need to go through the array in reversed order because I am removing some elements within the for loop.

I just tried and enumerate did not work well with reversed:
when I tried to use reversed(enumerate(urls)), I got the error:
TypeError: argument to reversed() must be a sequence
Then I tried reversed(list(enumerate(urls))) (which is imho at least as ugly as what I used originally) but that gave me the error:
TypeError: list indices must be integers, not tuple

so I gave up and reverted to reversed(range(len(urls)))

liiight · 2017-06-19T10:12:18Z

flexget/plugins/sites/rmz.py

+            urls[i]=urls[i].encode('ascii','ignore')
+            if regexps:
+                accept = False
+                for regexp in regexps:


this will crash if regexps is None

Look at line 90

ah ok, maybe just do regexps = self.config.get('links_re', []) and avoid the if statement

o.k. got it to work like this.

liiight · 2017-06-19T10:12:54Z

flexget/plugins/sites/rmz.py

+                if not accept:
+                    log.debug('Url: "%s" does not match any of the given filters: %s', urls[i], str(regexps))
+                    del(urls[i])
+        numlinks=len(urls)


use snake case num_links

liiight · 2017-06-19T10:13:36Z

flexget/plugins/sites/rmz.py

+                    del(urls[i])
+        numlinks=len(urls)
+        log.info('Found %d links at %s.',numlinks, entry['url'])
+        if numlinks>0:


not pythonic, no need to check if higher than 0, just do:

if numlinks:

cvium · 2017-06-19T10:08:43Z

flexget/plugins/sites/rmz.py

+        else:
+            log.debug('No link filters configured, using all found links.')
+        for i in reversed(range(len(urls))):
+            urls[i]=urls[i].encode('ascii','ignore')


Lacking some whitespace around equals and after comma.

cvium · 2017-06-19T10:09:00Z

flexget/plugins/sites/rmz.py

+                if not accept:
+                    log.debug('Url: "%s" does not match any of the given filters: %s', urls[i], str(regexps))
+                    del(urls[i])
+        numlinks=len(urls)


cvium · 2017-06-19T10:09:01Z

flexget/plugins/sites/rmz.py

+                    del(urls[i])
+        numlinks=len(urls)
+        log.info('Found %d links at %s.',numlinks, entry['url'])
+        if numlinks>0:


cvium · 2017-06-19T10:09:07Z

flexget/plugins/sites/rmz.py

+        log.info('Found %d links at %s.',numlinks, entry['url'])
+        if numlinks>0:
+            entry.setdefault('urls', urls)
+            entry['url']=urls[0]


cvium · 2017-06-19T10:10:07Z

flexget/plugins/sites/rmz.py

+
+@event('plugin.register')
+def register_plugin():
+    plugin.register(UrlRewriteRmz, 'rmz', interfaces=['urlrewriter', 'task'], api_ver=2)


There should be a newline at the end of the file.

I added a newline a few commits ago and I just checked on my local machine and with the online editor and there is definitely a newline at the end of the file now.

Don't know why this review did not get marked as outdated...

cvium · 2017-06-19T10:11:06Z

flexget/plugins/sites/rmz.py

+        urls=[]
+        for element in link_elements:
+            urls.extend(element.text.splitlines())
+        regexps = self.config.get('links_re', None)


None is already the default value returned by dict.get().

I replaced this with [] to make sure it does not brake my for loop in line 86

cvium · 2017-06-19T10:13:24Z

flexget/plugins/sites/rmz.py

+    def url_rewrite(self, task, entry):
+        try:
+            page = task.requests.get(entry['url'])
+        except Exception as e:


Could you please suggest an error class to use? I looked at the html plugin but they do not catch errors at all.
I tried to use requests.exceptions.RequestException but got an error.

that is the correct exception. What error did you get? did you import it correctly?

I got this fixed now. I had to add from requests.exceptions import RequestException. Now I am using RequestException instead of Exception.

cvium · 2017-06-19T10:13:39Z

flexget/plugins/sites/rmz.py

+        try:
+            page = task.requests.get(entry['url'])
+        except Exception as e:
+            raise UrlRewritingError(e)


Probably need to wrap e in str()

cvium · 2017-06-19T10:13:42Z

flexget/plugins/sites/rmz.py

+        try:
+            soup = get_soup(page.text)
+        except Exception as e:
+            raise UrlRewritingError(e)


Probably need to wrap e in str()

cvium · 2017-06-19T10:14:17Z

flexget/plugins/sites/rmz.py

+            raise UrlRewritingError(e)
+        try:
+            soup = get_soup(page.text)
+        except Exception as e:


This one is also too broad, but since BS4 can raise different exceptions it's fine in this case...

cvium · 2017-06-19T11:47:10Z

flexget/plugins/sites/rmz.py

+            else:
+                log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
+                del(urls[i])
+        num_links=len(urls)


cvium · 2017-06-19T11:47:25Z

flexget/plugins/sites/rmz.py

+                log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
+                del(urls[i])
+        num_links=len(urls)
+        log.info('Found %d links at %s.',num_links, entry['url'])


space after comma

cvium · 2017-06-19T12:09:36Z

flexget/plugins/sites/rmz.py

+                    break
+            else:
+                log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
+                del(urls[i])


It's a bad idea to iterate over a list and delete elements from it. You should either create a copy of the list or add to another list instead of deleting.

I never experienced any problem when I was traveling the list in reversed order. Are there any performance implications (although we are talking about 100 entries in the list tops)?

I read up on the performance implications of del.
switched to adding to a new list.

liiight · 2017-06-19T12:31:12Z

flexget/plugins/sites/rmz.py

+                log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
+                del(urls[i])
+        num_links = len(urls)
+        log.info('Found %d links at %s.',num_links, entry['url'])


I think log.verbose level would fit here better

changed that and added the space after the comma.

…ht for the code review.

thawn · 2017-06-19T13:31:42Z

@cvium @liiight
Thank you so much for the kind code review. I learned a lot today!
I think I now managed to address all the issues you raised. Let me know if there is more that needs to be improved.

Cheers,

Thawn

liiight · 2017-06-19T18:29:44Z

flexget/plugins/sites/rmz.py

+                log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
+        if regexps:
+            log.debug('Using filehosters_re filters: %s', str(regexps))
+            urls=filtered_urls


Missing spaces. See about running pep8 on your code. You can also use pycharm which is free

Thanks for the tip. I used autopep8 (via eclipse).

liiight · 2017-06-20T05:53:11Z

Thanks, care to update the wiki?

thawn · 2017-06-20T06:23:25Z

@liiight Sure, I'll write a wiki page and link to it under the "Data operations" section next to the generic urlrewrite plugin.

Thanks again for your help with improving the code!

thawn · 2017-06-20T08:34:42Z

Wiki page is done:
https://www.flexget.com/Plugins/rmz

and linked from the plugins page (section Modification/Data Operations):
https://flexget.com/Plugins#modification

thawn added 2 commits June 19, 2017 11:37

added urlrewrite plugin for rmz.cr (rapidmoviez.com|rapidmoviez.eu)

fb203c6

updated documentation for rmz urlrewrite plugin

47e3bde

cvium reviewed Jun 19, 2017

View reviewed changes

liiight reviewed Jun 19, 2017

View reviewed changes

cvium requested changes Jun 19, 2017

View reviewed changes

thawn added 2 commits June 19, 2017 12:24

fixed documentation for rmz urlrewrite plugin

345425d

requested changes to rmz urlrewrite plugin. Thanks to cvium and liiight.

61107f9

cvium reviewed Jun 19, 2017

View reviewed changes

fixed regression in rmz urlrewrite plugin.

3df86ce

cvium reviewed Jun 19, 2017

View reviewed changes

liiight reviewed Jun 19, 2017

View reviewed changes

thawn added 4 commits June 19, 2017 14:35

requested changes to rmz urlrewrite plugin. Thanks to cvium.

b99f4c0

requested changes to rmz urlrewrite plugin. Thanks to cvium and liiight.

325d904

improved code of rmz urlrewrite plugin.

a0521e3

requested changes to rmz urlrewrite plugin. Thanks to cvium and liiig…

02130c6

…ht for the code review.

code improvement for rmz urlrewrite plugin.

f94c4ac

liiight approved these changes Jun 19, 2017

View reviewed changes

fixed formatting with autopip8

1769d1e

liiight merged commit 5ae77ef into Flexget:develop Jun 20, 2017

This was referenced Jun 21, 2017

Added urlrewrite plugin for Rlsbb #1882

Closed

Added urlrewrite plugin for Rlsbb #1884

Merged

Conversation

thawn commented Jun 19, 2017

Motivation for changes:

Detailed changes:

Config usage if relevant (new plugin or updated schema):

Log and/or tests output (preferably both):

To Do:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thawn Jun 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thawn Jun 19, 2017 •

edited

Loading