Skip to content

added urlrewrite plugin for rmz.cr (rapidmoviez.com|rapidmoviez.eu)#1879

Merged
liiight merged 11 commits intoFlexget:developfrom
thawn:develop
Jun 20, 2017
Merged

added urlrewrite plugin for rmz.cr (rapidmoviez.com|rapidmoviez.eu)#1879
liiight merged 11 commits intoFlexget:developfrom
thawn:develop

Conversation

@thawn
Copy link
Copy Markdown
Contributor

@thawn thawn commented Jun 19, 2017

Motivation for changes:

rmz.cr needs an urlrewrite plugin for the rss feed (rmz.cr/feed) because the feed only provides links back to the rmz.cr release page.

Detailed changes:

rmz.cr (rapidmoviez.com) urlrewriter
Version 0.1

Configuration

rmz:
  url_re:
    - domain\.com
    - domain2\.org

Only add links that match any of the regular expressions listed under url_re.

If more than one valid link is found, the url of the entry is rewritten to
the first link found. The complete list of valid links is placed in the
'urls' field of the entry.

Therefore, it is recommended, that you configure your output to use the
'urls' field instead of the 'url' field.

For example, to use jdownloader 2 as output, you would use the exec plugin:
exec:
  - echo "text={{urls}}" >> "/path/to/jd2/folderwatch/{{title}}.crawljob"

Config usage if relevant (new plugin or updated schema):

    rmz:
      url_re:
        - domain\.com
        - domain2\.org

Log and/or tests output (preferably both):

DEBUG  rmz  TASK  Using links_re filters: <list of filter expressions>
DEBUG  rmz  TASK  Url: "<link found at entry url>" matched filter: <expression>

INFO      rmz  TASK  Found 2 links at <entry url>

To Do:

  • maybe extend this to work also as a search plugin for rmz.cr

accept = True
log.debug('Url: "%s" matched filter: %s', urls[i], regexp)
break
if not accept:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you break when finding a match, you could just use for-else instead of the accept variable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip, I'll try that.

schema = {
'type': 'object',
'properties': {
'links_re': {'type': 'array', 'items': {'format': 'regexp'} }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. there's a mismatch between this value and the one in the doc (url_re).
  2. I'm not sure why is this needed. Don't you know the TLD and format you need for the rewrite? why does the user need to input this? URL rewrite plugin do not need to rely on config.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'll fix that asap.
  2. The link_re config parameter works just like the link_re parameter in the html input plugin (thats why I renamed it).
    The reason for using this parameter is that on the rmz.cr site, they often offer links to several filehosters. This parameter allows the user to choose which filehosters he prefers. For example, the user may have an account for one of the filehosters or he may prefer some filehosters because they offer superior download speed or no file size restrictions for free users.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, gotcha. rmz.cr is an indexing site and the links are to external sites.
maybe name it filehoster instead then?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, done in the latest commit.

def url_rewritable(self, task, entry):
url = entry['url']
rewritable_regex = '^https?:\/\/(www.)?(rmz\.cr|rapidmoviez\.(com|eu))\/.*'
if re.match(rewritable_regex, url):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

return re.match(rewritable_regex, url) is not None

log.debug('Using links_re filters: %s', str(regexps))
else:
log.debug('No link filters configured, using all found links.')
for i in reversed(range(len(urls))):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd use enumerate here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to go through the array in reversed order because I am removing some elements within the for loop.

I just tried and enumerate did not work well with reversed:
when I tried to use reversed(enumerate(urls)), I got the error:
TypeError: argument to reversed() must be a sequence
Then I tried reversed(list(enumerate(urls))) (which is imho at least as ugly as what I used originally) but that gave me the error:
TypeError: list indices must be integers, not tuple

so I gave up and reverted to reversed(range(len(urls)))

urls[i]=urls[i].encode('ascii','ignore')
if regexps:
accept = False
for regexp in regexps:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will crash if regexps is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at line 90

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, maybe just do regexps = self.config.get('links_re', []) and avoid the if statement

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o.k. got it to work like this.

if not accept:
log.debug('Url: "%s" does not match any of the given filters: %s', urls[i], str(regexps))
del(urls[i])
numlinks=len(urls)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use snake case num_links

del(urls[i])
numlinks=len(urls)
log.info('Found %d links at %s.',numlinks, entry['url'])
if numlinks>0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not pythonic, no need to check if higher than 0, just do:

if numlinks:

else:
log.debug('No link filters configured, using all found links.')
for i in reversed(range(len(urls))):
urls[i]=urls[i].encode('ascii','ignore')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lacking some whitespace around equals and after comma.

if not accept:
log.debug('Url: "%s" does not match any of the given filters: %s', urls[i], str(regexps))
del(urls[i])
numlinks=len(urls)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space

del(urls[i])
numlinks=len(urls)
log.info('Found %d links at %s.',numlinks, entry['url'])
if numlinks>0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space

log.info('Found %d links at %s.',numlinks, entry['url'])
if numlinks>0:
entry.setdefault('urls', urls)
entry['url']=urls[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space


@event('plugin.register')
def register_plugin():
plugin.register(UrlRewriteRmz, 'rmz', interfaces=['urlrewriter', 'task'], api_ver=2) No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a newline at the end of the file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a newline a few commits ago and I just checked on my local machine and with the online editor and there is definitely a newline at the end of the file now.

Don't know why this review did not get marked as outdated...

urls=[]
for element in link_elements:
urls.extend(element.text.splitlines())
regexps = self.config.get('links_re', None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is already the default value returned by dict.get().

Copy link
Copy Markdown
Contributor Author

@thawn thawn Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this with [] to make sure it does not brake my for loop in line 86

def url_rewrite(self, task, entry):
try:
page = task.requests.get(entry['url'])
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too broad

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please suggest an error class to use? I looked at the html plugin but they do not catch errors at all.
I tried to use requests.exceptions.RequestException but got an error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is the correct exception. What error did you get? did you import it correctly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this fixed now. I had to add from requests.exceptions import RequestException. Now I am using RequestException instead of Exception.

try:
page = task.requests.get(entry['url'])
except Exception as e:
raise UrlRewritingError(e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to wrap e in str()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try:
soup = get_soup(page.text)
except Exception as e:
raise UrlRewritingError(e)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to wrap e in str()

raise UrlRewritingError(e)
try:
soup = get_soup(page.text)
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is also too broad, but since BS4 can raise different exceptions it's fine in this case...

else:
log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
del(urls[i])
num_links=len(urls)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space

log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
del(urls[i])
num_links=len(urls)
log.info('Found %d links at %s.',num_links, entry['url'])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after comma

break
else:
log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
del(urls[i])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bad idea to iterate over a list and delete elements from it. You should either create a copy of the list or add to another list instead of deleting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never experienced any problem when I was traveling the list in reversed order. Are there any performance implications (although we are talking about 100 entries in the list tops)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read up on the performance implications of del.
switched to adding to a new list.

log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
del(urls[i])
num_links = len(urls)
log.info('Found %d links at %s.',num_links, entry['url'])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think log.verbose level would fit here better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed that and added the space after the comma.

@thawn
Copy link
Copy Markdown
Contributor Author

thawn commented Jun 19, 2017

@cvium @liiight
Thank you so much for the kind code review. I learned a lot today!
I think I now managed to address all the issues you raised. Let me know if there is more that needs to be improved.

Cheers,

Thawn

log.debug('Url: "%s" does not match any of the given filehoster filters: %s', urls[i], str(regexps))
if regexps:
log.debug('Using filehosters_re filters: %s', str(regexps))
urls=filtered_urls
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing spaces. See about running pep8 on your code. You can also use pycharm which is free

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. I used autopep8 (via eclipse).

@liiight liiight merged commit 5ae77ef into Flexget:develop Jun 20, 2017
@liiight
Copy link
Copy Markdown
Member

liiight commented Jun 20, 2017

Thanks, care to update the wiki?

@thawn
Copy link
Copy Markdown
Contributor Author

thawn commented Jun 20, 2017

@liiight Sure, I'll write a wiki page and link to it under the "Data operations" section next to the generic urlrewrite plugin.

Thanks again for your help with improving the code!

@thawn
Copy link
Copy Markdown
Contributor Author

thawn commented Jun 20, 2017

Wiki page is done:
https://www.flexget.com/Plugins/rmz

and linked from the plugins page (section Modification/Data Operations):
https://flexget.com/Plugins#modification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants