Skip to content

imdb_watchlist plugin fixed issue #2035#2050

Merged
liiight merged 9 commits intoFlexget:developfrom
thawn:develop
Dec 31, 2017
Merged

imdb_watchlist plugin fixed issue #2035#2050
liiight merged 9 commits intoFlexget:developfrom
thawn:develop

Conversation

@thawn
Copy link
Copy Markdown
Contributor

@thawn thawn commented Dec 30, 2017

Motivation for changes:

imdb_watchlist plugin did not return any entries any more. See issue #2035 .

Detailed changes:

  • added new function to parse the new JSON code used by a new React widget. The new widget was introduced by imdb on watchlist pages early December 2017. Interestingly, the react widget uses an openly accessible api that returns JSON code which is easier to parse than the html code
  • updated the existing code to parse other user lists, where imdb changed the layout (but does not use the react widget)

Addressed issues:

Log and/or tests output (preferably both):

new JSON parser:

2017-12-30 15:56 VERBOSE  imdb_watchlist IMPORT_TEST_W   Retrieving imdb list: watchlist
2017-12-30 15:56 DEBUG    imdb_watchlist IMPORT_TEST_W   Requesting: http://www.imdb.com/user/ur25395647/watchlist {u'Accept-Language': u'en-us'}
2017-12-30 15:56 DEBUG    utils.requests IMPORT_TEST_W   Fetching URL http://www.imdb.com/user/ur25395647/watchlist with args () and kwargs {'headers': {u'Accept-Language': u'en-us'}, 'allow_redirects': True, 'params': {u'sort': u'list_order%2Casc', u'title_type': u'tvMiniSeries', u'view': u'detail'}, u'timeout': 30}
2017-12-30 15:56 DEBUG    utils.requests IMPORT_TEST_W   Fetching URL http://www.imdb.com/title/data with args () and kwargs {'allow_redirects': True, 'params': {u'ids': u'tt3959576,tt3847626'}, u'timeout': 30}
2017-12-30 15:56 VERBOSE  imdb_watchlist IMPORT_TEST_W   imdb list contains 2 items
2017-12-30 15:56 DEBUG    imdb_watchlist IMPORT_TEST_W   First entry (imdb id: tt3959576) looks like this: {u'starbar': {u'aggregate': 6.7, u'rating': 0, u'votes': 931, u'id': u'tt3959576', u'auth': u''}, u'ribbon': {u'inWL': False, u'tconst': u'tt3959576'}, u'title': {u'plot': u'A policewoman sets out to discover who murdered her husband, an undercover officer.', u'ratings': {u'rating': 6.7, u'votes': 931, u'canVote': True}, u'movieMeterCurrentRank': 5584, u'primary': {u'href': u'/title/tt3959576', u'year': [u'2015', u''], u'title': u'Black Work'}, u'credits': {u'star': [{u'href': u'/name/nm0809956', u'name': u'Sheridan Smith'}, {u'href': u'/name/nm2439913', u'name': u'Matthew McNulty'}, {u'href': u'/name/nm4359366', u'name': u'Oliver Woollford'}, {u'href': u'/name/nm4835867', u'name': u'Honor Kneafsey'}], u'creator': [{u'href': u'/name/nm4131020', u'name': u'Matt Charman'}]}, u'poster': {u'url': u'https://images-na.ssl-images-amazon.com/images/M/MV5BMDI1YmQ2MGMtMDBkNS00OWViLTlkZTAtY2I4ZmNjZDNlMjAzXkEyXkFqcGdeQXVyMjExMjk0ODk@._V1_.jpg', u'width': 1064, u'height': 1500}, u'type': u'series', u'id': u'tt3959576', u'metadata': {u'release': 1434844800000, u'genres': [u'Crime', u'Drama', u'Thriller'], u'runtime': 2700, u'numberOfEpisodes': 3, u'certificate': u''}}}

updated html parser:

2017-12-30 15:45 VERBOSE  imdb_watchlist IMPORT_TEST     Retrieving imdb list: ls056853267
2017-12-30 15:45 DEBUG    imdb_watchlist IMPORT_TEST     Requesting: http://www.imdb.com/list/ls056853267 {u'Accept-Language': u'en-us'}
2017-12-30 15:45 DEBUG    utils.requests IMPORT_TEST     Fetching URL http://www.imdb.com/list/ls056853267 with args () and kwargs {'headers': {u'Accept-Language': u'en-us'}, 'allow_redirects': True, 'params': {u'sort': u'list_order%2Casc', u'title_type': u'movie', u'page': 1, u'view': u'detail'}, u'timeout': 30}
2017-12-30 15:45 VERBOSE  imdb_watchlist IMPORT_TEST     imdb list contains 123 items
2017-12-30 15:45 DEBUG    imdb_watchlist IMPORT_TEST     100 items found on page 1
2017-12-30 15:45 DEBUG    imdb_watchlist IMPORT_TEST     Requesting: http://www.imdb.com/list/ls056853267 {u'Accept-Language': u'en-us'}
2017-12-30 15:45 DEBUG    utils.requests IMPORT_TEST     Fetching URL http://www.imdb.com/list/ls056853267 with args () and kwargs {'headers': {u'Accept-Language': u'en-us'}, 'allow_redirects': True, 'params': {u'sort': u'list_order%2Casc', u'title_type': u'movie', u'page': 2, u'view': u'detail'}, u'timeout': 30}
2017-12-30 15:45 DEBUG    imdb_watchlist IMPORT_TEST     23 items found on page 2

To Do:

  • Maybe add the new JSON api to flexget/utils/imdb.py (comments?)

entries = []
for imdb_id in imdb_ids:
entry = Entry()
if not 'title' in json_data[imdb_id] or not 'primary' in json_data[imdb_id]['title'] or not 'href' in json_data[imdb_id]['title']['primary'] or not 'title' in json_data[imdb_id]['title']['primary']:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120 char limit

if 'type' in json_data[imdb_id]['title']:
imdb_type = json_data[imdb_id]['title']['type']
else:
imdb_type = ''
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just initialize this to an empty string at the start of the loop

total_item_count = len(json_vars['list']['items'])
else:
total_item_count = 0
if total_item_count == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not total_item_count is more pythonic

if 'list' in json_vars and 'items' in json_vars['list']:
total_item_count = len(json_vars['list']['items'])
else:
total_item_count = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just initialize this to 0 above the if.

page = self.fetch_page(task, url, params, headers)
# log.debug(page.text)
re_result = re.search('IMDbReactInitialState.push\((.+?)\);\n', page.text)
json_text = re_result.group(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash if it doesn't find a match. Should be wrapped in a try-except and spit out a suitable error message informing the user that the plugin may be broken

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, cvium for the thorough code review!
I left line 121 out of the try-except, since the fetch_page function has its own code that handles HTTPErrors or wrong http status codes.

@thawn
Copy link
Copy Markdown
Contributor Author

thawn commented Dec 30, 2017

Thanks, cvium for the thorough code review!
I think I managed to implement the suggested changes.

params = {'ids': ','.join(imdb_ids)}
url = 'http://www.imdb.com/title/data'
try:
page = task.requests.get(url, params=params)
Copy link
Copy Markdown
Contributor

@cvium cvium Dec 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you give it the argument raise_status=True it will raise an exception when the error code is 4xx, 5xx etc. I think that's what you're going for?

And fyi, Requests raises a RequestException, so catch that instead of HTTPError.

if page.status_code != 200:
raise plugin.PluginError('Unable to get imdb title data. Html status code was: %d.' % page.status_code)
json_data = json.loads(page.text)
log.verbose('imdb list contains %d items' % len(json_data))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All logging calls should use implicit string interpolation ie. the logger takes an arbitrary number of arguments that it will use for string interpolation. As an example, the above line should be log.verbose('imdb list contains %d items', len(json_data)) (notice the comma).


items = soup.find_all(attrs={'data-item-id': True, 'class': 'list_item'})
items = soup.find_all('div', class_='lister-item')
if len(items) == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not items is the pythonic way

page = self.fetch_page(task, url, params, headers)
try:
re_result = re.search('IMDbReactInitialState.push\((.+?)\);\n', page.text)
json_text = re_result.group(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any need for an explicit variable here. You can just change the line below to json.loads(re_result.group(1)).

raise plugin.PluginError(
'Unable to get imdb list from imdb react widget.' +
' Either the list is empty or the imdb parser plugin is broken.' +
' Original error: %s.' % e.args[0])
Copy link
Copy Markdown
Contributor

@cvium cvium Dec 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think str(e) is better. You lose some information with e.args[0] (like what kind of exception it is). You will need to add from builtins import * # noqa pylint: disable=unused-import, redefined-builtin to the list of imports though for compatibility.

return
imdb_ids = []
for item in json_vars['list']['items']:
if is_valid_imdb_title_id(item['const']):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will crash if const is not present. Better to use dict.get() in this case.

try:
page = task.requests.get(url, params=params)
except HTTPError as e:
raise plugin.PluginError(e.args[0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str(e)

Copy link
Copy Markdown
Contributor

@cvium cvium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it's pretty good. Just a few corrections needed.

@thawn
Copy link
Copy Markdown
Contributor Author

thawn commented Dec 30, 2017

I think I got it now. Also checked entire code again for non-pythonic mishaps ;-).

if 'all' not in config['type']:
title_types = []
for title_type in config['type']:
title_types.append(TITLE_TYPE_MAP[title_type])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a style suggestion, list comprehension fits here well:

title_types = [TITLE_TYPE_MAP[title_type] for title_type in config['type']

url = 'http://www.imdb.com/title/data'
page = self.fetch_page(task, url, params, headers)
try:
json_data = json.loads(page.text)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use .json() on a returned Response

for imdb_id in imdb_ids:
imdb_type = ''
entry = Entry()
if not ('title' in json_data[imdb_id] and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant parentheses

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deliberately added the parentheses here in order to be able to break the long if statement into several lines (as suggested by cvium).
Also: AFAIK, I would have to change this to if not ... or not ... or not ... : in order to remove the parentheses and still get the same result. Therefore, the parentheses save me several additional negations.
For these reasons, I would like to keep the parentheses if that is o.k.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

log.debug('First entry (imdb id: %s) looks like this: %s', imdb_ids[0], json_data[imdb_ids[0]])
entries = []
for imdb_id in imdb_ids:
imdb_type = ''
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it makes sense to init this to an empty string?

@thawn
Copy link
Copy Markdown
Contributor Author

thawn commented Dec 31, 2017

Thanks, @liiight for the thorough code review!
I think I managed to implement the suggested changes (except for the parentheses see my answer to your code comment, why I would like to keep them if that is o.k.).

@liiight liiight merged commit be381aa into Flexget:develop Dec 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants