Skip to content

Series performance improvements#2020

Merged
stevezau merged 17 commits intodevelopfrom
series_speedup
Nov 25, 2017
Merged

Series performance improvements#2020
stevezau merged 17 commits intodevelopfrom
series_speedup

Conversation

@stevezau
Copy link
Copy Markdown
Contributor

@stevezau stevezau commented Nov 21, 2017

Motivation for changes:

If 100's of series exist flexget's performance can suffer when parsing.

Detailed changes:

  • Prefetch Series Data
  • Better Logic to parsing, right now it loops every entry * configured series..

@stevezau
Copy link
Copy Markdown
Contributor Author

stevezau commented Nov 21, 2017

@gazpachoking @liiight can you take a look at this? not sure other ways we can improve the speeds?

@gazpachoking
Copy link
Copy Markdown
Member

Hmm. Is all the manual normalization needed? Series.name is meant to automatically compare with normalized version of the name. How much of a speed difference does this make?

@stevezau
Copy link
Copy Markdown
Contributor Author

@gazpachoking it didn't work with name.in_. Do you know how to handle that in https://github.com/Flexget/Flexget/blob/develop/flexget/plugins/filter/series.py#L236-L238

@stevezau
Copy link
Copy Markdown
Contributor Author

With pre-fetch it cut down the process time by 50% for me.

It's still very slow though. if you have an rss feed that has 100 entires and series with 100-200 entries it loops over the 100 entries times by the amount of series you have.. Can you think of a better way? I couldn't.

@gazpachoking
Copy link
Copy Markdown
Member

@stevezau
Copy link
Copy Markdown
Contributor Author

stevezau commented Nov 22, 2017

ok, i'll look into that.

Any other idea how we can speed it up. Seems crazy to have to loop over each entry times by the number of series.

If 200 series and 100 entry rss feed that's 20,000 loops

In my case it's more like 30,000-40,000

@cvium
Copy link
Copy Markdown
Contributor

cvium commented Nov 22, 2017

Could we not sort the configured series alphabetically and put them in a dict keyed on first letter like this?

series:
  - a series
  - another series
  - other series
  - third series

-->

series_map = {
    'a': ['a series', 'another series'],
    'o': ['other series'],
    't': ['third series']
}

@stevezau
Copy link
Copy Markdown
Contributor Author

How will that help @cvium?

Won’t we still need to loop over them all right?

@cvium
Copy link
Copy Markdown
Contributor

cvium commented Nov 22, 2017

We can guess the series name from the title, grab the first letter and look up in the dict and try to match with the (hopefully much shorter) list

@stevezau
Copy link
Copy Markdown
Contributor Author

Hmm is that safe to do? I guess we need to also check alt names..

@cvium
Copy link
Copy Markdown
Contributor

cvium commented Nov 22, 2017

All possible names would need to have its own node in the tree

@stevezau
Copy link
Copy Markdown
Contributor Author

@gazpachoking i'm getting lost when trying to figure out how to properly impl series.name.in_

Do you understand how Comparator work? I think we need to impl def in_() but i'm not sure what to return?

@gazpachoking
Copy link
Copy Markdown
Member

gazpachoking commented Nov 22, 2017

@stevezau Maybe try this for the comparator:

def in_(self, other):
    return super().in_([normalize_series_name(e) for e in other])

@stevezau
Copy link
Copy Markdown
Contributor Author

@gazpachoking that's what i thought yesterday but it still calls operate https://github.com/Flexget/Flexget/blob/series_speedup/flexget/plugins/filter/series.py#L238

I can fix it by checking if it's a list in operate but that does not seem right?

entries_map = {}
for entry in task.entries:
parsed = parser.parse_series(entry['title'])
if parsed.name:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also check parsed.valid, not sure if it'll have a name property if it isn't valid


# Sort Entries into data model similar to https://en.wikipedia.org/wiki/Trie
# Only process series if both the entry title and series title first letter match
entries_map = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can set it to defaultdict(list), makes for a neater implementation later

# str() added to make sure number shows (e.g. 24) are turned into strings
series_names = [str(s.keys()[0]) for s in config]
existing_db_series = session.query(Series).filter(Series.name.in_(series_names))
existing_db_series = dict([(s.name_normalized, s) for s in existing_db_series])
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing_db_series = {s.name_normalized: s for s in existing_db_series}

@Flexget Flexget deleted a comment from liiight Nov 23, 2017
@stevezau
Copy link
Copy Markdown
Contributor Author

@gazpachoking @liiight @cvium tests are passing. Ready for review.

@stevezau stevezau self-assigned this Nov 23, 2017
@stevezau stevezau changed the title Prefetch Series Data Speedup Series Parsing Nov 23, 2017
@stevezau stevezau changed the title Speedup Series Parsing Series performance improvements Nov 23, 2017
@stevezau stevezau merged commit 99ff56c into develop Nov 25, 2017
@stevezau stevezau deleted the series_speedup branch November 26, 2017 07:26
db_series.alternate_names = [alt for alt in db_series.alternate_names if alt.alt_name in alts]
# Add/update the possibly new alternate names
else:
# TODO: Remove, added for debugging
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget something @stevezau ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants