Skip to content

Add lxml dependency #3944

@bastimeyer

Description

@bastimeyer

Checklist

Description

Streamlink should finally switch to a proper HTML/XML parser for extracting data instead of using cheap regex workarounds which don't work properly. I've already commented on this issue last year:
#3241 (comment)

The reason why I'm suggesting this again right now is that I was trying to fix the deutschewelle plugin (https://dw.com) yesterday and ran into issues with the itertags utility method, which is based on simple regexes for iterating HTML nodes and their attributes+body. itertags for example does not work with nested nodes, which makes adding ridiculous custom regexes necessary. Just take a look at this madness:

channel_re = re.compile(r'''<a.*?data-id="(\d+)".*?class="ici"''')
live_stream_div = re.compile(r'''
<div\s+class="mediaItem"\s+data-channel-id="(\d+)".*?>.*?
<input\s+type="hidden"\s+name="file_name"\s+value="(.*?)"\s*>.*?<div
''', re.DOTALL | re.VERBOSE)
smil_api_url = "http://www.dw.com/smil/{}"
html5_api_url = "http://www.dw.com/html5Resource/{}"
vod_player_type_re = re.compile(r'<input type="hidden" name="player_type" value="(?P<stream_type>.+?)">')
stream_vod_data_re = re.compile(r'<input\s+type="hidden"\s+name="file_name"\s+value="(?P<stream_url>.+?)">.*?'
r'<input\s+type="hidden"\s+name="media_id"\s+value="(?P<stream_id>\d+)">',
re.DOTALL)

With lxml (https://lxml.de/), HTML page contents can be parsed and the data extracted via XPath queries and/or the respective API methods. The methods are similar to python's native xml.etree.ElementTree, which itself is considered too slow and unsafe in certain cases. I am by no means an expert regarding python's standard library though, so if someone has better insight here, please share. In regards to packaging, this lib is available on basically every packaging system and adding it as a dependency here only has benefits.

I'd suggest that we add lxml as a dependency now and start using it for extracting data from HTML documents. The validation schema methods could be improved for this as well. There's also the parse_xml utility method, which is currently based on the native module.

Comments?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions