-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Checklist
- This is a feature request and not a different kind of issue
- I have read the contribution guidelines
- I have checked the list of open and recently closed plugin requests
Description
Streamlink should finally switch to a proper HTML/XML parser for extracting data instead of using cheap regex workarounds which don't work properly. I've already commented on this issue last year:
#3241 (comment)
The reason why I'm suggesting this again right now is that I was trying to fix the deutschewelle plugin (https://dw.com) yesterday and ran into issues with the itertags utility method, which is based on simple regexes for iterating HTML nodes and their attributes+body. itertags for example does not work with nested nodes, which makes adding ridiculous custom regexes necessary. Just take a look at this madness:
streamlink/src/streamlink/plugins/deutschewelle.py
Lines 18 to 29 in 3668770
| channel_re = re.compile(r'''<a.*?data-id="(\d+)".*?class="ici"''') | |
| live_stream_div = re.compile(r''' | |
| <div\s+class="mediaItem"\s+data-channel-id="(\d+)".*?>.*? | |
| <input\s+type="hidden"\s+name="file_name"\s+value="(.*?)"\s*>.*?<div | |
| ''', re.DOTALL | re.VERBOSE) | |
| smil_api_url = "http://www.dw.com/smil/{}" | |
| html5_api_url = "http://www.dw.com/html5Resource/{}" | |
| vod_player_type_re = re.compile(r'<input type="hidden" name="player_type" value="(?P<stream_type>.+?)">') | |
| stream_vod_data_re = re.compile(r'<input\s+type="hidden"\s+name="file_name"\s+value="(?P<stream_url>.+?)">.*?' | |
| r'<input\s+type="hidden"\s+name="media_id"\s+value="(?P<stream_id>\d+)">', | |
| re.DOTALL) |
With lxml (https://lxml.de/), HTML page contents can be parsed and the data extracted via XPath queries and/or the respective API methods. The methods are similar to python's native xml.etree.ElementTree, which itself is considered too slow and unsafe in certain cases. I am by no means an expert regarding python's standard library though, so if someone has better insight here, please share. In regards to packaging, this lib is available on basically every packaging system and adding it as a dependency here only has benefits.
I'd suggest that we add lxml as a dependency now and start using it for extracting data from HTML documents. The validation schema methods could be improved for this as well. There's also the parse_xml utility method, which is currently based on the native module.
Comments?