Add lxml dependency

### Checklist

- [X] This is a feature request and not a different kind of issue
- [X] [I have read the contribution guidelines](https://github.com/streamlink/streamlink/blob/master/CONTRIBUTING.md#contributing-to-streamlink)
- [X] [I have checked the list of open and recently closed plugin requests](https://github.com/streamlink/streamlink/issues?q=is%3Aissue+label%3A%22feature+request%22)

### Description

Streamlink should finally switch to a proper HTML/XML parser for extracting data instead of using cheap regex workarounds which don't work properly. I've already commented on this issue last year:
https://github.com/streamlink/streamlink/issues/3241#issuecomment-706486239

The reason why I'm suggesting this again right now is that I was trying to fix the deutschewelle plugin (https://dw.com) yesterday and ran into issues with the `itertags` utility method, which is based on simple regexes for iterating HTML nodes and their attributes+body. `itertags` for example does not work with nested nodes, which makes adding ridiculous custom regexes necessary. Just take a look at this madness:
https://github.com/streamlink/streamlink/blob/3668770d608f0fab54d40a46acd6720a97f63775/src/streamlink/plugins/deutschewelle.py#L18-L29

With `lxml` (https://lxml.de/), HTML page contents can be parsed and the data extracted via XPath queries and/or the respective API methods. The methods are similar to python's native `xml.etree.ElementTree`, which itself is considered too slow and unsafe in certain cases. I am by no means an expert regarding python's standard library though, so if someone has better insight here, please share. In regards to packaging, this lib is available on basically every packaging system and adding it as a dependency here only has benefits.

I'd suggest that we add `lxml` as a dependency now and start using it for extracting data from HTML documents. The validation schema methods could be improved for this as well. There's also the `parse_xml` utility method, which is currently based on the native module.

Comments?

	channel_re = re.compile(r'''<a.?data-id="(\d+)".?class="ici"''')
	live_stream_div = re.compile(r'''
	<div\s+class="mediaItem"\s+data-channel-id="(\d+)".?>.?
	<input\s+type="hidden"\s+name="file_name"\s+value="(.?)"\s>.*?<div
	''', re.DOTALL \| re.VERBOSE)

	smil_api_url = "http://www.dw.com/smil/{}"
	html5_api_url = "http://www.dw.com/html5Resource/{}"
	vod_player_type_re = re.compile(r'<input type="hidden" name="player_type" value="(?P<stream_type>.+?)">')
	stream_vod_data_re = re.compile(r'<input\s+type="hidden"\s+name="file_name"\s+value="(?P<stream_url>.+?)">.*?'
	r'<input\s+type="hidden"\s+name="media_id"\s+value="(?P<stream_id>\d+)">',
	re.DOTALL)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add lxml dependency #3944

Checklist

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Add lxml dependency #3944

Description

Checklist

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions