deprecate bytes and charset by davidism · Pull Request #2641 · pallets/werkzeug

davidism · 2023-04-11T21:13:44Z

This deprecates everywhere that accepted bytes when str was the correct type. It also deprecates the charset and errors parameters throughout the code. Closes #2602, which has more of my reasoning and notes.

Here are the modern standards, which all strongly suggest or require UTF-8.

HTML, URL, fetch, etc.: https://spec.whatwg.org/
HTTP, cookies, headers, etc. https://httpwg.org/specs/

In general, I looked for everywhere that was using instance checks, encoding, decoding, or internal helper functions. Many places were not using them consistently already. The modern HTML, URL, and HTTP standards all strongly suggest or require using UTF-8. Despite that, browsers will still accept HTML documents with other encodings, and send data from such documents in that encoding. However, 98% of websites use UTF-8, and as long as we send UTF-8 documents we will receive UTF-8 data.

Ideally, this provides a speedup once all the deprecations are removed, but this is hard to test. Anecdotally, I talked to devs who noticed a corresponding slowdown when Python 3 support was initially added to the Python 2-only code. Many places were doing instance checks or excessive encoding/decoding multiple times while handling a request or creating a response.

These notes are in order that GitHub displays the changes in the diff, except for the first one which is probably the biggest change.

Request.charset, Request.encoding_errors, Request.url_charset, Response.charset, and Client.charset attributes are deprecated. Request body, form, and cookie data will always be decoded using UTF-8. Response body data will always be encoded using UTF-8. That doesn't mean it's impossible to send or receive non-UTF-8 data, since it's still possible to get and set bytes directly.
Remove all use of the _to_str and _to_bytes functions, leftover from 2/3 compatibility.
Remove _encode_idna, it could be replaced with str.encode("idna").
_decode_idna still exists, only for use in iri_to_uri. It allows leaving invalid segments IDNA-encoded. I added a fast path where it tries to decode the full host first, before falling back to decoding each segment.
Anywhere that was annotated, documented, or tested to accept bytes shows a deprecation warning, then decodes as UTF-8. Any other places only accept str now. Errors in request data are replaced, errors in response data are strict and raise.
Header keys and values must be strings, not bytes. Values may still be other types, like int, but all types are converted to string. The as_bytes parameter when getting header values is deprecated, values are always returned as strings.
When setting a header value, it is validated to not contain newline characters using a regex, which requires only one scan rather than two.
multipart/form-data, and some headers, have a way to specify a charset for individual fields. Both these parsers only allow ASCII, UTF-8, and ISO-8859-1.
request.args, and application/x-www-form-urlencoded, keep invalid bytes percent encoded rather than replacing them with a placeholder character.
Creating a response with a list of strings will set the content length, instead of only doing so for bytes.
Map.charset is deprecated. Technically, it allowed building URLs that percent-encoded values as something besides UTF-8 bytes. However, Flask never exposed this option, so the vast, vast majority of devs have never been able to build routes with non-UTF-8 encoding anyway, even if they changed the response charset.
Validating a variable's type was removed in a few places, with the assumption that this is the job of type annotations and a static type checker instead.
Very few tests needed to be modified/removed to accommodate all these changes.

davidism · 2023-04-11T22:29:38Z

Removed charset information from the docs in 4a871c1

davidism added 10 commits April 10, 2023 13:14

inline _encode_idna, _decode_idna takes str only

9c92a5f

remove uses of _to_str and _to_bytes

72e4927

deprecate bytes where str is expected

67776f7

deprecate bytes in headers

c81896d

deprecate charset in routing

f931577

finish deprecating charset for cookies

7bf8357

remove request.url_charset

6d4de9e

restrict multipart charsets

ff3df42

deprecate request and response charset

a6d96ca

deprecate iri charset and errors params

42e898c

davidism added this to the 2.3.0 milestone Apr 11, 2023

davidism merged commit c7a1dbc into main Apr 11, 2023

davidism deleted the deprecate-bytes branch April 11, 2023 22:25

github-actions bot locked as resolved and limited conversation to collaborators Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

deprecate bytes and charset#2641

deprecate bytes and charset#2641
davidism merged 10 commits intomainfrom
deprecate-bytes

davidism commented Apr 11, 2023

Uh oh!

davidism commented Apr 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

davidism commented Apr 11, 2023

Uh oh!

davidism commented Apr 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant