Skip to content

deprecate bytes and charset#2641

Merged
davidism merged 10 commits intomainfrom
deprecate-bytes
Apr 11, 2023
Merged

deprecate bytes and charset#2641
davidism merged 10 commits intomainfrom
deprecate-bytes

Conversation

@davidism
Copy link
Copy Markdown
Member

This deprecates everywhere that accepted bytes when str was the correct type. It also deprecates the charset and errors parameters throughout the code. Closes #2602, which has more of my reasoning and notes.

Here are the modern standards, which all strongly suggest or require UTF-8.

In general, I looked for everywhere that was using instance checks, encoding, decoding, or internal helper functions. Many places were not using them consistently already. The modern HTML, URL, and HTTP standards all strongly suggest or require using UTF-8. Despite that, browsers will still accept HTML documents with other encodings, and send data from such documents in that encoding. However, 98% of websites use UTF-8, and as long as we send UTF-8 documents we will receive UTF-8 data.

Ideally, this provides a speedup once all the deprecations are removed, but this is hard to test. Anecdotally, I talked to devs who noticed a corresponding slowdown when Python 3 support was initially added to the Python 2-only code. Many places were doing instance checks or excessive encoding/decoding multiple times while handling a request or creating a response.

These notes are in order that GitHub displays the changes in the diff, except for the first one which is probably the biggest change.

  • Request.charset, Request.encoding_errors, Request.url_charset, Response.charset, and Client.charset attributes are deprecated. Request body, form, and cookie data will always be decoded using UTF-8. Response body data will always be encoded using UTF-8. That doesn't mean it's impossible to send or receive non-UTF-8 data, since it's still possible to get and set bytes directly.
  • Remove all use of the _to_str and _to_bytes functions, leftover from 2/3 compatibility.
  • Remove _encode_idna, it could be replaced with str.encode("idna").
  • _decode_idna still exists, only for use in iri_to_uri. It allows leaving invalid segments IDNA-encoded. I added a fast path where it tries to decode the full host first, before falling back to decoding each segment.
  • Anywhere that was annotated, documented, or tested to accept bytes shows a deprecation warning, then decodes as UTF-8. Any other places only accept str now. Errors in request data are replaced, errors in response data are strict and raise.
  • Header keys and values must be strings, not bytes. Values may still be other types, like int, but all types are converted to string. The as_bytes parameter when getting header values is deprecated, values are always returned as strings.
  • When setting a header value, it is validated to not contain newline characters using a regex, which requires only one scan rather than two.
  • multipart/form-data, and some headers, have a way to specify a charset for individual fields. Both these parsers only allow ASCII, UTF-8, and ISO-8859-1.
  • request.args, and application/x-www-form-urlencoded, keep invalid bytes percent encoded rather than replacing them with a placeholder character.
  • Creating a response with a list of strings will set the content length, instead of only doing so for bytes.
  • Map.charset is deprecated. Technically, it allowed building URLs that percent-encoded values as something besides UTF-8 bytes. However, Flask never exposed this option, so the vast, vast majority of devs have never been able to build routes with non-UTF-8 encoding anyway, even if they changed the response charset.
  • Validating a variable's type was removed in a few places, with the assumption that this is the job of type annotations and a static type checker instead.
  • Very few tests needed to be modified/removed to accommodate all these changes.

@davidism davidism added this to the 2.3.0 milestone Apr 11, 2023
@davidism davidism merged commit c7a1dbc into main Apr 11, 2023
@davidism davidism deleted the deprecate-bytes branch April 11, 2023 22:25
@davidism
Copy link
Copy Markdown
Member Author

Removed charset information from the docs in 4a871c1

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

drop support for bytes

1 participant