Skip to content

URL.host returns Punycode instead of Unicode for some URLS #850

@Kludex

Description

@Kludex

Originally opened by @loic-bellinger on 2024-10-04 13:51:49 in encode/httpx

Description

The URL.host property does not decode IDNA hostnames into Unicode, which contradicts the specification. According to the httpx documentation, the host should always be returned as a string, normalized to lowercase, with IDNA hosts decoded into Unicode.

Step to reproduce

from urllib.parse import urlparse
from httpx import URL

test_url = "https://www.égalité-femmes-hommes.gouv.fr"
print(URL(test_url).host)  # Expected: "www.égalité-femmes-hommes.gouv.fr", but returns: "www.xn--galit-femmes-hommes-9ybf.gouv.fr"
print(urlparse(test_url).hostname)  # returns: "www.égalité-femmes-hommes.gouv.fr". idna.decode() also returns this.

Expected behavior

The URL.host property should return the Unicode version of the host, in this case: www.égalité-femmes-hommes.gouv.fr.

Actual behavior

The URL.host property returns the Punycode-encoded version of the host: www.xn--galit-femmes-hommes-9ybf.gouv.fr.

Potential fix
It seems the issue arises in this part of the httpx code:

`@property`
def host(self) -> str: 
    host: str = self._uri_reference.host

    if host.startswith("xn--"):
        host = idna.decode(host)

    return host

The use of startswith("xn--") checks only for Punycode-encoded hosts that begin with this prefix. However, it should handle cases where IDNA encoding is used more comprehensively.

Replacing host.startswith("xn--") with something like if "xn--" in host might handle a broader set of cases?

Environment

httpx version: 0.27.2
Python version: 3.12.x
OS: Linux/Windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions