fix(http): improve HTML detection by using Content-Type header by phateffect · Pull Request #9407 · jdx/mise

phateffect · 2026-04-26T17:33:32Z

The current implementation of get_html uses a strict string-prefix check starts_with("<!DOCTYPE html>"))
which fails in several real-world scenarios common to PyPI registries.

Case Sensitivity: As noted by MDN, the DOCTYPE declaration is case-insensitive (e.g., <!doctype html>). The current starts_with check is case-sensitive and too rigid.
Leading Whitespace: mirrors.aliyun.com, one of the most widely used PyPI mirrors, returns an empty line before the DOCTYPE, causing the current check to fail.
Missing DOCTYPE: Legacy registries (like Nexus 3.1.x) often omit the <!DOCTYPE html> declaration entirely, starting directly with <html>.

Detailed HTTP response examples for cases 2 and 3 can be found in #9399.

This PR refactors the detection logic to rely on the Content-Type HTTP header, which is the industry-standard way to identify media types.
This change aligns mise with the behavior of other tools like pip, pipx, and uv, ensuring better compatibility across diverse infrastructure.

gemini-code-assist

Code Review

This pull request updates the get_html method to identify HTML content using the Content-Type header instead of checking for a <!DOCTYPE html> prefix, which improves robustness against leading whitespace and non-standard HTML. It also adds unit tests to verify this new behavior. A review comment suggests refactoring this detection logic into a shared helper to ensure consistency with other methods, such as get_text_with_headers, which still use the legacy validation approach.

gemini-code-assist · 2026-04-26T17:35:06Z

+        let is_html = resp
+            .headers()
+            .get(CONTENT_TYPE)
+            .and_then(|content_type| content_type.to_str().ok())
+            .is_some_and(|content_type| {
+                content_type
+                    .split_once(';')
+                    .map_or(content_type, |(media_type, _)| media_type)
+                    .trim()
+                    .eq_ignore_ascii_case("text/html")
+            });


The improved HTML detection logic using the Content-Type header is a significant improvement over the rigid starts_with check. However, get_text_with_headers (lines 163-170) still uses the old starts_with("<!DOCTYPE html>") logic to detect accidental HTML responses.

To ensure consistency and fix the issues described in the PR (case sensitivity, leading whitespace, and missing DOCTYPE) across the entire client, it would be beneficial to refactor this detection into a shared helper method and update get_text_with_headers as well. This would prevent get_text from incorrectly returning HTML content as text when mirrors return non-standard HTML or have leading whitespace.

greptile-apps · 2026-04-26T17:35:18Z

Greptile Summary

This PR replaces the brittle starts_with("<!DOCTYPE html>") body-prefix check in get_html with a Content-Type: text/html header check, correctly handling case-insensitive MIME types and stripped media-type parameters (e.g. ; charset=utf-8). Two mock-server tests are added to cover the accept and reject paths.

Confidence Score: 4/5

Safe to merge with the open review-thread issues in mind; the logic for the happy path is correct and the existing threads capture the most impactful gaps.

The implementation is correct for well-behaved servers. The only new finding here is a P2 error-message clarity issue. However, the two open threads (no-header fallback and missing charset test) have not yet been addressed, keeping the ceiling at 4.

src/http.rs — specifically the get_html function and its test block regarding the no-Content-Type fallback path.

Important Files Changed

Filename	Overview
src/http.rs	Refactors `get_html` to use `Content-Type` header instead of body-prefix inspection; adds two mock-based tests, though the no-header fallback and `text/html; charset=utf-8` test cases raised in existing review threads remain open.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant get_html
    participant HTTP as HTTP Client
    participant Server

    Caller->>get_html: get_html(url)
    get_html->>HTTP: get_async(url)
    HTTP->>Server: GET /path
    Server-->>HTTP: 200 OK + headers + body
    HTTP-->>get_html: Response (headers available, body streaming)
    get_html->>get_html: resp.headers().get(CONTENT_TYPE)
    alt Content-Type present and is text/html
        get_html->>get_html: is_html = true
        get_html->>HTTP: resp.text() — consume body
        HTTP-->>get_html: html String
        get_html-->>Caller: Ok(html)
    else Content-Type absent or not text/html
        get_html->>get_html: is_html = false
        get_html-->>Caller: Err("Got non-HTML text from …")
    end

_{Reviews (5): Last reviewed commit: "Merge branch 'main' into main" | Re-trigger Greptile}

greptile-apps · 2026-04-26T17:35:21Z

+        let is_html = resp
+            .headers()
+            .get(CONTENT_TYPE)
+            .and_then(|content_type| content_type.to_str().ok())
+            .is_some_and(|content_type| {
+                content_type
+                    .split_once(';')
+                    .map_or(content_type, |(media_type, _)| media_type)
+                    .trim()
+                    .eq_ignore_ascii_case("text/html")
+            });
+        if !is_html {
            bail!("Got non-HTML text from {}", url);


No fallback for absent Content-Type header

When a server omits the Content-Type header entirely, is_some_and evaluates to false and the function unconditionally bails. The old code would still succeed in that case if the body began with <!DOCTYPE html>. While RFC-compliant servers always include Content-Type, some legacy registries (the very case this PR targets) may not. Consider adding a fallback to the original body-prefix heuristic when the header is absent, or at minimum adding a test that documents the new "no-header → reject" behavior.

greptile-apps · 2026-04-28T15:33:26Z

Yes, exactly — that's the regression I flagged in my original comment. The old starts_with("<!DOCTYPE html>") check would succeed even with no Content-Type header, as long as the body looked like HTML. The new code's is_some_and(...) returns false when the header is absent entirely, so any legacy server that sends valid HTML but omits Content-Type goes from working → broken.

The fix is to add a body-prefix fallback for the no-header case. Read the body unconditionally, then:

let html = resp.text().await?;
let is_html = resp
    .headers()
    .get(CONTENT_TYPE)
    .and_then(|ct| ct.to_str().ok())
    .map(|ct| {
        ct.split_once(';')
            .map_or(ct, |(media_type, _)| media_type)
            .trim()
            .eq_ignore_ascii_case("text/html")
    })
    .unwrap_or_else(|| {
        // No Content-Type header — fall back to body heuristic
        let trimmed = html.trim_start();
        let lower = trimmed.to_ascii_lowercase();
        lower.starts_with("<!doctype html>") || lower.starts_with("<html")
    });
if !is_html {
    bail!("Got non-HTML text from {}", url);
}
Ok(html)

Header present → trust it; header absent → case-insensitive, whitespace-tolerant body check. Fixes the original three cases while preserving backwards compatibility for headerless responses.

@jdx

### 🚀 Features - **(deps)** add aube provider by @jdx in [#9452](#9452) - **(ls-remote)** add strict metadata mode by @jdx in [#9448](#9448) ### 🐛 Bug Fixes - **(env)** parse concatenated short form `-Eval` correctly by @bts in [#9456](#9456) - **(http)** improve HTML detection by using Content-Type header by @phateffect in [#9407](#9407) - **(task)** install monorepo subdir tools before running deps by @jdx in [#9454](#9454) ### 📦️ Dependency Updates - update astral-tokio-tar advisory by @jdx in [#9449](#9449) - respect -q flag for provider command stream by @JamBalaya56562 in [#9457](#9457) ### New Contributors - @JamBalaya56562 made their first contribution in [#9457](#9457) - @bts made their first contribution in [#9456](#9456) - @phateffect made their first contribution in [#9407](#9407) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>

leyan-pang and others added 2 commits April 27, 2026 01:09

fix(http): validate html content type

fa0e6b6

Merge branch 'jdx:main' into main

0c64d5b

gemini-code-assist Bot reviewed Apr 26, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 26, 2026

View reviewed changes

phateffect added 2 commits April 27, 2026 02:38

Merge branch 'main' into main

eeeb648

Merge branch 'main' into main

0640ed1

greptile-apps Bot reviewed Apr 27, 2026

View reviewed changes

Comment thread src/http.rs

leyan-pang and others added 2 commits April 28, 2026 00:56

test(http): close get_html rejection test

9ef2fa5

Merge branch 'main' into main

b25dc08

jdx merged commit 1c870a8 into jdx:main Apr 28, 2026
34 checks passed

mise-en-dev mentioned this pull request Apr 28, 2026

chore: release 2026.4.26 #9450

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(http): improve HTML detection by using Content-Type header#9407

fix(http): improve HTML detection by using Content-Type header#9407
jdx merged 6 commits intojdx:mainfrom
phateffect:main

phateffect commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Uh oh!

greptile-apps Bot commented Apr 26, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

phateffect commented Apr 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Apr 26, 2026 •

edited

Loading