Skip to content

fix(http): improve HTML detection by using Content-Type header#9407

Merged
jdx merged 6 commits intojdx:mainfrom
phateffect:main
Apr 28, 2026
Merged

fix(http): improve HTML detection by using Content-Type header#9407
jdx merged 6 commits intojdx:mainfrom
phateffect:main

Conversation

@phateffect
Copy link
Copy Markdown
Contributor

The current implementation of get_html uses a strict string-prefix check starts_with("<!DOCTYPE html>"))
which fails in several real-world scenarios common to PyPI registries.

  1. Case Sensitivity: As noted by MDN, the DOCTYPE declaration is case-insensitive (e.g., <!doctype html>). The current starts_with check is case-sensitive and too rigid.
  2. Leading Whitespace: mirrors.aliyun.com, one of the most widely used PyPI mirrors, returns an empty line before the DOCTYPE, causing the current check to fail.
  3. Missing DOCTYPE: Legacy registries (like Nexus 3.1.x) often omit the <!DOCTYPE html> declaration entirely, starting directly with <html>.

Detailed HTTP response examples for cases 2 and 3 can be found in #9399.

This PR refactors the detection logic to rely on the Content-Type HTTP header, which is the industry-standard way to identify media types.
This change aligns mise with the behavior of other tools like pip, pipx, and uv, ensuring better compatibility across diverse infrastructure.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the get_html method to identify HTML content using the Content-Type header instead of checking for a <!DOCTYPE html> prefix, which improves robustness against leading whitespace and non-standard HTML. It also adds unit tests to verify this new behavior. A review comment suggests refactoring this detection logic into a shared helper to ensure consistency with other methods, such as get_text_with_headers, which still use the legacy validation approach.

Comment thread src/http.rs
Comment on lines +210 to +220
let is_html = resp
.headers()
.get(CONTENT_TYPE)
.and_then(|content_type| content_type.to_str().ok())
.is_some_and(|content_type| {
content_type
.split_once(';')
.map_or(content_type, |(media_type, _)| media_type)
.trim()
.eq_ignore_ascii_case("text/html")
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The improved HTML detection logic using the Content-Type header is a significant improvement over the rigid starts_with check. However, get_text_with_headers (lines 163-170) still uses the old starts_with("<!DOCTYPE html>") logic to detect accidental HTML responses.

To ensure consistency and fix the issues described in the PR (case sensitivity, leading whitespace, and missing DOCTYPE) across the entire client, it would be beneficial to refactor this detection into a shared helper method and update get_text_with_headers as well. This would prevent get_text from incorrectly returning HTML content as text when mirrors return non-standard HTML or have leading whitespace.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 26, 2026

Greptile Summary

This PR replaces the brittle starts_with("<!DOCTYPE html>") body-prefix check in get_html with a Content-Type: text/html header check, correctly handling case-insensitive MIME types and stripped media-type parameters (e.g. ; charset=utf-8). Two mock-server tests are added to cover the accept and reject paths.

Confidence Score: 4/5

Safe to merge with the open review-thread issues in mind; the logic for the happy path is correct and the existing threads capture the most impactful gaps.

The implementation is correct for well-behaved servers. The only new finding here is a P2 error-message clarity issue. However, the two open threads (no-header fallback and missing charset test) have not yet been addressed, keeping the ceiling at 4.

src/http.rs — specifically the get_html function and its test block regarding the no-Content-Type fallback path.

Important Files Changed

Filename Overview
src/http.rs Refactors get_html to use Content-Type header instead of body-prefix inspection; adds two mock-based tests, though the no-header fallback and text/html; charset=utf-8 test cases raised in existing review threads remain open.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant get_html
    participant HTTP as HTTP Client
    participant Server

    Caller->>get_html: get_html(url)
    get_html->>HTTP: get_async(url)
    HTTP->>Server: GET /path
    Server-->>HTTP: 200 OK + headers + body
    HTTP-->>get_html: Response (headers available, body streaming)
    get_html->>get_html: resp.headers().get(CONTENT_TYPE)
    alt Content-Type present and is text/html
        get_html->>get_html: is_html = true
        get_html->>HTTP: resp.text() — consume body
        HTTP-->>get_html: html String
        get_html-->>Caller: Ok(html)
    else Content-Type absent or not text/html
        get_html->>get_html: is_html = false
        get_html-->>Caller: Err("Got non-HTML text from …")
    end
Loading

Reviews (5): Last reviewed commit: "Merge branch 'main' into main" | Re-trigger Greptile

Comment thread src/http.rs
Comment on lines +210 to 222
let is_html = resp
.headers()
.get(CONTENT_TYPE)
.and_then(|content_type| content_type.to_str().ok())
.is_some_and(|content_type| {
content_type
.split_once(';')
.map_or(content_type, |(media_type, _)| media_type)
.trim()
.eq_ignore_ascii_case("text/html")
});
if !is_html {
bail!("Got non-HTML text from {}", url);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No fallback for absent Content-Type header

When a server omits the Content-Type header entirely, is_some_and evaluates to false and the function unconditionally bails. The old code would still succeed in that case if the body began with <!DOCTYPE html>. While RFC-compliant servers always include Content-Type, some legacy registries (the very case this PR targets) may not. Consider adding a fallback to the original body-prefix heuristic when the header is absent, or at minimum adding a test that documents the new "no-header → reject" behavior.

Comment thread src/http.rs
Comment thread src/http.rs
@jdx jdx merged commit 1c870a8 into jdx:main Apr 28, 2026
34 checks passed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Yes, exactly — that's the regression I flagged in my original comment. The old starts_with("<!DOCTYPE html>") check would succeed even with no Content-Type header, as long as the body looked like HTML. The new code's is_some_and(...) returns false when the header is absent entirely, so any legacy server that sends valid HTML but omits Content-Type goes from working → broken.

The fix is to add a body-prefix fallback for the no-header case. Read the body unconditionally, then:

let html = resp.text().await?;
let is_html = resp
    .headers()
    .get(CONTENT_TYPE)
    .and_then(|ct| ct.to_str().ok())
    .map(|ct| {
        ct.split_once(';')
            .map_or(ct, |(media_type, _)| media_type)
            .trim()
            .eq_ignore_ascii_case("text/html")
    })
    .unwrap_or_else(|| {
        // No Content-Type header — fall back to body heuristic
        let trimmed = html.trim_start();
        let lower = trimmed.to_ascii_lowercase();
        lower.starts_with("<!doctype html>") || lower.starts_with("<html")
    });
if !is_html {
    bail!("Got non-HTML text from {}", url);
}
Ok(html)

Header present → trust it; header absent → case-insensitive, whitespace-tolerant body check. Fixes the original three cases while preserving backwards compatibility for headerless responses.

mise-en-dev added a commit that referenced this pull request Apr 29, 2026
### 🚀 Features

- **(deps)** add aube provider by @jdx in
[#9452](#9452)
- **(ls-remote)** add strict metadata mode by @jdx in
[#9448](#9448)

### 🐛 Bug Fixes

- **(env)** parse concatenated short form `-Eval` correctly by @bts in
[#9456](#9456)
- **(http)** improve HTML detection by using Content-Type header by
@phateffect in [#9407](#9407)
- **(task)** install monorepo subdir tools before running deps by @jdx
in [#9454](#9454)

### 📦️ Dependency Updates

- update astral-tokio-tar advisory by @jdx in
[#9449](#9449)
- respect -q flag for provider command stream by @JamBalaya56562 in
[#9457](#9457)

### New Contributors

- @JamBalaya56562 made their first contribution in
[#9457](#9457)
- @bts made their first contribution in
[#9456](#9456)
- @phateffect made their first contribution in
[#9407](#9407)

---------

Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants