fix(http): improve HTML detection by using Content-Type header#9407
fix(http): improve HTML detection by using Content-Type header#9407
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the get_html method to identify HTML content using the Content-Type header instead of checking for a <!DOCTYPE html> prefix, which improves robustness against leading whitespace and non-standard HTML. It also adds unit tests to verify this new behavior. A review comment suggests refactoring this detection logic into a shared helper to ensure consistency with other methods, such as get_text_with_headers, which still use the legacy validation approach.
| let is_html = resp | ||
| .headers() | ||
| .get(CONTENT_TYPE) | ||
| .and_then(|content_type| content_type.to_str().ok()) | ||
| .is_some_and(|content_type| { | ||
| content_type | ||
| .split_once(';') | ||
| .map_or(content_type, |(media_type, _)| media_type) | ||
| .trim() | ||
| .eq_ignore_ascii_case("text/html") | ||
| }); |
There was a problem hiding this comment.
The improved HTML detection logic using the Content-Type header is a significant improvement over the rigid starts_with check. However, get_text_with_headers (lines 163-170) still uses the old starts_with("<!DOCTYPE html>") logic to detect accidental HTML responses.
To ensure consistency and fix the issues described in the PR (case sensitivity, leading whitespace, and missing DOCTYPE) across the entire client, it would be beneficial to refactor this detection into a shared helper method and update get_text_with_headers as well. This would prevent get_text from incorrectly returning HTML content as text when mirrors return non-standard HTML or have leading whitespace.
Greptile SummaryThis PR replaces the brittle Confidence Score: 4/5Safe to merge with the open review-thread issues in mind; the logic for the happy path is correct and the existing threads capture the most impactful gaps. The implementation is correct for well-behaved servers. The only new finding here is a P2 error-message clarity issue. However, the two open threads (no-header fallback and missing charset test) have not yet been addressed, keeping the ceiling at 4. src/http.rs — specifically the Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant get_html
participant HTTP as HTTP Client
participant Server
Caller->>get_html: get_html(url)
get_html->>HTTP: get_async(url)
HTTP->>Server: GET /path
Server-->>HTTP: 200 OK + headers + body
HTTP-->>get_html: Response (headers available, body streaming)
get_html->>get_html: resp.headers().get(CONTENT_TYPE)
alt Content-Type present and is text/html
get_html->>get_html: is_html = true
get_html->>HTTP: resp.text() — consume body
HTTP-->>get_html: html String
get_html-->>Caller: Ok(html)
else Content-Type absent or not text/html
get_html->>get_html: is_html = false
get_html-->>Caller: Err("Got non-HTML text from …")
end
Reviews (5): Last reviewed commit: "Merge branch 'main' into main" | Re-trigger Greptile |
| let is_html = resp | ||
| .headers() | ||
| .get(CONTENT_TYPE) | ||
| .and_then(|content_type| content_type.to_str().ok()) | ||
| .is_some_and(|content_type| { | ||
| content_type | ||
| .split_once(';') | ||
| .map_or(content_type, |(media_type, _)| media_type) | ||
| .trim() | ||
| .eq_ignore_ascii_case("text/html") | ||
| }); | ||
| if !is_html { | ||
| bail!("Got non-HTML text from {}", url); |
There was a problem hiding this comment.
No fallback for absent
Content-Type header
When a server omits the Content-Type header entirely, is_some_and evaluates to false and the function unconditionally bails. The old code would still succeed in that case if the body began with <!DOCTYPE html>. While RFC-compliant servers always include Content-Type, some legacy registries (the very case this PR targets) may not. Consider adding a fallback to the original body-prefix heuristic when the header is absent, or at minimum adding a test that documents the new "no-header → reject" behavior.
|
Yes, exactly — that's the regression I flagged in my original comment. The old The fix is to add a body-prefix fallback for the no-header case. Read the body unconditionally, then: let html = resp.text().await?;
let is_html = resp
.headers()
.get(CONTENT_TYPE)
.and_then(|ct| ct.to_str().ok())
.map(|ct| {
ct.split_once(';')
.map_or(ct, |(media_type, _)| media_type)
.trim()
.eq_ignore_ascii_case("text/html")
})
.unwrap_or_else(|| {
// No Content-Type header — fall back to body heuristic
let trimmed = html.trim_start();
let lower = trimmed.to_ascii_lowercase();
lower.starts_with("<!doctype html>") || lower.starts_with("<html")
});
if !is_html {
bail!("Got non-HTML text from {}", url);
}
Ok(html)Header present → trust it; header absent → case-insensitive, whitespace-tolerant body check. Fixes the original three cases while preserving backwards compatibility for headerless responses. |
### 🚀 Features - **(deps)** add aube provider by @jdx in [#9452](#9452) - **(ls-remote)** add strict metadata mode by @jdx in [#9448](#9448) ### 🐛 Bug Fixes - **(env)** parse concatenated short form `-Eval` correctly by @bts in [#9456](#9456) - **(http)** improve HTML detection by using Content-Type header by @phateffect in [#9407](#9407) - **(task)** install monorepo subdir tools before running deps by @jdx in [#9454](#9454) ### 📦️ Dependency Updates - update astral-tokio-tar advisory by @jdx in [#9449](#9449) - respect -q flag for provider command stream by @JamBalaya56562 in [#9457](#9457) ### New Contributors - @JamBalaya56562 made their first contribution in [#9457](#9457) - @bts made their first contribution in [#9456](#9456) - @phateffect made their first contribution in [#9407](#9407) --------- Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
The current implementation of
get_htmluses a strict string-prefix checkstarts_with("<!DOCTYPE html>"))which fails in several real-world scenarios common to PyPI registries.
starts_withcheck is case-sensitive and too rigid.<!DOCTYPE html>declaration entirely, starting directly with<html>.Detailed HTTP response examples for cases 2 and 3 can be found in #9399.
This PR refactors the detection logic to rely on the Content-Type HTTP header, which is the industry-standard way to identify media types.
This change aligns mise with the behavior of other tools like
pip,pipx, anduv, ensuring better compatibility across diverse infrastructure.