HTML API

WordPress HTML parsing and modification framework providing spec-compliant HTML5 processing.

Since: 6.2.0
Source: wp-includes/html-api/

Components

ComponentDescription
class-wp-html-tag-processor.mdTag-level HTML scanner and modifier
class-wp-html-processor.mdFull HTML5 parser with tree construction
class-wp-html-decoder.mdHTML character reference decoding
class-wp-html-doctype-info.mdDOCTYPE token parsing
class-wp-html-open-elements.mdStack of open elements
class-wp-html-active-formatting-elements.mdActive formatting elements list
class-wp-html-processor-state.mdParser state management
class-wp-html-token.mdToken representation
class-wp-html-attribute-token.mdAttribute token data
class-wp-html-span.mdText span data structure
class-wp-html-text-replacement.mdText replacement data
class-wp-html-stack-event.mdStack operation record
class-wp-html-unsupported-exception.mdUnsupported markup exception
hooks.mdActions and filters

Architecture

WP_HTML_Tag_Processor (6.2.0)
    └── Linear token scanner
    └── Attribute modification
    └── Bookmark navigation

WP_HTML_Processor (6.4.0) extends WP_HTML_Tag_Processor
    └── HTML5 tree construction
    └── Stack of open elements
    └── Active formatting elements
    └── Breadcrumb navigation

Usage Patterns

Tag Processor (Simple Modifications)

php
// Modify attributes on specific tags
$processor = new WP_HTML_Tag_Processor( $html );
while ( $processor->next_tag( 'img' ) ) {
    $processor->set_attribute( 'loading', 'lazy' );
}
$html = $processor->get_updated_html();

HTML Processor (Structure-Aware)

php
// Modify based on document structure
$processor = WP_HTML_Processor::create_fragment( $html );
while ( $processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ) ) {
    $processor->add_class( 'figure-image' );
}
$html = $processor->get_updated_html();

Processing Flow

Tag Processor

next_tag() / next_token()
    ├── parse_next_tag()
    ├── parse_next_attribute() (loop)
    └── matches() query check

Modification methods:
    ├── set_attribute() → lexical_updates[]
    ├── remove_attribute() → lexical_updates[]
    ├── add_class() → classname_updates[]
    ├── remove_class() → classname_updates[]
    └── set_modifiable_text() → lexical_updates[]

get_updated_html()
    └── apply_attributes_updates()

HTML Processor

create_fragment() / create_full_parser()
    └── Initialize state, stacks

next_tag() / next_token()
    └── step() insertion mode dispatch
        ├── step_in_body()
        ├── step_in_table()
        ├── step_in_head()
        └── ... other modes

Tree operations:
    ├── insert_html_element()
    ├── generate_implied_end_tags()
    ├── reconstruct_active_formatting_elements()
    └── run_adoption_agency_algorithm()

Token Types

TokenDescription
#tagHTML element (opening or closing)
#textText node content
#commentHTML comment
#doctypeDOCTYPE declaration
#cdata-sectionCDATA section (only in foreign content — SVG/MathML)
#funky-commentInvalid tag closer as comment
#presumptuous-tagEmpty end tag </>

Compatibility Modes

ModeDescription
no-quirksStandards mode
limited-quirksAlmost standards mode
quirksQuirks mode (legacy)

Special Elements

Certain elements have special parsing rules:

  • SCRIPT: Raw text with legacy comment handling
  • STYLE: Raw text
  • TITLE/TEXTAREA: Decoded plaintext
  • IFRAME/NOSCRIPT/NOEMBED: Raw text, no decoding

Namespaces

NamespaceContext
htmlStandard HTML
svgSVG foreign content
mathMathML foreign content

Design Principles

  1. Spec Compliance: Implements HTML5 parsing specification
  2. Safety First: Aborts on unsupported markup rather than breaking documents
  3. Garbage-in, Garbage-out: The Tag Processor passes invalid inputs through unchanged; the HTML Processor will abort (bail) on unsupported or unrecognized markup rather than producing incorrect output
  4. Minimal Diff: Preserves original formatting where possible
  5. No Tree Construction: Tag Processor operates linearly
  6. Memory Efficient: Negligible memory overhead