Skip to content

MEDIA: tag silently drops .md (and other) files due to regex whitelist mismatch #34517

@crazyhulk

Description

@crazyhulk

Related

Introduced by PR #28350 (diagnosable MEDIA rejections + canonical cache roots + null-path guard).

Problem

extract_media uses a strict extension whitelist that does not include .md (nor .json, .yaml, .xml, .tsv, etc.), while the fallback extract_local_files does support them.

However, line 3709 unconditionally strips all MEDIA: tags from the response text with a loose regex (MEDIA:\s*\S+) — even those that extract_media failed to match.

This creates a black hole for unsupported extensions:

  1. extract_media (strict regex) → no match for .md
  2. Cleanup regex re.sub(r"MEDIA:\s*\S+", "", ...) → removes the path from text
  3. extract_local_files (broad extension list) → runs on already-cleaned text, path is gone

Result: The file is neither extracted as media nor detected as a bare path. The user receives nothing.

Reproduction

import re

# extract_media pattern (line 2524)
media_pattern = re.compile(
    r'''[`"']?MEDIA:\s*(?P<path>...)\.(?:png|jpe?g|gif|webp|mp4|mov|avi|mkv|webm|ogg|opus|mp3|wav|m4a|flac|epub|pdf|zip|rar|7z|docx?|xlsx?|pptx?|txt|csv|apk|ipa)(?=[\s\`"',;:)\]}]|$))[`"']?'''
)

# cleanup pattern (line 3709)
cleanup = re.compile(r'MEDIA:\s*\S+')

text = 'Here is your report: MEDIA:/tmp/paid_users_up_analysis.md'

assert media_pattern.search(text) is None        # not extracted
cleaned = cleanup.sub('', text).strip()
assert '/tmp/paid_users_up_analysis.md' not in cleaned  # path gone

Fix

Align extract_media's extension whitelist with extract_local_files's supported set. Missing extensions include: md, json, xml, ya?ml, tsv, odt, rtf, bmp, tiff, svg, tar, gz, tgz, bz2, xz, xls, ods, ppt, odp, key.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions