Skip to content

feat(msteams): extract structured quote/reply context from HTML attachments#44739

Closed
robinhanksliu wants to merge 3 commits intoopenclaw:mainfrom
robinhanksliu:feat/msteams-quote-context
Closed

feat(msteams): extract structured quote/reply context from HTML attachments#44739
robinhanksliu wants to merge 3 commits intoopenclaw:mainfrom
robinhanksliu:feat/msteams-quote-context

Conversation

@robinhanksliu
Copy link
Copy Markdown

Problem

When a Teams user quotes/replies to a message, the quoted sender name and body are merged into activity.text as a flat string. The agent receives something like:

Jianmei YuRobin's Claw 是你偷偷改格式了吗?

...with no way to distinguish the quoted content from the actual message, or identify who originally wrote the quoted part.

Solution

  • inbound.ts: Add extractMSTeamsQuoteInfo() to parse Teams <blockquote> HTML from text/html attachments. Supports both the schema.skype.com/Reply format and simpler blockquote variants.
  • message-handler.ts:
    • Call quote extraction on inbound messages
    • Populate ReplyToSender and ReplyToBody in the inbound context (consistent with Telegram and WhatsApp implementations)
    • Format the agent body with a [Replying to ...] annotation block

After this change, the agent sees:

actual message text

[Replying to Jianmei Yu]
是你偷偷改格式了吗?
[/Replying]

Changes

File Change
extensions/msteams/src/inbound.ts +extractMSTeamsQuoteInfo(), MSTeamsQuoteInfo type, HTML parsing helpers
extensions/msteams/src/inbound.test.ts +6 test cases covering various blockquote formats
extensions/msteams/src/monitor-handler/message-handler.ts Wire up quote extraction, populate ReplyToSender/ReplyToBody, format [Replying to ...] block

Testing

  • All 233 existing msteams tests pass ✅
  • 6 new tests added for quote extraction ✅
  • 1 pre-existing failure in probe.test.ts (unrelated to this change)

…hments

When a Teams user quotes/replies to a message, the quoted sender name and
body are mixed into activity.text as a flat string, making it impossible
for the agent to distinguish the quote from the actual message.

This change:
- Adds extractMSTeamsQuoteInfo() to parse Teams blockquote HTML from
  text/html attachments (supports both schema.skype.com/Reply format
  and simpler blockquote variants)
- Populates ReplyToSender and ReplyToBody in the inbound context
  (consistent with Telegram and WhatsApp implementations)
- Formats the agent body with a [Replying to ...] annotation block
  so the LLM receives structured context about quoted messages
- Adds 6 test cases covering various blockquote formats

The agent now sees:

  actual message text

  [Replying to Jianmei Yu]
  quoted message content
  [/Replying]

instead of the previous flat string where all content was merged.
@openclaw-barnacle openclaw-barnacle Bot added channel: msteams Channel integration: msteams size: M labels Mar 13, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 13, 2026

Greptile Summary

This PR adds structured quote/reply extraction for Microsoft Teams inbound messages. When a Teams user quotes a message, the quoted sender name and body are now parsed from the text/html attachment (using the schema.skype.com/Reply blockquote format or a simpler fallback), and the agent sees a clean [Replying to …] annotation block instead of the flat merged string. ReplyToSender and ReplyToBody are also populated on the inbound context payload, bringing Teams in line with the Telegram and WhatsApp implementations.

Key observations:

  • The HTML parsing relies on regex rather than a DOM parser, which is sufficient for Teams' well-structured output but has known limitations around nested blockquotes.
  • htmlToPlainText only decodes six named HTML entities (&nbsp;, &amp;, &lt;, &gt;, &quot;, &#39;); numeric entities (e.g. &#160;, &#x2019;) would be left as literal strings in extracted sender/body text.
  • When the text/html attachment contains only the blockquote and no trailing content, cleanBody falls back to the full merged activity.text. In that path, agentBody emits the quote context twice — once embedded in cleanBody and again in the [Replying to …] block — without any structured separation of the actual message.
  • 6 new unit tests cover the main code paths well, including the fallback case.

Confidence Score: 4/5

  • Safe to merge; the feature is additive and well-tested, with no changes to existing message routing or auth logic.
  • The implementation is logically sound and consistent with the Telegram/WhatsApp pattern. The two concerns (incomplete entity decoding and the duplicate-content fallback path) are edge cases that don't affect the common case and don't cause data loss or incorrect routing. The test suite is comprehensive.
  • extensions/msteams/src/inbound.ts (entity decoding) and extensions/msteams/src/monitor-handler/message-handler.ts (agentBody fallback path)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 67-81

Comment:
**Incomplete HTML entity decoding**

`htmlToPlainText` only decodes six named entities (`&nbsp;`, `&amp;`, `&lt;`, `&gt;`, `&quot;`, `&#39;`). Any other numeric or named HTML entity — e.g. `&#160;` (non-breaking space), `&#x2019;` (right single quotation mark), `&mdash;`, `&eacute;`, etc. — will appear as raw literal strings in the extracted `quotedSender` / `quotedBody` / `cleanBody`.

Teams can produce numeric entities in generated HTML (e.g. for curly quotes or special Unicode characters). Consider adding a generic numeric-entity fallback:

```ts
.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec)))
```

or use a lightweight library like `he` to handle the full entity spectrum.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/msteams/src/monitor-handler/message-handler.ts
Line: 514-517

Comment:
**Quote content duplicated in agent body when HTML lacks post-blockquote text**

When the `text/html` attachment contains only the `<blockquote>` and no trailing content (e.g. some Teams clients put the new message only in `activity.text`), `htmlToPlainText(afterBlockquote)` returns `""` and `cleanBody` falls back to the full `fallbackText` — the merged flat string that already contains the quoted sender name, the quoted body, and the actual message all concatenated.

In that case `agentBody` becomes:

```
Bobsome quotemy actual message

[Replying to Bob]
some quote
[/Replying]
```

The `quotedBody` (`some quote`) appears twice — once buried inside `cleanBody` and again inside the `[Replying to …]` block. The agent cannot distinguish the user's actual message from the quoted portion inside `cleanBody`, so the annotation provides little value in this fallback path.

One approach: when `cleanBody === fallbackText`, skip emitting the `[Replying to …]` block (since the structured separation can't be guaranteed anyway), or at least avoid duplicating the annotation:

```ts
const agentBody = quoteInfo
  ? quoteInfo.cleanBody !== fallbackText
    ? quoteInfo.cleanBody +
      `\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
    : rawBody  // structured extraction failed; surface as-is
  : rawBody;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 147-158

Comment:
**Regex can mis-match on nested or malformed blockquotes**

The non-greedy `([\s\S]*?)<\/blockquote>` pattern stops at the **first** `</blockquote>` it encounters. If Teams ever nests a quoted reply inside another reply (e.g. a thread reply to a reply), or if the blockquote's inner content itself contains a literal `</blockquote>` string (unlikely but possible in forwarded HTML content), the captured group `bqMatch[1]` will be truncated and both `quotedSender` and `quotedBody` may be wrong or empty.

For the current Teams use-case this is probably fine, but it's worth a brief code comment acknowledging the limitation so future maintainers don't be surprised:

```ts
// Note: non-greedy match stops at the first </blockquote>, so nested
// blockquotes are not supported. Teams does not currently produce them
// for quote/reply scenarios.
const blockquoteRe = ...
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: bf744ac

Comment on lines +67 to +81
function htmlToPlainText(html: string): string {
return html
.replace(/<br\s*\/?>/gi, "\n")
.replace(/<\/p>/gi, "\n")
.replace(/<[^>]*>/g, "")
.replace(/&nbsp;/gi, " ")
.replace(/&amp;/gi, "&")
.replace(/&lt;/gi, "<")
.replace(/&gt;/gi, ">")
.replace(/&quot;/gi, '"')
.replace(/&#39;/gi, "'")
.replace(/[ \t]+/g, " ")
.replace(/\n{3,}/g, "\n\n")
.trim();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete HTML entity decoding

htmlToPlainText only decodes six named entities (&nbsp;, &amp;, &lt;, &gt;, &quot;, &#39;). Any other numeric or named HTML entity — e.g. &#160; (non-breaking space), &#x2019; (right single quotation mark), &mdash;, &eacute;, etc. — will appear as raw literal strings in the extracted quotedSender / quotedBody / cleanBody.

Teams can produce numeric entities in generated HTML (e.g. for curly quotes or special Unicode characters). Consider adding a generic numeric-entity fallback:

.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec)))

or use a lightweight library like he to handle the full entity spectrum.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 67-81

Comment:
**Incomplete HTML entity decoding**

`htmlToPlainText` only decodes six named entities (`&nbsp;`, `&amp;`, `&lt;`, `&gt;`, `&quot;`, `&#39;`). Any other numeric or named HTML entity — e.g. `&#160;` (non-breaking space), `&#x2019;` (right single quotation mark), `&mdash;`, `&eacute;`, etc. — will appear as raw literal strings in the extracted `quotedSender` / `quotedBody` / `cleanBody`.

Teams can produce numeric entities in generated HTML (e.g. for curly quotes or special Unicode characters). Consider adding a generic numeric-entity fallback:

```ts
.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec)))
```

or use a lightweight library like `he` to handle the full entity spectrum.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +514 to +517
const agentBody = quoteInfo
? quoteInfo.cleanBody +
`\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
: rawBody;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quote content duplicated in agent body when HTML lacks post-blockquote text

When the text/html attachment contains only the <blockquote> and no trailing content (e.g. some Teams clients put the new message only in activity.text), htmlToPlainText(afterBlockquote) returns "" and cleanBody falls back to the full fallbackText — the merged flat string that already contains the quoted sender name, the quoted body, and the actual message all concatenated.

In that case agentBody becomes:

Bobsome quotemy actual message

[Replying to Bob]
some quote
[/Replying]

The quotedBody (some quote) appears twice — once buried inside cleanBody and again inside the [Replying to …] block. The agent cannot distinguish the user's actual message from the quoted portion inside cleanBody, so the annotation provides little value in this fallback path.

One approach: when cleanBody === fallbackText, skip emitting the [Replying to …] block (since the structured separation can't be guaranteed anyway), or at least avoid duplicating the annotation:

const agentBody = quoteInfo
  ? quoteInfo.cleanBody !== fallbackText
    ? quoteInfo.cleanBody +
      `\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
    : rawBody  // structured extraction failed; surface as-is
  : rawBody;
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/monitor-handler/message-handler.ts
Line: 514-517

Comment:
**Quote content duplicated in agent body when HTML lacks post-blockquote text**

When the `text/html` attachment contains only the `<blockquote>` and no trailing content (e.g. some Teams clients put the new message only in `activity.text`), `htmlToPlainText(afterBlockquote)` returns `""` and `cleanBody` falls back to the full `fallbackText` — the merged flat string that already contains the quoted sender name, the quoted body, and the actual message all concatenated.

In that case `agentBody` becomes:

```
Bobsome quotemy actual message

[Replying to Bob]
some quote
[/Replying]
```

The `quotedBody` (`some quote`) appears twice — once buried inside `cleanBody` and again inside the `[Replying to …]` block. The agent cannot distinguish the user's actual message from the quoted portion inside `cleanBody`, so the annotation provides little value in this fallback path.

One approach: when `cleanBody === fallbackText`, skip emitting the `[Replying to …]` block (since the structured separation can't be guaranteed anyway), or at least avoid duplicating the annotation:

```ts
const agentBody = quoteInfo
  ? quoteInfo.cleanBody !== fallbackText
    ? quoteInfo.cleanBody +
      `\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
    : rawBody  // structured extraction failed; surface as-is
  : rawBody;
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +147 to +158
const blockquoteRe =
/<blockquote[^>]*itemtype=["']http:\/\/schema\.skype\.com\/Reply["'][^>]*>([\s\S]*?)<\/blockquote>/i;
const bqMatch = blockquoteRe.exec(html);
if (!bqMatch) {
// Try a more lenient blockquote match (some Teams clients omit schema attrs).
const simpleBqRe = /<blockquote[^>]*>([\s\S]*?)<\/blockquote>/i;
const simpleBqMatch = simpleBqRe.exec(html);
if (!simpleBqMatch) {
continue;
}
return parseBlockquoteContent(simpleBqMatch, html, fallbackText);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex can mis-match on nested or malformed blockquotes

The non-greedy ([\s\S]*?)<\/blockquote> pattern stops at the first </blockquote> it encounters. If Teams ever nests a quoted reply inside another reply (e.g. a thread reply to a reply), or if the blockquote's inner content itself contains a literal </blockquote> string (unlikely but possible in forwarded HTML content), the captured group bqMatch[1] will be truncated and both quotedSender and quotedBody may be wrong or empty.

For the current Teams use-case this is probably fine, but it's worth a brief code comment acknowledging the limitation so future maintainers don't be surprised:

// Note: non-greedy match stops at the first </blockquote>, so nested
// blockquotes are not supported. Teams does not currently produce them
// for quote/reply scenarios.
const blockquoteRe = ...
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 147-158

Comment:
**Regex can mis-match on nested or malformed blockquotes**

The non-greedy `([\s\S]*?)<\/blockquote>` pattern stops at the **first** `</blockquote>` it encounters. If Teams ever nests a quoted reply inside another reply (e.g. a thread reply to a reply), or if the blockquote's inner content itself contains a literal `</blockquote>` string (unlikely but possible in forwarded HTML content), the captured group `bqMatch[1]` will be truncated and both `quotedSender` and `quotedBody` may be wrong or empty.

For the current Teams use-case this is probably fine, but it's worth a brief code comment acknowledging the limitation so future maintainers don't be surprised:

```ts
// Note: non-greedy match stops at the first </blockquote>, so nested
// blockquotes are not supported. Teams does not currently produce them
// for quote/reply scenarios.
const blockquoteRe = ...
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bf744ac425

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/msteams/src/inbound.ts Outdated
Comment on lines +152 to +153
const simpleBqRe = /<blockquote[^>]*>([\s\S]*?)<\/blockquote>/i;
const simpleBqMatch = simpleBqRe.exec(html);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid treating generic blockquotes as reply metadata

The fallback path matches any <blockquote> when the Skype reply schema is missing, so a normal user-authored quote block in a text/html attachment is reinterpreted as a reply and then rewritten downstream into [Replying to ...] context. This changes message meaning in conversations that use blockquote formatting without using Teams reply UI, because extractMSTeamsQuoteInfo will still return quote data from plain formatting HTML.

Useful? React with 👍 / 👎.

Comment on lines +514 to +517
const agentBody = quoteInfo
? quoteInfo.cleanBody +
`\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
: rawBody;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve attachment placeholder in quoted attachment-only messages

When quote info is present, BodyForAgent is always rebuilt from quoteInfo.cleanBody plus the reply annotation and never falls back to rawBody. For quote messages where text is empty and rawBody came from buildMSTeamsAttachmentPlaceholder (for example, attachment-only replies), this drops the placeholder and hides that media/doc content was sent, reducing the agent's available context.

Useful? React with 👍 / 👎.

- Add numeric entity decoding (&#x...; and &#...;) in htmlToPlainText
- Remove lenient blockquote fallback to avoid misinterpreting generic
  blockquotes as reply metadata (only match schema.skype.com/Reply)
- Add comment about nested blockquote limitation
- Guard against duplication when cleanBody falls back to raw text
- Preserve attachment placeholder for attachment-only replies
- Add test for numeric entity decoding
- Update simple blockquote test to verify it is now ignored
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 75ddf47325

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +80 to +81
.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard numeric entity decoding against invalid code points

htmlToPlainText decodes &#x...; / &#...; with String.fromCodePoint(...) directly, which throws RangeError for values above 0x10FFFF (for example &#x110000;). Because extractMSTeamsQuoteInfo is called on every inbound message before context finalization, a malformed or untrusted HTML attachment can make message handling fail for that turn instead of gracefully treating the entity as plain text.

Useful? React with 👍 / 👎.

@steipete
Copy link
Copy Markdown
Contributor

Closing this as implemented after Codex review.

Current main already extracts structured Microsoft Teams reply/quote data from HTML attachments, stores it as inbound reply context, and exposes that context to the agent through the shared inbound metadata prompt. The requested capability is present and was already shipped in v2026.4.22.

What I checked:

So I’m closing this as already implemented rather than keeping a duplicate issue open.

Review notes: reviewed against 38caa6832d4e; fix evidence: release v2026.4.22, commit 00bd2cf7a376.

@steipete steipete closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: msteams Channel integration: msteams size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants