feat(msteams): extract structured quote/reply context from HTML attachments#44739
feat(msteams): extract structured quote/reply context from HTML attachments#44739robinhanksliu wants to merge 3 commits intoopenclaw:mainfrom
Conversation
…hments When a Teams user quotes/replies to a message, the quoted sender name and body are mixed into activity.text as a flat string, making it impossible for the agent to distinguish the quote from the actual message. This change: - Adds extractMSTeamsQuoteInfo() to parse Teams blockquote HTML from text/html attachments (supports both schema.skype.com/Reply format and simpler blockquote variants) - Populates ReplyToSender and ReplyToBody in the inbound context (consistent with Telegram and WhatsApp implementations) - Formats the agent body with a [Replying to ...] annotation block so the LLM receives structured context about quoted messages - Adds 6 test cases covering various blockquote formats The agent now sees: actual message text [Replying to Jianmei Yu] quoted message content [/Replying] instead of the previous flat string where all content was merged.
Greptile SummaryThis PR adds structured quote/reply extraction for Microsoft Teams inbound messages. When a Teams user quotes a message, the quoted sender name and body are now parsed from the Key observations:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 67-81
Comment:
**Incomplete HTML entity decoding**
`htmlToPlainText` only decodes six named entities (` `, `&`, `<`, `>`, `"`, `'`). Any other numeric or named HTML entity — e.g. ` ` (non-breaking space), `’` (right single quotation mark), `—`, `é`, etc. — will appear as raw literal strings in the extracted `quotedSender` / `quotedBody` / `cleanBody`.
Teams can produce numeric entities in generated HTML (e.g. for curly quotes or special Unicode characters). Consider adding a generic numeric-entity fallback:
```ts
.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec)))
```
or use a lightweight library like `he` to handle the full entity spectrum.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/msteams/src/monitor-handler/message-handler.ts
Line: 514-517
Comment:
**Quote content duplicated in agent body when HTML lacks post-blockquote text**
When the `text/html` attachment contains only the `<blockquote>` and no trailing content (e.g. some Teams clients put the new message only in `activity.text`), `htmlToPlainText(afterBlockquote)` returns `""` and `cleanBody` falls back to the full `fallbackText` — the merged flat string that already contains the quoted sender name, the quoted body, and the actual message all concatenated.
In that case `agentBody` becomes:
```
Bobsome quotemy actual message
[Replying to Bob]
some quote
[/Replying]
```
The `quotedBody` (`some quote`) appears twice — once buried inside `cleanBody` and again inside the `[Replying to …]` block. The agent cannot distinguish the user's actual message from the quoted portion inside `cleanBody`, so the annotation provides little value in this fallback path.
One approach: when `cleanBody === fallbackText`, skip emitting the `[Replying to …]` block (since the structured separation can't be guaranteed anyway), or at least avoid duplicating the annotation:
```ts
const agentBody = quoteInfo
? quoteInfo.cleanBody !== fallbackText
? quoteInfo.cleanBody +
`\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
: rawBody // structured extraction failed; surface as-is
: rawBody;
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 147-158
Comment:
**Regex can mis-match on nested or malformed blockquotes**
The non-greedy `([\s\S]*?)<\/blockquote>` pattern stops at the **first** `</blockquote>` it encounters. If Teams ever nests a quoted reply inside another reply (e.g. a thread reply to a reply), or if the blockquote's inner content itself contains a literal `</blockquote>` string (unlikely but possible in forwarded HTML content), the captured group `bqMatch[1]` will be truncated and both `quotedSender` and `quotedBody` may be wrong or empty.
For the current Teams use-case this is probably fine, but it's worth a brief code comment acknowledging the limitation so future maintainers don't be surprised:
```ts
// Note: non-greedy match stops at the first </blockquote>, so nested
// blockquotes are not supported. Teams does not currently produce them
// for quote/reply scenarios.
const blockquoteRe = ...
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: bf744ac |
| function htmlToPlainText(html: string): string { | ||
| return html | ||
| .replace(/<br\s*\/?>/gi, "\n") | ||
| .replace(/<\/p>/gi, "\n") | ||
| .replace(/<[^>]*>/g, "") | ||
| .replace(/ /gi, " ") | ||
| .replace(/&/gi, "&") | ||
| .replace(/</gi, "<") | ||
| .replace(/>/gi, ">") | ||
| .replace(/"/gi, '"') | ||
| .replace(/'/gi, "'") | ||
| .replace(/[ \t]+/g, " ") | ||
| .replace(/\n{3,}/g, "\n\n") | ||
| .trim(); | ||
| } |
There was a problem hiding this comment.
Incomplete HTML entity decoding
htmlToPlainText only decodes six named entities ( , &, <, >, ", '). Any other numeric or named HTML entity — e.g.   (non-breaking space), ’ (right single quotation mark), —, é, etc. — will appear as raw literal strings in the extracted quotedSender / quotedBody / cleanBody.
Teams can produce numeric entities in generated HTML (e.g. for curly quotes or special Unicode characters). Consider adding a generic numeric-entity fallback:
.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec)))or use a lightweight library like he to handle the full entity spectrum.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 67-81
Comment:
**Incomplete HTML entity decoding**
`htmlToPlainText` only decodes six named entities (` `, `&`, `<`, `>`, `"`, `'`). Any other numeric or named HTML entity — e.g. ` ` (non-breaking space), `’` (right single quotation mark), `—`, `é`, etc. — will appear as raw literal strings in the extracted `quotedSender` / `quotedBody` / `cleanBody`.
Teams can produce numeric entities in generated HTML (e.g. for curly quotes or special Unicode characters). Consider adding a generic numeric-entity fallback:
```ts
.replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16)))
.replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec)))
```
or use a lightweight library like `he` to handle the full entity spectrum.
How can I resolve this? If you propose a fix, please make it concise.| const agentBody = quoteInfo | ||
| ? quoteInfo.cleanBody + | ||
| `\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]` | ||
| : rawBody; |
There was a problem hiding this comment.
Quote content duplicated in agent body when HTML lacks post-blockquote text
When the text/html attachment contains only the <blockquote> and no trailing content (e.g. some Teams clients put the new message only in activity.text), htmlToPlainText(afterBlockquote) returns "" and cleanBody falls back to the full fallbackText — the merged flat string that already contains the quoted sender name, the quoted body, and the actual message all concatenated.
In that case agentBody becomes:
Bobsome quotemy actual message
[Replying to Bob]
some quote
[/Replying]
The quotedBody (some quote) appears twice — once buried inside cleanBody and again inside the [Replying to …] block. The agent cannot distinguish the user's actual message from the quoted portion inside cleanBody, so the annotation provides little value in this fallback path.
One approach: when cleanBody === fallbackText, skip emitting the [Replying to …] block (since the structured separation can't be guaranteed anyway), or at least avoid duplicating the annotation:
const agentBody = quoteInfo
? quoteInfo.cleanBody !== fallbackText
? quoteInfo.cleanBody +
`\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
: rawBody // structured extraction failed; surface as-is
: rawBody;Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/monitor-handler/message-handler.ts
Line: 514-517
Comment:
**Quote content duplicated in agent body when HTML lacks post-blockquote text**
When the `text/html` attachment contains only the `<blockquote>` and no trailing content (e.g. some Teams clients put the new message only in `activity.text`), `htmlToPlainText(afterBlockquote)` returns `""` and `cleanBody` falls back to the full `fallbackText` — the merged flat string that already contains the quoted sender name, the quoted body, and the actual message all concatenated.
In that case `agentBody` becomes:
```
Bobsome quotemy actual message
[Replying to Bob]
some quote
[/Replying]
```
The `quotedBody` (`some quote`) appears twice — once buried inside `cleanBody` and again inside the `[Replying to …]` block. The agent cannot distinguish the user's actual message from the quoted portion inside `cleanBody`, so the annotation provides little value in this fallback path.
One approach: when `cleanBody === fallbackText`, skip emitting the `[Replying to …]` block (since the structured separation can't be guaranteed anyway), or at least avoid duplicating the annotation:
```ts
const agentBody = quoteInfo
? quoteInfo.cleanBody !== fallbackText
? quoteInfo.cleanBody +
`\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]`
: rawBody // structured extraction failed; surface as-is
: rawBody;
```
How can I resolve this? If you propose a fix, please make it concise.| const blockquoteRe = | ||
| /<blockquote[^>]*itemtype=["']http:\/\/schema\.skype\.com\/Reply["'][^>]*>([\s\S]*?)<\/blockquote>/i; | ||
| const bqMatch = blockquoteRe.exec(html); | ||
| if (!bqMatch) { | ||
| // Try a more lenient blockquote match (some Teams clients omit schema attrs). | ||
| const simpleBqRe = /<blockquote[^>]*>([\s\S]*?)<\/blockquote>/i; | ||
| const simpleBqMatch = simpleBqRe.exec(html); | ||
| if (!simpleBqMatch) { | ||
| continue; | ||
| } | ||
| return parseBlockquoteContent(simpleBqMatch, html, fallbackText); | ||
| } |
There was a problem hiding this comment.
Regex can mis-match on nested or malformed blockquotes
The non-greedy ([\s\S]*?)<\/blockquote> pattern stops at the first </blockquote> it encounters. If Teams ever nests a quoted reply inside another reply (e.g. a thread reply to a reply), or if the blockquote's inner content itself contains a literal </blockquote> string (unlikely but possible in forwarded HTML content), the captured group bqMatch[1] will be truncated and both quotedSender and quotedBody may be wrong or empty.
For the current Teams use-case this is probably fine, but it's worth a brief code comment acknowledging the limitation so future maintainers don't be surprised:
// Note: non-greedy match stops at the first </blockquote>, so nested
// blockquotes are not supported. Teams does not currently produce them
// for quote/reply scenarios.
const blockquoteRe = ...Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/msteams/src/inbound.ts
Line: 147-158
Comment:
**Regex can mis-match on nested or malformed blockquotes**
The non-greedy `([\s\S]*?)<\/blockquote>` pattern stops at the **first** `</blockquote>` it encounters. If Teams ever nests a quoted reply inside another reply (e.g. a thread reply to a reply), or if the blockquote's inner content itself contains a literal `</blockquote>` string (unlikely but possible in forwarded HTML content), the captured group `bqMatch[1]` will be truncated and both `quotedSender` and `quotedBody` may be wrong or empty.
For the current Teams use-case this is probably fine, but it's worth a brief code comment acknowledging the limitation so future maintainers don't be surprised:
```ts
// Note: non-greedy match stops at the first </blockquote>, so nested
// blockquotes are not supported. Teams does not currently produce them
// for quote/reply scenarios.
const blockquoteRe = ...
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bf744ac425
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const simpleBqRe = /<blockquote[^>]*>([\s\S]*?)<\/blockquote>/i; | ||
| const simpleBqMatch = simpleBqRe.exec(html); |
There was a problem hiding this comment.
Avoid treating generic blockquotes as reply metadata
The fallback path matches any <blockquote> when the Skype reply schema is missing, so a normal user-authored quote block in a text/html attachment is reinterpreted as a reply and then rewritten downstream into [Replying to ...] context. This changes message meaning in conversations that use blockquote formatting without using Teams reply UI, because extractMSTeamsQuoteInfo will still return quote data from plain formatting HTML.
Useful? React with 👍 / 👎.
| const agentBody = quoteInfo | ||
| ? quoteInfo.cleanBody + | ||
| `\n\n[Replying to ${quoteInfo.quotedSender ?? "unknown"}]\n${quoteInfo.quotedBody ?? "(no text)"}\n[/Replying]` | ||
| : rawBody; |
There was a problem hiding this comment.
Preserve attachment placeholder in quoted attachment-only messages
When quote info is present, BodyForAgent is always rebuilt from quoteInfo.cleanBody plus the reply annotation and never falls back to rawBody. For quote messages where text is empty and rawBody came from buildMSTeamsAttachmentPlaceholder (for example, attachment-only replies), this drops the placeholder and hides that media/doc content was sent, reducing the agent's available context.
Useful? React with 👍 / 👎.
- Add numeric entity decoding (&#x...; and &#...;) in htmlToPlainText - Remove lenient blockquote fallback to avoid misinterpreting generic blockquotes as reply metadata (only match schema.skype.com/Reply) - Add comment about nested blockquote limitation - Guard against duplication when cleanBody falls back to raw text - Preserve attachment placeholder for attachment-only replies - Add test for numeric entity decoding - Update simple blockquote test to verify it is now ignored
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 75ddf47325
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| .replace(/&#x([0-9a-f]+);/gi, (_, hex) => String.fromCodePoint(parseInt(hex, 16))) | ||
| .replace(/&#([0-9]+);/g, (_, dec) => String.fromCodePoint(Number(dec))) |
There was a problem hiding this comment.
Guard numeric entity decoding against invalid code points
htmlToPlainText decodes &#x...; / &#...; with String.fromCodePoint(...) directly, which throws RangeError for values above 0x10FFFF (for example �). Because extractMSTeamsQuoteInfo is called on every inbound message before context finalization, a malformed or untrusted HTML attachment can make message handling fail for that turn instead of gracefully treating the entity as plain text.
Useful? React with 👍 / 👎.
|
Closing this as implemented after Codex review. Current What I checked:
So I’m closing this as already implemented rather than keeping a duplicate issue open. Review notes: reviewed against 38caa6832d4e; fix evidence: release v2026.4.22, commit 00bd2cf7a376. |
Problem
When a Teams user quotes/replies to a message, the quoted sender name and body are merged into
activity.textas a flat string. The agent receives something like:...with no way to distinguish the quoted content from the actual message, or identify who originally wrote the quoted part.
Solution
inbound.ts: AddextractMSTeamsQuoteInfo()to parse Teams<blockquote>HTML fromtext/htmlattachments. Supports both theschema.skype.com/Replyformat and simpler blockquote variants.message-handler.ts:ReplyToSenderandReplyToBodyin the inbound context (consistent with Telegram and WhatsApp implementations)[Replying to ...]annotation blockAfter this change, the agent sees:
Changes
extensions/msteams/src/inbound.tsextractMSTeamsQuoteInfo(),MSTeamsQuoteInfotype, HTML parsing helpersextensions/msteams/src/inbound.test.tsextensions/msteams/src/monitor-handler/message-handler.tsReplyToSender/ReplyToBody, format[Replying to ...]blockTesting
probe.test.ts(unrelated to this change)