Refactor/base layer message split#436
Conversation
- Move FindLast, findLast, and SplitMessage from discord.go to pkg/utils/message.go - Update discord.go to use utils.SplitMessage() - Makes splitting logic reusable across other channels
- Accept hard upper limit (maxLen) instead of pre-subtracted value - Caller now passes actual platform limit (e.g., 2000 for Discord) - Internal buffer of 500 chars is handled within message.go - Preferred split at maxLen - 500, may extend to maxLen for code blocks - Never exceeds maxLen, no more mental math for callers
There was a problem hiding this comment.
Pull request overview
This PR moves Discord message chunking/splitting logic into a reusable pkg/utils helper so other channel adapters can share the same behavior.
Changes:
- Added
utils.SplitMessage(and related helpers) to split long outbound messages while trying to preserve fenced code blocks. - Updated the Discord channel to use
utils.SplitMessageand removed the now-inlined Discord-only implementation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pkg/utils/message.go | Introduces shared message-splitting utility intended for reuse across channels. |
| pkg/channels/discord.go | Replaces local Discord splitting logic with the shared utility. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // SplitMessage splits long messages into chunks, preserving code block integrity. | ||
| // The function prefers to split at maxLen - defaultCodeBlockBuffer to leave room for code blocks, | ||
| // but may extend up to maxLen when needed to avoid breaking incomplete code blocks. | ||
| // Please refer to pkg/channels/discord.go for usage. | ||
| func SplitMessage(content string, maxLen int) []string { | ||
| var messages []string | ||
| codeBlockBuffer := defaultCodeBlockBuffer | ||
|
|
||
| for len(content) > 0 { | ||
| if len(content) <= maxLen { | ||
| messages = append(messages, content) | ||
| break | ||
| } | ||
|
|
||
| // Effective split point: maxLen minus buffer, to leave room for code blocks | ||
| effectiveLimit := maxLen - codeBlockBuffer | ||
| if effectiveLimit < maxLen/2 { | ||
| effectiveLimit = maxLen / 2 | ||
| } | ||
|
|
||
| // Find natural split point within the effective limit | ||
| msgEnd := FindLastNewline(content[:effectiveLimit], 200) | ||
| if msgEnd <= 0 { | ||
| msgEnd = FindLastSpace(content[:effectiveLimit], 100) | ||
| } | ||
| if msgEnd <= 0 { | ||
| msgEnd = effectiveLimit | ||
| } | ||
|
|
||
| // Check if this would end with an incomplete code block | ||
| candidate := content[:msgEnd] | ||
| unclosedIdx := FindLastUnclosedCodeBlock(candidate) | ||
|
|
||
| if unclosedIdx >= 0 { | ||
| // Message would end with incomplete code block | ||
| // Try to extend up to maxLen to include the closing ``` | ||
| if len(content) > msgEnd { | ||
| closingIdx := FindNextClosingCodeBlock(content, msgEnd) | ||
| if closingIdx > 0 && closingIdx <= maxLen { | ||
| // Extend to include the closing ``` | ||
| msgEnd = closingIdx | ||
| } else { | ||
| // Can't find closing within maxLen, split before the code block | ||
| msgEnd = FindLastNewline(content[:unclosedIdx], 200) | ||
| if msgEnd <= 0 { | ||
| msgEnd = FindLastSpace(content[:unclosedIdx], 100) | ||
| } | ||
| if msgEnd <= 0 { | ||
| msgEnd = unclosedIdx | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if msgEnd <= 0 { | ||
| msgEnd = effectiveLimit | ||
| } | ||
|
|
||
| messages = append(messages, content[:msgEnd]) | ||
| content = strings.TrimSpace(content[msgEnd:]) | ||
| } | ||
|
|
||
| return messages | ||
| } |
There was a problem hiding this comment.
SplitMessage introduces non-trivial behavior (natural boundaries, code block fence handling, and the 500-char buffer) but there are no unit tests verifying chunk sizes, code-block integrity, and edge cases (e.g., long code blocks without closing fences, no whitespace, Unicode input). Given the repo has existing Go unit tests, please add focused tests in pkg/utils for this helper.
pkg/utils/message.go
Outdated
| for len(content) > 0 { | ||
| if len(content) <= maxLen { | ||
| messages = append(messages, content) | ||
| break | ||
| } | ||
|
|
||
| // Effective split point: maxLen minus buffer, to leave room for code blocks | ||
| effectiveLimit := maxLen - codeBlockBuffer | ||
| if effectiveLimit < maxLen/2 { | ||
| effectiveLimit = maxLen / 2 | ||
| } | ||
|
|
||
| // Find natural split point within the effective limit | ||
| msgEnd := FindLastNewline(content[:effectiveLimit], 200) | ||
| if msgEnd <= 0 { | ||
| msgEnd = FindLastSpace(content[:effectiveLimit], 100) | ||
| } | ||
| if msgEnd <= 0 { | ||
| msgEnd = effectiveLimit | ||
| } | ||
|
|
||
| // Check if this would end with an incomplete code block | ||
| candidate := content[:msgEnd] | ||
| unclosedIdx := FindLastUnclosedCodeBlock(candidate) |
There was a problem hiding this comment.
SplitMessage uses byte-based len() and byte indices when slicing (e.g., content[:msgEnd]). If msgEnd lands in the middle of a multi-byte UTF-8 rune, this will split the string into invalid UTF-8 and can produce garbled output or API rejections. Consider operating on []rune (or otherwise ensuring msgEnd aligns to rune boundaries) so chunk boundaries are Unicode-safe and the maxLen semantics align with character-based limits.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Find natural split point within the effective limit | ||
| msgEnd := findLastNewline(content[:effectiveLimit], 200) | ||
| if msgEnd <= 0 { | ||
| msgEnd = findLastSpace(content[:effectiveLimit], 100) | ||
| } | ||
| if msgEnd <= 0 { | ||
| msgEnd = effectiveLimit | ||
| } |
There was a problem hiding this comment.
This function computes split points using len(content) and slices strings by byte index (e.g., content[:effectiveLimit] / msgEnd = effectiveLimit). When there’s no newline/space, msgEnd can land in the middle of a multi-byte UTF-8 rune, producing invalid UTF-8 chunks and potentially garbling output. Consider doing all length accounting/splitting in runes (or adjust msgEnd back to a valid rune boundary using utf8 helpers).
| messages = append(messages, strings.TrimRight(content[:msgEnd], " \t\n\r")+"\n```") | ||
| content = strings.TrimSpace(header + "\n" + content[msgEnd:]) | ||
| continue |
There was a problem hiding this comment.
When splitting inside a code block, content = strings.TrimSpace(header + "\n" + content[msgEnd:]) removes all leading whitespace from the reopened chunk. If the code block content starts with indentation (common in code), this changes the message content rather than just splitting it. Consider avoiding TrimSpace here (e.g., only trimming leading newlines) to preserve code formatting.
| msgEnd = maxLen - 5 | ||
| messages = append(messages, strings.TrimRight(content[:msgEnd], " \t\n\r")+"\n```") | ||
| content = strings.TrimSpace(header + "\n" + content[msgEnd:]) | ||
| continue |
There was a problem hiding this comment.
Same as above: strings.TrimSpace(header + "\n" + content[msgEnd:]) in this branch can strip meaningful leading indentation/whitespace from code content after the reopened fence, altering output. Prefer preserving the original whitespace (or only trimming the newline(s) introduced by the split).
| headerEnd := strings.Index(content[unclosedIdx:], "\n") | ||
| if headerEnd == -1 { | ||
| headerEnd = unclosedIdx + 3 | ||
| } else { | ||
| headerEnd += unclosedIdx | ||
| } | ||
| header := strings.TrimSpace(content[unclosedIdx:headerEnd]) | ||
|
|
There was a problem hiding this comment.
If the opening fence line doesn’t contain a newline (e.g., lang immediately followed by content), the `headerEnd == -1` branch sets `headerEnd = unclosedIdx + 3`, so `header` becomes just "" and the language tag is lost when reopening. Consider falling back to the next whitespace/end-of-string (or msgEnd) so the language specifier is preserved when present.
pkg/utils/message.go
Outdated
| // SplitMessage splits long messages into chunks, preserving code block integrity. | ||
| // The function prefers to split at maxLen - defaultCodeBlockBuffer to leave room for code blocks, | ||
| // but may extend up to maxLen when needed to avoid breaking incomplete code blocks. | ||
| // Please refer to pkg/channels/discord.go for usage. | ||
| func SplitMessage(content string, maxLen int) []string { | ||
| var messages []string | ||
| codeBlockBuffer := defaultCodeBlockBuffer |
There was a problem hiding this comment.
SplitMessage introduces fairly intricate splitting behavior (code-fence detection, fence injection/reopening, buffer logic). Adding unit tests (e.g., plain text, long code blocks, missing closing fence, and Unicode/multi-byte input) would help ensure this shared utility stays stable as other channels start using it.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if !strings.HasSuffix(chunks[0], "\n```") { | ||
| t.Error("First chunk should end with injected closing fence") | ||
| } | ||
| // Check that second chunk starts with execution header |
There was a problem hiding this comment.
The comment says "Check that second chunk starts with execution header" but should say "Check that second chunk starts with code block header" for consistency with the error message on line 76.
| // Check that second chunk starts with execution header | |
| // Check that second chunk starts with code block header |
| messages = append(messages, content[:msgEnd]) | ||
| content = strings.TrimSpace(content[msgEnd:]) |
There was a problem hiding this comment.
The SplitMessage function uses byte-based indexing (string slicing) which can split multi-byte UTF-8 characters, potentially creating invalid UTF-8 sequences. When slicing at positions found by findLastNewline, findLastSpace, or at arbitrary positions like effectiveLimit or innerLimit, the function may split in the middle of a multi-byte character. Consider using rune-based indexing or validating split points to ensure they fall on character boundaries. The test case "Preserve Unicode characters" acknowledges this issue but doesn't verify the output is valid UTF-8.
|
|
||
| if msgEnd <= 0 { | ||
| msgEnd = effectiveLimit | ||
| } |
There was a problem hiding this comment.
There is a potential infinite loop when msgEnd is set to 0. This can occur on line 95 when unclosedIdx is 0 (code block starts at the beginning), or potentially from other code paths. When msgEnd is 0, line 112 performs content = strings.TrimSpace(content[msgEnd:]) which doesn't reduce the content length (it's just trimming the same string), causing an infinite loop. The guard on lines 107-109 sets msgEnd to effectiveLimit only if msgEnd <= 0, but this happens AFTER the code block handling logic which may have set msgEnd to 0 and then taken a continue path earlier. Consider adding validation that msgEnd must be > 0 before line 111, or ensure line 95 sets msgEnd to at least 1.
| } | |
| } | |
| // Ensure msgEnd is a valid, positive index so the loop always makes progress. | |
| if msgEnd <= 0 { | |
| msgEnd = 1 | |
| } else if msgEnd > len(content) { | |
| msgEnd = len(content) | |
| } |
| checkContent: func(t *testing.T, chunks []string) { | ||
| // Check that first chunk ends with closing fence | ||
| if !strings.HasSuffix(chunks[0], "\n```") { | ||
| t.Error("First chunk should end with injected closing fence") | ||
| } | ||
| // Check that second chunk starts with execution header | ||
| if !strings.HasPrefix(chunks[1], "```go") { | ||
| t.Error("Second chunk should start with injected code block header") | ||
| } | ||
| }, |
There was a problem hiding this comment.
The test cases only verify that specific chunks don't exceed maxLen, but don't comprehensively check all chunks. For example, in the "Simple split regular text" test (lines 36-39), only chunk 0 is checked, not chunk 1. In the "Long code block split" test (lines 69-78), neither chunk is explicitly checked for length. Consider adding a check that iterates through all chunks and verifies each one is <= maxLen to ensure the function never produces oversized chunks.
| checkContent: func(t *testing.T, chunks []string) { | ||
| // Just verify we didn't panic and got valid strings. | ||
| // Go strings are UTF-8, if we split mid-rune it would be bad, | ||
| // but standard slicing might do that. | ||
| // Let's assume standard behavior is acceptable or check if it produces invalid rune? | ||
| if !strings.Contains(chunks[0], "\u4e16") { | ||
| t.Error("Chunk should contain unicode characters") | ||
| } | ||
| }, |
There was a problem hiding this comment.
The test case acknowledges that splitting mid-rune (multi-byte UTF-8 character) would be bad, but doesn't verify that this doesn't happen. The test only checks that the chunk contains at least one instance of the Unicode character, but doesn't validate that all chunks contain valid UTF-8. Consider adding validation using utf8.ValidString() to ensure all chunks are valid UTF-8 sequences.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| runes := []rune(msg.Content) | ||
| if len(runes) == 0 { | ||
| return nil | ||
| } | ||
|
|
||
| chunks := splitMessage(msg.Content, 1500) // Discord has a limit of 2000 characters per message, leave 500 for natural split e.g. code blocks | ||
| chunks := utils.SplitMessage(msg.Content, 2000) // Split messages into chunks, Discord length limit: 2000 chars |
There was a problem hiding this comment.
Send converts the full message to []rune only to check emptiness (len(runes) == 0), which allocates and can be expensive for long messages. This can be replaced with a simple msg.Content == "" (or len(msg.Content)==0) check without changing behavior.
| // Find natural split point within the effective limit | ||
| msgEnd := findLastNewline(content[:effectiveLimit], 200) | ||
| if msgEnd <= 0 { | ||
| msgEnd = findLastSpace(content[:effectiveLimit], 100) | ||
| } | ||
| if msgEnd <= 0 { | ||
| msgEnd = effectiveLimit | ||
| } |
There was a problem hiding this comment.
SplitMessage computes split points using byte indexes (len, content[:effectiveLimit], content[:msgEnd]). When there are no ASCII split characters near the boundary (e.g., long CJK/emoji text), msgEnd can land in the middle of a UTF-8 sequence, producing invalid UTF-8 in a chunk (and JSON encoding will replace invalid bytes). Consider adjusting msgEnd to the nearest prior UTF-8 rune boundary (or operating on []rune) before slicing.
…plit Refactor/base layer message split from sipeed#143
📝 Description
This PR extracts the Discord message splitting logic from pkg/channels/discord.go into the reusable base-layer utility in pkg/utils/message.go. This enables other channels (Telegram, WhatsApp, etc.) to share the same message splitting functionality. For other channel usage: please import pkg/utils folder (for most channels this is already imported) and use
chunks := utils.SplitMessage(msg.Content, <lenth limit>). You can refer to pkg/channels/discord.go and #251 for details on handling message chunks.This PR also fixes the edge case(Long code blocks) mentioned in #143 .
🗣️ Type of Change
🤖 AI Code Generation
🔗 Related Issue
Refactor from #143 , related #251 , related discussion #244 and #285
📚 Technical Context (Skip for Docs)
🧪 Test Environment
📸 Evidence (Optional)
Click to view Logs/Screenshots
☑️ Checklist