Skip to content

feat(skills): add memory-leak-debug skill for heap snapshot diagnosis#4468

Merged
huww98 merged 1 commit into
QwenLM:mainfrom
huww98:feat/memory-leak-debug-skill
May 23, 2026
Merged

feat(skills): add memory-leak-debug skill for heap snapshot diagnosis#4468
huww98 merged 1 commit into
QwenLM:mainfrom
huww98:feat/memory-leak-debug-skill

Conversation

@huww98

@huww98 huww98 commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • What changed: Added a new .qwen/skills/memory-leak-debug/ skill that provides a step-by-step workflow for diagnosing memory leaks in the Qwen Code CLI using Node.js heap snapshots and the chrome-devtools CLI memory analysis tools.
  • Why it changed: Memory leaks (like the recent PerformanceMeasure leak from ink 7) require a repeatable diagnosis workflow. This skill codifies the process so future investigations can be done systematically with retained-size analysis, retainer chain tracing, and before/after comparison.
  • Reviewer focus: Workflow correctness and clarity of the SKILL.md instructions.

Validation

  • Commands run:
    # Tested both npm-run-dev and bundled paths end-to-end:
    bash .qwen/skills/tmux-real-user-testing/scripts/tmux-real-user-log.sh start memleak . \
      env QWEN_CODE_NO_RELAUNCH=true NODE_OPTIONS=--heapsnapshot-signal=SIGUSR2 npm run dev
    bash .qwen/skills/memory-leak-debug/scripts/find-leaf-node.sh <session>
    kill -USR2 <pid>
    chrome-devtools start --experimentalMemory --headless --no-usage-statistics
    chrome-devtools get_memory_snapshot_details /path/to/snapshot.heapsnapshot
  • Expected result: Heap snapshots are captured, chrome-devtools CLI analyzes them offline showing class aggregates with retained sizes.
  • Observed result: Successfully identified PerformanceMeasure leak (3,682 instances / 3,422 kB retained after a few prompts in dev mode; absent in bundled build confirming the tree-shake fix).
  • Quickest reviewer verification path: Run the skill workflow from Step 1 through Step 4 with any heapsnapshot file.

Scope / Risk

  • Main risk or tradeoff: Skill depends on globally-installed chrome-devtools-mcp package.
  • Not covered / not validated: Linux/Windows (tested on macOS only).
  • Breaking changes / migration notes: None — new files only.

Testing Matrix

🍏 🪟 🐧
npm run ⚠️ ⚠️
npx N/A N/A N/A
Docker N/A N/A N/A
Podman N/A N/A N/A
Seatbelt N/A N/A N/A

Testing matrix notes:

  • Skill is development tooling, not runtime. Tested on macOS only.

Linked Issues / Bugs

No linked issues.


🤖 Generated with Qwen Code

Provides a step-by-step workflow for diagnosing memory leaks in the CLI
using Node.js heap snapshots and the chrome-devtools CLI memory tools.
Includes a helper script for tmux PID discovery and a worked example
from the react-reconciler PerformanceMeasure leak (dbdc94b).
@github-actions

Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR adds a well-structured debugging skill for diagnosing memory leaks using heap snapshots and the chrome-devtools CLI. The documentation is clear, the workflow is logical, and the included example demonstrates real-world applicability (the ink 7 PerformanceMeasure leak). The shell script is simple and effective. Overall, this is a valuable addition to the debugging toolkit.

🔍 General Feedback

  • Strong documentation structure: The step-by-step workflow is clear, with appropriate prerequisites, notes, and cleanup instructions
  • Real-world validation: The worked example from the ink 7 upgrade provides credibility and a template for future investigations
  • Good integration: Leverages existing tmux-real-user-testing infrastructure appropriately
  • Practical focus: Emphasizes retained-size analysis over shallow-size, which is the correct approach for memory leak diagnosis
  • Platform limitation acknowledged: macOS-only testing is clearly stated, though Linux/Windows paths are marked as untested

🎯 Specific Feedback

🔵 Low

  • SKILL.md:line 23 — The eval pattern with export statements is correct, but the note about "shell environment does not persist across separate tool calls" seems specific to Qwen Code's agent context. Consider clarifying this is about tool call boundaries, not tmux pane boundaries.

  • SKILL.md:line 38 — The find-leaf-node.sh script walks the process tree to find the innermost node child. Consider adding a brief explanation of why this is necessary (e.g., "npm spawns a process chain; we need the actual CLI process for USR2 signals").

  • SKILL.md:line 45 — For production bundle profiling, you mention NODE_PID=$(tmux list-panes -t "<session-name>" -F '#{pane_pid}'). This assumes the pane PID is the node process. Consider adding a verification step or note that this works because node dist/cli.js is the direct command (no npm/tsx intermediary).

  • SKILL.md:line 80 — The get_memory_snapshot_details output format is described as CSV with specific columns. Consider noting that column order may vary by chrome-devtools version, or suggest using --format=csv explicitly if available.

  • SKILL.md:line 92 — The retainer chain command uses <nodeId> from the id field. Consider clarifying the distinction between uid (class aggregate identifier) and id (instance identifier) earlier, perhaps in a glossary or inline note.

  • find-leaf-node.sh:line 1 — The shebang uses #!/usr/bin/env bash, which is correct for portability. However, the script is marked executable (mode 100755), which is good. Consider adding a brief comment about the algorithm: "Walks down the process tree until no more node children exist."

  • find-leaf-node.sh:line 11 — The || true pattern prevents exit on failure, which is correct here. Consider adding a comment explaining this is intentional (we expect pgrep to return non-zero when no children exist).

  • react-reconciler-performance-measure-leak.md:line 4 — The symptom description mentions "300+ MB" heap growth. Consider adding the timeframe (e.g., "over ~25 minutes of moderate usage") to set expectations for leak detection sensitivity.

  • react-reconciler-performance-measure-leak.md:line 38 — The fix shows the esbuild define config. Consider linking to the actual commit dbdc94be9 as a reference, or noting that this is already applied (since the example is retrospective).

🟢 Medium

  • SKILL.md:line 17 — The prerequisite mentions installing chrome-devtools-mcp@latest -g. Consider adding a version constraint or minimum version if the memory tools require specific chrome-devtools features (e.g., --experimentalMemory flag availability).

  • SKILL.md:line 56 — The snapshot naming pattern Heap.<timestamp>.<pid>.<seq>.heapsnapshot is mentioned. Consider suggesting a renaming or organization strategy for multi-session investigations (e.g., "Consider prefixing with session name for clarity").

  • SKILL.md:line 100 — The "Common patterns" section lists four leak types. Consider adding a fifth: "Circular references preventing GC" — while less common in modern V8, it still occurs with certain native bindings or weak reference patterns.

  • react-reconciler-performance-measure-leak.md:line 26 — The retainer chain output description mentions (object elements)Array. Consider showing the actual chrome-devtools output format for clarity (e.g., a truncated example of the JSON or text output).

🟡 High

  • SKILL.md:line 23-25 — The eval pattern captures SESSION and OUTDIR, but the subsequent find-leaf-node.sh call requires <session-name> as a literal string replacement. This is a potential footgun: users might forget to substitute the actual session name. Consider adding an explicit example:

    # After eval, use the actual session name:
    NODE_PID=$(bash .qwen/skills/memory-leak-debug/scripts/find-leaf-node.sh "$SESSION")
  • SKILL.md:line 62-65 — The snapshot workflow shows three snapshots but doesn't explicitly state the expected outcome. Consider adding: "Compare maxRetainedSize for each class across snapshots — leaking classes will show monotonic growth."

✅ Highlights

  • Excellent worked example: The PerformanceMeasure leak case study is thorough, showing symptom → diagnosis → root cause → fix → verification. This is exactly the kind of documentation that helps future investigators.

  • Clear prerequisite chain: The skill explicitly states Node.js 22+ requirement for --heapsnapshot-signal, and provides installation instructions for chrome-devtools.

  • Smart use of tmux: The tmux-based approach allows interactive debugging while capturing snapshots — much more practical than automated scripting for exploratory leak hunting.

  • Retainer chain emphasis: The skill correctly prioritizes retained-size analysis and retainer-chain tracing, which are the key techniques for finding root causes (not just symptoms).

  • Verification step included: Step 6 (Verify Fix) closes the loop, ensuring the skill supports the full debugging lifecycle, not just diagnosis.

  • find-leaf-node.sh is elegant: The process-tree-walking algorithm is simple, robust, and handles arbitrary nesting depth. The || true pattern is correctly used to handle the "no children" termination case.

@pomelo-nwu pomelo-nwu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: LGTM, approve to merge

The motivation is well-founded — memory leak diagnosis is a core operational capability for a long-running CLI. Codifying the workflow while the ink 7 PerformanceMeasure leak investigation is still fresh is the right engineering culture.

The worked example is high quality: a complete five-phase closed loop (symptom → diagnosis → root cause → fix → verification) referencing a specific commit dbdc94be9, which shows the author actually walked through the full process. find-leaf-node.sh is clean and effective. Integration with tmux-real-user-testing is natural.

Two non-blocking suggestions — can follow up after merge:

  • Pin chrome-devtools-mcp version: Prerequisites say npm i chrome-devtools-mcp@latest -g. @latest has no version pinning; a breaking change upstream could silently break this skill in a few months. Consider pinning to a verified minimum version (e.g. >=0.x.y).
  • Unify $SESSION variable usage in examples: In Step 1, after eval captures $SESSION, subsequent examples use the literal "<session-name>" placeholder instead of $SESSION. This can trip up both agents and humans. Consider using $SESSION consistently.
中文说明

整体判断:认可,可以 merge

动机完全正当——内存泄漏诊断是长期运行 CLI 的核心运维能力,趁 ink 7 PerformanceMeasure 泄漏事件的热度把经验固化下来,是正确的工程文化。

Worked example 质量很高:symptom → diagnosis → root cause → fix → verification 五段完整闭环,引用了具体 commit dbdc94be9,说明作者确实走过一遍。find-leaf-node.sh 简洁有效,与 tmux-real-user-testing 的集成也很自然。

两个 non-blocking 建议,merge 后跟进也行:

  • chrome-devtools-mcp 版本锁定:Prerequisites 里写的是 npm i chrome-devtools-mcp@latest -g@latest 没有版本锁定,三个月后可能 breaking change。建议锁定一个已验证的最低版本号(如 >=0.x.y),避免 Skill 静默腐化。
  • $SESSION 变量示例统一:Step 1 里 eval 拿到 $SESSION 后,后续示例用的是 "<session-name>" 字面量而不是 $SESSION,容易让 agent 或人卡住。建议统一改成 $SESSION

— Qwen Code

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean skill addition, no runtime impact. The worked example and script are solid.

@huww98

huww98 commented May 23, 2026

Copy link
Copy Markdown
Collaborator Author

It is intentionally written as <session-name>, since shell variable does not suvive tool call boundary. All agents I know have this limitation.

@huww98 huww98 merged commit 84f4080 into QwenLM:main May 23, 2026
11 checks passed
doudouOUC added a commit that referenced this pull request May 25, 2026
…4500)

Pulls 5 main commits since #4469 (2026-05-24):
- #4464 fix(weixin): send decryptable image payloads
- #4465 fix(weixin): allow Windows image paths inside workspace
- #4470 fix(cli): resolve stale closure race in text buffer submit handler
- #4468 feat(skills): add memory-leak-debug skill for heap snapshot diagnosis
- #4288 feat(cli): do not append trailing space for directory completions (#4092)

11 manual conflicts resolved + 2 add/add conflicts taken from main wholesale:

Manual UU (12, all daemon-side preferred except text-buffer.ts):
- packages/acp-bridge/package.json — kept HEAD's fuller description (F1 lift expanded the package surface; main has stale pre-F1 wording).
- packages/cli/src/acp-integration/acpAgent.ts — kept HEAD's WorkspaceMcpBudget import (F2 needs it).
- packages/cli/src/acp-integration/acpAgent.worktree.test.ts (AA): kept HEAD's superset of mocks
  (MCP_BUDGET_WARN_FRACTION, getMCPDiscoveryState, MCPServerStatus, McpTransportPool, WorkspaceMcpBudget, workspace/debug/mcp config mocks). HEAD already includes
  main-side SessionStartSource + SessionEndReason mocks.
- packages/cli/src/ui/commands/directoryCommand.tsx — pure formatting (HEAD wrapped vs main inline). Kept HEAD.
- packages/cli/src/ui/commands/directoryCommand.test.tsx — pure formatting. Kept HEAD.
- packages/cli/src/ui/commands/skillsCommand.ts — pure formatting. Kept HEAD.
- packages/cli/src/ui/hooks/useCommandCompletion.tsx — pure formatting. Kept HEAD.
- packages/cli/src/ui/hooks/useCommandCompletion.test.ts — pure formatting. Kept HEAD.
- packages/cli/src/ui/hooks/useSlashCompletion.test.ts — pure formatting. Kept HEAD.
- packages/core/src/config/config.test.ts — kept HEAD's TrustGateError import (daemon-added).

text-buffer.ts (4 zones — took MAIN wholesale for #4470's stale-closure fix):
- Import: useRef instead of useReducer (daemon side had useReducer as a dead import — file uses dispatch via useCallback, not useReducer; verified via grep). useRef is needed for stateRef + #4470's currentText capture.
- writeFileSync zone: use stateRef.current.lines.join('\n') instead of stale closure-captured `text`. Fixes #4470's bug.
- text comparison: `newText !== currentText` not `newText !== text`.
- dep array: `[dispatch, ...]` not `[text, ...]` (callback reads from ref now, doesn't need to re-bind on text change).

AA (2, main wholesale via git checkout --theirs):
- packages/core/src/permissions/dangerousRules.ts + .test.ts
  Original #4151 Auto-mode added these on main, came into daemon via #4469 squash.
  Main then landed #4371 ("strip additional dangerous interpreter rules") as a follow-up
  that daemon side never saw. Take main's evolved version wholesale.

Verification:
- packages/core tsc: 50 errors PRE-merge, 50 errors POST-merge (pre-existing baseline — none introduced by this sync).
- packages/acp-bridge tsc: clean.
- 5 spot-test runs on conflict-resolved files: 132 + 17 + 24 + 30 + 1 = 204 tests pass (text-buffer / directoryCommand / useCommandCompletion / useSlashCompletion / skillsCommand).

Mirrors #4469's pattern (squash merge daemon_mode_b_main-side). Unblocks
#4490 daemon_mode_b_main → main reverse integration merge (currently
CONFLICTING precisely because of these 5 main commits).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants