feat: add commit attribution with per-file AI contribution tracking#3115
Conversation
…ia git notes Track character-level AI vs human contributions per file and store detailed attribution metadata as git notes (refs/notes/ai-attribution) after each successful git commit. This enables open-source AI disclosure and enterprise compliance audits without polluting commit messages.
📋 Review SummaryThis PR introduces a commit attribution system that tracks character-level AI contributions per file and stores them as git notes after successful commits. The implementation is well-structured with comprehensive test coverage (21 new tests), follows existing code conventions, and provides a non-blocking mechanism for AI contribution tracking. The approach of using git notes keeps commit messages clean while enabling compliance audits. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
…ted file exclusion
- Replace line-based diff with a prefix/suffix character-level algorithm
for precise contribution calculation (e.g. "Esc"→"esc" = 1 char, not whole line)
- Compute real AI vs human contribution percentages at commit time by analyzing
git diff --stat output: humanChars = max(0, diffSize - trackedAiChars)
- Add generated file exclusion (lock files, dist/, .min.js, .d.ts, etc.)
ported from an existing generatedFiles.ts
- Add file deletion tracking via recordDeletion()
- Update git notes payload format: {aiChars, humanChars, percent} per file
with real percentages instead of hardcoded 100%
… PR attribution Align with the full attribution feature set: - Surface tracking: read QWEN_CODE_ENTRYPOINT env var (cli/ide/api/sdk), include surfaceBreakdown in git notes payload - Prompt counting: incrementPromptCount() hooked into client.ts message loop, tracks promptCount/permissionPromptCount/escapeCount - Session persistence: toSnapshot()/restoreFromSnapshot() for serializing attribution state; ChatRecordingService.recordAttributionSnapshot() writes to session JSONL; client.ts restores on session resume - PR attribution: addAttributionToPR() in shell.ts detects `gh pr create` and appends "🤖 Generated with Qwen Code (N-shotted by Qwen-Coder)" - Session baseline: saves content hash on first AI edit of each file for precise human/AI contribution detection - generatePRAttribution() method for programmatic access
…iled commit counter preservation - Handle initial commit (no HEAD~1) by detecting parent with rev-parse and falling back to --root for first commit in repo - Exclude Cron-triggered messages from promptCount (not user-initiated) - Add commitSucceeded parameter to clearAttributions() so failed/disabled commits don't reset the prompts-since-last-commit counter - Add test for clearAttributions(false) behavior
- Normalize path.relative() to forward slashes for Windows compatibility - Use diff-tree --root for initial commits (git diff --root is invalid) - Replace String.replace() with indexOf+slice to avoid $& special patterns - Fix clearAttributions(false→true) when co-author disabled but commit succeeded - Use real newlines instead of literal \n in PR attribution text - Add surface fallback in restoreFromSnapshot for version compatibility - Fix single-quote regex to not assume bash supports \' escaping - Case-insensitive directory matching in generated file detection - Handle renamed file brace notation in parseDiffStat
# Conflicts: # packages/core/src/core/client.ts
# Conflicts: # packages/core/src/core/client.ts # packages/core/src/services/chatRecordingService.ts
74d152d to
180daa1
Compare
…ool edits Previously, recordAttributionSnapshot() only ran at the start of UserQuery and Cron turns — before the tools for that turn had executed. A session that wrote a file in turn 1 and committed in turn 2 (across process boundaries via --resume) lost the tracked edit: the last persisted snapshot was the turn-1-start snapshot (empty fileStates), so on resume the attribution service restored empty state and no git notes were attached to the commit. Move the snapshot call out of the UserQuery/Cron conditional and run it on every non-Retry turn. ToolResult turns are scheduled right after tools execute, so their start-of-turn snapshot now captures any edits those tools made. Retry turns are skipped since the state is unchanged from the prior turn. Added unit tests asserting the snapshot fires for ToolResult/UserQuery turns and skips Retry turns. Verified end-to-end in a scratch repo: write-file in turn 1 (no commit) → exit → --resume → commit in turn 2 → git notes now contain the recorded file with correct aiChars and promptCount: 2.
Collapse the two back-to-back messageType !== Retry blocks in sendMessageStream into one, and refresh chatRecordingService's recordAttributionSnapshot doc comment to reflect that snapshots fire on every non-retry turn (not just after user prompts).
# Conflicts: # packages/core/src/services/chatRecordingService.ts
…oggles
Matches the shape used upstream in Claude Code's `attribution.{commit,pr}`
so users can disable the PR body line without losing the commit-message
Co-authored-by trailer (or vice versa). The previous boolean forced both
to move together, which conflated two different surfaces.
- settingsSchema: gitCoAuthor becomes an object with nested commit/pr
booleans, each `showInDialog: true` so both appear in /settings.
- Config constructor accepts legacy boolean (coerced to { commit: v, pr: v })
so stored preferences from the pre-split schema carry over.
- shell.ts: attachCommitAttribution and addCoAuthorToGitCommit read .commit;
addAttributionToPR reads .pr.
Legacy gitCoAuthor was a single boolean and shipped ~4 months ago; the
previous commit split it into { commit, pr } sub-toggles. Without a
migration, users who had set gitCoAuthor: false would see the settings
dialog show the default (true) for both sub-toggles — misleading and
likely to flip their preference on the next save because getNestedValue
returns undefined when asked for .commit on a boolean.
- New v3-to-v4 migration expands boolean → { commit: v, pr: v },
preserves already-object values, resets invalid values to {} with a
warning.
- SETTINGS_VERSION bumped 3 → 4; existing integration assertions use the
constant so the next bump is a single-line change.
- Regenerate vscode-ide-companion settings.schema.json to reflect the
new nested shape.
- Docs: split the single gitCoAuthor row into .commit and .pr.
The migration already treats any non-boolean, non-object value as invalid
(reset to {} with warning), but the existing test only exercised the
string "yes" branch. Add parameterized cases for null, array, and number
so a future regression that accepts these in the valid bucket gets caught.
Also cover partial objects — the migration must not paternalistically
fill defaults; that responsibility lives in normalizeGitCoAuthor at the
Config boundary.
# Conflicts: # packages/core/src/services/chatRecordingService.ts # packages/core/src/tools/shell.ts
Two critical issues called out in review: 1. attachCommitAttribution treated the final shell exit code as proof that `git commit` itself failed. For compound commands like `git commit -m "x" && npm test`, the commit can succeed and a later step can fail; the previous code then cleared attribution without writing the git note. Now we snapshot HEAD before the command (via `git rev-parse HEAD` through child_process.execFile, kept independent of the mockable ShellExecutionService) and detect commit creation by HEAD movement, so attribution lands whenever a new commit was created regardless of later steps. 2. addAttributionToPR spliced the configured generator name into the user-approved `gh pr create --body "..."` argument verbatim. A name containing `"`, `$`, a backtick, or `'` could break the command or be evaluated as command substitution. Now we shell-escape the appended text per the surrounding quote style before splicing. Tests cover the new escape paths for both double- and single-quoted bodies, including a generator name designed to break interpolation (`$(rm -rf /) "danger" \`eval\``) and one with an apostrophe.
# Conflicts: # packages/cli/src/ui/commands/tasksCommand.ts # packages/core/src/tools/edit.ts # packages/core/src/tools/shell.ts # packages/core/src/tools/write-file.ts
There was a problem hiding this comment.
Pull request overview
This PR introduces structured AI contribution attribution for commits (via git notes) and PR descriptions (via gh pr create body modification), along with settings/migrations and session persistence to keep attribution state across resumes.
Changes:
- Add a singleton
CommitAttributionServiceto track per-file AI edits and generate git-notes JSON payloads. - Extend
ShellToolto (a) attach git notes aftergit commitand (b) append attribution togh pr create --body .... - Split
general.gitCoAuthorinto{ commit, pr }with CLI migration + docs/schema updates.
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/vscode-ide-companion/schemas/settings.schema.json | Updates VS Code settings schema to split commit vs PR attribution toggles. |
| packages/core/src/tools/write-file.ts | Records AI-originated writes into attribution tracking. |
| packages/core/src/tools/shell.ts | Detects commits to attach git notes; appends PR attribution to gh pr create; adds helpers for quoting and diff parsing. |
| packages/core/src/tools/shell.test.ts | Updates config shape in tests; adds PR attribution tests; resets attribution singleton between tests. |
| packages/core/src/tools/edit.ts | Records AI-originated edits into attribution tracking. |
| packages/core/src/services/generatedFiles.ts | Adds generated/vendored file detection for attribution exclusion. |
| packages/core/src/services/generatedFiles.test.ts | Unit tests for generated-file detection rules. |
| packages/core/src/services/commitAttribution.ts | Implements attribution tracking, snapshot/restore, note payload generation, prompt counters, and model-name sanitization. |
| packages/core/src/services/commitAttribution.test.ts | Unit tests for attribution service behavior and payload generation. |
| packages/core/src/services/chatRecordingService.ts | Adds a new chat record subtype + recorder method for attribution snapshots. |
| packages/core/src/services/attributionTrailer.ts | Builds git-notes command strings and enforces a max payload size. |
| packages/core/src/services/attributionTrailer.test.ts | Unit tests for git-notes command generation and formatting. |
| packages/core/src/core/client.ts | Increments prompt counters and records/restores attribution snapshots for session resume. |
| packages/core/src/core/client.test.ts | Tests snapshot recording on ToolResult/UserQuery turns and skipping on Retry. |
| packages/core/src/config/config.ts | Splits gitCoAuthor into {commit, pr} and normalizes legacy boolean shape. |
| packages/core/src/config/config.test.ts | Tests defaults and legacy boolean normalization for the new config shape. |
| packages/cli/src/utils/settingsUtils.ts | Updates settings dialog ordering to surface the two new toggles. |
| packages/cli/src/ui/components/subagents/runtime/AgentExecutionDisplay.test.tsx | Formatting-only change (single-line expectation). |
| packages/cli/src/config/settingsSchema.ts | Updates CLI settings schema to define gitCoAuthor as an object with commit/pr. |
| packages/cli/src/config/settings.ts | Bumps settings version from 3 → 4. |
| packages/cli/src/config/migration/versions/v3-to-v4.ts | Adds migration expanding legacy boolean general.gitCoAuthor into {commit, pr}. |
| packages/cli/src/config/migration/versions/v3-to-v4.test.ts | Unit tests for the new V3→V4 migration behavior. |
| packages/cli/src/config/migration/index.ts | Registers the new migration in the migration chain. |
| packages/cli/src/config/migration/index.test.ts | Updates integration expectations to the new current settings version. |
| integration-tests/terminal-capture/subagent-flicker-regression.ts | Formatting-only import compaction. |
| integration-tests/cli/settings-migration.test.ts | Updates expected migrated settings version to 4. |
| docs/users/configuration/settings.md | Documents the new general.gitCoAuthor.commit and .pr settings keys. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Six items called out on PR #3115 by Copilot: - shell.ts: addAttributionToPR's bash quote escaping doesn't apply to cmd.exe / PowerShell, where `\$` and `'\''` aren't honored. Skip the PR body rewrite entirely on Windows — losing PR attribution there is preferable to corrupting the user-approved `gh pr create` command. - attributionTrailer.ts + shell.ts call site: buildGitNotesCommand used bash-style single-quote escaping on the JSON note, which is broken on Windows. Switched to argv form (`{ command, args }`) and routed the invocation through child_process.execFile so shell quoting is bypassed entirely. Tests updated to assert the argv shape. - commitAttribution.ts: when a tracked file's aiChars exceeded the diff --stat-derived diffSize (long-line edits where diffSize ≈ lines * 40), humanChars clamped to 0 but aiChars stayed inflated, leaving aiChars + humanChars > the committed change magnitude. Clamp aiChars to diffSize so the totals stay consistent. - shell.ts parseDiffStat: only normalized rename brace notation (`{old => new}`). Cross-directory renames emit `old/path => new/path` without braces, leaving diffSizes keyed by the full string. Added a second normalization step. - shell.ts: addAttributionToPR docstring claimed `(X% N-shotted)` but the implementation only emits `(N-shotted by Generator)`. Updated the docstring to match the actual behavior. - settingsSchema.ts + generator: gitCoAuthor went from boolean to object in the V4 migration. The exported JSON Schema now wraps the field in `anyOf: [boolean, object]` (via a new `legacyTypes` hint on SettingDefinition) so users with a stored boolean don't see a spurious IDE warning before their next launch runs the migration.
…ation, sudo/env shifts, dir-stack gpt-5.5 review (issue 4389405179): 1. realpathOrSelf falls back to the non-canonical input when the leaf doesn't exist (deleted file). recordEdit stored the entry under the canonical path; lookup post-deletion misses on macOS where /var ↔ /private/var. Canonicalise the parent and rejoin the basename for missing leaves so deleted-file getFileAttribution still resolves the canonical key. Test updated to assert the lookup-after-unlink path explicitly. 2. validateOnDiskHashes read the LIVE working-tree, so a user who `git add`'d AI's content and then made additional unstaged edits would have the entry dropped on a commit whose blob still matched AI's hash. Replace with `validateAgainst(getContent)` that takes a caller-supplied reader; attachCommitAttribution now passes a reader that fetches the COMMITTED blob via `git show HEAD:<rel>`. Working-tree validation kept as `validateAgainstWorkingTree` for code paths without a committed ref. Returns null = no comparison signal (entry preserved). Tests cover all three readers (committed-blob via stub, working-tree, null-passthrough). deepseek-v4-pro review #1: sanitiseAttribution defaults missing contentHash to '' on legacy-snapshot restore. recordEdit's divergence check would then trip on every subsequent edit and silently reset all the AI work. Skip the divergence check when existing.contentHash is empty — we have no baseline to compare against, so don't drop. Test added covering legacy-snapshot preservation through validateAgainst. deepseek #4: validateAgainst now logs every entry drop via debugLogger.debug so a 3am operator can see WHICH entry got dropped and tied to which canonical key. deepseek #8: GIT_NAMESPACE removed from GIT_ENV_SHIFTS_REPO. It prefixes ref names within the same repo but doesn't redirect git to a different on-disk repository, so a commit underneath it still lands in our cwd's repo. Doc comment explains the distinction. deepseek #9: pushd/popd treated as cwd-shifting alongside cd in gitCommitContext / isAmendCommit / findAttributableCommitSegment. pushd reuses cdTargetMayChangeRepo (relative-no-escape stays in-repo); popd unconditionally flips cwdShifted because we don't track the bash dir-stack. deepseek #10: sudo's value-taking flag table now has a parallel SUDO_FLAGS_SHIFT_CWD set covering -D / --chdir (Linux sudo 1.9.2+). Any segment whose sudo wrapper sees one of those flags returns null from tokeniseSegment — same contract as env -C / --chdir and GIT_DIR=... 328 tests pass; typecheck clean both packages.
Follow-up — addressing two recent review batchesRe: gpt-5.5 issue comment (4389405179) + deepseek-v4-pro review (4237493276) — landed in gpt-5.5 #1 (deleted-file canonicalisation) ✅ gpt-5.5 #2 (working-tree vs committed blob) ✅ Refactored deepseek-v4-pro #1 (empty contentHash on legacy snapshot) ✅ Skip the divergence reset when deepseek-v4-pro #4 (silent deletes) ✅ deepseek-v4-pro #8 (GIT_NAMESPACE) ✅ Removed from deepseek-v4-pro #9 (pushd/popd) ✅ All three commit-walk sites (gitCommitContext, isAmendCommit, findAttributableCommitSegment) now treat deepseek-v4-pro #10 (sudo --chdir) ✅ Added Deferred with rationale:
328 tests pass; typecheck clean both packages. |
…der, intent-aware migration Round 1 of multi-pass audit on b3a06a7. Three correctness fixes: 1. validateAgainst was iterating ALL fileAttributions but the committed-blob reader (git show HEAD:<rel>) returns HEAD's pre-AI content for files NOT in the just-made commit. Result: pending unstaged AI work was silently wiped on every commit because the divergence check ran against the wrong baseline for unrelated files. Fix: build the committed scope first via matchCommittedFiles, scope the reader to that set (return null for everything else), validate, then RE-run matchCommittedFiles to pick up dropped entries. The validateAgainstWorkingTree wrapper had no production caller — removed it and its test. 2. The committed-blob reader used symbolic `HEAD` instead of the captured postHead SHA — same TOCTOU concern buildGitNotesCommand already addressed. A post-commit hook moving HEAD between capture and the reader's `git show` would silently compare against the wrong commit's content and trip the divergence check spuriously. Pin the reader to `git show <postHead>:<rel>`. 3. v3→v4 migration's invalid-string fallback used to reset to {}. Combined with the runtime pickBool's "absent → schema default true" rule, that silently re-enabled attribution for users who hand-edited `"gitCoAuthor": "off"` to disable. Migration now recognises enable-intent strings (true/yes/on/1/enabled) and disable-intent strings (false/no/off/0/disabled/'') and maps them to {commit, pr} explicitly. Unrecognised strings fall to {commit: false, pr: false} with a warning — same safer-by-default contract as runtime pickBool. Test grid covers all 11 cases. Also tidied the FileAttribution.contentHash JSDoc to reference the renamed `validateAgainst` (was still pointing at the dropped `validateOnDiskHashes` name). 1085 tests pass; typecheck clean both packages.
…Bool inputs
Round 2 of multi-pass audit. Two cleanups, no behaviour changes:
1. addCoAuthorToGitCommit and addAttributionToPR each carried their
own copy of the matchRange / isInside / "pick LAST non-nested
match" logic (~25 LOC duplicated). Extracted to module-level
helpers `matchSpan`, `isMatchInside`, and `pickOuterLastMatch<T>`
so a future bug fix can't apply to only one of the two
rewriters. Behaviour identical — same algorithm, same edge cases.
2. normalizeGitCoAuthor's pickBool silently maps unrecognised
strings to false (safer-by-default vs the old "default-to-true
on mismatch" policy, but a user who hand-edited
`{ commit: "maybe" }` had no signal that their setting was being
ignored). Add a `gitCoAuthorLogger.warn` listing the accepted
forms so a debug-mode user can see the actual coercion. Known
disable-intent strings (false/no/off/0/empty) stay silent —
they're explicit user intent. Also pass the field name so the
warning identifies which sub-toggle (commit vs pr) was bad.
1101 tests pass; typecheck clean.
Round 3 of multi-pass audit. One real correctness fix. Edit and WriteFile preserve the file's BOM and CRLF line-ending choice when writing back, so the on-disk bytes can include a leading U+FEFF and CRLFs even when AI's recordEdit input was given with LF and no BOM. The committed-blob reader's `git show <sha>:<rel>` returns those raw bytes verbatim, and computeContentHash hashed them as-is — so a UTF-8 BOM file or a CRLF-line-ending file would always have a mismatch between AI's recorded hash and the on-disk hash, and validateAgainst would drop the entry on every commit. Add `canonicaliseForHash`: strips a leading U+FEFF and normalises CRLF→LF before computing the SHA-256. Both sides (recordEdit when storing the post-write hash, and validateAgainst when comparing to the on-disk read) flow through computeContentHash, so the canonicalisation is symmetric. The hash is metadata used only for divergence detection — collapsing these visual differences is the right comparison semantics. Three regression tests added: BOM-only, CRLF-only, and BOM+CRLF combined. All exercise the typical case where AI's recordEdit input is LF + no BOM but the on-disk content (post-writeTextFile) has the file's preserved BOM/lineEnding choice.
…d file Round 4 of multi-pass audit + Copilot finding from review 4236842362 (I missed it in the previous refresh). recordEdit's existing prior-state check was symmetric on diverged oldContent but ASYMMETRIC on a fresh file lifetime: when AI creates `foo.ts` (oldContent=null), then user `rm foo.ts`, then AI re-creates `foo.ts` (oldContent=null again), the second recordEdit saw `existing` (from the first lifetime) and SKIPPED the divergence check (because oldContent === null bails out of that branch). The accumulator carried 100 chars from the deleted file plus 5 chars from the new content = 155, vs the actual 5 on disk. Subsequent generateNotePayload's clamp against `(adds+dels) * 40` couldn't catch this — the diff size for a 1-line addition is 40, far above the actual content size. Add a fresh-file-lifetime branch: when `existing` is set AND the caller reports `oldContent === null`, reset aiContribution and aiCreated before counting the new contribution. The new edit is treated as a brand-new file at the same path (which is what the caller's null oldContent means semantically). Test added covering the exact `AI create → delete → AI re-create` flow. Also verified `should treat new files as ai-created` and `should accumulate contributions across multiple edits` still pass.
Round 5 of multi-pass audit. Two related correctness/efficiency fixes around the cwd-shift parser and the preHead capture. 1. `git -C .` (and `-C ./`, `-C.`) is a no-op cwd shift but the "any -C → cwd-shifted" rule was treating it the same as `-C /tmp/other`, suppressing attribution for what's effectively `git commit` with an explicit current-dir marker. Add an `isNoopCwdTarget` helper used in both the spaced (`-C .`) and attached (`-C.`) branches of `parseGitInvocation`. `--git-dir` / `--work-tree` are left unconditional — those aren't cwd in the same sense. 2. preHead was being captured for ANY hasCommit, including the non-attributable cases (`cd /elsewhere && git commit`, `git -C /other commit`). The only consumer of preHead is the `attachCommitAttribution` call inside the `attributableInCwd` branch — there is intentionally NO cleanup branch for the non-attributable case (see the existing comment around the `else if (commitCtx.hasCommit)` non-branch). The execFileSync for `getGitHeadSync` is dead work in that path: ~10–50 ms blocking the event loop before the user's real command spawns. Gate the capture on `attributableInCwd` to match the consumer. Tests added for the three -C dot-form variants. Full suite green: 146 in shell.test.ts, 56 in commitAttribution.test.ts.
… into feat/commit-attribution
Round 7 of multi-pass audit. Two related fixes around how
`shell-quote` handles env-var references and how the cwd-shift
detector reads them.
1. `shell-quote.parse` collapses `$NAME` references it cannot
resolve to the empty string. The downstream cwd-shift checks
(`cdTargetMayChangeRepo`'s `target.includes('$')` repo-shift
detector, and the new `isNoopCwdTarget` no-op detector) were
designed to catch env-var targets but received `''` instead of
`$NAME` from `tokeniseSegment` and silently failed. Concretely,
`cd $HOME && git commit` and `git -C $HOME commit` would both
pass through as in-cwd attributable, stamping our trailer onto
commits that land in whatever repo `$HOME`/`$REPO_ROOT`
resolves to at runtime.
Pass an env getter `(key) => '$' + key` to `shell-quote.parse`
inside `tokeniseSegment` so unresolved references stay literal
in tokens (`['cd', '$HOME']` instead of `['cd', '']`).
`target.includes('$')` now fires correctly, and the no-op
detector sees `$HOME` (non-`.`) and rejects it. KEY=value
leading-env detection is unaffected (shell-quote doesn't
interpolate inside KEY=value tokens).
2. Even with env preservation, an `''` target can still slip
through (literal `-C ""`, escaped quotes, edge cases in
shell-quote). Round 5's `isNoopCwdTarget` accepted `''` as a
no-op alongside `'.'` / `'./'`, which would re-introduce the
attribution-on-wrong-repo problem if any path produced an
empty token. Tighten to `'.'` and `'./'` only — the only
missed cases are literal `-C ""` (malformed, won't actually
commit) and the rare `-C $PWD` (now also caught conservatively,
since `$PWD` becomes literal `$PWD` and isn't `.` or `./`).
Tests added for `cd $HOME` / `cd $REPO_ROOT && git commit` and
`git -C $HOME commit` / `git -C "" commit`. Full suite green
(150 in shell.test.ts, 58 in commitAttribution.test.ts).
There was a problem hiding this comment.
Review
Most of the prior round is closed: the note write is SHA-pinned, the three-layer hash-divergence detection holds up against paste-replace and delete-then-recreate, and the V3→V4 migration now honours user opt-out intent on string forms. The PR description was rewritten and matches what the code emits. Two residuals remain — one carried over from last round, one newly surfaced.
1. The diff phase still races against HEAD
The previous round identified that the note was being attached to a symbolic HEAD, with a window in which a post-commit hook could move HEAD before the note landed. That fix went in for the note write itself — the note now correctly attaches to the commit the user cared about. But the content of the note (which files changed, how big the changes were, what got deleted, what got renamed) is still computed by reading symbolic HEAD several times inside getCommittedFileInfo. Several awaited git subprocesses run between the postHead capture and these reads, so any post-commit hook, husky/lefthook auto-amender, chained git tag -m ..., or parallel git process that moves HEAD in that window leaves the note attached to commit A but describing commit B's contents. It's the same TOCTOU class as the previous critical, half-closed. Threading postHead (and the captured preHead) through the diff calls — so the range becomes ${preHead}..${postHead} instead of HEAD~1..HEAD — closes it.
2. aiChars reads as a character count but isn't one
The field is emitted as a plain integer named aiChars, and the PR description's example shows values like 3200 / 1500 / 4700 — anyone parsing the git note will read these as literal character counts. They aren't. Internally, the value is (added_lines + deleted_lines) × 40 for text files and a flat 1024 for binary files, with the per-file AI accumulator clamped against that ceiling. The consequence: a commit adding 1000 one-character lines and a commit adding 1000 thousand-character lines both report aiChars = 40000; a 5 MB image change and a 1-byte binary tweak both report 1024. The aiPercent is fine in most cases (the same heuristic appears on both sides of the ratio), but anyone aggregating raw aiChars for compliance or audit reporting — which is the motivation in the PR description — gets systematically wrong numbers.
Verdict
COMMENT — the architectural work is solid and most prior criticals are closed; these two are scoped fixes or doc clarifications, not rework.
…stic
Addresses tanzhenxin's review (4240760004) — two residuals after
the prior pinning round.
1. Diff phase still races against HEAD.
The note write itself was already pinned to the captured `postHead`
(`git notes add -f <postHead>`), but the *content* of the note —
`getCommittedFileInfo`'s probe + diff calls and the multi-commit
guard's `rev-list --count` — were still going through symbolic
`HEAD` / `HEAD~1` / `HEAD@{1}`. Several awaited subprocesses run
between the postHead capture and these reads, so a husky / lefthook
auto-amender, signed-commits hook, chained `git tag -m`, or
parallel git process moving HEAD in that window would leave the
note attached to commit A but describing commit B's contents.
Same TOCTOU class as the prior critical, half-closed.
Thread `postHead` (and `preHead` for amend) through
`getCommittedFileInfo`. Probes become `rev-parse --verify
${postHead}~1` and `log -1 --pretty=%P ${postHead}`; diffs become
`${postHead}~1..${postHead}` (parent case),
`${preHead}..${postHead}` (amend — preHead is the pre-amend SHA
captured before the user's command and is exactly what HEAD@{1}
resolved to at parse time, with the added benefit that it can't be
GC'd between capture and use), and `diff-tree --root <postHead>`
(root commit). The amend branch keeps the existing reflog-vs-
no-reflog warning, just driven off `preHead` instead of HEAD@{1}.
Same pin applied to `countCommitsAfter` (now `${preHead}..
${postHead}`) and `countCommitsFromRoot` (now `${postHead}`).
Why parent case uses `${postHead}~1` and NOT `${preHead}`: in
`git reset HEAD~3 && git commit` chains the captured preHead
points well above postHead's parent, and `${preHead}..${postHead}`
would describe the reset-away commits as deletions, drastically
over-attributing. The actual parent of the just-landed commit is
what we want, and `${postHead}~1` is the SHA-pinned form of that.
2. `aiChars` reads as a literal char count but isn't.
The field is emitted as a plain integer named `aiChars`; the PR
description's example shows values like 3200 / 1500 / 4700 that
anyone parsing the note will read as literal character counts.
Internally it's `(addedLines + deletedLines) × 40` for text and a
flat 1024 for binary, with the per-file AI accumulator clamped
against that ceiling. So 1000 one-character lines and 1000
thousand-character lines both report aiChars=40000, and a 5 MB
image change and a 1-byte binary tweak both report 1024. Anyone
aggregating raw aiChars for compliance reporting gets
systematically wrong numbers.
Add a comprehensive doc block on `FileAttributionDetail` (and
`CommitAttributionNote`) calling out the heuristic explicitly,
noting that `percent` / `summary.aiPercent` are the correct
fields for aggregation since both numerator and denominator use
the same proxy. Also expand the `APPROX_CHARS_PER_LINE` /
`BINARY_DIFF_SIZE_FALLBACK` const docs to point at the same
caveat. (Not renaming the fields — that'd break any downstream
consumer already parsing the existing schema; the doc is the
minimum-disruption call here.)
208 attribution tests pass; type-check clean.
|
@tanzhenxin both residuals addressed in 8c33120. 1. Diff-phase TOCTOU (now fully closed)
Why parent case uses 2. Not renaming the field (it would break any downstream consumer already parsing the existing schema), but added a comprehensive doc block on
The 208 attribution tests pass; type-check clean. |
|
The follow-up commit cleanly closes both findings — every git probe in the analysis phase is now SHA-pinned to the captured commit, and the schema doc explains the Two things to clear before merging: the branch is currently conflicting with main, and the |
…ttribution # Conflicts: # packages/cli/src/config/settingsSchema.ts # scripts/generate-settings-schema.ts
… compat Windows CI failure on the two new rename tests (visible at PR #3115's `Test (windows-latest, *)` jobs): AssertionError: expected undefined to be defined ❯ src/services/commitAttribution.test.ts:572:66 (basic move) AssertionError: expected 11 to be 22 (merge into existing) Root cause: `path.join(canonicalRepoRoot, ...renamedRel.split('/'))` calls `path.win32.join` on Windows, which forces backslash separators regardless of input form. The test's `fs.realpathSync` mock returns forward-slash paths (matching the macOS `/var` ↔ `/private/var` fixture style), so `recordEdit` stores keys like `/private/var/repo/src/old.ts`. The rename's joined target then came out as `\\private\\var\\repo\\src\\new.ts`, the mock left it unchanged (no `/var/` prefix to translate), and the subsequent `fileAttributions.get(renamedAbs)` / `getFileAttribution(...)` lookups missed the just-set entry — the rename silently dropped attribution. The fix: build the joined path with `path.posix.join` against a forward-slash-normalised `posixRepoRoot`, then let `realpathOrSelf` canonicalise to the platform's storage form. This way: - On real Windows production: posix-joined `D:/repo/src/new.ts` is accepted by `fs.realpathSync` (Win32 API takes mixed slashes) and returned in backslash form, matching what `recordEdit` stored. - On real Linux/macOS production: forward-slash throughout, no-op. - In the symlink-aware test (any platform): forward-slash matches the mock-fixture storage form. `matchCommittedFiles` already does the inverse normalisation (`.split(path.sep).join('/')` for the relative-form check), so the in/out paths line up either way. Skipped adding a path.sep-mocked Linux-side regression because the ESM module namespace doesn't allow `vi.spyOn` on path's exports. The Windows CI job is the regression catcher; a focused-rerun should now go green.
|
@tanzhenxin both blockers cleared. Branch conflict with main: Resolved in merge commit b8d242d. Two conflicts both touched additive expansions of the same interfaces —
CI should be green on the next run. |
tanzhenxin
left a comment
There was a problem hiding this comment.
Re-reviewed against the latest revision. The prior critical issues (HEAD-symbolic diff analysis TOCTOU and aiChars units mismatch) are addressed. Merge conflicts are resolved and CI is green on all completed jobs.
Approving.
The SHA-pinning round (8c33120) replaced symbolic `HEAD~1..HEAD` / `HEAD@{1}..HEAD` with `${postHead}~1..${postHead}` and `${preHead}..${postHead}` in `getCommittedFileInfo` and the rev-list helpers, but three docstrings / inline comments still described the old shapes: - `isAmendCommit` JSDoc said the amend switch goes from `HEAD~1..HEAD` to `HEAD@{1}..HEAD`. Updated to reference `${postHead}~1..${postHead}` and `${preHead}..${postHead}`, with the why (amended commit's parent is the original's parent so the standard parent diff lumps both commits' changes). - `attachCommitAttribution`'s amend branch comment had the same drift; updated to mention `${preHead}..${postHead}` directly. - `getCommittedFileInfo` JSDoc said it diffs "HEAD against its parent (HEAD~1)" and listed "--amend with no reflog" as an analysis-failure case. Updated to mention postHead-pinning and the preHead-driven amend bail (the reflog-GC dependency was dropped in the SHA-pin round). The remaining `HEAD~1..HEAD` references at countCommitsAfter:1959 and getCommittedFileInfo:2523 are intentional — they describe the old buggy shape as contrast for why we pin now. No code change; tests + tsc still clean.
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
The two residuals from the prior round are closed. The diff/rev-list phase is now SHA-pinned to the captured post-commit head (and to the captured pre-amend head on the amend path), matching the note write — a post-commit hook moving HEAD between analysis and the note write can no longer leave the note attached to one commit while describing another. The aiChars / humanChars field naming stays, but the schema docstring now spells out plainly that the values are heuristic diff-size proxies (lines × 40 for text, 1024 for binary), with the worst-case examples called out — consumers parsing the note as literal character counts will see the caveat.
Verdict
APPROVE — feature is in good shape; remaining items are minor / followup-class.
Round 13 audit found a real bug: `sudo --chdir=/tmp git commit`,
`env -C/tmp git commit`, `env --chdir=/tmp git commit`, and
`sudo -D/tmp git commit` were all silently slipping through the
cwd-shift detector and getting our `Co-authored-by` trailer stamped
onto commits that landed in a different repo.
Root cause: `shell-quote` tokenises both the long attached form
(`--chdir=/tmp`) and the short attached form (`-C/tmp`) as a single
argv entry. The previous SHIFT_CWD detector did set-membership only
against the bare flag (`{'-C', '--chdir'}` for env;
`{'-D', '--chdir'}` for sudo), so the attached-form tokens never
matched and `tokeniseSegment` returned a normally-attributable
`['git', 'commit', ...]` segment.
Fix: introduce `isShiftCwdFlag(flag, set)` that catches:
- bare set-membership (existing behavior),
- long attached: `--name=...` when `--name` is in the set,
- short attached: `-Xanything` when `-X` is in the set and the
token is longer than the flag itself.
The flag does NOT need to consume an extra value token in the
attached-form case (the value is already embedded), so the existing
TAKES_VALUE bookkeeping is unaffected — we just bail with `null`
from `tokeniseSegment` before reaching the value-skip step.
Tests added: `env --chdir=`, `env -C/...` (attached), `sudo --chdir=`,
`sudo -D/...` (attached) — each is asserted NOT to add a co-author
trailer. 154 shell tests pass; type-check + lint clean.
Adds three regression cases to the existing "git -C <path>" suppression test: the short attached form `-C/path` (single shell-quote token) and the long attached forms `--git-dir=/path` / `--work-tree=/path`. parseGitInvocation already had the prefix checks at lines 416/425, but no test exercised them — paired with the b89b655 sudo/env attached- form fix this round closes the family of "shell-quote single-token flag with embedded value" cases that the bare set-membership checks would otherwise miss. 157 shell tests pass; type-check clean.
The addCoAuthorToGitCommit body capture has a known truncation case when an inner unescaped `"` appears inside the captured body — handled for `$(...)` command substitution with an explicit bailout, but not for backtick command substitution. The trade-off was unspoken; spell it out so a future reviewer doesn't read the asymmetry as an oversight. Bare-backtick bodies (`\`func()\`` markdown-style) are common in commit messages, have no inner `"`, and the regex captures them correctly. Pathological backtick-with-inner-quote bodies (`\`cmd "with" quotes\``) are a near-zero-traffic case where bash itself already interprets the backticks as command substitution, so the user has likely already broken their own command before our rewrite runs. Bailing on any backtick would lose attribution for the common case to defend against the rare one. Also drops a stray blank line in commitAttribution.test.ts left over from an earlier regression-test attempt.
Round 13 follow-on. Both `addCoAuthorToGitCommit` and
`addAttributionToPR` ran their `-m` / `--body` regex against the full
segment string, including any trailing shell comment. For a command
like `git commit -m "real" # -m "fake"` (a human-authored script
might leave a comment-out flag in place), `lastMatchOf` would pick
the comment's `-m "fake"`, splice the `Co-authored-by:` trailer in
there, and bash would silently discard the entire segment as a
comment — leaving the actual commit unattributed. Same shape for
`gh pr create --body "real" # --body "fake"`.
Fix: introduce `findUnquotedCommentStart(s)` — a bash-aware position
scanner that tracks single/double-quote state and treats `#` as a
comment marker only when it begins a word (start of input or
preceded by whitespace), not when it appears inside a quoted region
or mid-token like `foo#bar`. Both rewriters slice the segment to
`[0, commentStart)` before running their regex, so the trailer can
only land in the live (pre-comment) part.
Tests added:
- `git commit -m "real" # -m "fake"` — trailer lands in `"real"`
body BEFORE the `#`, comment's `-m "fake"` is left untouched.
- `git commit -m "fix #123 add feature"` — `#` inside the quoted
body is correctly NOT treated as a comment; the `#123` stays
inside the body and the trailer is appended.
159 shell tests pass; type-check clean.
wenshao
left a comment
There was a problem hiding this comment.
…+ cover legacy gitCoAuthor migration end-to-end
Two residuals from this morning's review pass.
1. ANm7O — `addAttributionToPR` silently skipped for `--body-file`,
`--fill`, and bare `gh pr create` (editor) flows.
The rewriter only knows how to splice into an inline `--body`/`-b`
argv entry. For a `gh pr create` that uses `--body-file path`,
`--fill` (uses commit messages), or no body flag at all (editor
prompt), there's no inline body to splice into and the function
returned the unmodified command. Users with `gitCoAuthor.pr`
enabled would see PRs created without the attribution line and
have no signal as to why.
Add a debugLogger.warn at the no-match path naming the unsupported
flows and pointing the user at the inline form. Don't try to
handle `--body-file` automatically — that would mean mutating the
user's file on disk, which is well outside what an unprompted
command rewriter should do; `--fill` and editor flows have no body
in argv at all and can't be rewritten without re-architecting.
Tests added for `--body-file <path>`, `--fill`, and bare
`gh pr create` — each is asserted to leave the command unchanged
(no `Generated with Qwen Code` line spliced in).
2. ANm7L — settings-migration integration suite didn't cover the
exact V3 legacy shape this PR introduces.
`v3-to-v4.test.ts` already pins the migration body, but the end-
to-end CLI load → migrate → write path could regress without the
integration suite noticing. The existing v3LegacyDisableSettings
fixture has no `general.gitCoAuthor` field, so the V3→V4 step
technically fires but doesn't exercise the new boolean-expansion
logic.
Add a `v3GitCoAuthorBooleanSettings` fixture and a paired test
case that writes `general: { gitCoAuthor: false }` at $version 3,
runs the same `mcp list` CLI invocation, and asserts the saved
file has $version 4 plus `general.gitCoAuthor` exactly
`{ commit: false, pr: false }` — with sibling general.* keys and
unrelated top-level sections preserved.
162 shell tests pass; type-check + lint clean.
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
Re-stating approve after the prior one was auto-dismissed by the new commits. Nothing in the follow-up batch regresses the earlier round; one commit (the attached-form env / sudo cwd-shift fix) actually closes a real cross-repo attribution bug — sudo --chdir=/tmp git commit and env -C/tmp git commit would otherwise have stamped the Co-authored-by trailer onto commits landing in a different repository. The shell-comment-bypass fix for git commit -m "real" # -m "fake" and the no-inline-body warn for gh pr create are also legitimate scoped improvements, and the new integration test for the V3 boolean gitCoAuthor migration closes the only end-to-end gap from the self-review pass.
Verdict
APPROVE — feature is in good shape; the five follow-up commits sharpen edge-case handling without changing the architecture.
Motivation
When developers use AI agents to write code, there is no way to distinguish AI-generated changes from human-authored ones in git history. This creates challenges for:
This PR adds automatic, per-file AI contribution tracking that attaches structured metadata to commits without polluting commit messages.
How it works
Example output
{ "version": 1, "generator": "Qwen-Coder", "files": { "src/services/commitAttribution.ts": { "aiChars": 3200, "humanChars": 0, "percent": 100, "surface": "cli" }, "src/tools/shell.ts": { "aiChars": 1500, "humanChars": 0, "percent": 100, "surface": "cli" } }, "summary": { "aiPercent": 100, "aiChars": 4700, "humanChars": 0, "totalFilesTouched": 2, "surfaces": ["cli"] }, "surfaceBreakdown": { "cli": { "aiChars": 4700, "percent": 100 } }, "excludedGenerated": [], "excludedGeneratedCount": 0, "promptCount": 3 }Design decisions
CreateProcessargv ceiling (~32 KB UTF-16); the JSON note travels as a separateexecFileargv entry alongsidegit, the path, and other args, so we cap the payload itself well below thatEdge case behavior
validateOnDiskHashes()drops any entry whose on-disk content has diverged so AI is not credited for human edits made between attemptsChanged files
New files
Modified files
Test plan
Automated
commitAttribution.test.ts: 27 tests (character contribution, snapshot/restore, baselines, generated-file detection, surface, prompt counters)attributionTrailer.test.ts: 7 tests (shell-safe note command, size guard, formatting)tools/edit.test.ts/write-file.test.ts/shell.test.ts: 136 tests (integration with recordEdit and attachCommitAttribution)core/client.test.ts: 69 tests incl. 3 asserting attribution snapshot fires on UserQuery + ToolResult and skips RetryEnd-to-end (scratch repo + real CLI + real LLM)
aiChars/aiPercent: 100/surface: clihumanChars > 0, percent reflects real ratiopackage-lock.json) excluded fromfiles, appears inexcludedGenerateddiff-tree --rootand records correctlygeneral.gitCoAuthor: false→ commit succeeds with no git notes writtengh pr create --body "..."→ body auto-appended with🤖 Generated with Qwen Code--resume→ commit in turn 2 → notes contain the turn-1 edits (requires the ToolResult-turn snapshot fix in this PR)clearAttributions(false)on failed commit preservesgetPromptsSinceLastCommit()counter; next successful commit resets it