Skip to content

Bash line comments not recognized — verb chain extracted from comment text #25

@Aaronontheweb

Description

@Aaronontheweb

Observed

When the input contains a bash line comment (# starting a token, comment runs to end-of-line per POSIX), the parser appears to treat the comment text as regular tokens. The verb chain extracted from a multi-line script with a leading comment is the comment text rather than the actual command.

Repro from production (Netclaw consumer)

The agent issued a shell_execute call with this command body (from a real Slack session, 2026-05-12):

# Extract all unique branch names from worktrees
git -C ~/repositories/stannardlabs/netclaw worktree list | awk '{print $NF}' | tr -d '[]' | sort -u

The downstream approval prompt rendered as:

Approve # Extract in ~/repositories/stannardlabs/netclaw?

The displayed verb is # Extract — the first two whitespace-separated tokens of the comment line. The expected verb is git worktree list (BashArity collapses git to a 2-token verb, plus the actual worktree subcommand following the -C <path> flag-with-value).

Note: this specific report was traced through Netclaw's v2 ShellTokenizer (the legacy parser still consulted by the prompt builder), but ShellSyntaxTree's BashLexer would have the same gap if comments aren't handled — and consumers migrating to ShellSyntaxTree as the primary parser need this fixed.

Expected behavior

Per POSIX (and bash) shell grammar:

  • A # outside of single quotes, double quotes, or word characters starts a comment that runs to the next newline.
  • Comments produce no tokens — the lexer should skip them entirely.
  • A multi-line input where the first non-comment statement is git -C <path> worktree list | awk ... should parse as that statement (with the pipe-chained awk/tr/sort clauses), with no influence from the preceding comment.

What this affects

  • BashLexer token output: comment text should not produce WORD tokens.
  • BashParser: should not see comment content as part of any clause.
  • ParsedCommand.Source should retain the original input verbatim per the existing contract; it's only the structured AST that excludes comment content.
  • Consumers that walk ParsedCommand.Clauses to extract verb chains for security gates or audit logs get the wrong verb chain when the input has a leading or interleaved comment.

Suggested test cases for the corpus

# Leading line comment, single command
input:    "# fetch the latest\ngit pull"
expected: one clause with verb [git, pull]

# Inline mid-line comment (after whitespace, before newline)
input:    "git pull   # update local"
expected: one clause with verb [git, pull]; comment after the command is dropped

# Comment-only input
input:    "# just a note"
expected: zero clauses (or IsUnparseable=true with reason "no executable statements")

# Comment between commands in a script
input:    "git pull\n# now build\ndotnet build"
expected: two clauses with verbs [git, pull] and [dotnet, build]

# `#` inside double quotes — NOT a comment
input:    "echo \"hash is #1234\""
expected: one clause with verb [echo], one literal arg \"hash is #1234\"

# `#` inside single quotes — NOT a comment
input:    "echo 'use #foo'"
expected: one clause with verb [echo], one literal arg 'use #foo'

# `#` at start of a token mid-command (not preceded by whitespace) — debatable
# but bash treats this as a comment iff `#` is the first character of a "word"
# AND it's preceded by whitespace OR start-of-input.
# e.g. "echo abc#def" → echo gets one arg "abc#def" (no comment)
# e.g. "echo abc #def" → echo gets one arg "abc", "#def" starts comment

SPEC.md gap

SPEC.md §5 (Tokenization Rules) doesn't currently mention comments. The fix should:

  1. Add a "Comment handling" subsection to §5 documenting the rule (# starting a word and not inside quotes begins a comment to end-of-line; comments produce no tokens).
  2. Update the BNF in §4 to be explicit that comments are whitespace-equivalent at the lexer level (don't appear in the grammar).
  3. Add the test cases above to tests/Corpus/bash/.

Severity

Low for parser correctness (comments are degenerate input), but medium for downstream consumers — agents naturally include explanatory comments in scripts they author for human review (especially scripts they propose to run via a single approval per the scripts-as-units-of-approval pattern), and surfacing comment text in approval prompts confuses the user about what they're actually approving.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions