Documentation
¶
Overview ¶
Package tokendiff provides token-level diffing with delimiter support.
Unlike traditional line-based diff tools, tokendiff operates at the token level and treats configurable delimiter characters as separate tokens. This allows for more precise diffs when comparing code or structured text.
For example, when comparing:
someFunction(SomeType var) someFunction(SomeOtherType var)
A line-based diff would show the entire line changed. A word-based diff without delimiter awareness might show "someFunction(SomeType" changed to "someFunction(SomeOtherType". But tokendiff correctly identifies that only "SomeType" changed to "SomeOtherType" because it treats "(" as a delimiter.
This package uses github.com/dacharyc/diffx to provide this functionality.
Index ¶
- Constants
- Variables
- func ColorCode(fg, bg string, bold bool) (string, error)
- func ColorNames() []string
- func ComputeTokenSimilarity(text1, text2 string, opts Options) float64
- func DiscardConfusingTokens(tokens1, tokens2 []string) (filtered1, filtered2 []string, map1, map2 []int)
- func FormatDiff(diffs []Diff) string
- func FormatDiffResultAdvanced(result DiffResult, opts FormatOptions) string
- func FormatDiffWithOptions(diffs []Diff, opts FormatOptions) string
- func FormatDiffsAdvanced(diffs []Diff, opts FormatOptions) string
- func HasChanges(diffs []Diff) bool
- func NeedsSpaceAfter(token string) bool
- func NeedsSpaceBefore(token string) bool
- func OverstrikeBold(text string) string
- func OverstrikeUnderline(text string) string
- func ParseColor(spec string) (string, error)
- func ParseColorSpec(spec string) (deleteColor, insertColor string, err error)
- func ProcessUnifiedDiff(input io.Reader, output io.Writer, opts Options, fmtOpts FormatOptions) error
- func Tokenize(text string, opts Options) []string
- type Diff
- func AggregateDiffs(diffs []Diff) []Diff
- func ApplyMatchContext(diffs []Diff, minContext int) []Diff
- func ApplyWordDiff(hunk DiffHunk, opts Options) []Diff
- func DiffStrings(text1, text2 string, opts Options) []Diff
- func DiffStringsWithPreprocessing(text1, text2 string, opts Options) []Diff
- func DiffTokens(tokens1, tokens2 []string) []Diff
- func DiffTokensRaw(tokens1, tokens2 []string) []Diff
- func DiffTokensWithPreprocessing(tokens1, tokens2 []string) []Diff
- func EliminateStopwordAnchors(diffs []Diff) []Diff
- func InterleaveDiffs(diffs []Diff) []Diff
- func ShiftBoundaries(diffs []Diff) []Diff
- type DiffHunk
- type DiffResult
- type DiffStatistics
- type FormatOptions
- type LineDiffOutput
- type LineDiffResult
- type LinePairing
- type Operation
- type Options
- type TokenPos
- type UnifiedDiff
- type WholeFileDiffResult
Constants ¶
const ( ANSIReset = "\033[0m" ANSIClearEOL = "\033[K" ANSIDeleteColor = "\033[0;31;1m" // bold red ANSIInsertColor = "\033[0;32;1m" // bold green ANSIBold = "\033[1m" )
ANSI escape code constants
const DefaultDelimiters = ""
DefaultDelimiters contains the default set of delimiter characters. These are characters that are treated as separate tokens even when not surrounded by whitespace. NOTE: Original dwdiff has NO default delimiters (empty string). Words are split only on whitespace unless -d or -P is specified.
const DefaultWhitespace = " \t\n\r"
DefaultWhitespace contains the default set of whitespace characters.
Variables ¶
var BackgroundColors = map[string]string{
"black": "\033[40m",
"red": "\033[41m",
"green": "\033[42m",
"yellow": "\033[43m",
"blue": "\033[44m",
"magenta": "\033[45m",
"cyan": "\033[46m",
"white": "\033[47m",
"brightblack": "\033[100m",
"brightred": "\033[101m",
"brightgreen": "\033[102m",
"brightyellow": "\033[103m",
"brightblue": "\033[104m",
"brightmagenta": "\033[105m",
"brightcyan": "\033[106m",
"brightwhite": "\033[107m",
}
BackgroundColors maps color names to ANSI background escape codes.
var ForegroundColors = map[string]string{
"black": "\033[30m",
"red": "\033[31m",
"green": "\033[32m",
"yellow": "\033[33m",
"blue": "\033[34m",
"magenta": "\033[35m",
"cyan": "\033[36m",
"white": "\033[37m",
"brightblack": "\033[90m",
"brightred": "\033[91m",
"brightgreen": "\033[92m",
"brightyellow": "\033[93m",
"brightblue": "\033[94m",
"brightmagenta": "\033[95m",
"brightcyan": "\033[96m",
"brightwhite": "\033[97m",
}
ForegroundColors maps color names to ANSI foreground escape codes.
Functions ¶
func ColorCode ¶
ColorCode builds an ANSI escape sequence from component parts. fg is the foreground color name (or empty for default). bg is the background color name (or empty for none). bold adds the bold attribute if true. Returns an error if any color name is not recognized.
func ColorNames ¶
func ColorNames() []string
ColorNames returns a list of all available color names.
func ComputeTokenSimilarity ¶
ComputeTokenSimilarity calculates similarity between two strings based on shared tokens. Returns a value between 0.0 (no similarity) and 1.0 (identical). Similarity is computed as the ratio of Equal tokens to total diff operations.
func DiscardConfusingTokens ¶
func DiscardConfusingTokens(tokens1, tokens2 []string) (filtered1, filtered2 []string, map1, map2 []int)
DiscardConfusingTokens filters tokens that appear too frequently, which would create many spurious match points during diff calculation. Returns filtered token slices and index maps back to original positions.
Algorithm (inspired by GNU diff's discard_confusing_lines): 1. Count occurrences of each token in the OTHER file (equiv_count) 2. Mark tokens:
- equiv_count == 0: definitely discard (can't match anything)
- equiv_count > √n: provisionally discard (too many potential matches)
- else: keep for matching
3. Apply provisional discard rules:
- Provisional tokens are kept only if they form runs with non-provisional endpoints AND at least 25% of the run is provisional
Note: Filtered tokens are still included in the final diff output - they're just excluded from the LCS matching to prevent spurious anchoring.
func FormatDiff ¶
FormatDiff returns a human-readable representation of the diff. Deleted tokens are wrapped in [-...-] and inserted tokens in {+...+}.
func FormatDiffResultAdvanced ¶
func FormatDiffResultAdvanced(result DiffResult, opts FormatOptions) string
FormatDiffResultAdvanced formats a DiffResult preserving original spacing for Equal content. This uses position information to extract original text for Equal runs instead of reconstructing from tokens, which loses whitespace information. When opts.ShowLineNumbers is true, it tracks and displays line numbers based on SOURCE positions in the original texts.
func FormatDiffWithOptions ¶
func FormatDiffWithOptions(diffs []Diff, opts FormatOptions) string
FormatDiffWithOptions returns a formatted representation of the diff using the specified formatting options.
func FormatDiffsAdvanced ¶
func FormatDiffsAdvanced(diffs []Diff, opts FormatOptions) string
FormatDiffsAdvanced formats diffs with comprehensive options including colors, line numbers, overstrike modes, and marker repetition. This is a more feature-rich alternative to FormatDiffWithOptions.
func HasChanges ¶
HasChanges returns true if the diff slice contains any non-Equal operations.
func NeedsSpaceAfter ¶
NeedsSpaceAfter returns true if a space should follow this token when formatting diff output. Used internally by FormatDiff.
func NeedsSpaceBefore ¶
NeedsSpaceBefore returns true if a space should precede this token when formatting diff output. Used internally by FormatDiff.
func OverstrikeBold ¶
OverstrikeBold returns text with overstrike bold (char\bchar for each char). This is used for printer mode to highlight inserted text.
func OverstrikeUnderline ¶
OverstrikeUnderline returns text with overstrike underlining (_\bchar for each char). This is used for less -r mode to highlight deleted text.
func ParseColor ¶
ParseColor parses a color specification and returns the ANSI escape sequence. The spec can be:
- A single color name: "red" -> foreground red
- Foreground:background: "red:white" -> red text on white background
- Empty string returns empty string (no color)
Returns an error if the color name is not recognized.
func ParseColorSpec ¶
ParseColorSpec parses a color specification for diff output. The format is: "delete_color,insert_color" where each color can be "fg" or "fg:bg" (e.g., "red,green" or "red:white,green:black").
If only one color is specified, it's used for deletions and the default insert color (bold green) is used for insertions.
Returns the ANSI escape sequences for delete and insert colors.
func ProcessUnifiedDiff ¶
func ProcessUnifiedDiff(input io.Reader, output io.Writer, opts Options, fmtOpts FormatOptions) error
ProcessUnifiedDiff reads a unified diff from input and applies word-level diffing to each hunk. The result is written to output with diff headers preserved and hunk content replaced with word-level diff output.
Types ¶
type Diff ¶
Diff represents a single diff operation on a token.
func AggregateDiffs ¶
AggregateDiffs combines adjacent diffs of the same type into single tokens. For example, consecutive Delete operations are merged into one Delete with tokens joined appropriately (spaces between words, no spaces between punctuation/delimiters).
func ApplyMatchContext ¶
ApplyMatchContext processes diffs to require minimum context between changes. Equal tokens that appear between changes with fewer than minContext matches are converted to both Delete and Insert operations. This reduces noise from coincidental short matches between larger changes.
If minContext is 0 or negative, the diffs are returned unchanged.
func ApplyWordDiff ¶
ApplyWordDiff applies word-level diffing to a unified diff hunk. It returns the word-level diff result for the changed lines.
func DiffStrings ¶
DiffStrings tokenizes both strings and computes their diff.
func DiffStringsWithPreprocessing ¶
DiffStringsWithPreprocessing tokenizes both strings and computes their diff using histogram-based preprocessing that filters confusing tokens.
func DiffTokens ¶
DiffTokens computes the diff between two token slices. It uses the Myers diff algorithm via diffx.
func DiffTokensRaw ¶
DiffTokensRaw computes the diff without semantic cleanup. Use this when you need the raw Myers diff output.
func DiffTokensWithPreprocessing ¶
DiffTokensWithPreprocessing computes the diff using histogram-style preprocessing. This uses diffx's histogram diff algorithm which: 1. Filters stopwords (common words like "the", "for", "in") from anchor selection 2. Uses low-frequency tokens as anchors for divide-and-conquer 3. Produces cleaner output without spurious matches on common words
This produces readable output that groups semantically related changes together.
func EliminateStopwordAnchors ¶
EliminateStopwordAnchors converts stopword Equal tokens to Delete+Insert when they appear as single tokens sandwiched between changes. Unlike ApplyMatchContext, this only affects specific stopwords, preserving meaningful single-token Equals like "support", "config", etc.
The stopword is added to both the preceding Delete run and the following Insert run, so they merge together during formatting instead of appearing as separate `[---] {+-+}` markers.
func InterleaveDiffs ¶
InterleaveDiffs reorders diffs so that Delete/Insert pairs are interleaved. When there's a sequence of Deletes followed by Inserts, this function pairs them positionally: Delete[0] Insert[0] Delete[1] Insert[1], etc. Excess Deletes or Inserts (if the counts don't match) are output at the end.
func ShiftBoundaries ¶
ShiftBoundaries adjusts diff boundaries to create cleaner output. When a deleted token matches an adjacent equal token, shift the boundary.
This is a standard diff post-processing step (similar to GNU diff's shift_boundaries).
Patterns detected and shifted:
- EQUAL[...x] DELETE[x] INSERT[y] → EQUAL[...x] INSERT[y] (shift delete into equal)
- DELETE[x] INSERT[x...] EQUAL[...] → INSERT[...] EQUAL[x...] (shift common prefix)
type DiffHunk ¶
type DiffHunk struct {
// OldStart is the starting line number in the old file.
OldStart int
// OldCount is the number of lines from the old file.
OldCount int
// NewStart is the starting line number in the new file.
NewStart int
// NewCount is the number of lines in the new file.
NewCount int
// OldLines contains the removed lines (without the leading "-").
OldLines []string
// NewLines contains the added lines (without the leading "+").
NewLines []string
// ContextBefore contains context lines before the change.
ContextBefore []string
// ContextAfter contains context lines after the change.
ContextAfter []string
}
DiffHunk represents a single hunk from a unified diff.
type DiffResult ¶
type DiffResult struct {
Diffs []Diff
Text1 string // original old text
Text2 string // original new text
Positions1 []TokenPos // token positions in text1
Positions2 []TokenPos // token positions in text2
}
DiffResult contains diff output along with position information needed to reconstruct original spacing for Equal content.
func DiffStringsWithPositions ¶
func DiffStringsWithPositions(text1, text2 string, opts Options) DiffResult
DiffStringsWithPositions tokenizes and diffs strings, returning position info. This allows formatters to preserve original spacing for Equal content.
func DiffStringsWithPositionsAndPreprocessing ¶
func DiffStringsWithPositionsAndPreprocessing(text1, text2 string, opts Options) DiffResult
DiffStringsWithPositionsAndPreprocessing tokenizes and diffs strings using histogram-based preprocessing, returning position info for formatting. This allows formatters to preserve original spacing for Equal content.
type DiffStatistics ¶
type DiffStatistics struct {
OldWords int // total words in old text
NewWords int // total words in new text
DeletedWords int // words deleted (present in old but not new)
InsertedWords int // words inserted (present in new but not old)
CommonWords int // words common to both texts
}
DiffStatistics holds statistics about a diff operation.
func ComputeStatistics ¶
func ComputeStatistics(text1, text2 string, diffs []Diff, opts Options) DiffStatistics
ComputeStatistics calculates statistics for a diff.
type FormatOptions ¶
type FormatOptions struct {
// StartDelete is the string to mark the beginning of deleted text.
// Default: "[-"
StartDelete string
// StopDelete is the string to mark the end of deleted text.
// Default: "-]"
StopDelete string
// StartInsert is the string to mark the beginning of inserted text.
// Default: "{+"
StartInsert string
// StopInsert is the string to mark the end of inserted text.
// Default: "+}"
StopInsert string
// NoDeleted, when true, suppresses deleted tokens from output.
NoDeleted bool
// NoInserted, when true, suppresses inserted tokens from output.
NoInserted bool
// NoCommon, when true, suppresses unchanged tokens from output.
NoCommon bool
// UseColor enables ANSI color output. When true, DeleteColor and InsertColor
// are used instead of text markers.
UseColor bool
// DeleteColor is the ANSI escape sequence for deleted text color.
// Example: "\033[31m" for red
DeleteColor string
// InsertColor is the ANSI escape sequence for inserted text color.
// Example: "\033[32m" for green
InsertColor string
// ColorReset is the ANSI escape sequence to reset colors.
// Default: "\033[0m"
ColorReset string
// ClearToEOL is the ANSI escape sequence to clear to end of line.
// Default: "\033[K"
ClearToEOL string
// RepeatMarkers, when true, repeats markers at line boundaries for
// multi-line changes.
RepeatMarkers bool
// AggregateChanges, when true, combines adjacent changes of the same type.
AggregateChanges bool
// LessMode uses overstrike underlining for deleted text (for less -r).
LessMode bool
// PrinterMode uses overstrike bold for inserted text (for printing).
PrinterMode bool
// MatchContext is the minimum number of matching words between changes.
// Equal tokens sandwiched between changes with fewer than this many
// matches are converted to Delete+Insert pairs. 0 disables this feature.
MatchContext int
// ShowLineNumbers enables dual line number display (old:new format).
ShowLineNumbers bool
// LineNumWidth is the minimum width for line numbers. 0 means auto-calculate.
LineNumWidth int
// HeuristicSpacing uses NeedsSpaceBefore/After heuristics for spacing
// when PreserveWhitespace is false. When true, spaces are not tokens and
// spacing is determined heuristically.
HeuristicSpacing bool
}
FormatOptions configures diff output formatting.
func DefaultFormatOptions ¶
func DefaultFormatOptions() FormatOptions
DefaultFormatOptions returns FormatOptions with default settings.
type LineDiffOutput ¶
type LineDiffOutput struct {
Lines []LineDiffResult // individual line results
HasChanges bool // true if there are any differences
Statistics DiffStatistics // aggregate statistics
}
LineDiffOutput holds the results of a line-by-line diff operation.
func DiffLineByLine ¶
func DiffLineByLine(text1, text2 string, opts Options, fmtOpts FormatOptions, algorithm string, threshold float64) LineDiffOutput
DiffLineByLine compares files line by line with proper line-level diff tracking. This correctly tracks dual line numbers: - For equal lines: both old and new line numbers increment - For deleted lines: only old line number increments - For inserted lines: only new line number increments
The algorithm parameter controls how deleted and inserted lines are paired: - "best": similarity-based matching (pairs lines with highest token overlap) - "normal" or "fast": positional matching (pairs lines by position)
type LineDiffResult ¶
type LineDiffResult struct {
OldLineNum int // line number in old file
NewLineNum int // line number in new file
HasChanges bool // true if this line contains changes
Output string // formatted output for this line
}
LineDiffResult holds diff results for a single line in line-by-line mode.
func FilterWithContext ¶
func FilterWithContext(lines []LineDiffResult, contextLines int) []LineDiffResult
FilterWithContext returns only the lines that are changes or within contextLines of a change.
type LinePairing ¶
type LinePairing struct {
DeleteIndex int // index in deletes slice
InsertIndex int // index in inserts slice
Similarity float64 // similarity score (0.0-1.0)
}
LinePairing represents a pairing between a deleted line and an inserted line.
func FindPositionalPairings ¶
func FindPositionalPairings(deletes, inserts []string) []LinePairing
FindPositionalPairings pairs deleted and inserted lines by position. Delete[0] pairs with Insert[0], Delete[1] with Insert[1], etc. Returns pairings only up to min(len(deletes), len(inserts)).
func FindSimilarityPairings ¶
func FindSimilarityPairings(deletes, inserts []string, opts Options, threshold float64) []LinePairing
FindSimilarityPairings pairs deleted and inserted lines by content similarity. Uses a greedy algorithm: for each deleted line, find the most similar unmatched inserted line. Lines with similarity below threshold are left unpaired.
type Options ¶
type Options struct {
// Delimiters is the set of characters to treat as separate tokens.
// If empty, DefaultDelimiters is used.
// This is ignored if UsePunctuation is true.
Delimiters string
// Whitespace is the set of characters to treat as whitespace (word separators).
// If empty, DefaultWhitespace is used.
Whitespace string
// UsePunctuation, when true, uses Unicode punctuation characters as
// delimiters instead of the Delimiters string. This matches dwdiff's
// -P/--punctuation flag behavior.
UsePunctuation bool
// PreserveWhitespace, when true, includes whitespace as separate tokens.
// When false (default), whitespace is used only to separate words and
// is not included in the diff output.
PreserveWhitespace bool
// IgnoreCase, when true, performs case-insensitive comparison.
// The original case is preserved in the output.
IgnoreCase bool
}
Options configures the diff behavior.
func DefaultOptions ¶
func DefaultOptions() Options
DefaultOptions returns Options with default settings.
type TokenPos ¶
type TokenPos struct {
Start int // byte offset of token start
End int // byte offset of token end (exclusive)
}
TokenPos represents a token's position in the original text.
type UnifiedDiff ¶
type UnifiedDiff struct {
// OldFile is the name of the old file (from "---" line).
OldFile string
// NewFile is the name of the new file (from "+++" line).
NewFile string
// Hunks contains all the diff hunks.
Hunks []DiffHunk
}
UnifiedDiff represents a parsed unified diff.
func ParseUnifiedDiff ¶
func ParseUnifiedDiff(input string) ([]UnifiedDiff, error)
ParseUnifiedDiff parses a unified diff string into structured data. It handles standard unified diff format as produced by diff -u or git diff.
type WholeFileDiffResult ¶
type WholeFileDiffResult struct {
Result DiffResult // the raw diff result
Formatted string // formatted output
HasChanges bool // true if there are any differences
Statistics DiffStatistics // statistics about the diff
}
WholeFileDiffResult holds the result of a whole-file diff operation.
func DiffWholeFiles ¶
func DiffWholeFiles(text1, text2 string, opts Options, fmtOpts FormatOptions) WholeFileDiffResult
DiffWholeFiles performs a whole-file word-level diff and returns structured results. This is the main API for comparing two complete texts.