tokendiff

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 29, 2025 License: MIT Imports: 7 Imported by: 0

README

tokendiff

Go Reference CI Go Report Card Go Version

A Go library and CLI for token-level diffing with delimiter support.

tokendiff uses a histogram diff algorithm that groups semantically related changes together, producing more readable output than traditional Myers-based approaches for complex structural changes.

Motivation

Traditional diff tools operate at the line level. Word-based tools like wdiff improve on this but can produce suboptimal results when comparing code. For example, when comparing:

void someFunction(SomeType var)
void someFunction(SomeOtherType var)

wdiff reports that someFunction(SomeType changed to someFunction(SomeOtherType - grouping the function name with the parameter type.

tokendiff treats delimiter characters like ( as separate tokens, correctly identifying that only SomeType changed to SomeOtherType.

Algorithm

This library uses the histogram diff algorithm via diffx. The histogram algorithm is a variant of the patience diff algorithm that performs well on real-world text by:

  1. Finding unique tokens that appear exactly once in each input (strong anchors)
  2. Using frequency analysis to avoid matching common tokens that would create confusing output
  3. Recursively diffing the regions between anchors

This approach produces output that groups semantically related changes together, making diffs easier to read than traditional Myers-based algorithms when comparing files with significant structural changes.

Installation

Library
go get github.com/dacharyc/tokendiff
CLI Tool
go install github.com/dacharyc/tokendiff/cmd/tokendiff@latest

CLI Usage

tokendiff [options] file1 file2
tokendiff [options] -stdin file2
Options

Input/Output:

Flag Description
-d "..." Custom delimiter characters
-P, --punctuation Use Unicode punctuation as delimiters
-W, --white-space "..." Custom whitespace characters
--line-mode Compare files line by line
-C N Show N lines of context (implies --line-mode)
-L N, --line-numbers N Show line numbers with width N (0 for auto)
-stdin Read first input from stdin
--diff-input Read unified diff from stdin and apply token-level diff

Output Formatting:

Flag Description
-w "..." String to mark start of deleted text (default: [-)
-x "..." String to mark end of deleted text (default: -])
-y "..." String to mark start of inserted text (default: {+)
-z "..." String to mark end of inserted text (default: +})
-c, --color SPEC Set colors (format: del_fg[:bg],ins_fg[:bg], or list)
--no-color Disable colored output
-l, --less-mode Use overstrike for less -r viewing
-p, --printer Use overstrike for printing
-R, --repeat-markers Repeat markers at line boundaries
-a, --aggregate-changes Combine adjacent insertions/deletions

Output Suppression:

Flag Description
-1 Suppress deleted words
-2 Suppress inserted words
-3 Suppress common words

Comparison:

Flag Description
-i, --ignore-case Case-insensitive comparison
-m N, --match-context N Minimum matching words between changes

Other:

Flag Description
-s, --statistics Print diff statistics
--profile NAME Use settings from ~/.tokendiffrc.<NAME>
-v, --version Show version
-h Show help

The CLI respects the NO_COLOR environment variable.

Configuration Files

tokendiff supports configuration files to set default options:

  • ~/.tokendiffrc - Default configuration (loaded automatically)
  • ~/.config/tokendiff/config - XDG-compliant location (fallback)
  • ~/.tokendiffrc.<profile> - Named profile (use with --profile)

Config file format:

# Comment
option-name
option-name=value

Example ~/.tokendiffrc.html:

# HTML output profile
start-delete=<del>
stop-delete=</del>
start-insert=<ins>
stop-insert=</ins>
no-color

Usage:

tokendiff --profile=html old.txt new.txt

Command-line options override configuration file settings.

Exit Codes
Code Meaning
0 Files are identical
1 Files differ
2 Error occurred
Examples
# Compare two files
tokendiff old.txt new.txt

# Line-by-line with context
tokendiff --line-mode -C 3 old.go new.go

# Compare git versions
git show HEAD~1:file.go | tokendiff -stdin file.go

# Custom delimiters
tokendiff -d "(){}[]" file1.txt file2.txt

# Case-insensitive comparison with statistics
tokendiff -i -s old.txt new.txt

# HTML-style markers
tokendiff -w '<del>' -x '</del>' -y '<ins>' -z '</ins>' old.txt new.txt

# View in less with overstrike highlighting
tokendiff -l old.txt new.txt | less -r

# Apply token-level diff to a unified diff
git diff | tokendiff --diff-input
diff -u old.txt new.txt | tokendiff --diff-input

Library Usage

Basic Usage
package main

import (
    "fmt"
    "github.com/dacharyc/tokendiff"
)

func main() {
    old := "void someFunction(SomeType var)"
    new := "void someFunction(SomeOtherType var)"

    diffs := tokendiff.DiffStrings(old, new, tokendiff.DefaultOptions())
    fmt.Println(tokendiff.FormatDiff(diffs))
    // Output: void someFunction([-SomeType-]{+SomeOtherType+} var)
}
Working with Tokens
// Tokenize text with delimiter awareness
tokens := tokendiff.Tokenize("foo(bar, baz)", tokendiff.DefaultOptions())
// tokens = ["foo", "(", "bar", ",", "baz", ")"]

// Diff pre-tokenized content
diffs := tokendiff.DiffTokens(tokens1, tokens2)
Custom Delimiters
opts := tokendiff.Options{
    Delimiters: "|:-",  // Custom delimiter set
}
diffs := tokendiff.DiffStrings(text1, text2, opts)
Preserving Whitespace
opts := tokendiff.Options{
    Delimiters:         tokendiff.DefaultDelimiters,
    PreserveWhitespace: true,  // Include whitespace as tokens
}

API

Types
type Operation int
const (
    Equal  Operation = iota  // Token unchanged
    Insert                   // Token was added
    Delete                   // Token was removed
)

type Diff struct {
    Type  Operation
    Token string
}

type Options struct {
    Delimiters         string  // Characters to treat as separate tokens
    Whitespace         string  // Characters to treat as whitespace
    UsePunctuation     bool    // Use Unicode punctuation as delimiters
    PreserveWhitespace bool    // Include whitespace as tokens
    IgnoreCase         bool    // Case-insensitive comparison
}

type FormatOptions struct {
    StartDelete string  // Marker for start of deleted text (default: "[-")
    StopDelete  string  // Marker for end of deleted text (default: "-]")
    StartInsert string  // Marker for start of inserted text (default: "{+")
    StopInsert  string  // Marker for end of inserted text (default: "+}")
    NoDeleted   bool    // Suppress deleted tokens
    NoInserted  bool    // Suppress inserted tokens
    NoCommon    bool    // Suppress unchanged tokens
}
Functions

Tokenizing and Diffing:

  • Tokenize(text string, opts Options) []string - Split text into tokens
  • DiffTokens(tokens1, tokens2 []string) []Diff - Diff two token slices
  • DiffStrings(text1, text2 string, opts Options) []Diff - Tokenize and diff two strings
  • DefaultOptions() Options - Get default options

Diff Transformations:

  • AggregateDiffs(diffs []Diff) []Diff - Combine adjacent same-type operations
  • ApplyMatchContext(diffs []Diff, minContext int) []Diff - Require minimum matching words between changes

Formatting:

  • FormatDiff(diffs []Diff) string - Format diff with default markers
  • FormatDiffWithOptions(diffs []Diff, opts FormatOptions) string - Format with custom markers
  • DefaultFormatOptions() FormatOptions - Get default format options
  • HasChanges(diffs []Diff) bool - Check if diff contains any changes
  • NeedsSpaceBefore(token string) bool - Check if space should precede token
  • NeedsSpaceAfter(token string) bool - Check if space should follow token

Unified Diff Parsing:

  • ParseUnifiedDiff(input string) ([]UnifiedDiff, error) - Parse unified diff format
  • ApplyWordDiff(hunk DiffHunk, opts Options) []Diff - Apply token-level diff to a hunk

Default Delimiters

(){}[]<>,.;:!?"'`@#$%^&*+-=/\|~

Performance

Benchmarks on Apple M1:

BenchmarkTokenize      ~2.5 µs/op
BenchmarkDiffStrings   ~10.5 µs/op

License

MIT

Documentation

Overview

Package tokendiff provides token-level diffing with delimiter support.

Unlike traditional line-based diff tools, tokendiff operates at the token level and treats configurable delimiter characters as separate tokens. This allows for more precise diffs when comparing code or structured text.

For example, when comparing:

someFunction(SomeType var)
someFunction(SomeOtherType var)

A line-based diff would show the entire line changed. A word-based diff without delimiter awareness might show "someFunction(SomeType" changed to "someFunction(SomeOtherType". But tokendiff correctly identifies that only "SomeType" changed to "SomeOtherType" because it treats "(" as a delimiter.

This package uses github.com/dacharyc/diffx to provide this functionality.

Index

Constants

View Source
const (
	ANSIReset       = "\033[0m"
	ANSIClearEOL    = "\033[K"
	ANSIDeleteColor = "\033[0;31;1m" // bold red
	ANSIInsertColor = "\033[0;32;1m" // bold green
	ANSIBold        = "\033[1m"
)

ANSI escape code constants

View Source
const DefaultDelimiters = ""

DefaultDelimiters contains the default set of delimiter characters. These are characters that are treated as separate tokens even when not surrounded by whitespace. NOTE: Original dwdiff has NO default delimiters (empty string). Words are split only on whitespace unless -d or -P is specified.

View Source
const DefaultWhitespace = " \t\n\r"

DefaultWhitespace contains the default set of whitespace characters.

Variables

View Source
var BackgroundColors = map[string]string{
	"black":         "\033[40m",
	"red":           "\033[41m",
	"green":         "\033[42m",
	"yellow":        "\033[43m",
	"blue":          "\033[44m",
	"magenta":       "\033[45m",
	"cyan":          "\033[46m",
	"white":         "\033[47m",
	"brightblack":   "\033[100m",
	"brightred":     "\033[101m",
	"brightgreen":   "\033[102m",
	"brightyellow":  "\033[103m",
	"brightblue":    "\033[104m",
	"brightmagenta": "\033[105m",
	"brightcyan":    "\033[106m",
	"brightwhite":   "\033[107m",
}

BackgroundColors maps color names to ANSI background escape codes.

View Source
var ForegroundColors = map[string]string{
	"black":         "\033[30m",
	"red":           "\033[31m",
	"green":         "\033[32m",
	"yellow":        "\033[33m",
	"blue":          "\033[34m",
	"magenta":       "\033[35m",
	"cyan":          "\033[36m",
	"white":         "\033[37m",
	"brightblack":   "\033[90m",
	"brightred":     "\033[91m",
	"brightgreen":   "\033[92m",
	"brightyellow":  "\033[93m",
	"brightblue":    "\033[94m",
	"brightmagenta": "\033[95m",
	"brightcyan":    "\033[96m",
	"brightwhite":   "\033[97m",
}

ForegroundColors maps color names to ANSI foreground escape codes.

Functions

func ColorCode

func ColorCode(fg, bg string, bold bool) (string, error)

ColorCode builds an ANSI escape sequence from component parts. fg is the foreground color name (or empty for default). bg is the background color name (or empty for none). bold adds the bold attribute if true. Returns an error if any color name is not recognized.

func ColorNames

func ColorNames() []string

ColorNames returns a list of all available color names.

func ComputeTokenSimilarity

func ComputeTokenSimilarity(text1, text2 string, opts Options) float64

ComputeTokenSimilarity calculates similarity between two strings based on shared tokens. Returns a value between 0.0 (no similarity) and 1.0 (identical). Similarity is computed as the ratio of Equal tokens to total diff operations.

func DiscardConfusingTokens

func DiscardConfusingTokens(tokens1, tokens2 []string) (filtered1, filtered2 []string, map1, map2 []int)

DiscardConfusingTokens filters tokens that appear too frequently, which would create many spurious match points during diff calculation. Returns filtered token slices and index maps back to original positions.

Algorithm (inspired by GNU diff's discard_confusing_lines): 1. Count occurrences of each token in the OTHER file (equiv_count) 2. Mark tokens:

  • equiv_count == 0: definitely discard (can't match anything)
  • equiv_count > √n: provisionally discard (too many potential matches)
  • else: keep for matching

3. Apply provisional discard rules:

  • Provisional tokens are kept only if they form runs with non-provisional endpoints AND at least 25% of the run is provisional

Note: Filtered tokens are still included in the final diff output - they're just excluded from the LCS matching to prevent spurious anchoring.

func FormatDiff

func FormatDiff(diffs []Diff) string

FormatDiff returns a human-readable representation of the diff. Deleted tokens are wrapped in [-...-] and inserted tokens in {+...+}.

func FormatDiffResultAdvanced

func FormatDiffResultAdvanced(result DiffResult, opts FormatOptions) string

FormatDiffResultAdvanced formats a DiffResult preserving original spacing for Equal content. This uses position information to extract original text for Equal runs instead of reconstructing from tokens, which loses whitespace information. When opts.ShowLineNumbers is true, it tracks and displays line numbers based on SOURCE positions in the original texts.

func FormatDiffWithOptions

func FormatDiffWithOptions(diffs []Diff, opts FormatOptions) string

FormatDiffWithOptions returns a formatted representation of the diff using the specified formatting options.

func FormatDiffsAdvanced

func FormatDiffsAdvanced(diffs []Diff, opts FormatOptions) string

FormatDiffsAdvanced formats diffs with comprehensive options including colors, line numbers, overstrike modes, and marker repetition. This is a more feature-rich alternative to FormatDiffWithOptions.

func HasChanges

func HasChanges(diffs []Diff) bool

HasChanges returns true if the diff slice contains any non-Equal operations.

func NeedsSpaceAfter

func NeedsSpaceAfter(token string) bool

NeedsSpaceAfter returns true if a space should follow this token when formatting diff output. Used internally by FormatDiff.

func NeedsSpaceBefore

func NeedsSpaceBefore(token string) bool

NeedsSpaceBefore returns true if a space should precede this token when formatting diff output. Used internally by FormatDiff.

func OverstrikeBold

func OverstrikeBold(text string) string

OverstrikeBold returns text with overstrike bold (char\bchar for each char). This is used for printer mode to highlight inserted text.

func OverstrikeUnderline

func OverstrikeUnderline(text string) string

OverstrikeUnderline returns text with overstrike underlining (_\bchar for each char). This is used for less -r mode to highlight deleted text.

func ParseColor

func ParseColor(spec string) (string, error)

ParseColor parses a color specification and returns the ANSI escape sequence. The spec can be:

  • A single color name: "red" -> foreground red
  • Foreground:background: "red:white" -> red text on white background
  • Empty string returns empty string (no color)

Returns an error if the color name is not recognized.

func ParseColorSpec

func ParseColorSpec(spec string) (deleteColor, insertColor string, err error)

ParseColorSpec parses a color specification for diff output. The format is: "delete_color,insert_color" where each color can be "fg" or "fg:bg" (e.g., "red,green" or "red:white,green:black").

If only one color is specified, it's used for deletions and the default insert color (bold green) is used for insertions.

Returns the ANSI escape sequences for delete and insert colors.

func ProcessUnifiedDiff

func ProcessUnifiedDiff(input io.Reader, output io.Writer, opts Options, fmtOpts FormatOptions) error

ProcessUnifiedDiff reads a unified diff from input and applies word-level diffing to each hunk. The result is written to output with diff headers preserved and hunk content replaced with word-level diff output.

func Tokenize

func Tokenize(text string, opts Options) []string

Tokenize splits text into tokens, treating delimiters as separate tokens. Whitespace separates words but is not included in output unless PreserveWhitespace is true.

Types

type Diff

type Diff struct {
	Type  Operation
	Token string
}

Diff represents a single diff operation on a token.

func AggregateDiffs

func AggregateDiffs(diffs []Diff) []Diff

AggregateDiffs combines adjacent diffs of the same type into single tokens. For example, consecutive Delete operations are merged into one Delete with tokens joined appropriately (spaces between words, no spaces between punctuation/delimiters).

func ApplyMatchContext

func ApplyMatchContext(diffs []Diff, minContext int) []Diff

ApplyMatchContext processes diffs to require minimum context between changes. Equal tokens that appear between changes with fewer than minContext matches are converted to both Delete and Insert operations. This reduces noise from coincidental short matches between larger changes.

If minContext is 0 or negative, the diffs are returned unchanged.

func ApplyWordDiff

func ApplyWordDiff(hunk DiffHunk, opts Options) []Diff

ApplyWordDiff applies word-level diffing to a unified diff hunk. It returns the word-level diff result for the changed lines.

func DiffStrings

func DiffStrings(text1, text2 string, opts Options) []Diff

DiffStrings tokenizes both strings and computes their diff.

func DiffStringsWithPreprocessing

func DiffStringsWithPreprocessing(text1, text2 string, opts Options) []Diff

DiffStringsWithPreprocessing tokenizes both strings and computes their diff using histogram-based preprocessing that filters confusing tokens.

func DiffTokens

func DiffTokens(tokens1, tokens2 []string) []Diff

DiffTokens computes the diff between two token slices. It uses the Myers diff algorithm via diffx.

func DiffTokensRaw

func DiffTokensRaw(tokens1, tokens2 []string) []Diff

DiffTokensRaw computes the diff without semantic cleanup. Use this when you need the raw Myers diff output.

func DiffTokensWithPreprocessing

func DiffTokensWithPreprocessing(tokens1, tokens2 []string) []Diff

DiffTokensWithPreprocessing computes the diff using histogram-style preprocessing. This uses diffx's histogram diff algorithm which: 1. Filters stopwords (common words like "the", "for", "in") from anchor selection 2. Uses low-frequency tokens as anchors for divide-and-conquer 3. Produces cleaner output without spurious matches on common words

This produces readable output that groups semantically related changes together.

func EliminateStopwordAnchors

func EliminateStopwordAnchors(diffs []Diff) []Diff

EliminateStopwordAnchors converts stopword Equal tokens to Delete+Insert when they appear as single tokens sandwiched between changes. Unlike ApplyMatchContext, this only affects specific stopwords, preserving meaningful single-token Equals like "support", "config", etc.

The stopword is added to both the preceding Delete run and the following Insert run, so they merge together during formatting instead of appearing as separate `[---] {+-+}` markers.

func InterleaveDiffs

func InterleaveDiffs(diffs []Diff) []Diff

InterleaveDiffs reorders diffs so that Delete/Insert pairs are interleaved. When there's a sequence of Deletes followed by Inserts, this function pairs them positionally: Delete[0] Insert[0] Delete[1] Insert[1], etc. Excess Deletes or Inserts (if the counts don't match) are output at the end.

func ShiftBoundaries

func ShiftBoundaries(diffs []Diff) []Diff

ShiftBoundaries adjusts diff boundaries to create cleaner output. When a deleted token matches an adjacent equal token, shift the boundary.

This is a standard diff post-processing step (similar to GNU diff's shift_boundaries).

Patterns detected and shifted:

  • EQUAL[...x] DELETE[x] INSERT[y] → EQUAL[...x] INSERT[y] (shift delete into equal)
  • DELETE[x] INSERT[x...] EQUAL[...] → INSERT[...] EQUAL[x...] (shift common prefix)

type DiffHunk

type DiffHunk struct {
	// OldStart is the starting line number in the old file.
	OldStart int
	// OldCount is the number of lines from the old file.
	OldCount int
	// NewStart is the starting line number in the new file.
	NewStart int
	// NewCount is the number of lines in the new file.
	NewCount int
	// OldLines contains the removed lines (without the leading "-").
	OldLines []string
	// NewLines contains the added lines (without the leading "+").
	NewLines []string
	// ContextBefore contains context lines before the change.
	ContextBefore []string
	// ContextAfter contains context lines after the change.
	ContextAfter []string
}

DiffHunk represents a single hunk from a unified diff.

type DiffResult

type DiffResult struct {
	Diffs      []Diff
	Text1      string     // original old text
	Text2      string     // original new text
	Positions1 []TokenPos // token positions in text1
	Positions2 []TokenPos // token positions in text2
}

DiffResult contains diff output along with position information needed to reconstruct original spacing for Equal content.

func DiffStringsWithPositions

func DiffStringsWithPositions(text1, text2 string, opts Options) DiffResult

DiffStringsWithPositions tokenizes and diffs strings, returning position info. This allows formatters to preserve original spacing for Equal content.

func DiffStringsWithPositionsAndPreprocessing

func DiffStringsWithPositionsAndPreprocessing(text1, text2 string, opts Options) DiffResult

DiffStringsWithPositionsAndPreprocessing tokenizes and diffs strings using histogram-based preprocessing, returning position info for formatting. This allows formatters to preserve original spacing for Equal content.

type DiffStatistics

type DiffStatistics struct {
	OldWords      int // total words in old text
	NewWords      int // total words in new text
	DeletedWords  int // words deleted (present in old but not new)
	InsertedWords int // words inserted (present in new but not old)
	CommonWords   int // words common to both texts
}

DiffStatistics holds statistics about a diff operation.

func ComputeStatistics

func ComputeStatistics(text1, text2 string, diffs []Diff, opts Options) DiffStatistics

ComputeStatistics calculates statistics for a diff.

type FormatOptions

type FormatOptions struct {
	// StartDelete is the string to mark the beginning of deleted text.
	// Default: "[-"
	StartDelete string

	// StopDelete is the string to mark the end of deleted text.
	// Default: "-]"
	StopDelete string

	// StartInsert is the string to mark the beginning of inserted text.
	// Default: "{+"
	StartInsert string

	// StopInsert is the string to mark the end of inserted text.
	// Default: "+}"
	StopInsert string

	// NoDeleted, when true, suppresses deleted tokens from output.
	NoDeleted bool

	// NoInserted, when true, suppresses inserted tokens from output.
	NoInserted bool

	// NoCommon, when true, suppresses unchanged tokens from output.
	NoCommon bool

	// UseColor enables ANSI color output. When true, DeleteColor and InsertColor
	// are used instead of text markers.
	UseColor bool

	// DeleteColor is the ANSI escape sequence for deleted text color.
	// Example: "\033[31m" for red
	DeleteColor string

	// InsertColor is the ANSI escape sequence for inserted text color.
	// Example: "\033[32m" for green
	InsertColor string

	// ColorReset is the ANSI escape sequence to reset colors.
	// Default: "\033[0m"
	ColorReset string

	// ClearToEOL is the ANSI escape sequence to clear to end of line.
	// Default: "\033[K"
	ClearToEOL string

	// RepeatMarkers, when true, repeats markers at line boundaries for
	// multi-line changes.
	RepeatMarkers bool

	// AggregateChanges, when true, combines adjacent changes of the same type.
	AggregateChanges bool

	// LessMode uses overstrike underlining for deleted text (for less -r).
	LessMode bool

	// PrinterMode uses overstrike bold for inserted text (for printing).
	PrinterMode bool

	// MatchContext is the minimum number of matching words between changes.
	// Equal tokens sandwiched between changes with fewer than this many
	// matches are converted to Delete+Insert pairs. 0 disables this feature.
	MatchContext int

	// ShowLineNumbers enables dual line number display (old:new format).
	ShowLineNumbers bool

	// LineNumWidth is the minimum width for line numbers. 0 means auto-calculate.
	LineNumWidth int

	// HeuristicSpacing uses NeedsSpaceBefore/After heuristics for spacing
	// when PreserveWhitespace is false. When true, spaces are not tokens and
	// spacing is determined heuristically.
	HeuristicSpacing bool
}

FormatOptions configures diff output formatting.

func DefaultFormatOptions

func DefaultFormatOptions() FormatOptions

DefaultFormatOptions returns FormatOptions with default settings.

type LineDiffOutput

type LineDiffOutput struct {
	Lines      []LineDiffResult // individual line results
	HasChanges bool             // true if there are any differences
	Statistics DiffStatistics   // aggregate statistics
}

LineDiffOutput holds the results of a line-by-line diff operation.

func DiffLineByLine

func DiffLineByLine(text1, text2 string, opts Options, fmtOpts FormatOptions, algorithm string, threshold float64) LineDiffOutput

DiffLineByLine compares files line by line with proper line-level diff tracking. This correctly tracks dual line numbers: - For equal lines: both old and new line numbers increment - For deleted lines: only old line number increments - For inserted lines: only new line number increments

The algorithm parameter controls how deleted and inserted lines are paired: - "best": similarity-based matching (pairs lines with highest token overlap) - "normal" or "fast": positional matching (pairs lines by position)

type LineDiffResult

type LineDiffResult struct {
	OldLineNum int    // line number in old file
	NewLineNum int    // line number in new file
	HasChanges bool   // true if this line contains changes
	Output     string // formatted output for this line
}

LineDiffResult holds diff results for a single line in line-by-line mode.

func FilterWithContext

func FilterWithContext(lines []LineDiffResult, contextLines int) []LineDiffResult

FilterWithContext returns only the lines that are changes or within contextLines of a change.

type LinePairing

type LinePairing struct {
	DeleteIndex int     // index in deletes slice
	InsertIndex int     // index in inserts slice
	Similarity  float64 // similarity score (0.0-1.0)
}

LinePairing represents a pairing between a deleted line and an inserted line.

func FindPositionalPairings

func FindPositionalPairings(deletes, inserts []string) []LinePairing

FindPositionalPairings pairs deleted and inserted lines by position. Delete[0] pairs with Insert[0], Delete[1] with Insert[1], etc. Returns pairings only up to min(len(deletes), len(inserts)).

func FindSimilarityPairings

func FindSimilarityPairings(deletes, inserts []string, opts Options, threshold float64) []LinePairing

FindSimilarityPairings pairs deleted and inserted lines by content similarity. Uses a greedy algorithm: for each deleted line, find the most similar unmatched inserted line. Lines with similarity below threshold are left unpaired.

type Operation

type Operation int

Operation represents a diff operation type.

const (
	// Equal indicates the token is unchanged.
	Equal Operation = iota
	// Insert indicates the token was added.
	Insert
	// Delete indicates the token was removed.
	Delete
)

func (Operation) String

func (o Operation) String() string

String returns a human-readable representation of the operation.

type Options

type Options struct {
	// Delimiters is the set of characters to treat as separate tokens.
	// If empty, DefaultDelimiters is used.
	// This is ignored if UsePunctuation is true.
	Delimiters string

	// Whitespace is the set of characters to treat as whitespace (word separators).
	// If empty, DefaultWhitespace is used.
	Whitespace string

	// UsePunctuation, when true, uses Unicode punctuation characters as
	// delimiters instead of the Delimiters string. This matches dwdiff's
	// -P/--punctuation flag behavior.
	UsePunctuation bool

	// PreserveWhitespace, when true, includes whitespace as separate tokens.
	// When false (default), whitespace is used only to separate words and
	// is not included in the diff output.
	PreserveWhitespace bool

	// IgnoreCase, when true, performs case-insensitive comparison.
	// The original case is preserved in the output.
	IgnoreCase bool
}

Options configures the diff behavior.

func DefaultOptions

func DefaultOptions() Options

DefaultOptions returns Options with default settings.

type TokenPos

type TokenPos struct {
	Start int // byte offset of token start
	End   int // byte offset of token end (exclusive)
}

TokenPos represents a token's position in the original text.

func TokenizeWithPositions

func TokenizeWithPositions(text string, opts Options) ([]string, []TokenPos)

TokenizeWithPositions splits text into tokens and tracks their positions. This allows reconstructing original spacing for Equal content in diffs.

type UnifiedDiff

type UnifiedDiff struct {
	// OldFile is the name of the old file (from "---" line).
	OldFile string
	// NewFile is the name of the new file (from "+++" line).
	NewFile string
	// Hunks contains all the diff hunks.
	Hunks []DiffHunk
}

UnifiedDiff represents a parsed unified diff.

func ParseUnifiedDiff

func ParseUnifiedDiff(input string) ([]UnifiedDiff, error)

ParseUnifiedDiff parses a unified diff string into structured data. It handles standard unified diff format as produced by diff -u or git diff.

type WholeFileDiffResult

type WholeFileDiffResult struct {
	Result     DiffResult     // the raw diff result
	Formatted  string         // formatted output
	HasChanges bool           // true if there are any differences
	Statistics DiffStatistics // statistics about the diff
}

WholeFileDiffResult holds the result of a whole-file diff operation.

func DiffWholeFiles

func DiffWholeFiles(text1, text2 string, opts Options, fmtOpts FormatOptions) WholeFileDiffResult

DiffWholeFiles performs a whole-file word-level diff and returns structured results. This is the main API for comparing two complete texts.

Directories

Path Synopsis
cmd
tokendiff command
Command tokendiff performs token-level diffs with delimiter support.
Command tokendiff performs token-level diffs with delimiter support.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL