Skip to content

Milestone [1] Core Configuration and Rule Engine#1

Closed
mcode-app[bot] wants to merge 3 commits intomaster-modelcode-aifrom
gitleaks-milestone_1-5f2d33
Closed

Milestone [1] Core Configuration and Rule Engine#1
mcode-app[bot] wants to merge 3 commits intomaster-modelcode-aifrom
gitleaks-milestone_1-5f2d33

Conversation

@mcode-app
Copy link

@mcode-app mcode-app bot commented Jan 2, 2026

View Milestone

Table of Contents

Status

Milestone Partially Completed

  • Task 1 - Main Entry Point and Basic CLI Structure: Successfully completed
  • Task 2 - Core Configuration Structures and TOML Parsing: Failed (not attempted)
  • ⏸️ Task 3 - Regex Compilation and Keyword Prefilter Index: Ready (blocked by Task 2 failure)
  • ⏸️ Task 4 - Allowlist System Implementation: Ready (blocked by Task 2 failure)
  • ⏸️ Task 5 - Configuration Extension and Merging System: Ready (blocked by Task 2 failure)
  • ⏸️ Task 6 - Configuration Loading and Precedence System: Ready (blocked by Task 2 and other dependencies)

Task 2 was not attempted due to implementation failure or resource constraints. All subsequent tasks (3-6) depend on Task 2's configuration structures and were therefore not executed. The milestone successfully establishes the foundational CLI infrastructure but does not complete the configuration system that was the primary goal of this milestone.

Feature Overview

This milestone aimed to implement the Core Configuration and Rule Engine for the gitleaks Rust port, establishing the foundation for all secret detection functionality. While the milestone was only partially completed, Task 1 successfully delivers a runnable CLI application with proper infrastructure.

What Was Implemented

Task 1 - Basic CLI Infrastructure

  • A functional Rust binary that can be executed and accepts command-line arguments
  • Complete CLI structure using clap with derive macros, supporting all flags from the Go version including:
    • Configuration flags (--config, --exit-code)
    • Reporting flags (--report-path, --report-format, --report-template, --baseline-path)
    • Scanning control flags (--max-target-megabytes, --max-decode-depth, --max-archive-depth, --timeout)
    • Logging flags (--log-level, --verbose, --no-color)
    • Feature flags (--ignore-gitleaks-allow, --redact, --enable-rule)
    • Diagnostic flags (--diagnostics, --diagnostics-dir)
    • UI flags (--no-banner)
  • Version information display via both --version flag and version subcommand
  • Structured logging using tracing/tracing-subscriber (Rust equivalent to Go's zerolog)
  • Graceful interrupt handling (Ctrl+C) for clean shutdown
  • ASCII banner display with suppression option

What Was Not Implemented

The core configuration system, which was the primary goal of this milestone, was not implemented:

  • TOML configuration parsing for detection rules
  • Rule validation and compilation
  • Allowlist system for false positive suppression
  • Configuration extension and inheritance
  • Keyword prefilter indices for performance
  • Default embedded configuration

As a result, the application cannot load detection rules or perform any secret scanning. The CLI is ready but lacks the configuration backend needed for the detection engine.

Testing

Automated Testing

Unit Tests (2 passing)

  • src/cli/version.rs::tests::test_get_version - Verifies version string is non-empty and contains semantic version format
  • src/logging.rs::tests::test_parse_log_level - Validates log level parsing for all supported levels (trace, debug, info, warn, error)

Run tests with:

cd /l2l/dst/gitleaks
cargo test

All tests pass successfully.

Manual Testing

1. Version Display

# Test version flag
cargo run -- --version
# Expected: "gitleaks 8.0.0"

# Test version subcommand
cargo run -- version
# Expected: Prints banner followed by "8.0.0"

2. Help Text

cargo run -- --help
# Expected: Displays comprehensive help with all flags documented
# Verify: All CLI flags from the Go version are present

3. Banner Display

# With banner (default)
cargo run --
# Expected: ASCII banner displayed on stderr, followed by info message

# Without banner
cargo run -- --no-banner
# Expected: No banner, only info message

4. Logging Levels

# Test different log levels
cargo run -- --log-level=debug
cargo run -- --log-level=trace
cargo run -- --log-level=error
# Expected: Logging system initializes with specified level

5. Build Verification

cargo build
cargo check
# Expected: Builds successfully with no errors or warnings

6. Signal Handling

cargo run --
# Press Ctrl+C
# Expected: Displays "Interrupt signal received. Exiting..." and exits cleanly

Architecture

Overview

graph TB
    subgraph "CLI Layer - Implemented ✅"
        MAIN[src/main.rs<br/>Entry Point]
        CLI[src/cli/mod.rs<br/>CLI Structure]
        VERSION[src/cli/version.rs<br/>Version Command]
        LOG[src/logging.rs<br/>Logging Setup]
        
        MAIN --> CLI
        CLI --> VERSION
        MAIN --> LOG
    end
    
    subgraph "Configuration Layer - Not Implemented ❌"
        CONFIG[src/config/mod.rs<br/>Config Parser]
        RULES[src/config/rule.rs<br/>Rule Definitions]
        ALLOW[src/config/allowlist.rs<br/>Allowlist Logic]
        EXTEND[src/config/extend.rs<br/>Extension System]
        
        CLI -.-> CONFIG
        CONFIG -.-> RULES
        CONFIG -.-> ALLOW
        CONFIG -.-> EXTEND
    end
    
    subgraph "Future Layers - Not Started"
        SOURCE[Source Layer<br/>Git/Files/Stdin]
        DETECT[Detection Engine<br/>Regex/Entropy]
        REPORT[Reporting Layer<br/>JSON/CSV/SARIF]
    end
    
    CONFIG -.-> SOURCE
    SOURCE -.-> DETECT
    DETECT -.-> REPORT
    
    style MAIN fill:#90EE90
    style CLI fill:#90EE90
    style VERSION fill:#90EE90
    style LOG fill:#90EE90
    style CONFIG fill:#FFB6C6
    style RULES fill:#FFB6C6
    style ALLOW fill:#FFB6C6
    style EXTEND fill:#FFB6C6
    style SOURCE fill:#D3D3D3
    style DETECT fill:#D3D3D3
    style REPORT fill:#D3D3D3
    
    classDef legend fill:none,stroke:none
    class Legend legend
    
    Legend["<br/>Legend:<br/>🟢 Implemented<br/>🔴 Failed/Not Implemented<br/>⚪ Future Work"]:::legend
Loading

Changes

CLI Infrastructure (Implemented)

src/main.rs

  • Application entry point with signal handling using ctrlc crate
  • Sets up atomic flag for interrupt detection
  • Initializes logging system before command execution
  • Parses CLI arguments using clap
  • Displays banner unless suppressed
  • Routes to appropriate command handler

src/cli/mod.rs

  • Defines Cli struct with all command-line flags using clap derive macros
  • Implements Commands enum for subcommand routing (currently only Version)
  • Includes comprehensive flag documentation matching Go version
  • Banner display logic with show_banner() method
  • Configuration precedence documentation in help text

src/cli/version.rs

  • Simple version command implementation
  • Uses CARGO_PKG_VERSION environment variable (set by Cargo at compile time)
  • Provides get_version() and run() functions

src/logging.rs

  • Logging wrapper around tracing/tracing-subscriber
  • Supports trace, debug, info, warn, error levels
  • Configurable via --log-level flag or RUST_LOG environment variable
  • Outputs to stderr with level indicators
  • Defaults to info level for invalid specifications

src/lib.rs

  • Library crate root exposing public modules
  • Makes cli and logging modules available

Cargo.toml

  • Project metadata (name, version, edition, authors, license)
  • Dependencies: clap (CLI framework), tracing/tracing-subscriber (logging), anyhow (error handling), ctrlc (signal handling)

.gitignore

  • Standard Rust ignore patterns for /target/ and Cargo.lock

Design Decisions

1. CLI Framework Selection

  • Decision: Use clap v4 with derive macros
  • Description: Leverages Rust's macro system for ergonomic CLI definition, similar to how Go's cobra provides structured command handling
  • Justification: clap is the de facto standard in Rust, provides excellent help generation, supports environment variables natively, and the derive API reduces boilerplate while maintaining type safety

2. Logging Infrastructure

  • Decision: Use tracing/tracing-subscriber instead of simpler env_logger
  • Description: Structured logging framework with support for spans, events, and multiple subscribers
  • Justification: Go's zerolog provides structured logging; tracing is the Rust ecosystem equivalent, offering similar capabilities with better async support and extensibility for future needs (e.g., distributed tracing)

3. Signal Handling Approach

  • Decision: Use ctrlc crate with atomic boolean flag
  • Description: Sets up a Ctrl+C handler that sets an atomic flag and immediately exits
  • Justification: Simple, cross-platform solution that matches Go's signal handling behavior. Atomic boolean ensures thread-safe access for potential future use in long-running operations

4. Version Information Strategy

  • Decision: Use Cargo's built-in CARGO_PKG_VERSION environment variable
  • Description: Version is automatically extracted from Cargo.toml at compile time
  • Justification: Avoids the need for build scripts or manual version management (unlike Go's ldflags approach), ensuring version stays in sync with package metadata

5. Library vs Binary Structure

  • Decision: Use both lib.rs and main.rs (hybrid crate)
  • Description: Separates library code (reusable modules) from binary entry point
  • Justification: Follows Rust best practices, enables unit testing of library code, allows potential reuse of modules in other binaries or tests

6. Error Handling Strategy

  • Decision: Use anyhow for application-level errors
  • Description: Provides convenient error handling with context and chaining
  • Justification: Appropriate for application code where specific error types aren't needed at boundaries; will be complemented with thiserror for library error types in future tasks

Suggested Order of Review

  1. Cargo.toml - Start with project structure and dependencies to understand the foundation
  2. src/lib.rs - See the module structure and public API
  3. src/logging.rs - Review logging setup (small, self-contained module)
  4. src/cli/version.rs - Review version command (small, self-contained module)
  5. src/cli/mod.rs - Review comprehensive CLI structure and all flags
  6. src/main.rs - Review entry point that ties everything together
  7. .gitignore - Quick sanity check on ignored files

This ordering introduces simpler concepts first, then builds up to the more complex CLI structure, and finally shows how it all integrates in main.

Challenges

Task 2 Implementation Failure

Challenge: Task 2 (Core Configuration Structures and TOML Parsing) was not completed, blocking all subsequent tasks in the milestone.

Impact:

  • Tasks 3-6 form a dependency chain all requiring Task 2's configuration structures
  • The milestone's primary deliverable (configuration and rule engine) was not achieved
  • Without configuration parsing, the application cannot load detection rules or perform its core function
  • Approximately 80% of the planned milestone work remains incomplete

Technical Implications:

  • No TOML deserialization for rule definitions
  • No configuration validation logic
  • No rule structures (Config, ViperConfig, Rule, Extend)
  • Cannot proceed with regex compilation (Task 3)
  • Cannot implement allowlist system (Task 4)
  • Cannot add configuration extension (Task 5)
  • Cannot integrate configuration loading into CLI (Task 6)

Next Steps:
The milestone work should be resumed by:

  1. Investigating the root cause of Task 2's failure (resource constraints, complexity underestimation, technical blockers)
  2. Re-attempting Task 2 with additional context or different approach
  3. Once Task 2 completes, proceeding sequentially through Tasks 3-6
  4. Consider breaking Task 2 into smaller incremental pieces if complexity was the issue

The good news is that Task 1 provides a solid foundation - the CLI infrastructure is complete, well-tested, and ready to integrate with the configuration system once it's implemented.

mcode-bot and others added 3 commits December 31, 2025 21:55
Establish the foundational entry point for the gitleaks Rust application by migrating the main.rs file and setting up the basic CLI structure using clap.

Key components implemented:
- Project structure with Cargo.toml and proper dependencies (clap, tracing, anyhow, ctrlc)
- Logging infrastructure using tracing/tracing-subscriber as a Rust equivalent to zerolog
- CLI framework with clap derive macros supporting all flags from the Go version
- Version command that displays version from Cargo.toml
- Main entry point with Ctrl+C signal handling for graceful shutdown
- Banner display with --no-banner suppression option
- Comprehensive help text and --version flag

The application now has a runnable binary that can parse command-line arguments, display version information, and handle interrupts gracefully. While it doesn't perform any scanning yet (detection logic is in future tasks), it provides the foundation for all subsequent milestone work.

All acceptance criteria met per TASK.md Definition of Done:
- CLI structure supports --help, --version, and subcommands
- Version command functional via both --version and version subcommand
- Logging operational with configurable levels via --log-level flag
- Project builds successfully with no warnings (cargo check passes)
- All unit tests pass (2 tests in logging and version modules)

Note: Linting uses cargo check instead of cargo clippy (not available in environment).

Milestone No.: 1
Task No.: 1
Task ID: 15421
…rge from gitleaks-milestone_1-task_1-426cfd

Establish the foundational entry point for the gitleaks Rust application by migrating the main.rs file and setting up the basic CLI structure using clap.

Key components implemented:
- Project structure with Cargo.toml and proper dependencies (clap, tracing, anyhow, ctrlc)
- Logging infrastructure using tracing/tracing-subscriber as a Rust equivalent to zerolog
- CLI framework with clap derive macros supporting all flags from the Go version
- Version command that displays version from Cargo.toml
- Main entry point with Ctrl+C signal handling for graceful shutdown
- Banner display with --no-banner suppression option
- Comprehensive help text and --version flag

The application now has a runnable binary that can parse command-line arguments, display version information, and handle interrupts gracefully. While it doesn't perform any scanning yet (detection logic is in future tasks), it provides the foundation for all subsequent milestone work.

All acceptance criteria met per TASK.md Definition of Done:
- CLI structure supports --help, --version, and subcommands
- Version command functional via both --version and version subcommand
- Logging operational with configurable levels via --log-level flag
- Project builds successfully with no warnings (cargo check passes)
- All unit tests pass (2 tests in logging and version modules)

Note: Linting uses cargo check instead of cargo clippy (not available in environment).

Milestone No.: 1
Task No.: 1
Task ID: 15421
mcode-app bot pushed a commit that referenced this pull request Jan 3, 2026
This task implements the foundational configuration parsing system for the gitleaks Rust migration. It migrates the core configuration structures (Config, ViperConfig, Rule, Extend, Required) and implements TOML deserialization using serde.

## Changes Made

### Core Configuration Structures
- Defined `Config` struct: The main runtime configuration with compiled rules, keywords map, and ordered rules list
- Defined `ViperConfig` struct: Raw TOML deserialization target using serde
- Defined `Rule` struct: Detection rule with ID, description, regex pattern (as string), path pattern (as string), entropy threshold, secret group, keywords, tags, and required rules
- Defined `Required` struct: Composite rule dependency with within_lines and within_columns constraints
- Defined `Extend` struct: Configuration extension/inheritance settings (in separate extend.rs file as specified)

### TOML Deserialization
- Implemented serde derives on all configuration structures
- Used appropriate serde attributes: `#[serde(rename)]` for camelCase fields, `#[serde(default)]` for optional fields
- Handled deprecated fields (allowList vs allowlists) with appropriate attributes
- Uses toml 0.8.x as specified, with indexmap pinned to 2.0.0 for Rust 1.75 compatibility

### Configuration Translation
- Implemented `ViperConfig::translate()` method that converts raw TOML to validated runtime Config
- Converts keywords to lowercase during translation
- Validates rule IDs are not empty
- Validates either regex or path is present
- Validates required rule IDs exist in the configuration
- Builds keywords map for efficient keyword lookup

### Validation Logic
- Implemented `Rule::validate()` method for structural validation
- Validates rule ID is present and non-empty
- Validates at least one of regex or path is present
- Provides helpful error messages with context (description, regex, path)
- Note: Regex compilation and secretGroup validation deferred to Task 3

### Error Handling
- Created `ConfigError` enum using thiserror for comprehensive error types
- Error types for missing rule ID, no regex or path, invalid secret group, required rule not found, etc.
- Errors include context (rule ID, field values) for better debugging

### Testing
- Ported relevant tests from config_test.go (tests that don't involve allowlists or extension)
- Tests for valid configurations: generic, rule_path_only, rule_regex_escaped_character_group, rule_entropy_group
- Tests for invalid configurations: missing ID, no regex or path
- Note: bad_entropy_group test deferred to Task 3 (requires regex compilation)
- All 7 tests pass successfully

### Project Setup
- Created Cargo.toml with dependencies: serde 1.0.195, toml 0.8.0, thiserror 1.0.56, regex 1.10.2, indexmap 2.0.0
- Created modular structure: config module with error, extend, rule, and types submodules
- Copied necessary test data files from source repository
- Created .gitignore to exclude build artifacts

## Design Decisions

### Configuration Deserialization Strategy (Design Decision #1)
Chose approach 1: Use serde derive macros with custom validation in a separate validation pass. This provides:
- Clear separation between parsing and validation logic
- Similar pattern to Go implementation (ViperConfig → Config)
- Easy to extend and maintain
- Clear error reporting

### Module Structure
Created separate files as specified in task requirements:
- `src/config/extend.rs` - Extend struct (basic definition, merging logic will be added in Task 5)
- `src/config/rule.rs` - Rule and Required structs
- `src/config/types.rs` - Config and ViperConfig structs
- `src/config/error.rs` - Configuration error types
- `src/config/mod.rs` - Module exports

### Deferred Items
- Regex compilation and storage (Task 3)
- Allowlist structures and logic (Task 4)
- Configuration extension/merging (Task 5)
- Configuration precedence and file loading (Task 6)
- Keyword prefilter indices (Task 3)

## Verification
- `cargo build` - Compiles successfully
- `cargo test` - All 7 tests pass
- `cargo check` - No warnings or errors
- `cargo build --release` - Release build succeeds

Milestone No.: 1
Task No.: 2
Task ID: 15422
mcode-app bot pushed a commit that referenced this pull request Jan 3, 2026
…ng: merge from gitleaks-milestone_1-task_2-5656b0

This task implements the foundational configuration parsing system for the gitleaks Rust migration. It migrates the core configuration structures (Config, ViperConfig, Rule, Extend, Required) and implements TOML deserialization using serde.

## Changes Made

### Core Configuration Structures
- Defined `Config` struct: The main runtime configuration with compiled rules, keywords map, and ordered rules list
- Defined `ViperConfig` struct: Raw TOML deserialization target using serde
- Defined `Rule` struct: Detection rule with ID, description, regex pattern (as string), path pattern (as string), entropy threshold, secret group, keywords, tags, and required rules
- Defined `Required` struct: Composite rule dependency with within_lines and within_columns constraints
- Defined `Extend` struct: Configuration extension/inheritance settings (in separate extend.rs file as specified)

### TOML Deserialization
- Implemented serde derives on all configuration structures
- Used appropriate serde attributes: `#[serde(rename)]` for camelCase fields, `#[serde(default)]` for optional fields
- Handled deprecated fields (allowList vs allowlists) with appropriate attributes
- Uses toml 0.8.x as specified, with indexmap pinned to 2.0.0 for Rust 1.75 compatibility

### Configuration Translation
- Implemented `ViperConfig::translate()` method that converts raw TOML to validated runtime Config
- Converts keywords to lowercase during translation
- Validates rule IDs are not empty
- Validates either regex or path is present
- Validates required rule IDs exist in the configuration
- Builds keywords map for efficient keyword lookup

### Validation Logic
- Implemented `Rule::validate()` method for structural validation
- Validates rule ID is present and non-empty
- Validates at least one of regex or path is present
- Provides helpful error messages with context (description, regex, path)
- Note: Regex compilation and secretGroup validation deferred to Task 3

### Error Handling
- Created `ConfigError` enum using thiserror for comprehensive error types
- Error types for missing rule ID, no regex or path, invalid secret group, required rule not found, etc.
- Errors include context (rule ID, field values) for better debugging

### Testing
- Ported relevant tests from config_test.go (tests that don't involve allowlists or extension)
- Tests for valid configurations: generic, rule_path_only, rule_regex_escaped_character_group, rule_entropy_group
- Tests for invalid configurations: missing ID, no regex or path
- Note: bad_entropy_group test deferred to Task 3 (requires regex compilation)
- All 7 tests pass successfully

### Project Setup
- Created Cargo.toml with dependencies: serde 1.0.195, toml 0.8.0, thiserror 1.0.56, regex 1.10.2, indexmap 2.0.0
- Created modular structure: config module with error, extend, rule, and types submodules
- Copied necessary test data files from source repository
- Created .gitignore to exclude build artifacts

## Design Decisions

### Configuration Deserialization Strategy (Design Decision #1)
Chose approach 1: Use serde derive macros with custom validation in a separate validation pass. This provides:
- Clear separation between parsing and validation logic
- Similar pattern to Go implementation (ViperConfig → Config)
- Easy to extend and maintain
- Clear error reporting

### Module Structure
Created separate files as specified in task requirements:
- `src/config/extend.rs` - Extend struct (basic definition, merging logic will be added in Task 5)
- `src/config/rule.rs` - Rule and Required structs
- `src/config/types.rs` - Config and ViperConfig structs
- `src/config/error.rs` - Configuration error types
- `src/config/mod.rs` - Module exports

### Deferred Items
- Regex compilation and storage (Task 3)
- Allowlist structures and logic (Task 4)
- Configuration extension/merging (Task 5)
- Configuration precedence and file loading (Task 6)
- Keyword prefilter indices (Task 3)

## Verification
- `cargo build` - Compiles successfully
- `cargo test` - All 7 tests pass
- `cargo check` - No warnings or errors
- `cargo build --release` - Release build succeeds

Milestone No.: 1
Task No.: 2
Task ID: 15422
mcode-app bot pushed a commit that referenced this pull request Jan 3, 2026
This task implements regex compilation and storage for rules, along with the keyword prefilter index system using Aho-Corasick.

## Key Changes

### Core Implementation
- Created `CompiledConfig` and `CompiledRule` types that hold compiled regex patterns
- Implemented `CompiledConfig::from_config()` to compile raw Config into runtime-ready form
- Separated raw configuration (Config) from compiled configuration (CompiledConfig) at the type level
- Added comprehensive regex compilation error handling with detailed error messages

### Regex Compilation
- Both content regex and path regex patterns are compiled using the `regex` crate
- Regex compilation errors are caught and reported with rule context
- Invalid regex patterns fail compilation with clear error messages

### Secret Group Validation
- Implemented validation that secret_group doesn't exceed the number of capture groups
- Validation occurs during compilation after regex patterns are compiled
- Test case "invalid/rule_bad_entropy_group" now properly validates and fails compilation

### Keyword Prefilter Index
- Built global keyword index using Aho-Corasick automaton for fast prefiltering
- All keywords from all rules are collected, lowercased, and built into a single automaton
- Maintains mapping from keywords to rule IDs for quick lookup
- Supports case-insensitive keyword matching
- Multiple rules can share the same keyword

### Dependencies
- Added `aho-corasick = "1.1.2"` for efficient keyword matching

### Testing
- Added comprehensive tests for regex compilation (8 new tests)
- Tests validate secret_group checking, invalid regex handling, and keyword index functionality
- All tests pass (18 total tests: 5 in keywords module, 13 in config_test)

## Design Decisions

**Regex Compilation Strategy (Design Decision #2):** Adopted approach #3 - split config into "raw" and "compiled" versions with separate types. This provides clear separation between deserialized configuration and runtime-ready configuration, making it impossible to accidentally use uncompiled patterns.

**Keyword Index Structure (Design Decision #3):** Implemented approach #1 - build a single global Aho-Corasick automaton from all rule keywords with mapping back to rule IDs. This matches the Go implementation's approach and provides optimal performance for prefiltering.

## Files Modified
- `Cargo.toml` - Added aho-corasick dependency
- `src/config/mod.rs` - Exported new compiled and keywords modules
- `src/config/rule.rs` - Updated comment about secret_group validation

## Files Created
- `src/config/compiled.rs` - CompiledConfig and CompiledRule types with compilation logic (160 lines)
- `src/config/keywords.rs` - KeywordIndex using Aho-Corasick with tests (175 lines)

## Tests
- Updated `tests/config_test.rs` with 8 new compilation tests
- All 18 tests pass successfully
- Tests cover regex compilation, secret_group validation, keyword indexing, and error handling

Milestone No.: 1
Task No.: 3
Task ID: 15423
mcode-app bot pushed a commit that referenced this pull request Jan 3, 2026
…x: merge from gitleaks-milestone_1-task_3-2d4399

This task implements regex compilation and storage for rules, along with the keyword prefilter index system using Aho-Corasick.

## Key Changes

### Core Implementation
- Created `CompiledConfig` and `CompiledRule` types that hold compiled regex patterns
- Implemented `CompiledConfig::from_config()` to compile raw Config into runtime-ready form
- Separated raw configuration (Config) from compiled configuration (CompiledConfig) at the type level
- Added comprehensive regex compilation error handling with detailed error messages

### Regex Compilation
- Both content regex and path regex patterns are compiled using the `regex` crate
- Regex compilation errors are caught and reported with rule context
- Invalid regex patterns fail compilation with clear error messages

### Secret Group Validation
- Implemented validation that secret_group doesn't exceed the number of capture groups
- Validation occurs during compilation after regex patterns are compiled
- Test case "invalid/rule_bad_entropy_group" now properly validates and fails compilation

### Keyword Prefilter Index
- Built global keyword index using Aho-Corasick automaton for fast prefiltering
- All keywords from all rules are collected, lowercased, and built into a single automaton
- Maintains mapping from keywords to rule IDs for quick lookup
- Supports case-insensitive keyword matching
- Multiple rules can share the same keyword

### Dependencies
- Added `aho-corasick = "1.1.2"` for efficient keyword matching

### Testing
- Added comprehensive tests for regex compilation (8 new tests)
- Tests validate secret_group checking, invalid regex handling, and keyword index functionality
- All tests pass (18 total tests: 5 in keywords module, 13 in config_test)

## Design Decisions

**Regex Compilation Strategy (Design Decision #2):** Adopted approach #3 - split config into "raw" and "compiled" versions with separate types. This provides clear separation between deserialized configuration and runtime-ready configuration, making it impossible to accidentally use uncompiled patterns.

**Keyword Index Structure (Design Decision #3):** Implemented approach #1 - build a single global Aho-Corasick automaton from all rule keywords with mapping back to rule IDs. This matches the Go implementation's approach and provides optimal performance for prefiltering.

## Files Modified
- `Cargo.toml` - Added aho-corasick dependency
- `src/config/mod.rs` - Exported new compiled and keywords modules
- `src/config/rule.rs` - Updated comment about secret_group validation

## Files Created
- `src/config/compiled.rs` - CompiledConfig and CompiledRule types with compilation logic (160 lines)
- `src/config/keywords.rs` - KeywordIndex using Aho-Corasick with tests (175 lines)

## Tests
- Updated `tests/config_test.rs` with 8 new compilation tests
- All 18 tests pass successfully
- Tests cover regex compilation, secret_group validation, keyword indexing, and error handling

Milestone No.: 1
Task No.: 3
Task ID: 15423
mcode-app bot pushed a commit that referenced this pull request Jan 3, 2026
Implemented the configuration extension and merging system that allows users to extend a base configuration (either from a file or the default embedded config) with their own customizations.

## Changes Made

### Core Extension System (src/config/types.rs)
- Added thread-local `EXTEND_DEPTH` tracking with `MAX_EXTEND_DEPTH = 2` limit
- Implemented `Config::from_file()` method to load configs from file paths
- Implemented `Config::extend_default()` for extending from embedded default config (stub)
- Implemented `Config::extend_path()` for extending from file-based configs
- Implemented `Config::extend_from()` for merging base configs into current config
- Implemented `Config::get_ordered_rules()` helper method
- Added `ViperConfig::translate_with_path()` to support recursive extension with path tracking

### Extension Logic
- Validates that `extend.path` and `extend.use_default` are not both set (returns `ExtendConflict` error)
- Recursively loads and merges base configurations up to depth limit (MAX_EXTEND_DEPTH = 2)
- Properly handles disabled rules via `extend.disabled_rules`
- Merges rule fields with correct precedence:
  - Extending config fields override base config fields (description, entropy, secret_group, regex, path)
  - Arrays are appended (tags, keywords, allowlists) rather than replaced
- Merges global allowlists from both base and extending configs
- Sorts `ordered_rules` after merging for consistency
- Keywords from merged rules are added to global keywords set and lowercased

### Validation
- Extension logic runs before final validation (only at depth 0)
- Targeted allowlists are applied after extension is complete
- Rule validation happens after all extension is complete

### Test Infrastructure (tests/config_test.rs)
- Added 16 extension tests covering:
  - Basic extension chains (multiple levels)
  - Disabled rules
  - Rule field overrides (description, path, regex, entropy, secret_group, tags, keywords)
  - Allowlist merging (OR and AND conditions)
  - Keyword lowercasing in base and extended rules
  - Invalid extension scenarios
- All 45 integration tests pass (29 from previous tasks + 16 new)

### Test Data
- Copied test data files from source repository (testdata/config/)
- Fixed test data paths to work from repository root (changed `../testdata/config/` to `testdata/config/`)
- Created `testdata/config/extend_3.toml` for depth limit testing
- Copied `testdata/config/simple.toml` for override tests

## Implementation Notes

**Design Decision #4 Resolution**: Adopted approach #1 (parse configs recursively and merge using custom merge function). This provides explicit control over merging semantics and matches the Go implementation's approach.

**Path Resolution**: Extension paths are resolved relative to the working directory (not the config file's directory), matching Viper's `SetConfigFile` behavior in the Go implementation. This design choice means paths in `extend.path` fields are relative to where the program is executed, not to the config file's location.

**Default Config Stub**: The `get_default_config()` function currently returns an empty string as a stub. This will be replaced with the actual embedded gitleaks.toml in Task 6. Tests that require the default config (like `test_extend_invalid_ruleid`) currently expect errors until the default config is implemented.

**Depth Tracking**: Uses thread-local storage (`Cell<usize>`) to track extension depth across recursive calls, ensuring thread-safety while maintaining simple semantics. The depth is incremented before loading extended configs and decremented after merging.

**Merging Semantics**: When a rule exists in both the extending and base configs:
- Start with the base rule
- Override scalar fields if the extending config has non-default values (description, entropy != 0.0, secret_group != 0, regex/path is Some)
- Append arrays (tags, keywords, allowlists) from extending config to base
- Add all merged keywords to global keywords set

## Testing Status

All 45 tests pass:
- ✅ 14 unit tests in lib
- ✅ 31 integration tests including:
  - ✅ 16 extension-specific tests
  - ✅ 15 existing configuration tests from previous tasks

Extension tests demonstrate:
- ✅ Multi-level extension chains (up to depth 2)
- ✅ Rule field override and merging for all field types
- ✅ Keyword merging and lowercasing from base and extended rules
- ✅ Allowlist merging with OR and AND conditions
- ✅ Depth limiting (extends stop at max depth)
- ✅ Disabled rules properly excluded
- ✅ Invalid extension error handling
- ✅ Global allowlist targetRules integration with extension

## Future Work

- Task 6 will implement the embedded default configuration to replace the stub
- URL-based extension remains unimplemented (marked as TODO in Go version as well)

Milestone No.: 1
Task No.: 5
Task ID: 15425
mcode-app bot pushed a commit that referenced this pull request Jan 3, 2026
…merge from gitleaks-milestone_1-task_5-094a3d

Implemented the configuration extension and merging system that allows users to extend a base configuration (either from a file or the default embedded config) with their own customizations.

## Changes Made

### Core Extension System (src/config/types.rs)
- Added thread-local `EXTEND_DEPTH` tracking with `MAX_EXTEND_DEPTH = 2` limit
- Implemented `Config::from_file()` method to load configs from file paths
- Implemented `Config::extend_default()` for extending from embedded default config (stub)
- Implemented `Config::extend_path()` for extending from file-based configs
- Implemented `Config::extend_from()` for merging base configs into current config
- Implemented `Config::get_ordered_rules()` helper method
- Added `ViperConfig::translate_with_path()` to support recursive extension with path tracking

### Extension Logic
- Validates that `extend.path` and `extend.use_default` are not both set (returns `ExtendConflict` error)
- Recursively loads and merges base configurations up to depth limit (MAX_EXTEND_DEPTH = 2)
- Properly handles disabled rules via `extend.disabled_rules`
- Merges rule fields with correct precedence:
  - Extending config fields override base config fields (description, entropy, secret_group, regex, path)
  - Arrays are appended (tags, keywords, allowlists) rather than replaced
- Merges global allowlists from both base and extending configs
- Sorts `ordered_rules` after merging for consistency
- Keywords from merged rules are added to global keywords set and lowercased

### Validation
- Extension logic runs before final validation (only at depth 0)
- Targeted allowlists are applied after extension is complete
- Rule validation happens after all extension is complete

### Test Infrastructure (tests/config_test.rs)
- Added 16 extension tests covering:
  - Basic extension chains (multiple levels)
  - Disabled rules
  - Rule field overrides (description, path, regex, entropy, secret_group, tags, keywords)
  - Allowlist merging (OR and AND conditions)
  - Keyword lowercasing in base and extended rules
  - Invalid extension scenarios
- All 45 integration tests pass (29 from previous tasks + 16 new)

### Test Data
- Copied test data files from source repository (testdata/config/)
- Fixed test data paths to work from repository root (changed `../testdata/config/` to `testdata/config/`)
- Created `testdata/config/extend_3.toml` for depth limit testing
- Copied `testdata/config/simple.toml` for override tests

## Implementation Notes

**Design Decision #4 Resolution**: Adopted approach #1 (parse configs recursively and merge using custom merge function). This provides explicit control over merging semantics and matches the Go implementation's approach.

**Path Resolution**: Extension paths are resolved relative to the working directory (not the config file's directory), matching Viper's `SetConfigFile` behavior in the Go implementation. This design choice means paths in `extend.path` fields are relative to where the program is executed, not to the config file's location.

**Default Config Stub**: The `get_default_config()` function currently returns an empty string as a stub. This will be replaced with the actual embedded gitleaks.toml in Task 6. Tests that require the default config (like `test_extend_invalid_ruleid`) currently expect errors until the default config is implemented.

**Depth Tracking**: Uses thread-local storage (`Cell<usize>`) to track extension depth across recursive calls, ensuring thread-safety while maintaining simple semantics. The depth is incremented before loading extended configs and decremented after merging.

**Merging Semantics**: When a rule exists in both the extending and base configs:
- Start with the base rule
- Override scalar fields if the extending config has non-default values (description, entropy != 0.0, secret_group != 0, regex/path is Some)
- Append arrays (tags, keywords, allowlists) from extending config to base
- Add all merged keywords to global keywords set

## Testing Status

All 45 tests pass:
- ✅ 14 unit tests in lib
- ✅ 31 integration tests including:
  - ✅ 16 extension-specific tests
  - ✅ 15 existing configuration tests from previous tasks

Extension tests demonstrate:
- ✅ Multi-level extension chains (up to depth 2)
- ✅ Rule field override and merging for all field types
- ✅ Keyword merging and lowercasing from base and extended rules
- ✅ Allowlist merging with OR and AND conditions
- ✅ Depth limiting (extends stop at max depth)
- ✅ Disabled rules properly excluded
- ✅ Invalid extension error handling
- ✅ Global allowlist targetRules integration with extension

## Future Work

- Task 6 will implement the embedded default configuration to replace the stub
- URL-based extension remains unimplemented (marked as TODO in Go version as well)

Milestone No.: 1
Task No.: 5
Task ID: 15425
@mcode-app mcode-app bot closed this Jan 23, 2026
@mcode-app mcode-app bot force-pushed the master-modelcode-ai branch from 89f677f to a2510b4 Compare January 23, 2026 09:16
mcode-app bot pushed a commit that referenced this pull request Jan 23, 2026
… TOML loading

This task implements the complete configuration management system for the gitleaks Python migration, providing the foundation for all secret detection operations.

Core Implementation:

1. Pydantic Models (src/gitleaks/config/models.py - 214 lines, 93% coverage):
   - Config: Main configuration with rules, allowlists, and settings
   - Rule: Detection rules with regex, keywords, entropy, and validation
   - Allowlist: Filtering with commits, paths, regexes, and stop words
   - Extend: Config extension with path or useDefault options
   - Required: Required rule references for multi-part secrets
   - Full validation with regex compilation at config load time
   - Handles deprecated allowlist formats with warnings
   - Translates RE2 syntax (\z → \Z) to Python regex

2. Configuration Loader (src/gitleaks/config/loader.py - 181 lines, 95% coverage):
   - Implements config resolution order: --config flag → GITLEAKS_CONFIG env → GITLEAKS_CONFIG_TOML env → {source}/.gitleaks.toml → default config
   - Config extension and merging with max depth protection
   - Rule override logic during extension (description, regex, keywords, etc.)
   - DisabledRules filtering
   - Case-insensitive TOML field parsing (camelCase and lowercase variations)
   - Path resolution for extended configs relative to parent config directory
   - Default embedded config from gitleaks.toml
   - Clear, actionable error messages for invalid configs

3. Utilities (src/gitleaks/config/utils.py - 17 lines, 100% coverage):
   - Regex helper functions: regex_matched, any_regex_match, join_regex_or
   - Used for allowlist matching and rule prefiltering

Python 3.10+ Compatibility:
- Updated from Python 3.11+ to Python 3.10+ minimum version
- Added tomli dependency with conditional import (stdlib tomllib for 3.11+, tomli package for 3.10)
- Updated pyproject.toml: dependencies, classifiers, and tool configurations (black, ruff, mypy)

Test Coverage:
- 73 tests passing with 83% overall coverage
- 47 model tests covering all Pydantic validation logic
- 26 loader tests covering config loading, extension, and edge cases
- 8 utility tests for regex helpers
- Testdata validation tests using actual gitleaks config files from source (config files only)

Design Decisions Implemented:
- Design Decision #1 (Configuration Schema Translation): Flat Pydantic model structure with Field aliases for kebab-case keys
- Design Decision #2 (Regex Engine Selection): Using `regex` library instead of stdlib `re` for better PCRE/RE2 compatibility

Files Created:
- src/gitleaks/config/models.py
- src/gitleaks/config/loader.py
- src/gitleaks/config/utils.py
- src/gitleaks/config/gitleaks.toml (default config)
- tests/config/test_models.py
- tests/config/test_loader.py
- tests/config/test_utils.py
- .gitleaks.toml (repository config with allowlists)
- testdata/config/ (config test files only - 52 files for validation)

Acceptance Criteria Met:
✅ All Pydantic models complete with proper validation
✅ TOML loading works with full config resolution order
✅ Config extension functional with rule merging and DisabledRules
✅ Regex compilation using regex library with error handling
✅ All tests pass (73/73) with strong coverage (83%)
✅ Can load and validate testdata/config/*.toml files
mcode-app bot pushed a commit that referenced this pull request Jan 23, 2026
… TOML loading: merge from gitleaks-milestone_1-task_2-61715a

This task implements the complete configuration management system for the gitleaks Python migration, providing the foundation for all secret detection operations.

**Core Implementation:**

1. **Pydantic Models** (src/gitleaks/config/models.py - 214 lines, 93% coverage):
   - Config: Main configuration with rules, allowlists, and settings
   - Rule: Detection rules with regex, keywords, entropy, and validation
   - Allowlist: Filtering with commits, paths, regexes, and stop words
   - Extend: Config extension with path or useDefault options
   - Required: Required rule references for multi-part secrets
   - Full validation with regex compilation at config load time
   - Handles deprecated allowlist formats with warnings
   - Translates RE2 syntax (\z → \Z) to Python regex

2. **Configuration Loader** (src/gitleaks/config/loader.py - 181 lines, 95% coverage):
   - Implements config resolution order: --config flag → GITLEAKS_CONFIG env → GITLEAKS_CONFIG_TOML env → {source}/.gitleaks.toml → default config
   - Config extension and merging with max depth protection
   - Rule override logic during extension (description, regex, keywords, etc.)
   - DisabledRules filtering
   - Case-insensitive TOML field parsing (camelCase and lowercase variations)
   - Path resolution for extended configs relative to parent config directory
   - Default embedded config from gitleaks.toml
   - Clear, actionable error messages for invalid configs

3. **Utilities** (src/gitleaks/config/utils.py - 17 lines, 100% coverage):
   - Regex helper functions: regex_matched, any_regex_match, join_regex_or
   - Used for allowlist matching and rule prefiltering

**Python 3.10+ Compatibility:**
- Updated from Python 3.11+ to Python 3.10+ minimum version
- Added tomli dependency with conditional import (stdlib tomllib for 3.11+, tomli package for 3.10)
- Updated pyproject.toml: dependencies, classifiers, and tool configurations (black, ruff, mypy)

**Test Coverage:**
- 73 tests passing with 83% overall coverage
- 47 model tests covering all Pydantic validation logic
- 26 loader tests covering config loading, extension, and edge cases
- 8 utility tests for regex helpers
- Testdata validation tests using actual gitleaks config files from source (config files only)

**Design Decisions Implemented:**
- **Design Decision #1 (Configuration Schema Translation)**: Flat Pydantic model structure with Field aliases for kebab-case keys
- **Design Decision #2 (Regex Engine Selection)**: Using `regex` library instead of stdlib `re` for better PCRE/RE2 compatibility

**Files Created:**
- src/gitleaks/config/models.py
- src/gitleaks/config/loader.py
- src/gitleaks/config/utils.py
- src/gitleaks/config/gitleaks.toml (default config)
- tests/config/test_models.py
- tests/config/test_loader.py
- tests/config/test_utils.py
- .gitleaks.toml (repository config with allowlists)
- testdata/config/ (config test files only - 52 files for validation)

**Acceptance Criteria Met:**
✅ All Pydantic models complete with proper validation
✅ TOML loading works with full config resolution order
✅ Config extension functional with rule merging and DisabledRules
✅ Regex compilation using regex library with error handling
✅ All tests pass (73/73) with strong coverage (83%)
✅ Can load and validate testdata/config/*.toml files

Milestone No.: 1
Task No.: 2
Task ID: 32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants