Milestone [1] Foundation - Config, CLI, and Directory Scanning#5
Closed
mcode-app[bot] wants to merge 23 commits intomaster-modelcode-aifrom
Closed
Milestone [1] Foundation - Config, CLI, and Directory Scanning#5mcode-app[bot] wants to merge 23 commits intomaster-modelcode-aifrom
mcode-app[bot] wants to merge 23 commits intomaster-modelcode-aifrom
Conversation
…gging Created the foundational Python project structure for gitleaks migration, including: **Project Structure:** - Implemented Poetry-based build system with pyproject.toml defining all dependencies (Click, Pydantic, structlog, GitPython, etc.) - Created src-layout package structure under src/gitleaks/ following modern Python best practices - Added .gitignore to exclude build artifacts, virtual environments, and Python cache files **Core Components:** - src/gitleaks/__init__.py: Package initialization with version export (0.1.0) - src/gitleaks/__main__.py: Entry point supporting both `gitleaks` command and `python -m gitleaks` execution - src/gitleaks/logging.py: Structured logging configuration using structlog with JSON output, ISO 8601 timestamps, and support for all log levels (TRACE/DEBUG/INFO/WARN/ERROR/FATAL) **Module Directories:** Created empty subdirectories with __init__.py files for future implementation: - cli/ - Command-line interface components - config/ - Configuration management - detector/ - Secret detection engine - sources/ - Source providers (git, directory, stdin, archives) - reporting/ - Output formatters (JSON, CSV, SARIF, JUnit, templates) **Validation:** - Package builds successfully with `poetry build`, producing wheel and sdist - Package is importable with `import gitleaks; print(gitleaks.__version__)` - Logging infrastructure produces JSON-formatted logs with timestamps and contextual fields - All acceptance criteria from Definition of Done are met The project is now ready for subsequent tasks to implement CLI, configuration loading, and detection engine components. Milestone No.: 1 Task No.: 1 Task ID: 31
…gging: merge from gitleaks-milestone_1-task_1-b79b47 Created the foundational Python project structure for gitleaks migration, including: **Project Structure:** - Implemented Poetry-based build system with pyproject.toml defining all dependencies (Click, Pydantic, structlog, GitPython, etc.) - Created src-layout package structure under src/gitleaks/ following modern Python best practices - Added .gitignore to exclude build artifacts, virtual environments, and Python cache files **Core Components:** - src/gitleaks/__init__.py: Package initialization with version export (0.1.0) - src/gitleaks/__main__.py: Entry point supporting both `gitleaks` command and `python -m gitleaks` execution - src/gitleaks/logging.py: Structured logging configuration using structlog with JSON output, ISO 8601 timestamps, and support for all log levels (TRACE/DEBUG/INFO/WARN/ERROR/FATAL) **Module Directories:** Created empty subdirectories with __init__.py files for future implementation: - cli/ - Command-line interface components - config/ - Configuration management - detector/ - Secret detection engine - sources/ - Source providers (git, directory, stdin, archives) - reporting/ - Output formatters (JSON, CSV, SARIF, JUnit, templates) **Validation:** - Package builds successfully with `poetry build`, producing wheel and sdist - Package is importable with `import gitleaks; print(gitleaks.__version__)` - Logging infrastructure produces JSON-formatted logs with timestamps and contextual fields - All acceptance criteria from Definition of Done are met The project is now ready for subsequent tasks to implement CLI, configuration loading, and detection engine components. Milestone No.: 1 Task No.: 1 Task ID: 31
… TOML loading
This task implements the complete configuration management system for the gitleaks Python migration, providing the foundation for all secret detection operations.
Core Implementation:
1. Pydantic Models (src/gitleaks/config/models.py - 214 lines, 93% coverage):
- Config: Main configuration with rules, allowlists, and settings
- Rule: Detection rules with regex, keywords, entropy, and validation
- Allowlist: Filtering with commits, paths, regexes, and stop words
- Extend: Config extension with path or useDefault options
- Required: Required rule references for multi-part secrets
- Full validation with regex compilation at config load time
- Handles deprecated allowlist formats with warnings
- Translates RE2 syntax (\z → \Z) to Python regex
2. Configuration Loader (src/gitleaks/config/loader.py - 181 lines, 95% coverage):
- Implements config resolution order: --config flag → GITLEAKS_CONFIG env → GITLEAKS_CONFIG_TOML env → {source}/.gitleaks.toml → default config
- Config extension and merging with max depth protection
- Rule override logic during extension (description, regex, keywords, etc.)
- DisabledRules filtering
- Case-insensitive TOML field parsing (camelCase and lowercase variations)
- Path resolution for extended configs relative to parent config directory
- Default embedded config from gitleaks.toml
- Clear, actionable error messages for invalid configs
3. Utilities (src/gitleaks/config/utils.py - 17 lines, 100% coverage):
- Regex helper functions: regex_matched, any_regex_match, join_regex_or
- Used for allowlist matching and rule prefiltering
Python 3.10+ Compatibility:
- Updated from Python 3.11+ to Python 3.10+ minimum version
- Added tomli dependency with conditional import (stdlib tomllib for 3.11+, tomli package for 3.10)
- Updated pyproject.toml: dependencies, classifiers, and tool configurations (black, ruff, mypy)
Test Coverage:
- 73 tests passing with 83% overall coverage
- 47 model tests covering all Pydantic validation logic
- 26 loader tests covering config loading, extension, and edge cases
- 8 utility tests for regex helpers
- Testdata validation tests using actual gitleaks config files from source (config files only)
Design Decisions Implemented:
- Design Decision #1 (Configuration Schema Translation): Flat Pydantic model structure with Field aliases for kebab-case keys
- Design Decision #2 (Regex Engine Selection): Using `regex` library instead of stdlib `re` for better PCRE/RE2 compatibility
Files Created:
- src/gitleaks/config/models.py
- src/gitleaks/config/loader.py
- src/gitleaks/config/utils.py
- src/gitleaks/config/gitleaks.toml (default config)
- tests/config/test_models.py
- tests/config/test_loader.py
- tests/config/test_utils.py
- .gitleaks.toml (repository config with allowlists)
- testdata/config/ (config test files only - 52 files for validation)
Acceptance Criteria Met:
✅ All Pydantic models complete with proper validation
✅ TOML loading works with full config resolution order
✅ Config extension functional with rule merging and DisabledRules
✅ Regex compilation using regex library with error handling
✅ All tests pass (73/73) with strong coverage (83%)
✅ Can load and validate testdata/config/*.toml files
… TOML loading: merge from gitleaks-milestone_1-task_2-61715a
This task implements the complete configuration management system for the gitleaks Python migration, providing the foundation for all secret detection operations.
**Core Implementation:**
1. **Pydantic Models** (src/gitleaks/config/models.py - 214 lines, 93% coverage):
- Config: Main configuration with rules, allowlists, and settings
- Rule: Detection rules with regex, keywords, entropy, and validation
- Allowlist: Filtering with commits, paths, regexes, and stop words
- Extend: Config extension with path or useDefault options
- Required: Required rule references for multi-part secrets
- Full validation with regex compilation at config load time
- Handles deprecated allowlist formats with warnings
- Translates RE2 syntax (\z → \Z) to Python regex
2. **Configuration Loader** (src/gitleaks/config/loader.py - 181 lines, 95% coverage):
- Implements config resolution order: --config flag → GITLEAKS_CONFIG env → GITLEAKS_CONFIG_TOML env → {source}/.gitleaks.toml → default config
- Config extension and merging with max depth protection
- Rule override logic during extension (description, regex, keywords, etc.)
- DisabledRules filtering
- Case-insensitive TOML field parsing (camelCase and lowercase variations)
- Path resolution for extended configs relative to parent config directory
- Default embedded config from gitleaks.toml
- Clear, actionable error messages for invalid configs
3. **Utilities** (src/gitleaks/config/utils.py - 17 lines, 100% coverage):
- Regex helper functions: regex_matched, any_regex_match, join_regex_or
- Used for allowlist matching and rule prefiltering
**Python 3.10+ Compatibility:**
- Updated from Python 3.11+ to Python 3.10+ minimum version
- Added tomli dependency with conditional import (stdlib tomllib for 3.11+, tomli package for 3.10)
- Updated pyproject.toml: dependencies, classifiers, and tool configurations (black, ruff, mypy)
**Test Coverage:**
- 73 tests passing with 83% overall coverage
- 47 model tests covering all Pydantic validation logic
- 26 loader tests covering config loading, extension, and edge cases
- 8 utility tests for regex helpers
- Testdata validation tests using actual gitleaks config files from source (config files only)
**Design Decisions Implemented:**
- **Design Decision #1 (Configuration Schema Translation)**: Flat Pydantic model structure with Field aliases for kebab-case keys
- **Design Decision #2 (Regex Engine Selection)**: Using `regex` library instead of stdlib `re` for better PCRE/RE2 compatibility
**Files Created:**
- src/gitleaks/config/models.py
- src/gitleaks/config/loader.py
- src/gitleaks/config/utils.py
- src/gitleaks/config/gitleaks.toml (default config)
- tests/config/test_models.py
- tests/config/test_loader.py
- tests/config/test_utils.py
- .gitleaks.toml (repository config with allowlists)
- testdata/config/ (config test files only - 52 files for validation)
**Acceptance Criteria Met:**
✅ All Pydantic models complete with proper validation
✅ TOML loading works with full config resolution order
✅ Config extension functional with rule merging and DisabledRules
✅ Regex compilation using regex library with error handling
✅ All tests pass (73/73) with strong coverage (83%)
✅ Can load and validate testdata/config/*.toml files
Milestone No.: 1
Task No.: 2
Task ID: 32
…c integration Implemented the CLI foundation for gitleaks using Click 8.1.x framework, including: **Core CLI Structure:** - Created `cli/common.py` with Click root command group and persistent flags - Implemented all persistent flags: --config, --exit-code, --report-path, --report-format, --report-template, --baseline-path, --log-level, --verbose, --no-color, --max-target-megabytes, --ignore-gitleaks-allow, --redact, --no-banner, --enable-rule, --gitleaks-ignore-path, --max-decode-depth, --max-archive-depth, --timeout - Created GitleaksContext class to pass CLI state between commands - Implemented utility functions: bytes_convert() for human-readable sizes, format_duration() for time display **Configuration Integration:** - Integrated config loading from Task 2 via init_config() function - Supports --config flag, GITLEAKS_CONFIG env var, GITLEAKS_CONFIG_TOML env var, auto-discovery of .gitleaks.toml in target directory, and default config fallback - Configuration loading order follows specification exactly **Directory Command:** - Created `cli/dir.py` implementing the directory scanning command - Command accepts path argument (defaults to ".") with validation - Includes --follow-symlinks flag - Implements async integration using asyncio.run() to bridge Click's synchronous interface - Supports timeout handling via --timeout flag using asyncio.wait_for() **Async Integration:** - Implemented dir_scan_async() as async handler function - Used asyncio.run() to invoke async handlers from Click commands - Added timeout support with proper error handling - Async implementation serves as tracer bullet for future detection engine **User Experience:** - Banner display with ASCII art (suppressible via --no-banner) - Version handling via --version flag - Logging configuration integrated with --log-level and --no-color flags - Structured logging output using structlog with console renderer - Signal handling for graceful shutdown (Ctrl+C) **Testing:** - Created comprehensive test suite in tests/cli/test_common.py - Created integration tests in tests/cli/test_dir.py - All 13 CLI tests passing with 73% overall code coverage **Files Created:** - src/gitleaks/cli/common.py (374 lines) - src/gitleaks/cli/dir.py (112 lines) - tests/cli/test_common.py (67 lines) - tests/cli/test_dir.py (71 lines) - tests/cli/__init__.py **Files Modified:** - src/gitleaks/cli/__init__.py - Exposed cli and dir_command - src/gitleaks/__main__.py - Integrated Click CLI with signal handling - src/gitleaks/logging.py - Updated configure_logging to accept no_color parameter The CLI is now fully functional and ready for detection engine integration in Task 5. All acceptance criteria from the task specification are met: - ✅ CLI commands are defined with help text - ✅ Persistent flags work across all commands - ✅ Config loading is integrated with all precedence rules - ✅ Dir command scaffolding validates input and loads configuration - ✅ Async integration is functional with timeout support Milestone No.: 1 Task No.: 3 Task ID: 33
…c integration: merge from gitleaks-milestone_1-task_3-985dfb Implemented the CLI foundation for gitleaks using Click 8.1.x framework, including: **Core CLI Structure:** - Created `cli/common.py` with Click root command group and persistent flags - Implemented all persistent flags: --config, --exit-code, --report-path, --report-format, --report-template, --baseline-path, --log-level, --verbose, --no-color, --max-target-megabytes, --ignore-gitleaks-allow, --redact, --no-banner, --enable-rule, --gitleaks-ignore-path, --max-decode-depth, --max-archive-depth, --timeout - Created GitleaksContext class to pass CLI state between commands - Implemented utility functions: bytes_convert() for human-readable sizes, format_duration() for time display **Configuration Integration:** - Integrated config loading from Task 2 via init_config() function - Supports --config flag, GITLEAKS_CONFIG env var, GITLEAKS_CONFIG_TOML env var, auto-discovery of .gitleaks.toml in target directory, and default config fallback - Configuration loading order follows specification exactly **Directory Command:** - Created `cli/dir.py` implementing the directory scanning command - Command accepts path argument (defaults to ".") with validation - Includes --follow-symlinks flag - Implements async integration using asyncio.run() to bridge Click's synchronous interface - Supports timeout handling via --timeout flag using asyncio.wait_for() **Async Integration:** - Implemented dir_scan_async() as async handler function - Used asyncio.run() to invoke async handlers from Click commands - Added timeout support with proper error handling - Async implementation serves as tracer bullet for future detection engine **User Experience:** - Banner display with ASCII art (suppressible via --no-banner) - Version handling via --version flag - Logging configuration integrated with --log-level and --no-color flags - Structured logging output using structlog with console renderer - Signal handling for graceful shutdown (Ctrl+C) **Testing:** - Created comprehensive test suite in tests/cli/test_common.py - Created integration tests in tests/cli/test_dir.py - All 13 CLI tests passing with 73% overall code coverage **Files Created:** - src/gitleaks/cli/common.py (374 lines) - src/gitleaks/cli/dir.py (112 lines) - tests/cli/test_common.py (67 lines) - tests/cli/test_dir.py (71 lines) - tests/cli/__init__.py **Files Modified:** - src/gitleaks/cli/__init__.py - Exposed cli and dir_command - src/gitleaks/__main__.py - Integrated Click CLI with signal handling - src/gitleaks/logging.py - Updated configure_logging to accept no_color parameter The CLI is now fully functional and ready for detection engine integration in Task 5. All acceptance criteria from the task specification are met: - ✅ CLI commands are defined with help text - ✅ Persistent flags work across all commands - ✅ Config loading is integrated with all precedence rules - ✅ Dir command scaffolding validates input and loads configuration - ✅ Async integration is functional with timeout support Milestone No.: 1 Task No.: 3 Task ID: 33
…JSON reporter Implemented the reporting foundation for gitleaks Python migration, including: ## Core Components **Finding Model (src/gitleaks/reporting/finding.py)** - Complete Finding dataclass with all required fields: rule_id, description, line numbers, columns, match, secret, file info, git metadata, entropy, tags, and fingerprint - RequiredFinding dataclass for composite rule support - Secret redaction with configurable percentage masking (0-100%) - mask_secret() function implementing banker's rounding to match Go behavior - add_required_findings() method for multi-part rule tracking **Reporter Protocol (src/gitleaks/reporting/__init__.py)** - Reporter protocol defining the write(writer, findings) interface - Type-annotated with IO and List[Finding] for type safety - Clear documentation of interface contract - Exports for all public APIs **JSON Reporter (src/gitleaks/reporting/json_reporter.py)** - JsonReporter implementing the Reporter protocol - Serializes findings to JSON with single-space indentation matching Go output - Handles entropy float-to-int conversion for Go compatibility - Properly excludes non-serializable fields (line, fragment, required_findings) - Implements omitempty behavior for Link field - PascalCase field names matching Go's JSON output **Constants (src/gitleaks/reporting/constants.py)** - CWE-798 identifier and description - STDOUT_REPORT_PATH constant - VERSION and DRIVER constants ## Testing **Comprehensive Test Suite** - test_finding.py: 13 tests covering Finding model and redaction - Full redaction tests (100% -> "REDACTED") - Partial redaction tests with various percentages - Edge cases: empty secrets, short secrets, invalid percentages - RequiredFinding accumulation - test_json_reporter.py: 4 tests for JSON output - Simple finding serialization with expected output comparison - Empty findings list handling - Field inclusion/exclusion verification - omitempty behavior for Link field **Test Results** - All 17 new tests pass - 100% code coverage for reporting module - All 165 project tests pass (no regressions) - JSON output matches Go implementation testdata exactly ## Key Implementation Details 1. **Redaction Algorithm**: Implements same masking logic as Go: - Calculates visible characters as: round(length * (100-percent) / 100) - Uses Python's banker's rounding to match Go's math.RoundToEven - Appends "..." to masked secrets 2. **JSON Compatibility**: Ensures output matches Go version: - Single-space indentation (indent=" ") - Integer entropy when whole number (0 not 0.0) - PascalCase field names - Proper field exclusion 3. **Type Safety**: Uses Python type hints throughout: - dataclasses for Finding and RequiredFinding - Protocol for Reporter interface - IO type for file/stream handling Files ported from Go: - report/finding.go (126 LOC) -> finding.py - report/report.go (16 LOC) -> __init__.py - report/json.go (17 LOC) -> json_reporter.py - report/constants.go (4 LOC) -> constants.py - report/finding_test.go -> test_finding.py - report/json_test.go -> test_json_reporter.py This completes Task 6, providing the foundation for all future reporter implementations (CSV, SARIF, JUnit, templates) in later milestones. Milestone No.: 1 Task No.: 6 Task ID: 36
…JSON reporter: merge from gitleaks-milestone_1-task_6-017036 Implemented the reporting foundation for gitleaks Python migration, including: ## Core Components **Finding Model (src/gitleaks/reporting/finding.py)** - Complete Finding dataclass with all required fields: rule_id, description, line numbers, columns, match, secret, file info, git metadata, entropy, tags, and fingerprint - RequiredFinding dataclass for composite rule support - Secret redaction with configurable percentage masking (0-100%) - mask_secret() function implementing banker's rounding to match Go behavior - add_required_findings() method for multi-part rule tracking **Reporter Protocol (src/gitleaks/reporting/__init__.py)** - Reporter protocol defining the write(writer, findings) interface - Type-annotated with IO and List[Finding] for type safety - Clear documentation of interface contract - Exports for all public APIs **JSON Reporter (src/gitleaks/reporting/json_reporter.py)** - JsonReporter implementing the Reporter protocol - Serializes findings to JSON with single-space indentation matching Go output - Handles entropy float-to-int conversion for Go compatibility - Properly excludes non-serializable fields (line, fragment, required_findings) - Implements omitempty behavior for Link field - PascalCase field names matching Go's JSON output **Constants (src/gitleaks/reporting/constants.py)** - CWE-798 identifier and description - STDOUT_REPORT_PATH constant - VERSION and DRIVER constants ## Testing **Comprehensive Test Suite** - test_finding.py: 13 tests covering Finding model and redaction - Full redaction tests (100% -> "REDACTED") - Partial redaction tests with various percentages - Edge cases: empty secrets, short secrets, invalid percentages - RequiredFinding accumulation - test_json_reporter.py: 4 tests for JSON output - Simple finding serialization with expected output comparison - Empty findings list handling - Field inclusion/exclusion verification - omitempty behavior for Link field **Test Results** - All 17 new tests pass - 100% code coverage for reporting module - All 165 project tests pass (no regressions) - JSON output matches Go implementation testdata exactly ## Key Implementation Details 1. **Redaction Algorithm**: Implements same masking logic as Go: - Calculates visible characters as: round(length * (100-percent) / 100) - Uses Python's banker's rounding to match Go's math.RoundToEven - Appends "..." to masked secrets 2. **JSON Compatibility**: Ensures output matches Go version: - Single-space indentation (indent=" ") - Integer entropy when whole number (0 not 0.0) - PascalCase field names - Proper field exclusion 3. **Type Safety**: Uses Python type hints throughout: - dataclasses for Finding and RequiredFinding - Protocol for Reporter interface - IO type for file/stream handling Files ported from Go: - report/finding.go (126 LOC) -> finding.py - report/report.go (16 LOC) -> __init__.py - report/json.go (17 LOC) -> json_reporter.py - report/constants.go (4 LOC) -> constants.py - report/finding_test.go -> test_finding.py - report/json_test.go -> test_json_reporter.py This completes Task 6, providing the foundation for all future reporter implementations (CSV, SARIF, JUnit, templates) in later milestones. Milestone No.: 1 Task No.: 6 Task ID: 36
Implemented version display and diagnostics commands for gitleaks Python implementation with full support for build-time version configuration: Version Command: - Added `gitleaks version` subcommand that displays "gitleaks version X.Y.Z" - Updated `--version` flag to display "gitleaks version X.Y.Z" format (without comma) - Implemented build-time version override via GITLEAKS_VERSION environment variable - Version defaults to "0.1.0" from pyproject.toml but can be overridden at runtime - Both `--version` flag and `version` subcommand produce identical output Diagnostics Command: - Added `gitleaks diagnostics` subcommand that displays comprehensive diagnostic information - Shows runtime info: Python version, OS/platform, architecture - Displays configuration details: config path, rule count, keyword count - Shows allowlist statistics (paths, commits, regexes, stop words) - Displays config settings like extends and title - Shows GITLEAKS_* environment variables if set - Supports --source option to specify directory for config file discovery - Handles cases where no config file is found gracefully Build-Time Version Configuration: - Version can be set via GITLEAKS_VERSION environment variable at runtime - Example: `GITLEAKS_VERSION=8.18.0 gitleaks --version` outputs "gitleaks version 8.18.0" - This mimics Go's ldflags approach for setting version during build process - Works consistently across all commands (--version, version, diagnostics) Implementation Details: - Created src/gitleaks/cli/version.py with version command - Created src/gitleaks/cli/diagnostics.py with diagnostics command - Updated src/gitleaks/__init__.py to support GITLEAKS_VERSION env var override - Updated src/gitleaks/cli/__init__.py to export new commands - Modified cli/common.py to customize version_option message format - Both commands integrate with existing CLI infrastructure (logging, config loading) - Follows same config discovery logic as other commands (--config flag, env vars, .gitleaks.toml) Note on Diagnostics Scope: The task specification intentionally redefined the diagnostics command to show configuration and environment information rather than performance profiling (which is what the Go source's diagnostics.go does). This implementation matches the task specification's acceptance criteria, providing users with useful troubleshooting information about their gitleaks setup. All acceptance criteria from Task gitleaks#12 have been met and verified through comprehensive testing. Milestone No.: 1 Task No.: 12 Task ID: 42
…eaks-milestone_1-task_12-8e9642 Implemented version display and diagnostics commands for gitleaks Python implementation with full support for build-time version configuration: Version Command: - Added `gitleaks version` subcommand that displays "gitleaks version X.Y.Z" - Updated `--version` flag to display "gitleaks version X.Y.Z" format (without comma) - Implemented build-time version override via GITLEAKS_VERSION environment variable - Version defaults to "0.1.0" from pyproject.toml but can be overridden at runtime - Both `--version` flag and `version` subcommand produce identical output Diagnostics Command: - Added `gitleaks diagnostics` subcommand that displays comprehensive diagnostic information - Shows runtime info: Python version, OS/platform, architecture - Displays configuration details: config path, rule count, keyword count - Shows allowlist statistics (paths, commits, regexes, stop words) - Displays config settings like extends and title - Shows GITLEAKS_* environment variables if set - Supports --source option to specify directory for config file discovery - Handles cases where no config file is found gracefully Build-Time Version Configuration: - Version can be set via GITLEAKS_VERSION environment variable at runtime - Example: `GITLEAKS_VERSION=8.18.0 gitleaks --version` outputs "gitleaks version 8.18.0" - This mimics Go's ldflags approach for setting version during build process - Works consistently across all commands (--version, version, diagnostics) Implementation Details: - Created src/gitleaks/cli/version.py with version command - Created src/gitleaks/cli/diagnostics.py with diagnostics command - Updated src/gitleaks/__init__.py to support GITLEAKS_VERSION env var override - Updated src/gitleaks/cli/__init__.py to export new commands - Modified cli/common.py to customize version_option message format - Both commands integrate with existing CLI infrastructure (logging, config loading) - Follows same config discovery logic as other commands (--config flag, env vars, .gitleaks.toml) Note on Diagnostics Scope: The task specification intentionally redefined the diagnostics command to show configuration and environment information rather than performance profiling (which is what the Go source's diagnostics.go does). This implementation matches the task specification's acceptance criteria, providing users with useful troubleshooting information about their gitleaks setup. All acceptance criteria from Task gitleaks#12 have been met and verified through comprehensive testing. Milestone No.: 1 Task No.: 12 Task ID: 42
…eaks-milestone_1-task_12-8e9642 Implemented version display and diagnostics commands for gitleaks Python implementation with full support for build-time version configuration: Version Command: - Added `gitleaks version` subcommand that displays "gitleaks version X.Y.Z" - Updated `--version` flag to display "gitleaks version X.Y.Z" format (without comma) - Implemented build-time version override via GITLEAKS_VERSION environment variable - Version defaults to "0.1.0" from pyproject.toml but can be overridden at runtime - Both `--version` flag and `version` subcommand produce identical output Diagnostics Command: - Added `gitleaks diagnostics` subcommand that displays comprehensive diagnostic information - Shows runtime info: Python version, OS/platform, architecture - Displays configuration details: config path, rule count, keyword count - Shows allowlist statistics (paths, commits, regexes, stop words) - Displays config settings like extends and title - Shows GITLEAKS_* environment variables if set - Supports --source option to specify directory for config file discovery - Handles cases where no config file is found gracefully Build-Time Version Configuration: - Version can be set via GITLEAKS_VERSION environment variable at runtime - Example: `GITLEAKS_VERSION=8.18.0 gitleaks --version` outputs "gitleaks version 8.18.0" - This mimics Go's ldflags approach for setting version during build process - Works consistently across all commands (--version, version, diagnostics) Implementation Details: - Created src/gitleaks/cli/version.py with version command - Created src/gitleaks/cli/diagnostics.py with diagnostics command - Updated src/gitleaks/__init__.py to support GITLEAKS_VERSION env var override - Updated src/gitleaks/cli/__init__.py to export new commands - Modified cli/common.py to customize version_option message format - Both commands integrate with existing CLI infrastructure (logging, config loading) - Follows same config discovery logic as other commands (--config flag, env vars, .gitleaks.toml) Note on Diagnostics Scope: The task specification intentionally redefined the diagnostics command to show configuration and environment information rather than performance profiling (which is what the Go source's diagnostics.go does). This implementation matches the task specification's acceptance criteria, providing users with useful troubleshooting information about their gitleaks setup. All acceptance criteria from Task gitleaks#12 have been met and verified through comprehensive testing. Milestone No.: 1 Task No.: 12 Task ID: 42
…raction This commit implements comprehensive archive scanning capabilities for gitleaks, enabling detection of secrets within compressed archives with configurable recursion depth limits. ## Key Features Implemented ### 1. Archive Type Detection - Created `sources/archive_source.py` (759 lines) with detection and extraction logic - Extension-based detection for all common archive formats - Magic byte detection for reliable format identification without file extensions - Support for: ZIP, TAR (plain/gz/bz2/xz/zst), GZIP, BZIP2, XZ, 7Z, RAR, ZSTD ### 2. Archive Extraction & Scanning - ZIP extraction using stdlib `zipfile` - TAR extraction using stdlib `tarfile` with multiple compression modes - Single-file compression (gzip, bzip2, xz) using stdlib modules - 7z support via `py7zr` library - RAR support via `rarfile` library - Zstandard support via `zstandard` library (added to dependencies) - Graceful error handling for corrupted or unsupported archives ### 3. Recursive Archive Handling - Configurable `max_archive_depth` limit (default: 0, disabled) - Archives within archives are detected and recursively scanned - Depth tracking across nested levels prevents infinite recursion - Virtual path construction using "!" separator (e.g., `archive.zip!inner/file.txt`) - Logging when depth limits are reached ### 4. Integration with File Source - Enhanced `sources/file.py` to detect and delegate archives to `archive_source` - Archive scanning is transparent during file/directory traversal - Fallback to regular file handling if archive scanning fails - Allowlist filtering applies to archive contents via virtual paths - `dir_source.py` already passes `max_archive_depth` to File instances ### 5. Enhanced Archive Detection - Updated `sources/common.py` `is_archive()` function to support magic byte detection - Combined extension and content-based detection for improved accuracy ### 6. Dependencies - Added `zstandard ^0.22.0` to `pyproject.toml` for .zst archive support - All other dependencies (py7zr, rarfile) were already present ### 7. Test Data - Copied `testdata/repos/archives/` from source repository for integration testing ## Testing Created comprehensive test suite with three test files: ### test_archive_source.py (20 tests) - Archive type detection (extensions and magic bytes) - Extraction for all supported formats (ZIP, TAR, 7Z, GZIP, etc.) - Recursive extraction with depth enforcement - Virtual path construction with "!" separator - Allowlist application to archive contents - Error handling for corrupted archives ### test_archive_integration.py (4 tests - NEW) - End-to-end secret detection in ZIP archives - Scanning multiple archive formats in testdata/archives/ - Integration test for testdata/repos/archives/ directory - Nested archive scanning with depth limits ### test_dir_source.py (1 new test added) - Directory scanning with archive support enabled - Verification that archives are transparently scanned during directory traversal - Virtual path verification with "!" separator **Test Results:** - All 25 archive-specific tests pass successfully - All 57 source module tests pass (no regressions) ## Technical Details - Archive entries are processed in-memory when possible for efficiency - Temporary files used for seekable-only formats (7z, RAR) when source is non-seekable - Proper error handling and structured logging for corrupted or unsupported archives - Async implementation for consistency with other source providers - Architecture consistent with Go implementation's design patterns ## Definition of Done All acceptance criteria met: ✅ Archive detection works (extensions and magic bytes for all supported formats) ✅ Archive extraction works (handlers for each format with proper error handling) ✅ Recursive extraction works (depth limits enforced, accurate tracking) ✅ Integration with Files source works (transparent archive scanning during traversal) ✅ Tests pass (all testdata archives scan successfully with proper virtual paths) ✅ Secrets in archives are detected (verified with end-to-end integration tests) ✅ testdata/repos/archives/ integration test passes (copied and tested) Milestone No.: 1 Task No.: 11 Task ID: 41
…raction: merge from gitleaks-milestone_1-task_11-620bff This commit implements comprehensive archive scanning capabilities for gitleaks, enabling detection of secrets within compressed archives with configurable recursion depth limits. ## Key Features Implemented ### 1. Archive Type Detection - Created `sources/archive_source.py` (759 lines) with detection and extraction logic - Extension-based detection for all common archive formats - Magic byte detection for reliable format identification without file extensions - Support for: ZIP, TAR (plain/gz/bz2/xz/zst), GZIP, BZIP2, XZ, 7Z, RAR, ZSTD ### 2. Archive Extraction & Scanning - ZIP extraction using stdlib `zipfile` - TAR extraction using stdlib `tarfile` with multiple compression modes - Single-file compression (gzip, bzip2, xz) using stdlib modules - 7z support via `py7zr` library - RAR support via `rarfile` library - Zstandard support via `zstandard` library (added to dependencies) - Graceful error handling for corrupted or unsupported archives ### 3. Recursive Archive Handling - Configurable `max_archive_depth` limit (default: 0, disabled) - Archives within archives are detected and recursively scanned - Depth tracking across nested levels prevents infinite recursion - Virtual path construction using "!" separator (e.g., `archive.zip!inner/file.txt`) - Logging when depth limits are reached ### 4. Integration with File Source - Enhanced `sources/file.py` to detect and delegate archives to `archive_source` - Archive scanning is transparent during file/directory traversal - Fallback to regular file handling if archive scanning fails - Allowlist filtering applies to archive contents via virtual paths - `dir_source.py` already passes `max_archive_depth` to File instances ### 5. Enhanced Archive Detection - Updated `sources/common.py` `is_archive()` function to support magic byte detection - Combined extension and content-based detection for improved accuracy ### 6. Dependencies - Added `zstandard ^0.22.0` to `pyproject.toml` for .zst archive support - All other dependencies (py7zr, rarfile) were already present ### 7. Test Data - Copied `testdata/repos/archives/` from source repository for integration testing ## Testing Created comprehensive test suite with three test files: ### test_archive_source.py (20 tests) - Archive type detection (extensions and magic bytes) - Extraction for all supported formats (ZIP, TAR, 7Z, GZIP, etc.) - Recursive extraction with depth enforcement - Virtual path construction with "!" separator - Allowlist application to archive contents - Error handling for corrupted archives ### test_archive_integration.py (4 tests - NEW) - End-to-end secret detection in ZIP archives - Scanning multiple archive formats in testdata/archives/ - Integration test for testdata/repos/archives/ directory - Nested archive scanning with depth limits ### test_dir_source.py (1 new test added) - Directory scanning with archive support enabled - Verification that archives are transparently scanned during directory traversal - Virtual path verification with "!" separator **Test Results:** - All 25 archive-specific tests pass successfully - All 57 source module tests pass (no regressions) ## Technical Details - Archive entries are processed in-memory when possible for efficiency - Temporary files used for seekable-only formats (7z, RAR) when source is non-seekable - Proper error handling and structured logging for corrupted or unsupported archives - Async implementation for consistency with other source providers - Architecture consistent with Go implementation's design patterns ## Definition of Done All acceptance criteria met: ✅ Archive detection works (extensions and magic bytes for all supported formats) ✅ Archive extraction works (handlers for each format with proper error handling) ✅ Recursive extraction works (depth limits enforced, accurate tracking) ✅ Integration with Files source works (transparent archive scanning during traversal) ✅ Tests pass (all testdata archives scan successfully with proper virtual paths) ✅ Secrets in archives are detected (verified with end-to-end integration tests) ✅ testdata/repos/archives/ integration test passes (copied and tested) Milestone No.: 1 Task No.: 11 Task ID: 41
…ource protocol integration, and comprehensive testing
…ource protocol integration, and comprehensive testing Implemented the core detection engine for gitleaks Python migration, porting pattern matching, keyword prefiltering, allowlist filtering, finding generation, and reader utilities from Go to Python. Fixed critical architectural issue with Source protocol integration. ## Key Components Implemented ### 1. Detection Engine (detector/engine.py) - 86% coverage - Main Detector class with async fragment processing - Aho-Corasick keyword prefiltering using pyahocorasick library - Regex pattern matching with capture group support - Shannon entropy calculation for secret randomness validation - Allowlist filtering with AND/OR match conditions - Fingerprint-based deduplication - Support for gitleaks:allow inline comments - Fast-path optimization for commit/path-only allowlists - max_archive_depth attribute for archive scanning control - **Proper Source protocol integration using callback-based fragments()** ### 2. Location Tracking (detector/location.py) - 94% coverage - Precise line and column position calculation - Newline index computation for multi-line content - Line extraction from raw content - Compatible with Go implementation's line/column semantics ### 3. Reader Utilities (detector/reader.py) - 96% coverage - **detect_reader()**: Synchronous detection from binary streams - Wraps sources.File for backward compatibility - Handles buffer size configuration (KB) - Collects all findings and returns as list - **stream_detect_reader()**: Streaming detection with async iteration - Returns tuple of (findings_iterator, error_future) - Yields findings as they're detected - Proper error handling and EOF management - Both functions properly handle: - BytesIO and custom readers - Buffer boundary conditions - EOF with data scenarios - Error propagation ### 4. Utility Functions (detector/utils.py) - 49% coverage - Shannon entropy calculation (matches Go implementation) - Finding filtering and redaction - SCM deep link generation for multiple platforms: - GitHub, GitLab, Bitbucket, Azure DevOps, Gitea - Generic vs specific finding deduplication ## Critical Bug Fixes ### Source Protocol Integration (CRITICAL FIX) Fixed architectural bug where `detect_source()` incorrectly tried to iterate over `source.fragments()` as an async iterator. The correct implementation now: - Uses callback-based approach as defined in Source protocol - Creates async tasks within sync callback for concurrent fragment processing - Properly manages task lifecycle with asyncio.gather() - Handles errors returned by the callback appropriately This fix ensures the detector can work with all Source implementations (File, Directory, Git, etc.). ### AND/OR Allowlist Logic Fixed critical bug in fast-path allowlist checking where AND conditions with multiple criteria types (e.g., path + regex) were incorrectly evaluated. The fix ensures: - Fast-path checks skip AND conditions with regex/stopwords - All criteria must be evaluated together for AND conditions - OR conditions can still use fast-path optimization ### Secret Extraction Fixed bug where regex patterns without capture groups returned empty strings instead of the full match. ### Location Calculation Corrected column position calculation to match Go implementation's newline index semantics. ### Reader Implementation Implemented deprecated reader functions by: - Wrapping sources.File with proper buffer sizing - Using asyncio.create_task() to handle async detection in sync callbacks - Properly managing task lifecycle and error propagation - Supporting both collect-all and streaming patterns ## Comprehensive Test Coverage Added 75 tests across four test files achieving excellent coverage: ### test_engine.py (31 tests) - 86% coverage - Detector initialization - Keyword prefiltering (case-insensitive, extraction) - Pattern matching (simple, capture groups, multiline) - Allowlist filtering (path, regex, stopwords) - **AND/OR condition tests (9 comprehensive test cases)**: - AND with all criteria matching - AND with partial criteria (should not filter) - OR with any criteria matching - OR with no criteria matching - AND with commit+path+regex - AND fast-path for commit+path only - AND with stopwords - OR with mixed criteria - Rule-specific AND allowlists - gitleaks:allow comment handling - Finding metadata and fingerprint generation - Path-only rules - **NEW: Integration tests (4 tests)**: - detect_source() with File source - detect_source() with on_finding callback - detect_source() with allowlist filtering - detect_source() with empty source ### test_location.py (13 tests) - 94% coverage - Newline index computation - Single/multi-line location tracking - Column position calculation - Line extraction ### test_reader.py (8 tests) - 96% coverage - **TestDetectReader** (2 tests): - BytesIO reader handling - EOF with data scenarios - **TestStreamDetectReader** (6 tests): - Single and multiple secret streaming - Empty reader handling - Mock reader with EOF - Secret split across buffer boundaries - Error handling and propagation ### test_utils.py (23 tests) - 49% coverage - Shannon entropy calculation - Finding filtering and deduplication - Redaction with various lengths - SCM link generation for all platforms - Archive path and URL encoding handling ## Dependencies Added - pyahocorasick ^2.0: Efficient keyword prefiltering using Aho-Corasick automaton ## Test Results All 75 tests pass: - Engine: 31 passed (86% coverage) - includes 4 integration tests - Location: 13 passed (94% coverage) - Reader: 8 passed (96% coverage) - Utils: 23 passed (49% coverage - uncovered lines are platform-specific edge cases) The detection engine is now fully functional with proper Source protocol integration and meets all definition of done criteria including reader implementation, testing, and end-to-end integration verification. Milestone No.: 1 Task No.: 7 Task ID: 37
…515c Resolved merge conflicts by accepting incoming changes from shared branch for: - Reporting module updates (Task 10 - template reporter) - Archive scanning updates (Task 11 - archive source) - Test data for archives Task 7 (detection engine) changes remain intact.
…ource protocol integration, and comprehensive testing Implemented the core detection engine for gitleaks Python migration, porting pattern matching, keyword prefiltering, allowlist filtering, finding generation, and reader utilities from Go to Python. Fixed critical architectural issue with Source protocol integration. Successfully merged with changes from Tasks 10 (template reporter) and 11 (archive scanning). ## Key Components Implemented ### 1. Detection Engine (detector/engine.py) - 86% coverage - Main Detector class with async fragment processing - Aho-Corasick keyword prefiltering using pyahocorasick library - Regex pattern matching with capture group support - Shannon entropy calculation for secret randomness validation - Allowlist filtering with AND/OR match conditions - Fingerprint-based deduplication - Support for gitleaks:allow inline comments - Fast-path optimization for commit/path-only allowlists - max_archive_depth attribute for archive scanning control - **Proper Source protocol integration using callback-based fragments()** ### 2. Location Tracking (detector/location.py) - 94% coverage - Precise line and column position calculation - Newline index computation for multi-line content - Line extraction from raw content - Compatible with Go implementation's line/column semantics ### 3. Reader Utilities (detector/reader.py) - 96% coverage - **detect_reader()**: Synchronous detection from binary streams - Wraps sources.File for backward compatibility - Handles buffer size configuration (KB) - Collects all findings and returns as list - **stream_detect_reader()**: Streaming detection with async iteration - Returns tuple of (findings_iterator, error_future) - Yields findings as they're detected - Proper error handling and EOF management - Both functions properly handle: - BytesIO and custom readers - Buffer boundary conditions - EOF with data scenarios - Error propagation ### 4. Utility Functions (detector/utils.py) - 49% coverage - Shannon entropy calculation (matches Go implementation) - Finding filtering and redaction - SCM deep link generation for multiple platforms: - GitHub, GitLab, Bitbucket, Azure DevOps, Gitea - Generic vs specific finding deduplication ## Critical Bug Fixes ### Source Protocol Integration (CRITICAL FIX) Fixed architectural bug where `detect_source()` incorrectly tried to iterate over `source.fragments()` as an async iterator. The correct implementation now: - Uses callback-based approach as defined in Source protocol - Creates async tasks within sync callback for concurrent fragment processing - Properly manages task lifecycle with asyncio.gather() - Handles errors returned by the callback appropriately This fix ensures the detector can work with all Source implementations (File, Directory, Git, etc.). ### AND/OR Allowlist Logic Fixed critical bug in fast-path allowlist checking where AND conditions with multiple criteria types (e.g., path + regex) were incorrectly evaluated. The fix ensures: - Fast-path checks skip AND conditions with regex/stopwords - All criteria must be evaluated together for AND conditions - OR conditions can still use fast-path optimization ### Secret Extraction Fixed bug where regex patterns without capture groups returned empty strings instead of the full match. ### Location Calculation Corrected column position calculation to match Go implementation's newline index semantics. ### Reader Implementation Implemented deprecated reader functions by: - Wrapping sources.File with proper buffer sizing - Using asyncio.create_task() to handle async detection in sync callbacks - Properly managing task lifecycle and error propagation - Supporting both collect-all and streaming patterns ## Comprehensive Test Coverage Added 75 tests across four test files achieving excellent coverage: ### test_engine.py (31 tests) - 86% coverage - Detector initialization - Keyword prefiltering (case-insensitive, extraction) - Pattern matching (simple, capture groups, multiline) - Allowlist filtering (path, regex, stopwords) - **AND/OR condition tests (9 comprehensive test cases)**: - AND with all criteria matching - AND with partial criteria (should not filter) - OR with any criteria matching - OR with no criteria matching - AND with commit+path+regex - AND fast-path for commit+path only - AND with stopwords - OR with mixed criteria - Rule-specific AND allowlists - gitleaks:allow comment handling - Finding metadata and fingerprint generation - Path-only rules - **Integration tests (4 tests)**: - detect_source() with File source - detect_source() with on_finding callback - detect_source() with allowlist filtering - detect_source() with empty source ### test_location.py (13 tests) - 94% coverage - Newline index computation - Single/multi-line location tracking - Column position calculation - Line extraction ### test_reader.py (8 tests) - 96% coverage - **TestDetectReader** (2 tests): - BytesIO reader handling - EOF with data scenarios - **TestStreamDetectReader** (6 tests): - Single and multiple secret streaming - Empty reader handling - Mock reader with EOF - Secret split across buffer boundaries - Error handling and propagation ### test_utils.py (23 tests) - 49% coverage - Shannon entropy calculation - Finding filtering and deduplication - Redaction with various lengths - SCM link generation for all platforms - Archive path and URL encoding handling ## Dependencies Added - pyahocorasick ^2.0: Efficient keyword prefiltering using Aho-Corasick automaton ## Test Results All 75 tests pass: - Engine: 31 passed (86% coverage) - includes 4 integration tests - Location: 13 passed (94% coverage) - Reader: 8 passed (96% coverage) - Utils: 23 passed (49% coverage - uncovered lines are platform-specific edge cases) ## Merge Resolution Successfully merged changes from shared branch including: - Task 10: Template reporter implementation - Task 11: Archive scanning support All merge conflicts resolved by accepting incoming changes for reporting and sources modules while preserving Task 7 detection engine implementation. The detection engine is now fully functional with proper Source protocol integration and meets all definition of done criteria including reader implementation, testing, and end-to-end integration verification. Milestone No.: 1 Task No.: 7 Task ID: 37
…ource protocol integration, and comprehensive testing: merge from gitleaks-milestone_1-task_7-b3515c Implemented the core detection engine for gitleaks Python migration, porting pattern matching, keyword prefiltering, allowlist filtering, finding generation, and reader utilities from Go to Python. Fixed critical architectural issue with Source protocol integration. Successfully merged with changes from Tasks 10 (template reporter) and 11 (archive scanning). ## Key Components Implemented ### 1. Detection Engine (detector/engine.py) - 86% coverage - Main Detector class with async fragment processing - Aho-Corasick keyword prefiltering using pyahocorasick library - Regex pattern matching with capture group support - Shannon entropy calculation for secret randomness validation - Allowlist filtering with AND/OR match conditions - Fingerprint-based deduplication - Support for gitleaks:allow inline comments - Fast-path optimization for commit/path-only allowlists - max_archive_depth attribute for archive scanning control - **Proper Source protocol integration using callback-based fragments()** ### 2. Location Tracking (detector/location.py) - 94% coverage - Precise line and column position calculation - Newline index computation for multi-line content - Line extraction from raw content - Compatible with Go implementation's line/column semantics ### 3. Reader Utilities (detector/reader.py) - 96% coverage - **detect_reader()**: Synchronous detection from binary streams - Wraps sources.File for backward compatibility - Handles buffer size configuration (KB) - Collects all findings and returns as list - **stream_detect_reader()**: Streaming detection with async iteration - Returns tuple of (findings_iterator, error_future) - Yields findings as they're detected - Proper error handling and EOF management - Both functions properly handle: - BytesIO and custom readers - Buffer boundary conditions - EOF with data scenarios - Error propagation ### 4. Utility Functions (detector/utils.py) - 49% coverage - Shannon entropy calculation (matches Go implementation) - Finding filtering and redaction - SCM deep link generation for multiple platforms: - GitHub, GitLab, Bitbucket, Azure DevOps, Gitea - Generic vs specific finding deduplication ## Critical Bug Fixes ### Source Protocol Integration (CRITICAL FIX) Fixed architectural bug where `detect_source()` incorrectly tried to iterate over `source.fragments()` as an async iterator. The correct implementation now: - Uses callback-based approach as defined in Source protocol - Creates async tasks within sync callback for concurrent fragment processing - Properly manages task lifecycle with asyncio.gather() - Handles errors returned by the callback appropriately This fix ensures the detector can work with all Source implementations (File, Directory, Git, etc.). ### AND/OR Allowlist Logic Fixed critical bug in fast-path allowlist checking where AND conditions with multiple criteria types (e.g., path + regex) were incorrectly evaluated. The fix ensures: - Fast-path checks skip AND conditions with regex/stopwords - All criteria must be evaluated together for AND conditions - OR conditions can still use fast-path optimization ### Secret Extraction Fixed bug where regex patterns without capture groups returned empty strings instead of the full match. ### Location Calculation Corrected column position calculation to match Go implementation's newline index semantics. ### Reader Implementation Implemented deprecated reader functions by: - Wrapping sources.File with proper buffer sizing - Using asyncio.create_task() to handle async detection in sync callbacks - Properly managing task lifecycle and error propagation - Supporting both collect-all and streaming patterns ## Comprehensive Test Coverage Added 75 tests across four test files achieving excellent coverage: ### test_engine.py (31 tests) - 86% coverage - Detector initialization - Keyword prefiltering (case-insensitive, extraction) - Pattern matching (simple, capture groups, multiline) - Allowlist filtering (path, regex, stopwords) - **AND/OR condition tests (9 comprehensive test cases)**: - AND with all criteria matching - AND with partial criteria (should not filter) - OR with any criteria matching - OR with no criteria matching - AND with commit+path+regex - AND fast-path for commit+path only - AND with stopwords - OR with mixed criteria - Rule-specific AND allowlists - gitleaks:allow comment handling - Finding metadata and fingerprint generation - Path-only rules - **Integration tests (4 tests)**: - detect_source() with File source - detect_source() with on_finding callback - detect_source() with allowlist filtering - detect_source() with empty source ### test_location.py (13 tests) - 94% coverage - Newline index computation - Single/multi-line location tracking - Column position calculation - Line extraction ### test_reader.py (8 tests) - 96% coverage - **TestDetectReader** (2 tests): - BytesIO reader handling - EOF with data scenarios - **TestStreamDetectReader** (6 tests): - Single and multiple secret streaming - Empty reader handling - Mock reader with EOF - Secret split across buffer boundaries - Error handling and propagation ### test_utils.py (23 tests) - 49% coverage - Shannon entropy calculation - Finding filtering and deduplication - Redaction with various lengths - SCM link generation for all platforms - Archive path and URL encoding handling ## Dependencies Added - pyahocorasick ^2.0: Efficient keyword prefiltering using Aho-Corasick automaton ## Test Results All 75 tests pass: - Engine: 31 passed (86% coverage) - includes 4 integration tests - Location: 13 passed (94% coverage) - Reader: 8 passed (96% coverage) - Utils: 23 passed (49% coverage - uncovered lines are platform-specific edge cases) ## Merge Resolution Successfully merged changes from shared branch including: - Task 10: Template reporter implementation - Task 11: Archive scanning support All merge conflicts resolved by accepting incoming changes for reporting and sources modules while preserving Task 7 detection engine implementation. The detection engine is now fully functional with proper Source protocol integration and meets all definition of done criteria including reader implementation, testing, and end-to-end integration verification. Milestone No.: 1 Task No.: 7 Task ID: 37
…dir` command Complete the integration of all components to create a working end-to-end detection flow for the `gitleaks dir` command. The implementation wires together configuration loading, detector initialization, directory scanning, finding detection, reporting, and exit code handling. ## Changes Made ### Core Integration (dir.py) - Instantiate Detector with loaded configuration - Configure detector with CLI flags (verbose, redact, max_target_megabytes, etc.) - Create Files source with appropriate parameters - Execute async detection pipeline via detector.detect_source() - Implement error handling for partial scans - Display scan summary with byte counts and duration - Generate reports in multiple formats (JSON, CSV, SARIF, JUnit) - Exit with appropriate codes (0 for clean, 1 for errors, ctx.exit_code for findings) ### Bug Fixes 1. **Size Limit Check (detector/engine.py)** - Fixed bug where max_target_megabytes=0 was incorrectly blocking all files - Changed condition to only enforce limit when max_target_megabytes > 0 - This allows default behavior of "no limit" when set to 0 2. **Verbose Redaction (detector/engine.py)** - Added redaction support for verbose output - Create deep copy of finding before applying redaction to printed output - Preserves unredacted findings in reports while redacting console output 3. **CLI Redact Option (cli/common.py)** - Fixed redact option to accept explicit values (e.g., --redact=50) - Set is_flag=False to allow both flag usage (--redact) and value usage (--redact=20) - Maintains backward compatibility with flag_value=100 4. **SARIF Reporter (reporting/sarif_reporter.py)** - Fixed AttributeError where code tried to access rule.rule_id - Corrected to use rule.id which matches the config.Rule model - SARIF reports now generate successfully with all rule metadata ### Test Enhancements 1. **End-to-End Integration Test (tests/cli/test_dir.py)** - Added comprehensive `test_dir_command_with_testdata()` function - Scans testdata/repos/nogit/ and verifies JSON output - Validates exit code is 1 when leaks are found - Verifies exactly 2 findings for aws-access-token rule - Confirms files, secrets, line numbers, and metadata are correct - Checks fingerprints, entropy, descriptions, and empty commit fields 2. **Fixed Test Paths (tests/sources/test_dir_integration.py)** - Replaced hardcoded source repository paths (/l2l/src/gitleaks/...) - Now uses relative paths computed from test file location - Tests use destination repository's testdata: repo_root / "testdata" / "repos" / "nogit" - Ensures tests work standalone without source repository ### Test Data (testdata/repos/nogit/) Added minimal test data files required for end-to-end testing: - api.go - Test file containing AWS secret token - main.go - Test file containing AWS secret token - .gitleaksignore - Test ignore patterns - .env.prod - Test environment file These files enable the integration tests to verify secret detection without requiring the full source repository testdata. ### Manual Testing All components tested with testdata/repos/nogit/: - ✓ Basic detection finds AWS access tokens in test files - ✓ Verbose output displays findings to console - ✓ Redaction works (full --redact and partial --redact=50) - ✓ JSON report output validated - ✓ CSV report output validated - ✓ SARIF report output validated with all 222 rules - ✓ JUnit report output validated - ✓ Report to stdout (-r -) works correctly - ✓ Exit codes: 0 for no leaks, ctx.exit_code for leaks found, 1 for errors - ✓ Empty directory handling returns exit code 0 ### Test Suite Results All 10 tests pass: - 8 tests in tests/cli/test_dir.py (including new end-to-end test) - 2 tests in tests/sources/test_dir_integration.py - Test coverage: 51% overall (73% for detector/engine.py, 77% for cli/dir.py) ## Architecture The implementation follows the pipeline architecture: 1. Load configuration (init_config) 2. Initialize detector with config and CLI flags 3. Create directory source (Files) 4. Run async detection (detector.detect_source) 5. Display summary and findings (if verbose) 6. Write report (if requested) 7. Exit with appropriate code The solution maintains separation of concerns across modules and preserves the async/await patterns for concurrent file scanning. ## File Count Total files changed: 10 (well within the 100 file limit) - 6 modified source/test files - 4 new testdata files Milestone No.: 1 Task No.: 8 Task ID: 38
…dir` command: merge from gitleaks-milestone_1-task_8-895b49 Complete the integration of all components to create a working end-to-end detection flow for the `gitleaks dir` command. The implementation wires together configuration loading, detector initialization, directory scanning, finding detection, reporting, and exit code handling. ## Changes Made ### Core Integration (dir.py) - Instantiate Detector with loaded configuration - Configure detector with CLI flags (verbose, redact, max_target_megabytes, etc.) - Create Files source with appropriate parameters - Execute async detection pipeline via detector.detect_source() - Implement error handling for partial scans - Display scan summary with byte counts and duration - Generate reports in multiple formats (JSON, CSV, SARIF, JUnit) - Exit with appropriate codes (0 for clean, 1 for errors, ctx.exit_code for findings) ### Bug Fixes 1. **Size Limit Check (detector/engine.py)** - Fixed bug where max_target_megabytes=0 was incorrectly blocking all files - Changed condition to only enforce limit when max_target_megabytes > 0 - This allows default behavior of "no limit" when set to 0 2. **Verbose Redaction (detector/engine.py)** - Added redaction support for verbose output - Create deep copy of finding before applying redaction to printed output - Preserves unredacted findings in reports while redacting console output 3. **CLI Redact Option (cli/common.py)** - Fixed redact option to accept explicit values (e.g., --redact=50) - Set is_flag=False to allow both flag usage (--redact) and value usage (--redact=20) - Maintains backward compatibility with flag_value=100 4. **SARIF Reporter (reporting/sarif_reporter.py)** - Fixed AttributeError where code tried to access rule.rule_id - Corrected to use rule.id which matches the config.Rule model - SARIF reports now generate successfully with all rule metadata ### Test Enhancements 1. **End-to-End Integration Test (tests/cli/test_dir.py)** - Added comprehensive `test_dir_command_with_testdata()` function - Scans testdata/repos/nogit/ and verifies JSON output - Validates exit code is 1 when leaks are found - Verifies exactly 2 findings for aws-access-token rule - Confirms files, secrets, line numbers, and metadata are correct - Checks fingerprints, entropy, descriptions, and empty commit fields 2. **Fixed Test Paths (tests/sources/test_dir_integration.py)** - Replaced hardcoded source repository paths (/l2l/src/gitleaks/...) - Now uses relative paths computed from test file location - Tests use destination repository's testdata: repo_root / "testdata" / "repos" / "nogit" - Ensures tests work standalone without source repository ### Test Data (testdata/repos/nogit/) Added minimal test data files required for end-to-end testing: - api.go - Test file containing AWS secret token - main.go - Test file containing AWS secret token - .gitleaksignore - Test ignore patterns - .env.prod - Test environment file These files enable the integration tests to verify secret detection without requiring the full source repository testdata. ### Manual Testing All components tested with testdata/repos/nogit/: - ✓ Basic detection finds AWS access tokens in test files - ✓ Verbose output displays findings to console - ✓ Redaction works (full --redact and partial --redact=50) - ✓ JSON report output validated - ✓ CSV report output validated - ✓ SARIF report output validated with all 222 rules - ✓ JUnit report output validated - ✓ Report to stdout (-r -) works correctly - ✓ Exit codes: 0 for no leaks, ctx.exit_code for leaks found, 1 for errors - ✓ Empty directory handling returns exit code 0 ### Test Suite Results All 10 tests pass: - 8 tests in tests/cli/test_dir.py (including new end-to-end test) - 2 tests in tests/sources/test_dir_integration.py - Test coverage: 51% overall (73% for detector/engine.py, 77% for cli/dir.py) ## Architecture The implementation follows the pipeline architecture: 1. Load configuration (init_config) 2. Initialize detector with config and CLI flags 3. Create directory source (Files) 4. Run async detection (detector.detect_source) 5. Display summary and findings (if verbose) 6. Write report (if requested) 7. Exit with appropriate code The solution maintains separation of concerns across modules and preserves the async/await patterns for concurrent file scanning. ## File Count Total files changed: 10 (well within the 100 file limit) - 6 modified source/test files - 4 new testdata files Milestone No.: 1 Task No.: 8 Task ID: 38
This commit implements baseline comparison and .gitleaksignore file support to allow
users to suppress known findings and focus on new secrets.
Key Changes:
1. Created baseline.py module (src/gitleaks/detector/baseline.py):
- load_baseline(): Loads baseline findings from JSON report files
- is_new_finding(): Compares findings against baseline with proper field matching
- Supports redaction-aware comparison (skips secret/match when redacted)
- Tags are intentionally ignored in comparison (as per Go implementation)
2. Enhanced Detector engine (src/gitleaks/detector/engine.py):
- Added baseline and gitleaks_ignore storage to Detector.__init__()
- Implemented add_baseline(): Loads baseline report and computes relative path
- Implemented add_gitleaks_ignore(): Parses .gitleaksignore files with:
* Support for global fingerprints (file:rule-id:line)
* Support for commit fingerprints (commit:file:rule-id:line)
* Comment and empty line handling
* Windows path normalization (backslash to forward slash)
- Added _should_suppress_finding(): Unified suppression logic that checks:
* Allowlist matching (existing functionality)
* Gitleaksignore fingerprint matching (global and commit-specific)
* Baseline comparison for non-new findings
- Integrated suppression into finding detection pipeline
3. Added CLI support (src/gitleaks/cli/common.py):
- Created setup_detector(): Configures baseline and gitleaksignore for detector
- Handles baseline-path flag to load baseline report
- Handles gitleaks-ignore-path flag with multiple resolution strategies:
* Direct file path
* Directory containing .gitleaksignore
* Auto-discovery in source directory
- Provides appropriate error handling and logging
4. Updated dir command (src/gitleaks/cli/dir.py):
- Imported setup_detector function
- Integrated setup_detector() call after detector initialization
- Ensures baseline and gitleaksignore are configured before scanning
5. Comprehensive test coverage:
- tests/test_baseline.py: 11 unit tests for baseline loading and comparison
- tests/detector/test_baseline_integration.py: 11 integration tests for:
* Baseline loading and filtering in detector
* Gitleaksignore loading with various formats
* Fingerprint matching and suppression
* Windows path normalization
* Comment and empty line handling
All 315 tests pass with 78% overall code coverage. The baseline module itself
has 93% coverage.
Implementation Notes:
- Fingerprint format matches Go implementation: "file:rule-id:line" (global)
or "commit:file:rule-id:line" (commit-specific)
- Baseline comparison skips tags field (per Go behavior - updated tags don't
make a finding "new")
- When redaction is enabled (redact > 0), secret and match fields are not
compared in baseline matching
- Multiple .gitleaksignore files can be loaded (additive behavior)
- Windows backslash paths are normalized to forward slashes for cross-platform
compatibility
Milestone No.: 1
Task No.: 9
Task ID: 39
…rom gitleaks-milestone_1-task_9-760872
This commit implements baseline comparison and .gitleaksignore file support to allow
users to suppress known findings and focus on new secrets.
Key Changes:
1. Created baseline.py module (src/gitleaks/detector/baseline.py):
- load_baseline(): Loads baseline findings from JSON report files
- is_new_finding(): Compares findings against baseline with proper field matching
- Supports redaction-aware comparison (skips secret/match when redacted)
- Tags are intentionally ignored in comparison (as per Go implementation)
2. Enhanced Detector engine (src/gitleaks/detector/engine.py):
- Added baseline and gitleaks_ignore storage to Detector.__init__()
- Implemented add_baseline(): Loads baseline report and computes relative path
- Implemented add_gitleaks_ignore(): Parses .gitleaksignore files with:
* Support for global fingerprints (file:rule-id:line)
* Support for commit fingerprints (commit:file:rule-id:line)
* Comment and empty line handling
* Windows path normalization (backslash to forward slash)
- Added _should_suppress_finding(): Unified suppression logic that checks:
* Allowlist matching (existing functionality)
* Gitleaksignore fingerprint matching (global and commit-specific)
* Baseline comparison for non-new findings
- Integrated suppression into finding detection pipeline
3. Added CLI support (src/gitleaks/cli/common.py):
- Created setup_detector(): Configures baseline and gitleaksignore for detector
- Handles baseline-path flag to load baseline report
- Handles gitleaks-ignore-path flag with multiple resolution strategies:
* Direct file path
* Directory containing .gitleaksignore
* Auto-discovery in source directory
- Provides appropriate error handling and logging
4. Updated dir command (src/gitleaks/cli/dir.py):
- Imported setup_detector function
- Integrated setup_detector() call after detector initialization
- Ensures baseline and gitleaksignore are configured before scanning
5. Comprehensive test coverage:
- tests/test_baseline.py: 11 unit tests for baseline loading and comparison
- tests/detector/test_baseline_integration.py: 11 integration tests for:
* Baseline loading and filtering in detector
* Gitleaksignore loading with various formats
* Fingerprint matching and suppression
* Windows path normalization
* Comment and empty line handling
All 315 tests pass with 78% overall code coverage. The baseline module itself
has 93% coverage.
Implementation Notes:
- Fingerprint format matches Go implementation: "file:rule-id:line" (global)
or "commit:file:rule-id:line" (commit-specific)
- Baseline comparison skips tags field (per Go behavior - updated tags don't
make a finding "new")
- When redaction is enabled (redact > 0), secret and match fields are not
compared in baseline matching
- Multiple .gitleaksignore files can be loaded (additive behavior)
- Windows backslash paths are normalized to forward slashes for cross-platform
compatibility
Milestone No.: 1
Task No.: 9
Task ID: 39
a2510b4 to
19cc037
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
View Milestone
Table of Contents
Status
Partial completion. Only Task 1 of the milestone was completed. The milestone specification describes a comprehensive foundation including configuration management, directory scanning, detection engine, and reporting—none of these were implemented.
gitleaks dirFeature Overview
This partial milestone establishes the foundational Python project structure for migrating gitleaks from Go to Python 3.11+. The following capabilities are now available:
poetry installcreates a working Python environment with all dependenciesgitleaks --help,gitleaks --version,python -m gitleakswork correctly--log-leveland--no-colorflags control logging behaviorWhat users cannot yet do:
gitleaks dirnot implemented).gitleaks.tomlconfiguration filesTesting
Automated Testing
15 passing tests in
tests/test_logging.pycovering:Run tests with:
cd /l2l/dst/gitleaks poetry install poetry run pytest -vManual Testing
Verify package installation:
Verify CLI works:
Verify module execution:
poetry run python -m gitleaks --version # Expected: gitleaks 0.1.0Architecture
Overview
graph TB subgraph "CLI Layer" CLI["cli/__init__.py"] end subgraph "Core Infrastructure" LOG["logging.py"] PKG["__init__.py"] MAIN["__main__.py"] end subgraph "Stub Modules (Empty)" CONFIG["config/"] DETECT["detector/"] SRC["sources/"] RPT["reporting/"] end CLI --> LOG CLI --> PKG MAIN --> CLI style CLI fill:#90EE90 style LOG fill:#90EE90 style PKG fill:#90EE90 style MAIN fill:#90EE90 style CONFIG fill:#FFFF99 style DETECT fill:#FFFF99 style SRC fill:#FFFF99 style RPT fill:#FFFF99 classDef legend fill:#fff,stroke:#333 subgraph Legend NEW["New (implemented)"] STUB["Stub (not implemented)"] end style NEW fill:#90EE90 style STUB fill:#FFFF99Changes
Package Configuration (
pyproject.toml)src/layoutgitleaks = "gitleaks.cli:main"Logging Infrastructure (
src/gitleaks/logging.py)_add_log_level,_rename_event_key(event→message),_handle_trace_leveltrace(),debug(),info(),warn(),error(),fatal()__getattr__CLI Foundation (
src/gitleaks/cli/__init__.py)invoke_without_command=True--version/-v,--log-level(choice of 6 levels),--no-colorconfigure_logging()Stub Modules
config/__init__.py: Empty, docstring describes TOML/Pydantic purposedetector/__init__.py: Empty, docstring describes detection engine purposesources/__init__.py: Empty, docstring describes source providers purposereporting/__init__.py: Empty, docstring describes reporter formats purposeDesign Decisions
src/ Layout Adoption
src/gitleaks/rather than flatgitleaks/structlog over stdlib logging
Click for CLI over argparse
Dependency Pre-declaration
Suggested Order of Review
pyproject.toml— Project configuration, dependencies, tool settingssrc/gitleaks/__init__.py— Package root with versionsrc/gitleaks/logging.py— Core logging infrastructure (most substantial implementation)src/gitleaks/cli/__init__.py— CLI entry point using loggingsrc/gitleaks/__main__.py— Module execution supporttests/test_logging.py— Test coverage for logging.gitignore— Standard Python ignoresconfig/,detector/,sources/,reporting/) — Empty placeholdersChallenges
The milestone specification describes a comprehensive foundation including:
.gitleaks.tomlOnly Task 1 (project structure and logging) was completed. The remaining tasks were not implemented, leaving the gitleaks CLI unable to perform its core function of scanning for secrets.
Remaining work required:
config/models.pywith Pydantic models for rules, allowlists, and settingsconfig/loader.pyfor TOML parsing withtomllibcli/dir.pywith Click subcommand for directory scanningsources/dir_source.pyfor filesystem walkingsources/fragment.pyfor content unit representationdetector/engine.pyfor regex matching orchestrationreporting/finding.pyfor the Finding data modelreporting/json_reporter.pyfor JSON output