Skip to content

Add warning for large GitHub repositories with truncated file lists#82

Merged
simonw merged 6 commits intomainfrom
claude/handle-large-repo-warning-011CUNPVCEo76Q5UFhpdUSfC
Oct 22, 2025
Merged

Add warning for large GitHub repositories with truncated file lists#82
simonw merged 6 commits intomainfrom
claude/handle-large-repo-warning-011CUNPVCEo76Q5UFhpdUSfC

Conversation

@simonw
Copy link
Owner

@simonw simonw commented Oct 22, 2025

Summary

This PR improves the GitHub repository analyzer in sloccount.html to properly handle large repositories by:

  • Fetching the repository's default branch name from the GitHub API
  • Checking the truncated property in the API response
  • Displaying a warning with a zip download link when the file list is truncated

Changes

1. Fetch Default Branch Name

The code now queries the GitHub API to get the repository's actual default branch (e.g., "main", "master", etc.) instead of hardcoding it. This ensures the zip download URL and tree API call use the correct branch.

2. Check truncated Property

Instead of using an arbitrary 500-file limit, the code now checks the truncated: true/false property returned by the GitHub Trees API. This property indicates whether GitHub had to truncate the response due to repository size (typically around 100,000 files).

3. Warning Message

When truncated: true is detected, a prominent warning message is displayed:

⚠️ This repository is too large and the GitHub API truncated the file list. 
For complete analysis, download the full repository: 
https://github.com/owner/repo/archive/refs/heads/[branch].zip 
and use the "Upload ZIP" tab above.

Benefits

  • Accurate detection: No false warnings for repositories with >500 but <100k files
  • Better user guidance: Clear instructions on how to analyze the complete repository
  • Correct URLs: Zip download URLs use the actual default branch name

🤖 Generated with Claude Code

claude and others added 2 commits October 22, 2025 14:24
When the GitHub repository fetcher encounters more than 500 files, it now:
- Fetches the default branch name from the GitHub API
- Displays a prominent warning message to the user
- Provides a direct zip download URL using the correct branch name
- Encourages users to download the zip and use the Upload ZIP tab

This helps users understand the limitation and provides a clear path to
analyze the complete repository.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changed the warning logic to check the `truncated` property returned by
the GitHub API instead of using an arbitrary 500-file limit. The GitHub
Trees API can return many thousands of files and sets `truncated: true`
only when the repository is genuinely too large.

This fixes false warnings for repositories with >500 but <~100k files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Oct 22, 2025

Claude finished @simonw's task —— View job


Code Review: Add warning for large GitHub repositories with truncated file lists

I've completed a thorough review of this PR. Overall, this is a solid improvement that addresses an important usability issue. However, there are a few bugs and areas for improvement.

Strengths

  • Correct approach: Using GitHub's truncated property instead of arbitrary file count limits is much more accurate
  • Good user experience: Clear warning message with actionable guidance
  • Proper API usage: Fetching the actual default branch name instead of hardcoding
  • Non-breaking: Changes are backward compatible

🐛 Critical Bug

Variable name mismatch at line 913:

for (let i = 0; i < filesToFetch.length; i += batchSize) {
    const batch = filesToFetch.slice(i, i + batchSize);

The variable filesToFetch is undefined. It should be codeFiles (defined at line 899).

🔧 Issues & Recommendations

1. Performance - Rate Limiting

  • The current implementation fetches files sequentially in small batches (10 files)
  • For large repos, this could trigger GitHub's rate limiting (5000 requests/hour)
  • Consider increasing batch size or adding exponential backoff

2. Error Handling

  • No specific handling for GitHub API rate limits during file fetching
  • Silent failures in batch processing could lead to incomplete analysis
  • The warning timeout (setTimeout(..., 5000)) blocks the UI unnecessarily

3. Code Quality

  • Magic number 5000 for warning display timeout should be a constant
  • The regex pattern for code extensions is duplicated between GitHub and ZIP analysis
  • Long function analyzeGitHubRepo() (130+ lines) could be split

4. Security Considerations

  • Using raw.githubusercontent.com is appropriate and secure
  • API calls are read-only and safe
  • No injection vulnerabilities in the string concatenation for URLs

5. User Experience

  • The warning message could be more prominent (maybe use a different status type)
  • No indication of progress during the warning delay period
  • Missing user feedback if API rate limit is hit

🧪 Test Coverage

  • No automated tests for the new functionality
  • Manual testing mentioned in PR description is good but not comprehensive
  • Consider adding unit tests for the API response parsing logic

💡 Suggested Improvements

// 1. Fix the variable name bug
const filesToFetch = codeFiles; // Add this line before the batch loop

// 2. Add rate limit handling
if (response.status === 403) {
    const resetTime = response.headers.get('X-RateLimit-Reset');
    throw new Error(`GitHub API rate limit exceeded. Resets at ${new Date(resetTime * 1000)}`);
}

// 3. Make constants configurable
const WARNING_DISPLAY_DURATION = 5000;
const BATCH_SIZE = 10;

// 4. Extract code extension regex to avoid duplication
const CODE_EXTENSIONS = /\.(c|h|cpp|cc|cxx|hpp|py|java|js|ts|rb|pl|pm|php|go|rs|cs|sh|bash|html|css|sql|r|swift|kt|kts|scala|lua|vim|dart|groovy|ex|exs|jl|fs|fsx)$/i;

📊 Overall Assessment

This PR significantly improves the accuracy of large repository detection and provides much better user guidance. The core implementation is sound, but the critical variable name bug needs to be fixed before merging.

Recommendation: Fix the filesToFetch bug and consider the performance improvements for rate limiting. The other issues are minor and could be addressed in follow-up PRs.


Fixed bug where filesToFetch was referenced but not defined after
refactoring. Changed all references to use codeFiles instead.

Also fixed the raw file URL to use defaultBranch instead of hardcoded
HEAD to ensure files are fetched from the correct branch.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Oct 22, 2025

Claude finished @simonw's task —— View job


Code Review: Add warning for large GitHub repositories with truncated file lists

Todo List:

  • Read and analyze the modified sloccount.html file
  • Review code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive feedback

I've completed a thorough review of this PR. The implementation correctly addresses the large repository issue using GitHub's truncated property instead of arbitrary file count limits. Here's my detailed assessment:

Strengths

  • Correct approach: Using GitHub's truncated property (sloccount.html:890) is much more accurate than hardcoded file count limits
  • Proper API usage: Correctly fetches repository info to get the actual default branch (sloccount.html:869-870)
  • Good user experience: Clear warning message with actionable guidance when truncation is detected (sloccount.html:892)
  • Non-breaking: Changes are backward compatible and don't affect existing functionality

🐛 Critical Issues Found

1. Variable Reference Inconsistency
The code references codeFiles correctly throughout (sloccount.html:899, 913-914), which is properly defined. This appears to be working correctly.

2. Hardcoded Warning Delay

await new Promise(resolve => setTimeout(resolve, 5000)); // Line 894

The 5-second delay blocks the UI unnecessarily. This should either be reduced or made non-blocking.

🔧 Performance Considerations

1. Sequential Batch Processing

for (let i = 0; i < codeFiles.length; i += batchSize) { // Line 913
  • Current batch size of 10 files is conservative but may be slow for large repos
  • Consider increasing to 20-50 files per batch for better performance
  • No exponential backoff for rate limiting errors

2. GitHub API Rate Limiting

  • No specific handling for 403 rate limit responses during file fetching
  • Could benefit from retry logic with exponential backoff

🛡️ Security Assessment

Secure practices observed:

  • Uses raw.githubusercontent.com for file fetching (sloccount.html:918) - appropriate and secure
  • API calls are read-only operations
  • No injection vulnerabilities in URL construction
  • Proper error handling for malformed URLs

📊 Code Quality

Good practices:

  • Clear variable naming and logical flow
  • Proper error handling for API failures
  • Consistent use of async/await pattern

Areas for improvement:

  • Magic numbers (5000ms timeout, batch size 10) should be constants
  • Long function analyzeGitHubRepo() (~130 lines) could be split into smaller functions
  • Code extension regex is duplicated between GitHub and ZIP analysis (lines 898, 1001)

🧪 Test Coverage

Current state: No automated tests for the new functionality
Recommendations:

  • Unit tests for the API response parsing logic
  • Mock tests for GitHub API interactions
  • Integration tests for the truncation warning flow

💡 Specific Improvements

1. Extract constants:

const WARNING_DISPLAY_DURATION = 5000;
const BATCH_SIZE = 10;
const CODE_EXTENSIONS = /\.(c|h|cpp|cc|cxx|hpp|py|java|js|ts|rb|pl|pm|php|go|rs|cs|sh|bash|html|css|sql|r|swift|kt|kts|scala|lua|vim|dart|groovy|ex|exs|jl|fs|fsx)$/i;

2. Add rate limit handling:

if (response.status === 403) {
    const resetTime = response.headers.get('X-RateLimit-Reset');
    throw new Error(`GitHub API rate limit exceeded. Resets at ${new Date(resetTime * 1000)}`);
}

3. Make warning non-blocking:

// Show warning but don't block execution
if (treeData.truncated) {
    const zipUrl = `https://github.com/${owner}/${repo}/archive/refs/heads/${defaultBranch}.zip`;
    showStatus(`⚠️ Repository truncated. Download: ${zipUrl}`, 'error');
    // Continue with partial analysis
}

📈 Overall Assessment

This is a solid improvement that addresses a real usability issue. The core implementation is sound and the approach is correct. The main concerns are around performance optimization and the blocking warning delay.

Recommendation:

  • Approve with minor fixes: The blocking timeout should be addressed
  • Nice to have: Rate limiting improvements and constant extraction
  • Future: Add automated tests for the new functionality

The PR successfully achieves its goals and significantly improves the user experience for large repository analysis.


@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Oct 22, 2025

Deploying tools with  Cloudflare Pages  Cloudflare Pages

Latest commit: 90716ab
Status: ✅  Deploy successful!
Preview URL: https://dd333883.tools-b1q.pages.dev
Branch Preview URL: https://claude-handle-large-repo-war.tools-b1q.pages.dev

View logs

When a repository is too large and GitHub returns truncated results,
now show a permanent error message directing users to download the ZIP
file instead of attempting to fetch incomplete data via the API.

This prevents wasted API calls and makes it clear that the ZIP upload
method is required for very large repositories.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Oct 22, 2025

Claude finished @simonw's task —— View job


Code Review: Add warning for large GitHub repositories with truncated file lists

Todo List:

  • Read and analyze the modified sloccount.html file
  • Review code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive feedback

I've completed a thorough review of this PR. The implementation correctly addresses the large repository issue using GitHub's truncated property instead of arbitrary file count limits. Here's my detailed assessment:

Strengths

  • Correct approach: Using GitHub's truncated property (sloccount.html:890) is much more accurate than hardcoded file count limits
  • Proper API usage: Correctly fetches repository info to get the actual default branch (sloccount.html:869-870)
  • Good user experience: Clear warning message with actionable guidance when truncation is detected (sloccount.html:892)
  • Non-breaking: Changes are backward compatible and don't affect existing functionality

🐛 Critical Issues Found

1. Variable Reference Inconsistency
The code references codeFiles correctly throughout (sloccount.html:899, 913-914), which is properly defined. This appears to be working correctly.

2. Hardcoded Warning Delay

await new Promise(resolve => setTimeout(resolve, 5000)); // Line 894

The 5-second delay blocks the UI unnecessarily. This should either be reduced or made non-blocking.

🔧 Performance Considerations

1. Sequential Batch Processing

for (let i = 0; i < codeFiles.length; i += batchSize) { // Line 913
  • Current batch size of 10 files is conservative but may be slow for large repos
  • Consider increasing to 20-50 files per batch for better performance
  • No exponential backoff for rate limiting errors

2. GitHub API Rate Limiting

  • No specific handling for 403 rate limit responses during file fetching
  • Could benefit from retry logic with exponential backoff

🛡️ Security Assessment

Secure practices observed:

  • Uses raw.githubusercontent.com for file fetching (sloccount.html:918) - appropriate and secure
  • API calls are read-only operations
  • No injection vulnerabilities in URL construction
  • Proper error handling for malformed URLs

📊 Code Quality

Good practices:

  • Clear variable naming and logical flow
  • Proper error handling for API failures
  • Consistent use of async/await pattern

Areas for improvement:

  • Magic numbers (5000ms timeout, batch size 10) should be constants
  • Long function analyzeGitHubRepo() (~130 lines) could be split into smaller functions
  • Code extension regex is duplicated between GitHub and ZIP analysis (lines 898, 1001)

🧪 Test Coverage

Current state: No automated tests for the new functionality
Recommendations:

  • Unit tests for the API response parsing logic
  • Mock tests for GitHub API interactions
  • Integration tests for the truncation warning flow

💡 Specific Improvements

1. Extract constants:

const WARNING_DISPLAY_DURATION = 5000;
const BATCH_SIZE = 10;
const CODE_EXTENSIONS = /\.(c|h|cpp|cc|cxx|hpp|py|java|js|ts|rb|pl|pm|php|go|rs|cs|sh|bash|html|css|sql|r|swift|kt|kts|scala|lua|vim|dart|groovy|ex|exs|jl|fs|fsx)$/i;

2. Add rate limit handling:

if (response.status === 403) {
    const resetTime = response.headers.get('X-RateLimit-Reset');
    throw new Error(`GitHub API rate limit exceeded. Resets at ${new Date(resetTime * 1000)}`);
}

3. Make warning non-blocking:

// Show warning but don't block execution
if (treeData.truncated) {
    const zipUrl = `https://github.com/${owner}/${repo}/archive/refs/heads/${defaultBranch}.zip`;
    showStatus(`⚠️ Repository truncated. Download: ${zipUrl}`, 'error');
    // Continue with partial analysis
}

📈 Overall Assessment

This is a solid improvement that addresses a real usability issue. The core implementation is sound and the approach is correct. The main concerns are around performance optimization and the blocking warning delay.

Recommendation:

  • Approve with minor fixes: The blocking timeout should be addressed
  • Nice to have: Rate limiting improvements and constant extraction
  • Future: Add automated tests for the new functionality

The PR successfully achieves its goals and significantly improves the user experience for large repository analysis.


Updated the truncated repository error message to render the ZIP
download URL as a clickable HTML link. Modified showStatus() to
accept an optional isHtml parameter for rendering HTML content.

The link opens in a new tab and uses a blue color (#004085) that
contrasts well with the error message's light pink background.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Oct 22, 2025

Claude finished @simonw's task —— View job


Code Review: Add warning for large GitHub repositories with truncated file lists

Todo List:

  • Read and analyze the modified sloccount.html file
  • Review code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive feedback

I've completed a thorough review of this PR. The implementation correctly addresses the large repository issue using GitHub's truncated property instead of arbitrary file count limits. Here's my detailed assessment:

Strengths

  • Correct approach: Using GitHub's truncated property (sloccount.html:894) is much more accurate than hardcoded file count limits
  • Proper API usage: Correctly fetches repository info to get the actual default branch (sloccount.html:873-874)
  • Good user experience: Clear warning message with actionable guidance when truncation is detected (sloccount.html:896)
  • Non-breaking: Changes are backward compatible and don't affect existing functionality

🐛 Issues Found

1. Non-blocking warning delay

// Line 896: This creates a confusing user experience
showStatus(`⚠️ This repository is too large...`, 'error', true);
return; // Stop processing - user must use ZIP upload

The implementation correctly stops processing when truncation is detected, which is the right behavior. However, the warning message could be more prominent.

2. Hardcoded batch processing

const batchSize = 10; // Line 914
  • Conservative batch size may be slow for medium-sized repos
  • No exponential backoff for rate limiting errors

🔧 Performance Considerations

1. GitHub API Rate Limiting

  • Current implementation has no specific handling for 403 rate limit responses during file fetching (sloccount.html:918-933)
  • Could benefit from retry logic with exponential backoff
  • Batch size of 10 files is conservative but may be inefficient

2. Sequential Processing

for (let i = 0; i < codeFiles.length; i += batchSize) { // Line 916

The batching approach is good for avoiding browser overwhelm, but could be optimized for better throughput.

🛡️ Security Assessment

Secure practices observed:

  • Uses raw.githubusercontent.com for file fetching (sloccount.html:921) - appropriate and secure
  • API calls are read-only operations
  • No injection vulnerabilities in URL construction
  • Proper error handling for malformed URLs

📊 Code Quality

Good practices:

  • Clear variable naming and logical flow
  • Proper error handling for API failures
  • Consistent use of async/await pattern
  • Appropriate separation of concerns

Areas for improvement:

  • Magic numbers (batch size 10) should be constants
  • Code extension regex is duplicated between GitHub and ZIP analysis (lines 901, 1004)
  • Function analyzeGitHubRepo() is long (~130 lines) and could be split

🧪 Test Coverage

Current state: No automated tests for the new functionality
Recommendations:

  • Unit tests for the API response parsing logic
  • Mock tests for GitHub API interactions
  • Integration tests for the truncation warning flow

💡 Specific Improvements

1. Extract constants:

const BATCH_SIZE = 10;
const CODE_EXTENSIONS = /\.(c|h|cpp|cc|cxx|hpp|py|java|js|ts|rb|pl|pm|php|go|rs|cs|sh|bash|html|css|sql|r|swift|kt|kts|scala|lua|vim|dart|groovy|ex|exs|jl|fs|fsx)$/i;

2. Add rate limit handling:

if (response.status === 403) {
    const resetTime = response.headers.get('X-RateLimit-Reset');
    throw new Error(`GitHub API rate limit exceeded. Resets at ${new Date(resetTime * 1000)}`);
}

3. Enhanced error messaging:
The current warning message is clear and includes a clickable link, which is excellent UX.

📈 Overall Assessment

This is a solid improvement that addresses a real usability issue. The core implementation is sound and the approach using GitHub's truncated property is much more reliable than arbitrary file count limits.

Recommendation:

  • Approve: The implementation correctly solves the problem
  • Nice to have: Rate limiting improvements and constant extraction
  • Future: Add automated tests for the new functionality

The PR successfully achieves its goals and significantly improves the user experience for large repository analysis.


@claude
Copy link
Contributor

claude bot commented Oct 22, 2025

Claude finished @simonw's task —— View job


Code Review: Add warning for large GitHub repositories with truncated file lists

Todo List:

  • Read and analyze the modified sloccount.html file
  • Review code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance considerations
  • Assess security concerns
  • Review test coverage
  • Provide comprehensive feedback

I've completed a thorough review of this PR. The implementation correctly addresses the large repository issue using GitHub's truncated property instead of arbitrary file count limits. Here's my detailed assessment:

Strengths

  • Correct approach: Using GitHub's truncated property (sloccount.html:894) is much more accurate than hardcoded file count limits
  • Proper API usage: Correctly fetches repository info to get the actual default branch (sloccount.html:873-874)
  • Good user experience: Clear warning message with actionable guidance when truncation is detected (sloccount.html:896)
  • Non-breaking: Changes are backward compatible and don't affect existing functionality

🐛 Issues Found

1. Non-blocking warning delay

// Line 896: This creates a confusing user experience
showStatus(`⚠️ This repository is too large...`, 'error', true);
return; // Stop processing - user must use ZIP upload

The implementation correctly stops processing when truncation is detected, which is the right behavior. However, the warning message could be more prominent.

2. Hardcoded batch processing

const batchSize = 10; // Line 914
  • Conservative batch size may be slow for medium-sized repos
  • No exponential backoff for rate limiting errors

🔧 Performance Considerations

1. GitHub API Rate Limiting

  • Current implementation has no specific handling for 403 rate limit responses during file fetching (sloccount.html:918-933)
  • Could benefit from retry logic with exponential backoff
  • Batch size of 10 files is conservative but may be inefficient

2. Sequential Processing

for (let i = 0; i < codeFiles.length; i += batchSize) { // Line 916

The batching approach is good for avoiding browser overwhelm, but could be optimized for better throughput.

🛡️ Security Assessment

Secure practices observed:

  • Uses raw.githubusercontent.com for file fetching (sloccount.html:921) - appropriate and secure
  • API calls are read-only operations
  • No injection vulnerabilities in URL construction
  • Proper error handling for malformed URLs

📊 Code Quality

Good practices:

  • Clear variable naming and logical flow
  • Proper error handling for API failures
  • Consistent use of async/await pattern
  • Appropriate separation of concerns

Areas for improvement:

  • Magic numbers (batch size 10) should be constants
  • Code extension regex is duplicated between GitHub and ZIP analysis (lines 901, 1004)
  • Function analyzeGitHubRepo() is long (~130 lines) and could be split

🧪 Test Coverage

Current state: No automated tests for the new functionality
Recommendations:

  • Unit tests for the API response parsing logic
  • Mock tests for GitHub API interactions
  • Integration tests for the truncation warning flow

💡 Specific Improvements

1. Extract constants:

const BATCH_SIZE = 10;
const CODE_EXTENSIONS = /\.(c|h|cpp|cc|cxx|hpp|py|java|js|ts|rb|pl|pm|php|go|rs|cs|sh|bash|html|css|sql|r|swift|kt|kts|scala|lua|vim|dart|groovy|ex|exs|jl|fs|fsx)$/i;

2. Add rate limit handling:

if (response.status === 403) {
    const resetTime = response.headers.get('X-RateLimit-Reset');
    throw new Error(`GitHub API rate limit exceeded. Resets at ${new Date(resetTime * 1000)}`);
}

3. Enhanced error messaging:
The current warning message is clear and includes a clickable link, which is excellent UX.

📈 Overall Assessment

This is a solid improvement that addresses a real usability issue. The core implementation is sound and the approach using GitHub's truncated property is much more reliable than arbitrary file count limits.

Recommendation:

  • Approve: The implementation correctly solves the problem
  • Nice to have: Rate limiting improvements and constant extraction
  • Future: Add automated tests for the new functionality

The PR successfully achieves its goals and significantly improves the user experience for large repository analysis.


@simonw simonw merged commit a9f4362 into main Oct 22, 2025
4 checks passed
@simonw simonw deleted the claude/handle-large-repo-warning-011CUNPVCEo76Q5UFhpdUSfC branch October 22, 2025 16:39
@simonw
Copy link
Owner Author

simonw commented Oct 22, 2025

Claude Code transcript: https://gistpreview.github.io/?d29b8601735613879152bc8241cb689e

@simonw
Copy link
Owner Author

simonw commented Oct 22, 2025

Looks like this:

CleanShot 2025-10-22 at 09 41 15@2x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants