A professional tool for discovering internal linking opportunities using semantic analysis and BERT embeddings.
Semantic Interlinker analyzes web pages and identifies semantically related content to suggest strategic internal linking opportunities. It uses state-of-the-art natural language processing models to understand content relationships and cluster similar pages together.
- Semantic Analysis: Uses BERT-based sentence transformers to understand content meaning
- Intelligent Clustering: Automatically groups semantically similar pages
- Web Scraping: Built-in scraper to extract content from URLs
- Multiple Export Formats: Export results to Excel, JSON, or JSON Lines
- Flexible Configuration: YAML-based configuration with CLI overrides
- Professional Architecture: Clean, modular, and well-documented codebase
- CLI Interface: Easy-to-use command-line interface
# Clone the repository
cd semantic-interlinker
# Install in development mode
pip install -e .
# Or install with development dependencies
pip install -e ".[dev]"pip install -r requirements.txt
pip install -e .# Analyze URLs from a CSV file
semantic-interlinker analyze urls.csv -o results.xlsx# Use a specific model
semantic-interlinker analyze urls.csv --model multilingual -o results.xlsx
# Adjust similarity threshold
semantic-interlinker analyze urls.csv --min-similarity 0.6
# Export to multiple formats
semantic-interlinker analyze urls.csv --format excel json
# Use custom configuration
semantic-interlinker analyze urls.csv --config my_config.yamlThe tool expects a CSV file with at least a URL column. Example:
URL,Title
https://example.com/page1,Page 1 Title
https://example.com/page2,Page 2 Title
https://example.com/page3,Page 3 TitleCreate a config.yaml file to customize behavior:
# Model Settings
default_model: "semantic_clustering" # Options: semantic_clustering, multilingual, performance
# Analysis Settings
min_similarity: 0.4
min_cluster_size: 2
batch_size: 256
# Scraping Settings
concurrent_requests: 100
request_timeout: 30
# Export Settings
export_formats:
- excel
- json
# Logging
log_level: "INFO"| Model | Description | Speed | Best For |
|---|---|---|---|
semantic_clustering |
Multi-QA MPNet | Slow | Best accuracy |
multilingual |
Multilingual MiniLM | Medium | Multi-language sites |
performance |
MiniLM v3 | Fast | Quick analysis |
default |
All-MPNet base | Medium | General purpose |
The tool generates:
- Excel File: Multiple sheets with all results and results grouped by cluster
- JSON File: Structured data for programmatic access
- JSONL File: Line-delimited JSON for streaming/processing
Example output structure:
{
"source_url": "https://example.com/page1",
"source_title": "Page 1 Title",
"destination_url": "https://example.com/page2",
"destination_title": "Page 2 Title",
"similarity_score": 0.85,
"cluster_name": "topic cluster",
"anchor_text": null
}semantic-interlinker/
├── src/semantic_interlinker/
│ ├── analyzers/ # Semantic analysis logic
│ ├── exporters/ # Output formatters
│ ├── models/ # Data models
│ ├── scrapers/ # Web scraping
│ ├── utils/ # Utilities
│ ├── cli.py # CLI interface
│ └── config.py # Configuration
├── data/
│ ├── input/ # Input CSV files
│ ├── output/ # Generated reports
│ └── cache/ # Model cache
├── tests/ # Unit tests
├── docs/ # Documentation
├── config.yaml # Configuration file
├── requirements.txt # Dependencies
└── setup.py # Package setup
pytest tests/black src/mypy src/- SEO Optimization: Improve internal linking structure
- Content Strategy: Identify content gaps and relationships
- Site Architecture: Understand content clusters
- Link Building: Find natural linking opportunities
- Python 3.8+
- 4GB+ RAM (for model loading)
- Internet connection (for scraping and model downloads)
- Use
--no-scrapeif you already have content in your CSV - Adjust
--batch-sizebased on available RAM - Use the
performancemodel for large datasets - Enable caching by setting a cache directory
Reduce batch size:
semantic-interlinker analyze urls.csv --batch-size 64Use existing data:
semantic-interlinker analyze urls.csv --no-scrapeUse faster model:
semantic-interlinker analyze urls.csv --model performanceContributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
MIT License - See LICENSE file for details
- Built with sentence-transformers
- Uses Scrapy for web scraping
- Inspired by SEO best practices and semantic web principles
For issues and questions:
- Open an issue on GitHub
- Check the documentation in
docs/
- Initial release
- Semantic clustering using BERT
- Multi-format export
- CLI interface
- Configurable settings