Semantic Interlinker

A professional tool for discovering internal linking opportunities using semantic analysis and BERT embeddings.

Overview

Semantic Interlinker analyzes web pages and identifies semantically related content to suggest strategic internal linking opportunities. It uses state-of-the-art natural language processing models to understand content relationships and cluster similar pages together.

Features

Semantic Analysis: Uses BERT-based sentence transformers to understand content meaning
Intelligent Clustering: Automatically groups semantically similar pages
Web Scraping: Built-in scraper to extract content from URLs
Multiple Export Formats: Export results to Excel, JSON, or JSON Lines
Flexible Configuration: YAML-based configuration with CLI overrides
Professional Architecture: Clean, modular, and well-documented codebase
CLI Interface: Easy-to-use command-line interface

Installation

From Source

# Clone the repository
cd semantic-interlinker

# Install in development mode
pip install -e .

# Or install with development dependencies
pip install -e ".[dev]"

Using pip

pip install -r requirements.txt
pip install -e .

Quick Start

Basic Usage

# Analyze URLs from a CSV file
semantic-interlinker analyze urls.csv -o results.xlsx

Advanced Usage

# Use a specific model
semantic-interlinker analyze urls.csv --model multilingual -o results.xlsx

# Adjust similarity threshold
semantic-interlinker analyze urls.csv --min-similarity 0.6

# Export to multiple formats
semantic-interlinker analyze urls.csv --format excel json

# Use custom configuration
semantic-interlinker analyze urls.csv --config my_config.yaml

Input Format

The tool expects a CSV file with at least a URL column. Example:

URL,Title
https://example.com/page1,Page 1 Title
https://example.com/page2,Page 2 Title
https://example.com/page3,Page 3 Title

Configuration

Create a config.yaml file to customize behavior:

# Model Settings
default_model: "semantic_clustering"  # Options: semantic_clustering, multilingual, performance

# Analysis Settings
min_similarity: 0.4
min_cluster_size: 2
batch_size: 256

# Scraping Settings
concurrent_requests: 100
request_timeout: 30

# Export Settings
export_formats:
  - excel
  - json

# Logging
log_level: "INFO"

Available Models

Model	Description	Speed	Best For
`semantic_clustering`	Multi-QA MPNet	Slow	Best accuracy
`multilingual`	Multilingual MiniLM	Medium	Multi-language sites
`performance`	MiniLM v3	Fast	Quick analysis
`default`	All-MPNet base	Medium	General purpose

Output

The tool generates:

Excel File: Multiple sheets with all results and results grouped by cluster
JSON File: Structured data for programmatic access
JSONL File: Line-delimited JSON for streaming/processing

Example output structure:

{
  "source_url": "https://example.com/page1",
  "source_title": "Page 1 Title",
  "destination_url": "https://example.com/page2",
  "destination_title": "Page 2 Title",
  "similarity_score": 0.85,
  "cluster_name": "topic cluster",
  "anchor_text": null
}

Project Structure

semantic-interlinker/
├── src/semantic_interlinker/
│   ├── analyzers/         # Semantic analysis logic
│   ├── exporters/         # Output formatters
│   ├── models/            # Data models
│   ├── scrapers/          # Web scraping
│   ├── utils/             # Utilities
│   ├── cli.py             # CLI interface
│   └── config.py          # Configuration
├── data/
│   ├── input/             # Input CSV files
│   ├── output/            # Generated reports
│   └── cache/             # Model cache
├── tests/                 # Unit tests
├── docs/                  # Documentation
├── config.yaml            # Configuration file
├── requirements.txt       # Dependencies
└── setup.py              # Package setup

Development

Running Tests

pytest tests/

Code Formatting

black src/

Type Checking

mypy src/

Use Cases

SEO Optimization: Improve internal linking structure
Content Strategy: Identify content gaps and relationships
Site Architecture: Understand content clusters
Link Building: Find natural linking opportunities

Requirements

Python 3.8+
4GB+ RAM (for model loading)
Internet connection (for scraping and model downloads)

Performance Tips

Use --no-scrape if you already have content in your CSV
Adjust --batch-size based on available RAM
Use the performance model for large datasets
Enable caching by setting a cache directory

Troubleshooting

Out of Memory

Reduce batch size:

semantic-interlinker analyze urls.csv --batch-size 64

Scraping Fails

Use existing data:

semantic-interlinker analyze urls.csv --no-scrape

Slow Performance

Use faster model:

semantic-interlinker analyze urls.csv --model performance

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

MIT License - See LICENSE file for details

Acknowledgments

Built with sentence-transformers
Uses Scrapy for web scraping
Inspired by SEO best practices and semantic web principles

Support

For issues and questions:

Open an issue on GitHub
Check the documentation in docs/

Changelog

Version 1.0.0

Initial release
Semantic clustering using BERT
Multi-format export
CLI interface
Configurable settings

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
docs		docs
old_code		old_code
src/semantic_interlinker		src/semantic_interlinker
.gitignore		.gitignore
LICENSE		LICENSE
MIGRATION.md		MIGRATION.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TESTING_VERIFICATION.md		TESTING_VERIFICATION.md
config.yaml		config.yaml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Semantic Interlinker

Overview

Features

Installation

From Source

Using pip

Quick Start

Basic Usage

Advanced Usage

Input Format

Configuration

Available Models

Output

Project Structure

Development

Running Tests

Code Formatting

Type Checking

Use Cases

Requirements

Performance Tips

Troubleshooting

Out of Memory

Scraping Fails

Slow Performance

Contributing

License

Acknowledgments

Support

Changelog

Version 1.0.0

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages