Skip to content

pushkarsingh32/Semantic-Interlinker

Repository files navigation

Semantic Interlinker

A professional tool for discovering internal linking opportunities using semantic analysis and BERT embeddings.

Overview

Semantic Interlinker analyzes web pages and identifies semantically related content to suggest strategic internal linking opportunities. It uses state-of-the-art natural language processing models to understand content relationships and cluster similar pages together.

Features

  • Semantic Analysis: Uses BERT-based sentence transformers to understand content meaning
  • Intelligent Clustering: Automatically groups semantically similar pages
  • Web Scraping: Built-in scraper to extract content from URLs
  • Multiple Export Formats: Export results to Excel, JSON, or JSON Lines
  • Flexible Configuration: YAML-based configuration with CLI overrides
  • Professional Architecture: Clean, modular, and well-documented codebase
  • CLI Interface: Easy-to-use command-line interface

Installation

From Source

# Clone the repository
cd semantic-interlinker

# Install in development mode
pip install -e .

# Or install with development dependencies
pip install -e ".[dev]"

Using pip

pip install -r requirements.txt
pip install -e .

Quick Start

Basic Usage

# Analyze URLs from a CSV file
semantic-interlinker analyze urls.csv -o results.xlsx

Advanced Usage

# Use a specific model
semantic-interlinker analyze urls.csv --model multilingual -o results.xlsx

# Adjust similarity threshold
semantic-interlinker analyze urls.csv --min-similarity 0.6

# Export to multiple formats
semantic-interlinker analyze urls.csv --format excel json

# Use custom configuration
semantic-interlinker analyze urls.csv --config my_config.yaml

Input Format

The tool expects a CSV file with at least a URL column. Example:

URL,Title
https://example.com/page1,Page 1 Title
https://example.com/page2,Page 2 Title
https://example.com/page3,Page 3 Title

Configuration

Create a config.yaml file to customize behavior:

# Model Settings
default_model: "semantic_clustering"  # Options: semantic_clustering, multilingual, performance

# Analysis Settings
min_similarity: 0.4
min_cluster_size: 2
batch_size: 256

# Scraping Settings
concurrent_requests: 100
request_timeout: 30

# Export Settings
export_formats:
  - excel
  - json

# Logging
log_level: "INFO"

Available Models

Model Description Speed Best For
semantic_clustering Multi-QA MPNet Slow Best accuracy
multilingual Multilingual MiniLM Medium Multi-language sites
performance MiniLM v3 Fast Quick analysis
default All-MPNet base Medium General purpose

Output

The tool generates:

  • Excel File: Multiple sheets with all results and results grouped by cluster
  • JSON File: Structured data for programmatic access
  • JSONL File: Line-delimited JSON for streaming/processing

Example output structure:

{
  "source_url": "https://example.com/page1",
  "source_title": "Page 1 Title",
  "destination_url": "https://example.com/page2",
  "destination_title": "Page 2 Title",
  "similarity_score": 0.85,
  "cluster_name": "topic cluster",
  "anchor_text": null
}

Project Structure

semantic-interlinker/
├── src/semantic_interlinker/
│   ├── analyzers/         # Semantic analysis logic
│   ├── exporters/         # Output formatters
│   ├── models/            # Data models
│   ├── scrapers/          # Web scraping
│   ├── utils/             # Utilities
│   ├── cli.py             # CLI interface
│   └── config.py          # Configuration
├── data/
│   ├── input/             # Input CSV files
│   ├── output/            # Generated reports
│   └── cache/             # Model cache
├── tests/                 # Unit tests
├── docs/                  # Documentation
├── config.yaml            # Configuration file
├── requirements.txt       # Dependencies
└── setup.py              # Package setup

Development

Running Tests

pytest tests/

Code Formatting

black src/

Type Checking

mypy src/

Use Cases

  • SEO Optimization: Improve internal linking structure
  • Content Strategy: Identify content gaps and relationships
  • Site Architecture: Understand content clusters
  • Link Building: Find natural linking opportunities

Requirements

  • Python 3.8+
  • 4GB+ RAM (for model loading)
  • Internet connection (for scraping and model downloads)

Performance Tips

  1. Use --no-scrape if you already have content in your CSV
  2. Adjust --batch-size based on available RAM
  3. Use the performance model for large datasets
  4. Enable caching by setting a cache directory

Troubleshooting

Out of Memory

Reduce batch size:

semantic-interlinker analyze urls.csv --batch-size 64

Scraping Fails

Use existing data:

semantic-interlinker analyze urls.csv --no-scrape

Slow Performance

Use faster model:

semantic-interlinker analyze urls.csv --model performance

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

MIT License - See LICENSE file for details

Acknowledgments

Support

For issues and questions:

  • Open an issue on GitHub
  • Check the documentation in docs/

Changelog

Version 1.0.0

  • Initial release
  • Semantic clustering using BERT
  • Multi-format export
  • CLI interface
  • Configurable settings

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages