Identify structural and stylistic patterns in LaTeX projects at scale
This repository provides a Python-based tool for the automated collection and analysis of LaTeX source code from publicly available GitHub repositories. The goal is to explore how LaTeX is used across different types of projects and to support research into best practices and potential coding conventions for LaTeX.
✅ Automated Data Collection
- Search and clone LaTeX repositories from GitHub using keyword filters.
- Apply size thresholds to avoid trivial or template repositories.
✅ Feature Extraction
- Analyze LaTeX projects for:
- Project structure (number of files, folders, modularization)
- Macro and command usage
- Preamble and package organization
- Code style and readability metrics
- Structural elements (sections, environments, citations)
✅ Output
- Results saved as JSON for easy downstream analysis and visualization.
- Research into LaTeX coding standards
- Educational insights into LaTeX usage
- Code audits for large LaTeX documents
- Academic studies on writing practices and collaboration