language-dataset icon indicating copy to clipboard operation
language-dataset copied to clipboard

Dataset for programming language identification.

language-dataset

A dataset for programming language identification.

Methodology

Rules for sample inclusion are:

  • No more than one sample from each repository.
  • Sample is at least 500b and at most 100kb.

Dataset

The dataset is stored in the data directory. It contains:

  • meta.yml: metadata about the dataset and available languages.
  • dataset.yml: collection of all samples, with pointers sample paths relative to data.

Check a summary of the dataset at REPORT.md.

Contributing

See CONTRIBUTING.md.

Tooling

The tools directory contains various Python utilities to maintain the dataset:

  • tools/gen_meta.py: Generates data/meta.yml. This is only needed when upgrading to a new github/linguist or acmeism/RosettaCodeData version.
  • tools/harvest.py: Fetches samples from GitHub.
  • tools/vote.py: Updates the vote annotation.
  • tools/lint.py: Checks the dataset for potential problems.
  • tools/prepare_commit.py: Updates generated files, required before any commit.
  • tools/classify_linguist.py: Updates linguist labels.
  • tools/classify_pygments.py: Updates pygments labels.

To run tools first create the virtual environment:

pip install poetry
poetry install

Then run the tool with python -m:

poetry run python -m tools.gen_meta

License

Each sample in data has its own license. Check the origin repository for details.

Everything else is licensed under the MIT License.