In this folder we show how to fine-tune an autoregressive Language model on the following evaluation and downstream tasks with support for 7 programming languages:
- APPS: Python benchmark to evaluate code generation. It is similar to HumanEval and MBPP, but it is more challanging and has more evaluation problems.
- CodeComplex: Java benchmark with a classification problem to predict the algorithmic complexity of Java programs among 7 labels.
- CodeClone: Java benchmark from CodeXGLUE dataset, with a binary classification problem of predicting the semantic equivalence of two programs. [WIP]
- CodeDefect: C benchmark from CodeXGLUE, with a binary classification problem of predicting whether a code is insecure code and may attack software systems. [WIP]
- Code-to-text: Dataset from CodeXGLUE for generationg natural language comments from code in Python, Go, Java, Javascript, PHP and Ruby. This task can also be done in a zero-shot setting without need for fine-tuning. [WIP]
We use Hugging Face Trainer API for all tasks, which supports distributed training on multiple GPUs.
The evaluation score on the test set is shown at the end of the fine-tuning. For implementation details, please refer to the README inside each folder.