Add training skill: `train-sentence-transformers` by tomaarsen · Pull Request #3752 · huggingface/sentence-transformers

tomaarsen · 2026-05-06T11:31:36Z

Hello!

Pull Request overview

Add the train-sentence-transformers Hugging Face Agent Skill under skills/, covering all three sentence-transformers architectures (bi-encoder, cross-encoder, SPLADE) so users can drive end-to-end training runs from any compatible coding agent
Add .github/workflows/sync-skills.yml to mirror the canonical skill to the huggingface/skills marketplace on each v* tag

Details

The skill follows the Agent Skills format (SKILL.md + references/ + scripts/), so it's tool-neutral: once published, users install via hf skills add train-sentence-transformers, /plugin install train-sentence-transformers@huggingface/skills (Claude Code), or the auto-published Cursor / Codex / Gemini variants. skills/README.md documents both paths plus a local-development recipe for contributors who want to symlink the skill folder into their agent's standard install location for instant edit-loop iteration (junctions via mklink /J on Windows, since those don't require Developer Mode or admin). .gitignore picks up .claude/ and .agents/ so those local symlinks stay untracked.

SKILL.md is a router rather than a manual: it identifies the model type ([SentenceTransformer] / [CrossEncoder] / [SparseEncoder] via tiebreaker rules) and points at the per-type required reading. Per-type loss / evaluator catalogs (references/losses_<type>.md, references/evaluators_<type>.md) and production templates (scripts/train_<type>_example.py) sit alongside cross-cutting refs (training_args.md, dataset_formats.md, troubleshooting.md, base_model_selection.md, plus opt-in model_architectures.md, hardware_guide.md, hf_jobs_execution.md, prompts_and_instructions.md). Variant templates cover Matryoshka, multi-dataset, LoRA, distillation, multilingual, static embedding, listwise CE, and SPLADE distillation. scripts/mine_hard_negatives.py ships as a CLI for the cross-cutting hard-negative mining step.

.github/workflows/sync-skills.yml is modelled on huggingface_hub's sync-hf-cli-skill.yml. It fires on v* tags (excluding RCs) and on manual workflow_dispatch, checks out huggingface/skills using the same GitHub App credentials as the hub-cli workflow (reachable at the huggingface org level since the repo moved here), copies skills/train-sentence-transformers/ into the receiving repo, runs that repo's ./scripts/publish.sh to regenerate the cross-tool manifests, and opens a PR. marketplace.json entries are hand-maintained on the receiving end, so first publication needs a one-time manual PR adding the folder and its entry; the workflow takes over for subsequent content updates. workflow_dispatch is the manual escape hatch for skill-only fixes between releases.

The same PR drops --diff from the typos pre-commit hook. In --diff mode, typos silently exits non-zero with no output when a typo has multiple suggested corrections (e.g. ambiguous prefixes), which made failures debug-hostile. The default error format gives file:line:col + the suggestion inline, which pre-commit displays correctly.

Tom Aarsen

Copilot

Pull request overview

This PR adds three Hugging Face “Agent Skills” under skills/ to enable end-to-end training workflows for sentence-transformers models (SentenceTransformer, CrossEncoder, and SparseEncoder/SPLADE), plus automation to keep shared docs in sync and to publish updates to the huggingface/skills marketplace.

Changes:

Introduces three self-contained skills (train-sentence-transformer, train-cross-encoder, train-sparse-encoder) with reference docs and runnable training templates.
Adds a sync workflow to mirror skills/<name>/ into huggingface/skills on release tags (and via manual dispatch).
Adds a shared-file mirroring script (skills/sync_shared.py) and a pre-commit hook to prevent drift across duplicated docs/scripts.

Reviewed changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
.github/workflows/sync-skills.yml	Syncs skill folders into `huggingface/skills` and opens an automated PR on releases/manual runs.
.gitignore	Ignores local agent/plugin directories used during skill development.
.pre-commit-config.yaml	Adds a local hook to enforce shared-doc/script synchronization across skills.
skills/README.md	Documents how to install/use the skills and how to develop locally.
skills/sync_shared.py	Copies canonical shared docs/scripts from `train-sentence-transformer` into the other two skills (with `--check`).
skills/train-cross-encoder/SKILL.md	Skill definition and instructions for cross-encoder (reranker) training workflows.
skills/train-cross-encoder/scripts/mine_hard_negatives.py	CLI wrapper for mining hard negatives to support training datasets.
skills/train-cross-encoder/scripts/train_distillation_example.py	Cross-encoder distillation training template.
skills/train-cross-encoder/scripts/train_example.py	Cross-encoder pointwise training template (with hard-negative mining).
skills/train-cross-encoder/scripts/train_listwise_example.py	Cross-encoder listwise training template.
skills/train-cross-encoder/references/dataset_formats.md	Reference guide for supported dataset shapes/formats.
skills/train-cross-encoder/references/evaluators.md	Reference guide for evaluation options/metrics for cross-encoders.
skills/train-cross-encoder/references/hardware_guide.md	Hardware guidance for running training efficiently.
skills/train-cross-encoder/references/hf_jobs_execution.md	Guidance for running these scripts on Hugging Face Jobs.
skills/train-cross-encoder/references/losses.md	Reference guide for cross-encoder losses and when to use them.
skills/train-cross-encoder/references/prompts_and_instructions.md	Guidance for prompts/instructions usage during training.
skills/train-cross-encoder/references/training_args.md	Reference for training arguments and recommended settings.
skills/train-cross-encoder/references/troubleshooting.md	Troubleshooting guide for common training/runtime issues.
skills/train-sentence-transformer/SKILL.md	Skill definition and instructions for SentenceTransformer training workflows.
skills/train-sentence-transformer/scripts/mine_hard_negatives.py	CLI wrapper for mining hard negatives to support training datasets.
skills/train-sentence-transformer/scripts/train_distillation_example.py	SentenceTransformer distillation training template.
skills/train-sentence-transformer/scripts/train_example.py	Baseline SentenceTransformer training template.
skills/train-sentence-transformer/scripts/train_make_multilingual_example.py	Multilingual training template.
skills/train-sentence-transformer/scripts/train_matryoshka_example.py	Matryoshka training template.
skills/train-sentence-transformer/scripts/train_multi_dataset_example.py	Multi-dataset training template.
skills/train-sentence-transformer/scripts/train_static_embedding_example.py	Static embedding model training template.
skills/train-sentence-transformer/scripts/train_with_lora_example.py	LoRA fine-tuning training template.
skills/train-sentence-transformer/references/dataset_formats.md	Reference guide for supported dataset shapes/formats.
skills/train-sentence-transformer/references/evaluators.md	Reference guide for evaluation options/metrics for bi-encoders.
skills/train-sentence-transformer/references/hardware_guide.md	Hardware guidance for running training efficiently.
skills/train-sentence-transformer/references/hf_jobs_execution.md	Guidance for running these scripts on Hugging Face Jobs.
skills/train-sentence-transformer/references/losses.md	Reference guide for SentenceTransformer losses and when to use them.
skills/train-sentence-transformer/references/model_architectures.md	Reference guide for SentenceTransformer model architectures.
skills/train-sentence-transformer/references/prompts_and_instructions.md	Guidance for prompts/instructions usage during training.
skills/train-sentence-transformer/references/training_args.md	Reference for training arguments and recommended settings.
skills/train-sentence-transformer/references/troubleshooting.md	Troubleshooting guide for common training/runtime issues.
skills/train-sparse-encoder/SKILL.md	Skill definition and instructions for SPLADE/sparse-encoder training workflows.
skills/train-sparse-encoder/scripts/mine_hard_negatives.py	CLI wrapper for mining hard negatives to support training datasets.
skills/train-sparse-encoder/scripts/train_distillation_example.py	Sparse-encoder distillation training template.
skills/train-sparse-encoder/scripts/train_example.py	Sparse-encoder (SPLADE) contrastive training template.
skills/train-sparse-encoder/references/dataset_formats.md	Reference guide for supported dataset shapes/formats.
skills/train-sparse-encoder/references/evaluators.md	Reference guide for sparse evaluation options/metrics.
skills/train-sparse-encoder/references/hardware_guide.md	Hardware guidance for running training efficiently.
skills/train-sparse-encoder/references/hf_jobs_execution.md	Guidance for running these scripts on Hugging Face Jobs.
skills/train-sparse-encoder/references/losses.md	Reference guide for sparse-encoder losses and when to use them.
skills/train-sparse-encoder/references/prompts_and_instructions.md	Guidance for prompts/instructions usage during training.
skills/train-sparse-encoder/references/training_args.md	Reference for training arguments and recommended settings.
skills/train-sparse-encoder/references/troubleshooting.md	Troubleshooting guide for common training/runtime issues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 17 comments.

Copilot

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 4 comments.

Add 3 training skills

0e6e787

tomaarsen requested a review from Copilot May 6, 2026 11:31

Copilot started reviewing on behalf of tomaarsen May 6, 2026 11:32 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Merge into one skill & try 'less is more' strategy

8e71c56

tomaarsen mentioned this pull request May 6, 2026

Add 3 Sentence Transformers training skills huggingface/skills#136

Merged

Harden skill templates: smoke-test, metric keys, Normalize fix

2b8da82

tomaarsen changed the title ~~Add 3 training skills: train-sentence-transformer, train-cross-encoder, and train-sparse-encoder~~ Add training skill: train-sentence-transformer May 7, 2026

tomaarsen requested a review from Copilot May 7, 2026 09:51

Copilot started reviewing on behalf of tomaarsen May 7, 2026 09:52 View session

Log actual pushed URL

d8015f3

Copilot AI reviewed May 7, 2026

View reviewed changes

Add VERDICT in variant templates + SPARSE keys + VLM precision

da3e061

tomaarsen requested a review from Copilot May 7, 2026 10:12

Copilot started reviewing on behalf of tomaarsen May 7, 2026 10:13 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

workflow check, listwise eval, multilingual scoring, SKILL.md cross-refs

c2fc078

tomaarsen requested a review from Copilot May 7, 2026 10:47

Copilot started reviewing on behalf of tomaarsen May 7, 2026 10:48 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread skills/train-sentence-transformers/scripts/train_cross_encoder_example.py

Comment thread skills/train-sentence-transformers/scripts/train_sentence_transformer_with_lora_example.py

Comment thread .github/workflows/sync-skills.yml

Comment thread skills/README.md Outdated

tomaarsen added 2 commits May 7, 2026 13:22

Tighten SKILL.md description

70a9634

Rename to plural train-sentence-transformers

cabbc5c

tomaarsen changed the title ~~Add training skill: train-sentence-transformer~~ Add training skill: train-sentence-transformers May 7, 2026

Add mentions of 'hf skills add train-sentence-transformers'

b39c996

tomaarsen enabled auto-merge (squash) May 7, 2026 14:26

tomaarsen disabled auto-merge May 7, 2026 14:47

tomaarsen merged commit e038b8a into huggingface:main May 7, 2026
17 checks passed

Uh oh!

Conversation

tomaarsen commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request overview

Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomaarsen commented May 6, 2026 •

edited

Loading