Skip to content

Add training skill: train-sentence-transformers#3752

Merged
tomaarsen merged 9 commits into
huggingface:mainfrom
tomaarsen:skills/add-training-skills
May 7, 2026
Merged

Add training skill: train-sentence-transformers#3752
tomaarsen merged 9 commits into
huggingface:mainfrom
tomaarsen:skills/add-training-skills

Conversation

@tomaarsen

@tomaarsen tomaarsen commented May 6, 2026

Copy link
Copy Markdown
Member

Hello!

Pull Request overview

  • Add the train-sentence-transformers Hugging Face Agent Skill under skills/, covering all three sentence-transformers architectures (bi-encoder, cross-encoder, SPLADE) so users can drive end-to-end training runs from any compatible coding agent
  • Add .github/workflows/sync-skills.yml to mirror the canonical skill to the huggingface/skills marketplace on each v* tag

Details

The skill follows the Agent Skills format (SKILL.md + references/ + scripts/), so it's tool-neutral: once published, users install via hf skills add train-sentence-transformers, /plugin install train-sentence-transformers@huggingface/skills (Claude Code), or the auto-published Cursor / Codex / Gemini variants. skills/README.md documents both paths plus a local-development recipe for contributors who want to symlink the skill folder into their agent's standard install location for instant edit-loop iteration (junctions via mklink /J on Windows, since those don't require Developer Mode or admin). .gitignore picks up .claude/ and .agents/ so those local symlinks stay untracked.

SKILL.md is a router rather than a manual: it identifies the model type ([SentenceTransformer] / [CrossEncoder] / [SparseEncoder] via tiebreaker rules) and points at the per-type required reading. Per-type loss / evaluator catalogs (references/losses_<type>.md, references/evaluators_<type>.md) and production templates (scripts/train_<type>_example.py) sit alongside cross-cutting refs (training_args.md, dataset_formats.md, troubleshooting.md, base_model_selection.md, plus opt-in model_architectures.md, hardware_guide.md, hf_jobs_execution.md, prompts_and_instructions.md). Variant templates cover Matryoshka, multi-dataset, LoRA, distillation, multilingual, static embedding, listwise CE, and SPLADE distillation. scripts/mine_hard_negatives.py ships as a CLI for the cross-cutting hard-negative mining step.

.github/workflows/sync-skills.yml is modelled on huggingface_hub's sync-hf-cli-skill.yml. It fires on v* tags (excluding RCs) and on manual workflow_dispatch, checks out huggingface/skills using the same GitHub App credentials as the hub-cli workflow (reachable at the huggingface org level since the repo moved here), copies skills/train-sentence-transformers/ into the receiving repo, runs that repo's ./scripts/publish.sh to regenerate the cross-tool manifests, and opens a PR. marketplace.json entries are hand-maintained on the receiving end, so first publication needs a one-time manual PR adding the folder and its entry; the workflow takes over for subsequent content updates. workflow_dispatch is the manual escape hatch for skill-only fixes between releases.

The same PR drops --diff from the typos pre-commit hook. In --diff mode, typos silently exits non-zero with no output when a typo has multiple suggested corrections (e.g. ambiguous prefixes), which made failures debug-hostile. The default error format gives file:line:col + the suggestion inline, which pre-commit displays correctly.

  • Tom Aarsen

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds three Hugging Face “Agent Skills” under skills/ to enable end-to-end training workflows for sentence-transformers models (SentenceTransformer, CrossEncoder, and SparseEncoder/SPLADE), plus automation to keep shared docs in sync and to publish updates to the huggingface/skills marketplace.

Changes:

  • Introduces three self-contained skills (train-sentence-transformer, train-cross-encoder, train-sparse-encoder) with reference docs and runnable training templates.
  • Adds a sync workflow to mirror skills/<name>/ into huggingface/skills on release tags (and via manual dispatch).
  • Adds a shared-file mirroring script (skills/sync_shared.py) and a pre-commit hook to prevent drift across duplicated docs/scripts.

Reviewed changes

Copilot reviewed 47 out of 48 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
.github/workflows/sync-skills.yml Syncs skill folders into huggingface/skills and opens an automated PR on releases/manual runs.
.gitignore Ignores local agent/plugin directories used during skill development.
.pre-commit-config.yaml Adds a local hook to enforce shared-doc/script synchronization across skills.
skills/README.md Documents how to install/use the skills and how to develop locally.
skills/sync_shared.py Copies canonical shared docs/scripts from train-sentence-transformer into the other two skills (with --check).
skills/train-cross-encoder/SKILL.md Skill definition and instructions for cross-encoder (reranker) training workflows.
skills/train-cross-encoder/scripts/mine_hard_negatives.py CLI wrapper for mining hard negatives to support training datasets.
skills/train-cross-encoder/scripts/train_distillation_example.py Cross-encoder distillation training template.
skills/train-cross-encoder/scripts/train_example.py Cross-encoder pointwise training template (with hard-negative mining).
skills/train-cross-encoder/scripts/train_listwise_example.py Cross-encoder listwise training template.
skills/train-cross-encoder/references/dataset_formats.md Reference guide for supported dataset shapes/formats.
skills/train-cross-encoder/references/evaluators.md Reference guide for evaluation options/metrics for cross-encoders.
skills/train-cross-encoder/references/hardware_guide.md Hardware guidance for running training efficiently.
skills/train-cross-encoder/references/hf_jobs_execution.md Guidance for running these scripts on Hugging Face Jobs.
skills/train-cross-encoder/references/losses.md Reference guide for cross-encoder losses and when to use them.
skills/train-cross-encoder/references/prompts_and_instructions.md Guidance for prompts/instructions usage during training.
skills/train-cross-encoder/references/training_args.md Reference for training arguments and recommended settings.
skills/train-cross-encoder/references/troubleshooting.md Troubleshooting guide for common training/runtime issues.
skills/train-sentence-transformer/SKILL.md Skill definition and instructions for SentenceTransformer training workflows.
skills/train-sentence-transformer/scripts/mine_hard_negatives.py CLI wrapper for mining hard negatives to support training datasets.
skills/train-sentence-transformer/scripts/train_distillation_example.py SentenceTransformer distillation training template.
skills/train-sentence-transformer/scripts/train_example.py Baseline SentenceTransformer training template.
skills/train-sentence-transformer/scripts/train_make_multilingual_example.py Multilingual training template.
skills/train-sentence-transformer/scripts/train_matryoshka_example.py Matryoshka training template.
skills/train-sentence-transformer/scripts/train_multi_dataset_example.py Multi-dataset training template.
skills/train-sentence-transformer/scripts/train_static_embedding_example.py Static embedding model training template.
skills/train-sentence-transformer/scripts/train_with_lora_example.py LoRA fine-tuning training template.
skills/train-sentence-transformer/references/dataset_formats.md Reference guide for supported dataset shapes/formats.
skills/train-sentence-transformer/references/evaluators.md Reference guide for evaluation options/metrics for bi-encoders.
skills/train-sentence-transformer/references/hardware_guide.md Hardware guidance for running training efficiently.
skills/train-sentence-transformer/references/hf_jobs_execution.md Guidance for running these scripts on Hugging Face Jobs.
skills/train-sentence-transformer/references/losses.md Reference guide for SentenceTransformer losses and when to use them.
skills/train-sentence-transformer/references/model_architectures.md Reference guide for SentenceTransformer model architectures.
skills/train-sentence-transformer/references/prompts_and_instructions.md Guidance for prompts/instructions usage during training.
skills/train-sentence-transformer/references/training_args.md Reference for training arguments and recommended settings.
skills/train-sentence-transformer/references/troubleshooting.md Troubleshooting guide for common training/runtime issues.
skills/train-sparse-encoder/SKILL.md Skill definition and instructions for SPLADE/sparse-encoder training workflows.
skills/train-sparse-encoder/scripts/mine_hard_negatives.py CLI wrapper for mining hard negatives to support training datasets.
skills/train-sparse-encoder/scripts/train_distillation_example.py Sparse-encoder distillation training template.
skills/train-sparse-encoder/scripts/train_example.py Sparse-encoder (SPLADE) contrastive training template.
skills/train-sparse-encoder/references/dataset_formats.md Reference guide for supported dataset shapes/formats.
skills/train-sparse-encoder/references/evaluators.md Reference guide for sparse evaluation options/metrics.
skills/train-sparse-encoder/references/hardware_guide.md Hardware guidance for running training efficiently.
skills/train-sparse-encoder/references/hf_jobs_execution.md Guidance for running these scripts on Hugging Face Jobs.
skills/train-sparse-encoder/references/losses.md Reference guide for sparse-encoder losses and when to use them.
skills/train-sparse-encoder/references/prompts_and_instructions.md Guidance for prompts/instructions usage during training.
skills/train-sparse-encoder/references/training_args.md Reference for training arguments and recommended settings.
skills/train-sparse-encoder/references/troubleshooting.md Troubleshooting guide for common training/runtime issues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/sync-skills.yml
Comment thread .pre-commit-config.yaml Outdated
@tomaarsen tomaarsen changed the title Add 3 training skills: train-sentence-transformer, train-cross-encoder, and train-sparse-encoder Add training skill: train-sentence-transformer May 7, 2026
@tomaarsen tomaarsen requested a review from Copilot May 7, 2026 09:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 17 comments.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 4 comments.

Comment thread .github/workflows/sync-skills.yml
Comment thread .github/workflows/sync-skills.yml
Comment thread skills/train-sentence-transformer/scripts/train_cross_encoder_listwise_example.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 4 comments.

Comment thread .github/workflows/sync-skills.yml
Comment thread skills/README.md Outdated
@tomaarsen tomaarsen changed the title Add training skill: train-sentence-transformer Add training skill: train-sentence-transformers May 7, 2026
@tomaarsen tomaarsen enabled auto-merge (squash) May 7, 2026 14:26
@tomaarsen tomaarsen disabled auto-merge May 7, 2026 14:47
@tomaarsen tomaarsen merged commit e038b8a into huggingface:main May 7, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants