fix(dataset): make COCO annotation/image ids chainable across splits#2267
Merged
Borda merged 12 commits intoMay 22, 2026
Merged
Conversation
…oboflow#768) Exporting train/valid/test splits with DetectionDataset.as_coco previously restarted image_id and annotation_id at 1 for every split, producing three JSON files whose ids collided and could not be safely merged into a single COCO collection. Adds optional starting_image_id and starting_annotation_id parameters to save_coco_annotations and DetectionDataset.as_coco (default 1 to preserve existing behavior) and returns a (next_image_id, next_annotation_id) tuple so callers can feed the result of one export straight into the next: next_image, next_ann = train.as_coco(annotations_path="train.json") next_image, next_ann = valid.as_coco( annotations_path="valid.json", starting_image_id=next_image, starting_annotation_id=next_ann, ) test.as_coco( annotations_path="test.json", starting_image_id=next_image, starting_annotation_id=next_ann, ) The images-only branch of as_coco (annotations_path=None) round-trips the starting ids unchanged so chaining still works there. Adds 4 regression tests covering defaults, custom starting ids, end-to-end three-split chaining with global uniqueness assertions, and the images-only round-trip. Fixes roboflow#768
- Document Args for save_coco_annotations (was Returns-only) - Move cv2 import in test helper to module scope - Add empty-dataset regression test for save_coco_annotations - Tighten as_coco docstring wording on the images-only branch
Borda
previously approved these changes
May 22, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses COCO image_id / annotation_id collisions when exporting multiple dataset splits by making the id sequences chainable across repeated DetectionDataset.as_coco() / save_coco_annotations() calls.
Changes:
- Add
starting_image_id/starting_annotation_idparameters to COCO export and return(next_image_id, next_annotation_id)for easy chaining across splits. - Update
DetectionDataset.as_coco()to forward the new parameters and to return starting ids unchanged in the images-only branch. - Add regression/unit tests covering default ids, custom starting ids, empty dataset behavior, and cross-split chaining.
Assessment (per repo review checklist):
- Code quality: 4/5
- Testing: 5/5
- Docs: 4/5
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/supervision/dataset/formats/coco.py |
Adds chainable starting ids to COCO writer and returns next unused ids. |
src/supervision/dataset/core.py |
Extends DetectionDataset.as_coco() API to forward starting ids and return chaining tuple. |
tests/dataset/formats/test_coco.py |
Adds coverage for default/custom starting ids, empty dataset, and split-chaining regression for #768. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #2267 +/- ##
=======================================
Coverage 78% 78%
=======================================
Files 66 66
Lines 8406 8410 +4
=======================================
+ Hits 6524 6552 +28
+ Misses 1882 1858 -24 🚀 New features to boost your workflow:
|
…fault to 0.0 Pre-existing default of 0.75 caused aggressive polygon simplification for direct callers while as_coco used 0.0 — inconsistent behavior for the same param. [resolve roboflow#1] /review finding by foundry:solution-architect (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>
Docstring opened directly with Args: — violates Google style and broke autodoc rendering (mkdocstrings requires summary sentence as first element). [resolve roboflow#2] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>
Documents new starting_image_id/starting_annotation_id params and the None → tuple[int, int] return type change so downstream users can audit. [resolve roboflow#3] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>
…annotations Chaining fixes integer ID uniqueness only; file_name uses bare basename so splits with identical filenames still collide when COCO files are merged. Note added to Returns section to prevent false assumptions. [resolve roboflow#4] /review finding by foundry:challenger (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>
…>= 1 COCO spec requires all ids to be positive integers (1-indexed). Passing 0 or negative values silently produced spec-noncompliant files rejected by downstream COCO tools. Raises ValueError with the offending values. [resolve roboflow#5] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): [resolve roboflow#5] also flagged by @Copilot (gh comment #3290385582, #3290385701) --- Co-authored-by: Claude Code <noreply@anthropic.com>
Public module-level function lacked any usage example, violating project AGENTS.md convention. Added basic single-split export example. [resolve roboflow#8] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>
Silently discarding the return value contradicted the just-described chaining contract, leaving readers unsure whether the discard was intentional. Use _, _ with an explanatory comment. [resolve roboflow#9] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>
…etections Three additions: - Assert cv2.imwrite return value in _tiny_detection_dataset helper so codec failures produce explicit errors rather than cryptic downstream ones - Fix _tiny_detection_dataset xyxy shape for dets_per_image=0 (was (0,), now (0,4) via reshape — required by Detections validator) - test_save_coco_annotations_annotation_image_id_references_correct_image: verifies every annotation.image_id references a known image id after chaining with a non-default starting_image_id - test_save_coco_annotations_zero_annotation_images: pins behaviour for images with zero detections — image ids advance, annotation list empty [resolve roboflow#6] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): [resolve roboflow#7] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): [resolve roboflow#10] @Copilot (gh comment #3290385653): --- Co-authored-by: Claude Code <noreply@anthropic.com>
Borda
approved these changes
May 22, 2026
4 tasks
Borda
added a commit
to madhavcodez/supervision
that referenced
this pull request
May 27, 2026
Documents the behavior-observable change from roboflow#2276/roboflow#1181 in the UnReleased section, consistent with project convention used for roboflow#2267. --- Co-authored-by: Claude Code <noreply@anthropic.com>
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Before submitting
docs/datasets/core.mdalready auto-rendersDetectionDatasetvia mkdocstrings, so the updated docstring (including the multi-split chaining example) shows up there automatically.Description
DetectionDataset.as_cocoresetsimage_idandannotation_idto1on every call. Exporting train / valid / test as three separate COCO files therefore produces three id sequences that all start at 1 and collide when downstream tooling merges them into a single COCO collection — the failure mode reported in #768.This PR adds optional
starting_image_idandstarting_annotation_idparameters (both default to1, preserving existing behavior) and returns the first unused ids so callers can chain exports without bookkeeping:Type of Change
Motivation and Context
DetectionDataset.as_cocois the documented path for converting YOLO/VOC datasets to COCO. Real users hit this when they convert split-by-split and then try to merge into a single COCO file for training/eval — the duplicated ids quietly break downstream tools that assume globally unique ids per the spec. Issue has been open since January 2024.Closes #768
Changes Made
src/supervision/dataset/formats/coco.py—save_coco_annotationsgainsstarting_image_idandstarting_annotation_idparameters (default1) and returns(next_image_id, next_annotation_id). Args and Returns blocks added.src/supervision/dataset/core.py—DetectionDataset.as_cocoforwards the two new params, returns the tuple fromsave_coco_annotations, and falls through to(starting_image_id, starting_annotation_id)when only writing images so the chaining pattern still composes in that branch. Docstring expanded with the three-split chaining example.tests/dataset/formats/test_coco.py— 5 new tests (see below).Testing
5 new tests in
tests/dataset/formats/test_coco.py:test_save_coco_annotations_defaults_start_at_one1..Nand returns the next unused ids.test_save_coco_annotations_respects_starting_idsstarting_image_id/starting_annotation_idshow up both in the written JSON and in the returned tuple.test_save_coco_annotations_empty_dataset_returns_starting_idstest_as_coco_chains_ids_across_splits_without_collisiontest_as_coco_without_annotations_path_returns_starting_idspytest tests/dataset/formats/test_coco.py→ 61 passed locally (56 existing + 5 new).ruff checkandruff format --checkare clean on the touched files.Notes for review
Nonetotuple[int, int]. I checked every call site in the repo (src/,tests/,docs/,examples/) and none store or pattern-match against the return value, so this is backward-compatible in practice. Mentioned here for completeness.1, 1) match the previously hard-coded values byte-for-byte.save_coco_annotations, so the fix lands in one place.