fix(dataset): make COCO annotation/image ids chainable across splits by madhavcodez · Pull Request #2267 · roboflow/supervision

madhavcodez · 2026-05-22T05:23:38Z

Before submitting

Self-reviewed the code
Updated documentation, follow Google-style
Added docs entry for autogeneration (if new functions/classes) — docs/datasets/core.md already auto-renders DetectionDataset via mkdocstrings, so the updated docstring (including the multi-split chaining example) shows up there automatically.
Added/updated tests
All tests pass locally

Description

DetectionDataset.as_coco resets image_id and annotation_id to 1 on every call. Exporting train / valid / test as three separate COCO files therefore produces three id sequences that all start at 1 and collide when downstream tooling merges them into a single COCO collection — the failure mode reported in #768.

This PR adds optional starting_image_id and starting_annotation_id parameters (both default to 1, preserving existing behavior) and returns the first unused ids so callers can chain exports without bookkeeping:

next_image, next_ann = train.as_coco(annotations_path="train.json")
next_image, next_ann = valid.as_coco(
    annotations_path="valid.json",
    starting_image_id=next_image,
    starting_annotation_id=next_ann,
)
test.as_coco(
    annotations_path="test.json",
    starting_image_id=next_image,
    starting_annotation_id=next_ann,
)

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)

Motivation and Context

DetectionDataset.as_coco is the documented path for converting YOLO/VOC datasets to COCO. Real users hit this when they convert split-by-split and then try to merge into a single COCO file for training/eval — the duplicated ids quietly break downstream tools that assume globally unique ids per the spec. Issue has been open since January 2024.

Closes #768

Changes Made

src/supervision/dataset/formats/coco.py — save_coco_annotations gains starting_image_id and starting_annotation_id parameters (default 1) and returns (next_image_id, next_annotation_id). Args and Returns blocks added.
src/supervision/dataset/core.py — DetectionDataset.as_coco forwards the two new params, returns the tuple from save_coco_annotations, and falls through to (starting_image_id, starting_annotation_id) when only writing images so the chaining pattern still composes in that branch. Docstring expanded with the three-split chaining example.
tests/dataset/formats/test_coco.py — 5 new tests (see below).

Testing

I have tested this code locally
I have added unit tests that prove my fix is effective

5 new tests in tests/dataset/formats/test_coco.py:

Test	What it pins
`test_save_coco_annotations_defaults_start_at_one`	Default behavior still produces ids `1..N` and returns the next unused ids.
`test_save_coco_annotations_respects_starting_ids`	Custom `starting_image_id` / `starting_annotation_id` show up both in the written JSON and in the returned tuple.
`test_save_coco_annotations_empty_dataset_returns_starting_ids`	Empty dataset writes a valid (empty) COCO file and returns the starting ids unchanged so chaining still composes around it.
`test_as_coco_chains_ids_across_splits_without_collision`	End-to-end three-split chain produces globally unique image and annotation ids across the three output files (the regression for #768).
`test_as_coco_without_annotations_path_returns_starting_ids`	Images-only path round-trips the starting ids so chaining still composes when only images are being written.

pytest tests/dataset/formats/test_coco.py → 61 passed locally (56 existing + 5 new). ruff check and ruff format --check are clean on the touched files.

Notes for review

The return type changes from None to tuple[int, int]. I checked every call site in the repo (src/, tests/, docs/, examples/) and none store or pattern-match against the return value, so this is backward-compatible in practice. Mentioned here for completeness.
No behavior change when the new parameters are not supplied — the defaults (1, 1) match the previously hard-coded values byte-for-byte.
All COCO write paths go through save_coco_annotations, so the fix lands in one place.

…oboflow#768) Exporting train/valid/test splits with DetectionDataset.as_coco previously restarted image_id and annotation_id at 1 for every split, producing three JSON files whose ids collided and could not be safely merged into a single COCO collection. Adds optional starting_image_id and starting_annotation_id parameters to save_coco_annotations and DetectionDataset.as_coco (default 1 to preserve existing behavior) and returns a (next_image_id, next_annotation_id) tuple so callers can feed the result of one export straight into the next: next_image, next_ann = train.as_coco(annotations_path="train.json") next_image, next_ann = valid.as_coco( annotations_path="valid.json", starting_image_id=next_image, starting_annotation_id=next_ann, ) test.as_coco( annotations_path="test.json", starting_image_id=next_image, starting_annotation_id=next_ann, ) The images-only branch of as_coco (annotations_path=None) round-trips the starting ids unchanged so chaining still works there. Adds 4 regression tests covering defaults, custom starting ids, end-to-end three-split chaining with global uniqueness assertions, and the images-only round-trip. Fixes roboflow#768

- Document Args for save_coco_annotations (was Returns-only) - Move cv2 import in test helper to module scope - Add empty-dataset regression test for save_coco_annotations - Tighten as_coco docstring wording on the images-only branch

…plits

Copilot

Pull request overview

This PR addresses COCO image_id / annotation_id collisions when exporting multiple dataset splits by making the id sequences chainable across repeated DetectionDataset.as_coco() / save_coco_annotations() calls.

Changes:

Add starting_image_id / starting_annotation_id parameters to COCO export and return (next_image_id, next_annotation_id) for easy chaining across splits.
Update DetectionDataset.as_coco() to forward the new parameters and to return starting ids unchanged in the images-only branch.
Add regression/unit tests covering default ids, custom starting ids, empty dataset behavior, and cross-split chaining.

Assessment (per repo review checklist):

Code quality: 4/5
Testing: 5/5
Docs: 4/5

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/supervision/dataset/formats/coco.py`	Adds chainable starting ids to COCO writer and returns next unused ids.
`src/supervision/dataset/core.py`	Extends `DetectionDataset.as_coco()` API to forward starting ids and return chaining tuple.
`tests/dataset/formats/test_coco.py`	Adds coverage for default/custom starting ids, empty dataset, and split-chaining regression for #768.

codecov · 2026-05-22T18:02:37Z

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78%. Comparing base (cb25906) to head (35dfe42).

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #2267   +/-   ##
=======================================
  Coverage       78%     78%           
=======================================
  Files           66      66           
  Lines         8406    8410    +4     
=======================================
+ Hits          6524    6552   +28     
+ Misses        1882    1858   -24

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…fault to 0.0 Pre-existing default of 0.75 caused aggressive polygon simplification for direct callers while as_coco used 0.0 — inconsistent behavior for the same param. [resolve roboflow#1] /review finding by foundry:solution-architect (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>

Docstring opened directly with Args: — violates Google style and broke autodoc rendering (mkdocstrings requires summary sentence as first element). [resolve roboflow#2] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>

Documents new starting_image_id/starting_annotation_id params and the None → tuple[int, int] return type change so downstream users can audit. [resolve roboflow#3] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>

…annotations Chaining fixes integer ID uniqueness only; file_name uses bare basename so splits with identical filenames still collide when COCO files are merged. Note added to Returns section to prevent false assumptions. [resolve roboflow#4] /review finding by foundry:challenger (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>

…>= 1 COCO spec requires all ids to be positive integers (1-indexed). Passing 0 or negative values silently produced spec-noncompliant files rejected by downstream COCO tools. Raises ValueError with the offending values. [resolve roboflow#5] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): [resolve roboflow#5] also flagged by @Copilot (gh comment #3290385582, #3290385701) --- Co-authored-by: Claude Code <noreply@anthropic.com>

Public module-level function lacked any usage example, violating project AGENTS.md convention. Added basic single-split export example. [resolve roboflow#8] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>

Silently discarding the return value contradicted the just-described chaining contract, leaving readers unsure whether the discard was intentional. Use _, _ with an explanatory comment. [resolve roboflow#9] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): --- Co-authored-by: Claude Code <noreply@anthropic.com>

…etections Three additions: - Assert cv2.imwrite return value in _tiny_detection_dataset helper so codec failures produce explicit errors rather than cryptic downstream ones - Fix _tiny_detection_dataset xyxy shape for dets_per_image=0 (was (0,), now (0,4) via reshape — required by Detections validator) - test_save_coco_annotations_annotation_image_id_references_correct_image: verifies every annotation.image_id references a known image id after chaining with a non-default starting_image_id - test_save_coco_annotations_zero_annotation_images: pins behaviour for images with zero detections — image ids advance, annotation list empty [resolve roboflow#6] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): [resolve roboflow#7] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md): [resolve roboflow#10] @Copilot (gh comment #3290385653): --- Co-authored-by: Claude Code <noreply@anthropic.com>

…plits

Documents the behavior-observable change from roboflow#2276/roboflow#1181 in the UnReleased section, consistent with project convention used for roboflow#2267. --- Co-authored-by: Claude Code <noreply@anthropic.com>

madhavcodez requested a review from SkalskiP as a code owner May 22, 2026 05:23

madhavcodez added 2 commits May 22, 2026 00:30

docs: address review polish on COCO id-chaining

def3fd3

- Document Args for save_coco_annotations (was Returns-only) - Move cv2 import in test helper to module scope - Add empty-dataset regression test for save_coco_annotations - Tighten as_coco docstring wording on the images-only branch

Borda changed the title ~~fix(dataset): make COCO annotation/image ids chainable across splits (#768)~~ fix(dataset): make COCO annotation/image ids chainable across splits May 22, 2026

Borda requested a review from Copilot May 22, 2026 17:57

Copilot started reviewing on behalf of Borda May 22, 2026 17:58 View session

Merge branch 'develop' into fix/coco-annotation-id-collision-across-s…

3abcfd5

…plits

Borda previously approved these changes May 22, 2026

View reviewed changes

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread src/supervision/dataset/formats/coco.py

Comment thread tests/dataset/formats/test_coco.py

Comment thread src/supervision/dataset/core.py

github-actions Bot added the has conflicts label May 22, 2026

Borda and others added 8 commits May 22, 2026 21:28

Borda dismissed their stale review via 20e25ea May 22, 2026 19:32

Merge branch 'develop' into fix/coco-annotation-id-collision-across-s…

35dfe42

…plits

github-actions Bot removed the has conflicts label May 22, 2026

Borda added the enhancement New feature or request label May 22, 2026

Borda approved these changes May 22, 2026

View reviewed changes

Borda merged commit befdb7c into roboflow:develop May 22, 2026
26 checks passed

satishkc7 mentioned this pull request May 25, 2026

feat: add show_progress option to dataset loaders and savers #2181

Open

4 tasks

Borda mentioned this pull request Jun 11, 2026

releasing 0.29.0 #2311

Closed

SkalskiP mentioned this pull request Jun 11, 2026

supervision-0.29.0 release #2310

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dataset): make COCO annotation/image ids chainable across splits#2267

fix(dataset): make COCO annotation/image ids chainable across splits#2267
Borda merged 12 commits into
roboflow:developfrom
madhavcodez:fix/coco-annotation-id-collision-across-splits

madhavcodez commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

madhavcodez commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Motivation and Context

Changes Made

Testing

Notes for review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

madhavcodez commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading