Skip to content

fix(dataset): make COCO annotation/image ids chainable across splits#2267

Merged
Borda merged 12 commits into
roboflow:developfrom
madhavcodez:fix/coco-annotation-id-collision-across-splits
May 22, 2026
Merged

fix(dataset): make COCO annotation/image ids chainable across splits#2267
Borda merged 12 commits into
roboflow:developfrom
madhavcodez:fix/coco-annotation-id-collision-across-splits

Conversation

@madhavcodez

@madhavcodez madhavcodez commented May 22, 2026

Copy link
Copy Markdown
Contributor
Before submitting
  • Self-reviewed the code
  • Updated documentation, follow Google-style
  • Added docs entry for autogeneration (if new functions/classes) — docs/datasets/core.md already auto-renders DetectionDataset via mkdocstrings, so the updated docstring (including the multi-split chaining example) shows up there automatically.
  • Added/updated tests
  • All tests pass locally

Description

DetectionDataset.as_coco resets image_id and annotation_id to 1 on every call. Exporting train / valid / test as three separate COCO files therefore produces three id sequences that all start at 1 and collide when downstream tooling merges them into a single COCO collection — the failure mode reported in #768.

This PR adds optional starting_image_id and starting_annotation_id parameters (both default to 1, preserving existing behavior) and returns the first unused ids so callers can chain exports without bookkeeping:

next_image, next_ann = train.as_coco(annotations_path="train.json")
next_image, next_ann = valid.as_coco(
    annotations_path="valid.json",
    starting_image_id=next_image,
    starting_annotation_id=next_ann,
)
test.as_coco(
    annotations_path="test.json",
    starting_image_id=next_image,
    starting_annotation_id=next_ann,
)

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)

Motivation and Context

DetectionDataset.as_coco is the documented path for converting YOLO/VOC datasets to COCO. Real users hit this when they convert split-by-split and then try to merge into a single COCO file for training/eval — the duplicated ids quietly break downstream tools that assume globally unique ids per the spec. Issue has been open since January 2024.

Closes #768

Changes Made

  • src/supervision/dataset/formats/coco.pysave_coco_annotations gains starting_image_id and starting_annotation_id parameters (default 1) and returns (next_image_id, next_annotation_id). Args and Returns blocks added.
  • src/supervision/dataset/core.pyDetectionDataset.as_coco forwards the two new params, returns the tuple from save_coco_annotations, and falls through to (starting_image_id, starting_annotation_id) when only writing images so the chaining pattern still composes in that branch. Docstring expanded with the three-split chaining example.
  • tests/dataset/formats/test_coco.py — 5 new tests (see below).

Testing

  • I have tested this code locally
  • I have added unit tests that prove my fix is effective

5 new tests in tests/dataset/formats/test_coco.py:

Test What it pins
test_save_coco_annotations_defaults_start_at_one Default behavior still produces ids 1..N and returns the next unused ids.
test_save_coco_annotations_respects_starting_ids Custom starting_image_id / starting_annotation_id show up both in the written JSON and in the returned tuple.
test_save_coco_annotations_empty_dataset_returns_starting_ids Empty dataset writes a valid (empty) COCO file and returns the starting ids unchanged so chaining still composes around it.
test_as_coco_chains_ids_across_splits_without_collision End-to-end three-split chain produces globally unique image and annotation ids across the three output files (the regression for #768).
test_as_coco_without_annotations_path_returns_starting_ids Images-only path round-trips the starting ids so chaining still composes when only images are being written.

pytest tests/dataset/formats/test_coco.py61 passed locally (56 existing + 5 new). ruff check and ruff format --check are clean on the touched files.

Notes for review

  • The return type changes from None to tuple[int, int]. I checked every call site in the repo (src/, tests/, docs/, examples/) and none store or pattern-match against the return value, so this is backward-compatible in practice. Mentioned here for completeness.
  • No behavior change when the new parameters are not supplied — the defaults (1, 1) match the previously hard-coded values byte-for-byte.
  • All COCO write paths go through save_coco_annotations, so the fix lands in one place.

@madhavcodez madhavcodez requested a review from SkalskiP as a code owner May 22, 2026 05:23
…oboflow#768)

Exporting train/valid/test splits with DetectionDataset.as_coco
previously restarted image_id and annotation_id at 1 for every split,
producing three JSON files whose ids collided and could not be safely
merged into a single COCO collection.

Adds optional starting_image_id and starting_annotation_id parameters
to save_coco_annotations and DetectionDataset.as_coco (default 1 to
preserve existing behavior) and returns a (next_image_id,
next_annotation_id) tuple so callers can feed the result of one
export straight into the next:

    next_image, next_ann = train.as_coco(annotations_path="train.json")
    next_image, next_ann = valid.as_coco(
        annotations_path="valid.json",
        starting_image_id=next_image,
        starting_annotation_id=next_ann,
    )
    test.as_coco(
        annotations_path="test.json",
        starting_image_id=next_image,
        starting_annotation_id=next_ann,
    )

The images-only branch of as_coco (annotations_path=None) round-trips
the starting ids unchanged so chaining still works there.

Adds 4 regression tests covering defaults, custom starting ids,
end-to-end three-split chaining with global uniqueness assertions,
and the images-only round-trip.

Fixes roboflow#768
- Document Args for save_coco_annotations (was Returns-only)
- Move cv2 import in test helper to module scope
- Add empty-dataset regression test for save_coco_annotations
- Tighten as_coco docstring wording on the images-only branch
@Borda Borda changed the title fix(dataset): make COCO annotation/image ids chainable across splits (#768) fix(dataset): make COCO annotation/image ids chainable across splits May 22, 2026
@Borda Borda requested a review from Copilot May 22, 2026 17:57
Borda
Borda previously approved these changes May 22, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses COCO image_id / annotation_id collisions when exporting multiple dataset splits by making the id sequences chainable across repeated DetectionDataset.as_coco() / save_coco_annotations() calls.

Changes:

  • Add starting_image_id / starting_annotation_id parameters to COCO export and return (next_image_id, next_annotation_id) for easy chaining across splits.
  • Update DetectionDataset.as_coco() to forward the new parameters and to return starting ids unchanged in the images-only branch.
  • Add regression/unit tests covering default ids, custom starting ids, empty dataset behavior, and cross-split chaining.

Assessment (per repo review checklist):

  • Code quality: 4/5
  • Testing: 5/5
  • Docs: 4/5

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/supervision/dataset/formats/coco.py Adds chainable starting ids to COCO writer and returns next unused ids.
src/supervision/dataset/core.py Extends DetectionDataset.as_coco() API to forward starting ids and return chaining tuple.
tests/dataset/formats/test_coco.py Adds coverage for default/custom starting ids, empty dataset, and split-chaining regression for #768.

Comment thread src/supervision/dataset/formats/coco.py
Comment thread tests/dataset/formats/test_coco.py
Comment thread src/supervision/dataset/core.py
@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78%. Comparing base (cb25906) to head (35dfe42).

Additional details and impacted files
@@           Coverage Diff           @@
##           develop   #2267   +/-   ##
=======================================
  Coverage       78%     78%           
=======================================
  Files           66      66           
  Lines         8406    8410    +4     
=======================================
+ Hits          6524    6552   +28     
+ Misses        1882    1858   -24     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Borda and others added 8 commits May 22, 2026 21:28
…fault to 0.0

Pre-existing default of 0.75 caused aggressive polygon simplification for direct
callers while as_coco used 0.0 — inconsistent behavior for the same param.

[resolve roboflow#1] /review finding by foundry:solution-architect (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):

---
Co-authored-by: Claude Code <noreply@anthropic.com>
Docstring opened directly with Args: — violates Google style and broke
autodoc rendering (mkdocstrings requires summary sentence as first element).

[resolve roboflow#2] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):

---
Co-authored-by: Claude Code <noreply@anthropic.com>
Documents new starting_image_id/starting_annotation_id params and the
None → tuple[int, int] return type change so downstream users can audit.

[resolve roboflow#3] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):

---
Co-authored-by: Claude Code <noreply@anthropic.com>
…annotations

Chaining fixes integer ID uniqueness only; file_name uses bare basename so
splits with identical filenames still collide when COCO files are merged.
Note added to Returns section to prevent false assumptions.

[resolve roboflow#4] /review finding by foundry:challenger (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):

---
Co-authored-by: Claude Code <noreply@anthropic.com>
…>= 1

COCO spec requires all ids to be positive integers (1-indexed). Passing 0
or negative values silently produced spec-noncompliant files rejected by
downstream COCO tools. Raises ValueError with the offending values.

[resolve roboflow#5] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):
[resolve roboflow#5] also flagged by @Copilot (gh comment #3290385582, #3290385701)

---
Co-authored-by: Claude Code <noreply@anthropic.com>
Public module-level function lacked any usage example, violating project
AGENTS.md convention. Added basic single-split export example.

[resolve roboflow#8] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):

---
Co-authored-by: Claude Code <noreply@anthropic.com>
Silently discarding the return value contradicted the just-described
chaining contract, leaving readers unsure whether the discard was
intentional. Use _, _ with an explanatory comment.

[resolve roboflow#9] /review finding by foundry:doc-scribe (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):

---
Co-authored-by: Claude Code <noreply@anthropic.com>
…etections

Three additions:
- Assert cv2.imwrite return value in _tiny_detection_dataset helper so
  codec failures produce explicit errors rather than cryptic downstream ones
- Fix _tiny_detection_dataset xyxy shape for dets_per_image=0 (was (0,),
  now (0,4) via reshape — required by Detections validator)
- test_save_coco_annotations_annotation_image_id_references_correct_image:
  verifies every annotation.image_id references a known image id after
  chaining with a non-default starting_image_id
- test_save_coco_annotations_zero_annotation_images: pins behaviour for
  images with zero detections — image ids advance, annotation list empty

[resolve roboflow#6] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):
[resolve roboflow#7] /review finding by foundry:qa-specialist (report: .reports/review/2026-05-22T18-44-11Z/review-report.md):
[resolve roboflow#10] @Copilot (gh comment #3290385653):

---
Co-authored-by: Claude Code <noreply@anthropic.com>
@Borda Borda added the enhancement New feature or request label May 22, 2026
@Borda Borda merged commit befdb7c into roboflow:develop May 22, 2026
26 checks passed
Borda added a commit to madhavcodez/supervision that referenced this pull request May 27, 2026
Documents the behavior-observable change from roboflow#2276/roboflow#1181 in the
UnReleased section, consistent with project convention used for roboflow#2267.

---
Co-authored-by: Claude Code <noreply@anthropic.com>
@Borda Borda mentioned this pull request Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DetectionDataset.as_coco having annotation id collision between splits

3 participants