fix: retry checksum errors when reading tfevents files by timoffex · Pull Request #10625 · wandb/wandb

timoffex · 2025-10-01T23:45:55Z

Fixes WB-20957.

Retries checksum errors when syncing TensorBoard files. Previously, the integration would stop syncing tfevents files upon encountering these errors; now it logs a warning and retries later (the default delay is 5 seconds).

See this list of events:

tensorboard: failed reading next event: tensorboard: unexpected CRC-32C checksum for event. Expected: 24, got: 2894970002

In almost every single one, the Expected number (the checksum read from the file rather than calculated from the data) is small: usually 0, but in some instances 24, 12869, 2878429. The checksum is stored in little-endian format, so a simple explanation would be that its final (most significant) bytes are not yet written.

Unfortunately for that explanation, there are plenty of counter-examples, like this event:

(big endian bytes)
5F 6C 2F 73 (1600925555) - checksum read from the file
B2 1A EF 6D (2988109677) - checksum computed from the payload

A way to account for this is that the checksum and payload may be written independently, possibly using a mechanism like mmap. When mmap is used, we can't assume the data in the file is final: the writer may have reserved space for the event but not yet written all its bytes. When we see an invalid checksum, we should reread the data.

But still, this doesn't explain all the checksum errors we see. In some cases, the same "expected" and "actual" checksums appear in multiple events spaced hours apart, so there is probably some kind of data corruption.

timoffex · 2025-10-01T23:46:14Z

fix: retry checksum errors when reading tfevents files #10625 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

codecov · 2025-10-01T23:57:52Z

Codecov Report

❌ Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
core/internal/tensorboard/tfeventreader.go	88.88%	6 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

graphite-app · 2025-10-06T18:09:08Z

Merge activity

Oct 6, 6:09 PM UTC: This pull request can not be added to the Graphite merge queue. Please try rebasing and resubmitting to merge when ready.
Oct 6, 6:09 PM UTC: Graphite disabled "merge when ready" on this PR due to: a merge conflict with the target branch; resolve the conflict and try again..
Oct 13, 9:41 PM UTC: This pull request can not be added to the Graphite merge queue. Please try rebasing and resubmitting to merge when ready.
Oct 13, 9:41 PM UTC: Graphite disabled "merge when ready" on this PR due to: a merge conflict with the target branch; resolve the conflict and try again..
Oct 14, 1:22 AM UTC: This pull request can not be added to the Graphite merge queue. Please try rebasing and resubmitting to merge when ready.
Oct 14, 1:22 AM UTC: Graphite disabled "merge when ready" on this PR due to: a merge conflict with the target branch; resolve the conflict and try again..
Oct 14, 1:36 AM UTC: Graphite rebased this pull request as part of a merge.
Oct 14, 1:46 AM UTC: @timoffex merged this pull request with Graphite.

dmitryduev

🦺 🚢

timoffex changed the title ~~tensorboard checksum retry~~ fix: retry checksum errors when reading tfevents files Oct 1, 2025

timoffex marked this pull request as ready for review October 2, 2025 00:11

timoffex requested a review from a team as a code owner October 2, 2025 00:11

timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch from c98cc6d to 44af9a7 Compare October 2, 2025 00:14

jacobromero reviewed Oct 2, 2025

View reviewed changes

Comment thread core/internal/tensorboard/tfeventreader.go Outdated

dmitryduev approved these changes Oct 2, 2025

View reviewed changes

timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch from 44af9a7 to f05fdf4 Compare October 9, 2025 21:27

timoffex requested review from dmitryduev and jacobromero October 9, 2025 21:29

dmitryduev approved these changes Oct 10, 2025

View reviewed changes

Comment thread core/internal/tensorboard/tfeventreader.go

Comment thread core/internal/tensorboard/tfeventreader.go

Comment thread core/internal/tensorboard/tfeventreader.go

timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch 2 times, most recently from ffe5a8b to 09aa705 Compare October 14, 2025 01:22

tensorboard checksum retry

5fc034e

timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch from 09aa705 to 5fc034e Compare October 14, 2025 01:36

timoffex merged commit 476cec8 into main Oct 14, 2025
32 checks passed

timoffex deleted the timoffex/10-01-tensorboard_checksum_retry branch October 14, 2025 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry checksum errors when reading tfevents files#10625

fix: retry checksum errors when reading tfevents files#10625
timoffex merged 1 commit intomainfrom
timoffex/10-01-tensorboard_checksum_retry

timoffex commented Oct 1, 2025 •

edited

Loading

Uh oh!

timoffex commented Oct 1, 2025

Uh oh!

codecov Bot commented Oct 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

graphite-app Bot commented Oct 6, 2025 •

edited by timoffex

Loading

Uh oh!

dmitryduev left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

timoffex commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timoffex commented Oct 1, 2025

Uh oh!

codecov Bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

graphite-app Bot commented Oct 6, 2025 • edited by timoffex Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

dmitryduev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

timoffex commented Oct 1, 2025 •

edited

Loading

codecov Bot commented Oct 1, 2025 •

edited

Loading

graphite-app Bot commented Oct 6, 2025 •

edited by timoffex

Loading