Skip to content

fix: retry checksum errors when reading tfevents files#10625

Merged
timoffex merged 1 commit intomainfrom
timoffex/10-01-tensorboard_checksum_retry
Oct 14, 2025
Merged

fix: retry checksum errors when reading tfevents files#10625
timoffex merged 1 commit intomainfrom
timoffex/10-01-tensorboard_checksum_retry

Conversation

@timoffex
Copy link
Copy Markdown
Contributor

@timoffex timoffex commented Oct 1, 2025

Fixes WB-20957.

Retries checksum errors when syncing TensorBoard files. Previously, the integration would stop syncing tfevents files upon encountering these errors; now it logs a warning and retries later (the default delay is 5 seconds).

See this list of events:

tensorboard: failed reading next event: tensorboard: unexpected CRC-32C checksum for event. Expected: 24, got: 2894970002

In almost every single one, the Expected number (the checksum read from the file rather than calculated from the data) is small: usually 0, but in some instances 24, 12869, 2878429. The checksum is stored in little-endian format, so a simple explanation would be that its final (most significant) bytes are not yet written.

Unfortunately for that explanation, there are plenty of counter-examples, like this event:

(big endian bytes)
5F 6C 2F 73 (1600925555) - checksum read from the file
B2 1A EF 6D (2988109677) - checksum computed from the payload

A way to account for this is that the checksum and payload may be written independently, possibly using a mechanism like mmap. When mmap is used, we can't assume the data in the file is final: the writer may have reserved space for the event but not yet written all its bytes. When we see an invalid checksum, we should reread the data.

But still, this doesn't explain all the checksum errors we see. In some cases, the same "expected" and "actual" checksums appear in multiple events spaced hours apart, so there is probably some kind of data corruption.

@timoffex timoffex changed the title tensorboard checksum retry fix: retry checksum errors when reading tfevents files Oct 1, 2025
Copy link
Copy Markdown
Contributor Author

timoffex commented Oct 1, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 1, 2025

Codecov Report

❌ Patch coverage is 89.06250% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
core/internal/tensorboard/tfeventreader.go 88.88% 6 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@timoffex timoffex marked this pull request as ready for review October 2, 2025 00:11
@timoffex timoffex requested a review from a team as a code owner October 2, 2025 00:11
@timoffex timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch from c98cc6d to 44af9a7 Compare October 2, 2025 00:14
Comment thread core/internal/tensorboard/tfeventreader.go Outdated
@graphite-app
Copy link
Copy Markdown

graphite-app Bot commented Oct 6, 2025

Merge activity

  • Oct 6, 6:09 PM UTC: This pull request can not be added to the Graphite merge queue. Please try rebasing and resubmitting to merge when ready.
  • Oct 6, 6:09 PM UTC: Graphite disabled "merge when ready" on this PR due to: a merge conflict with the target branch; resolve the conflict and try again..
  • Oct 13, 9:41 PM UTC: This pull request can not be added to the Graphite merge queue. Please try rebasing and resubmitting to merge when ready.
  • Oct 13, 9:41 PM UTC: Graphite disabled "merge when ready" on this PR due to: a merge conflict with the target branch; resolve the conflict and try again..
  • Oct 14, 1:22 AM UTC: This pull request can not be added to the Graphite merge queue. Please try rebasing and resubmitting to merge when ready.
  • Oct 14, 1:22 AM UTC: Graphite disabled "merge when ready" on this PR due to: a merge conflict with the target branch; resolve the conflict and try again..
  • Oct 14, 1:36 AM UTC: Graphite rebased this pull request as part of a merge.
  • Oct 14, 1:46 AM UTC: @timoffex merged this pull request with Graphite.

@timoffex timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch from 44af9a7 to f05fdf4 Compare October 9, 2025 21:27
Copy link
Copy Markdown
Member

@dmitryduev dmitryduev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦺 🚢

Comment thread core/internal/tensorboard/tfeventreader.go
Comment thread core/internal/tensorboard/tfeventreader.go
Comment thread core/internal/tensorboard/tfeventreader.go
@timoffex timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch 2 times, most recently from ffe5a8b to 09aa705 Compare October 14, 2025 01:22
@timoffex timoffex force-pushed the timoffex/10-01-tensorboard_checksum_retry branch from 09aa705 to 5fc034e Compare October 14, 2025 01:36
@timoffex timoffex merged commit 476cec8 into main Oct 14, 2025
32 checks passed
@timoffex timoffex deleted the timoffex/10-01-tensorboard_checksum_retry branch October 14, 2025 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants