Skip to content

[Bug]: Possible data loss in BigtableIO r/w if timestamp not set (default to epoch) #27022

@Abacn

Description

@Abacn

What happened?

Reported from GoogleCloudPlatform/DataflowTemplates#759

When implementing a load test for BigTableIO, we encountered the following:

  • load tests up to 200mb pass stably.
  • after 5 million records, not all data gets into BigTable, but the pipeline logs indicate that all data was written.

Dataflow write pipeline logs say that 10M records were written.
However, the read job shows only 1.6M records read.

Using the cbt utility, the cbt -instance count

command found out that BigTableIO write did not work correctly. Despite the fact that the logs say that all 10M records were written, in fact, there were exactly as many in the table as the read pipeline processed (1.6M). Some of the records processed by the write pipeline did not get into the table.

  • Dataflow write pipeline logs - 2023-06-05_03_51_23-9051905355392445711
  • Dataflow read pipeline logs - 2023-06-05_03_58_18-7016807525741705033

project: apache-beam-testing

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions