-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[Bug]: Possible data loss in BigtableIO r/w if timestamp not set (default to epoch) #27022
Copy link
Copy link
Open
Description
What happened?
Reported from GoogleCloudPlatform/DataflowTemplates#759
When implementing a load test for BigTableIO, we encountered the following:
- load tests up to 200mb pass stably.
- after 5 million records, not all data gets into BigTable, but the pipeline logs indicate that all data was written.
Dataflow write pipeline logs say that 10M records were written.
However, the read job shows only 1.6M records read.
Using the cbt utility, the cbt -instance count
command found out that BigTableIO write did not work correctly. Despite the fact that the logs say that all 10M records were written, in fact, there were exactly as many in the table as the read pipeline processed (1.6M). Some of the records processed by the write pipeline did not get into the table.- Dataflow write pipeline logs -
2023-06-05_03_51_23-9051905355392445711 - Dataflow read pipeline logs -
2023-06-05_03_58_18-7016807525741705033
project: apache-beam-testing
Issue Priority
Priority: 1 (data loss / total loss of function)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner
Reactions are currently unavailable