Pulled from #35337 (comment).
SHA:
https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e
Failed test:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog
The test failed on master:
cluster.go:1251,tpcc.go:126,cluster.go:1589,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1176948-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
stderr:
stdout:
243.7 253.9 3489.7 10737.4 28991.0 49392.1 newOrder
43m5s 0 25.0 25.6 1208.0 3221.2 3221.2 3221.2 orderStatus
43m5s 0 314.6 255.0 3623.9 27917.3 77309.4 103079.2 payment
43m5s 0 23.0 25.5 2080.4 3489.7 11274.3 11274.3 stockLevel
43m6s 0 15.0 25.5 4563.4 17179.9 34359.7 34359.7 delivery
43m6s 0 248.3 253.9 3221.2 13421.8 47244.6 103079.2 newOrder
43m6s 0 23.0 25.6 805.3 2952.8 2952.8 2952.8 orderStatus
43m6s 0 180.2 255.0 2818.6 15032.4 40802.2 103079.2 payment
43m6s 0 22.0 25.5 771.8 3087.0 3087.0 3087.0 stockLevel
Error: error in newOrder: ERROR: foreign key violation: value [914 3 3050] not found in order@primary [o_w_id o_d_id o_id] (txn=f995c83d-a90d-4856-9b61-72713ca22db5) (SQLSTATE 23503)
Error: exit status 1
: exit status 1
cluster.go:1610,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called
Artifacts:
https://drive.google.com/open?id=1bQWTo6DOlNj8ie1cFepZdMNMirRANMq8
A few theories we can immediately ignore:
this was a very long running transaction
We see from the workload logs a cumulative ops/sec for new_order transactions of 253.9. The load ran for 43m6s, so we expect to have performed about 253.9*(43*60+5) = 656,332 new order transactions.
We see from the error that this was order 3050 in warehouse 914 and district 3. Each district begins the workload with 3000 orders. That means that this was the 50th new_order transaction performed for this warehouse/district.
The test ran with 1350 warehouses, which means it ran with 13500 unique districts. Given a uniform distribution of new_order transactions, we expect the 50th order for a given district to take place after 1350*10*49 = 661,500 completed new_order transactions. Since this is close to our estimate for the total number of new_order transactions performed, we can conclude that the victim transaction began very recently.
the transaction was aborted and its abort span was already GCed
The abort span is not removed until an hour after the transaction is last active:
|
// TxnCleanupThreshold is the threshold after which a transaction is |
|
// considered abandoned and fit for removal, as measured by the |
|
// maximum of its last heartbeat and timestamp. Abort spans for the |
|
// transaction are cleaned up at the same time. |
|
// |
|
// TODO(tschottdorf): need to enforce at all times that this is much |
|
// larger than the heartbeat interval used by the coordinator. |
|
const TxnCleanupThreshold = time.Hour |
The load was only running for 43 minutes, so no abort spans should have been GCed.
Pulled from #35337 (comment).
SHA:
https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e
Failed test:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog
Artifacts:
https://drive.google.com/open?id=1bQWTo6DOlNj8ie1cFepZdMNMirRANMq8
A few theories we can immediately ignore:
this was a very long running transaction
We see from the workload logs a cumulative ops/sec for
new_ordertransactions of 253.9. The load ran for43m6s, so we expect to have performed about253.9*(43*60+5) = 656,332new order transactions.We see from the error that this was order 3050 in warehouse 914 and district 3. Each district begins the workload with 3000 orders. That means that this was the 50th
new_ordertransaction performed for this warehouse/district.The test ran with 1350 warehouses, which means it ran with 13500 unique districts. Given a uniform distribution of
new_ordertransactions, we expect the 50th order for a given district to take place after1350*10*49 = 661,500completednew_ordertransactions. Since this is close to our estimate for the total number ofnew_ordertransactions performed, we can conclude that the victim transaction began very recently.the transaction was aborted and its abort span was already GCed
The abort span is not removed until an hour after the transaction is last active:
cockroach/pkg/storage/storagebase/base.go
Lines 37 to 44 in 2695531
The load was only running for 43 minutes, so no abort spans should have been GCed.