Skip to content

roachtest: tpcc/nodes=3/w=max failed with foreign key violation #35812

@nvb

Description

@nvb

Pulled from #35337 (comment).

SHA:

https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e

Failed test:

https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog

The test failed on master:
	cluster.go:1251,tpcc.go:126,cluster.go:1589,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1176948-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 243.7          253.9   3489.7  10737.4  28991.0  49392.1 newOrder
		   43m5s        0           25.0           25.6   1208.0   3221.2   3221.2   3221.2 orderStatus
		   43m5s        0          314.6          255.0   3623.9  27917.3  77309.4 103079.2 payment
		   43m5s        0           23.0           25.5   2080.4   3489.7  11274.3  11274.3 stockLevel
		   43m6s        0           15.0           25.5   4563.4  17179.9  34359.7  34359.7 delivery
		   43m6s        0          248.3          253.9   3221.2  13421.8  47244.6 103079.2 newOrder
		   43m6s        0           23.0           25.6    805.3   2952.8   2952.8   2952.8 orderStatus
		   43m6s        0          180.2          255.0   2818.6  15032.4  40802.2 103079.2 payment
		   43m6s        0           22.0           25.5    771.8   3087.0   3087.0   3087.0 stockLevel
		Error: error in newOrder: ERROR: foreign key violation: value [914 3 3050] not found in order@primary [o_w_id o_d_id o_id] (txn=f995c83d-a90d-4856-9b61-72713ca22db5) (SQLSTATE 23503)
		Error:  exit status 1
		: exit status 1
	cluster.go:1610,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called

Artifacts:

https://drive.google.com/open?id=1bQWTo6DOlNj8ie1cFepZdMNMirRANMq8

A few theories we can immediately ignore:

this was a very long running transaction

We see from the workload logs a cumulative ops/sec for new_order transactions of 253.9. The load ran for 43m6s, so we expect to have performed about 253.9*(43*60+5) = 656,332 new order transactions.

We see from the error that this was order 3050 in warehouse 914 and district 3. Each district begins the workload with 3000 orders. That means that this was the 50th new_order transaction performed for this warehouse/district.

The test ran with 1350 warehouses, which means it ran with 13500 unique districts. Given a uniform distribution of new_order transactions, we expect the 50th order for a given district to take place after 1350*10*49 = 661,500 completed new_order transactions. Since this is close to our estimate for the total number of new_order transactions performed, we can conclude that the victim transaction began very recently.

the transaction was aborted and its abort span was already GCed

The abort span is not removed until an hour after the transaction is last active:

// TxnCleanupThreshold is the threshold after which a transaction is
// considered abandoned and fit for removal, as measured by the
// maximum of its last heartbeat and timestamp. Abort spans for the
// transaction are cleaned up at the same time.
//
// TODO(tschottdorf): need to enforce at all times that this is much
// larger than the heartbeat interval used by the coordinator.
const TxnCleanupThreshold = time.Hour

The load was only running for 43 minutes, so no abort spans should have been GCed.

Metadata

Metadata

Assignees

Labels

S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions