-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Progressively degrading performance while running block writer example #4196
Description
I started up the block writer example this weekend in order to see how performance has improved since the last time I was intensely running it (several months ago). What I discovered was a startling progressive performance degradation:
2016/02/07 03:19:37 88 dumps were executed at 88.0/second (0 total errors)
2016/02/07 03:19:38 905 dumps were executed at 905.0/second (0 total errors)
2016/02/07 03:19:39 663 dumps were executed at 662.2/second (0 total errors)
2016/02/07 03:19:40 548 dumps were executed at 548.6/second (0 total errors)
2016/02/07 03:19:41 453 dumps were executed at 451.3/second (0 total errors)
2016/02/07 03:19:42 411 dumps were executed at 412.1/second (0 total errors)
2016/02/07 03:19:43 397 dumps were executed at 396.2/second (0 total errors)
2016/02/07 03:19:44 377 dumps were executed at 377.7/second (0 total errors)
2016/02/07 03:19:45 327 dumps were executed at 326.9/second (0 total errors)
2016/02/07 03:19:46 321 dumps were executed at 320.7/second (0 total errors)
2016/02/07 03:19:47 314 dumps were executed at 314.2/second (0 total errors)
2016/02/07 03:19:48 280 dumps were executed at 279.2/second (0 total errors)
2016/02/07 03:19:49 271 dumps were executed at 272.3/second (0 total errors)
2016/02/07 03:19:50 261 dumps were executed at 261.0/second (0 total errors)
2016/02/07 03:19:51 246 dumps were executed at 244.8/second (0 total errors)
2016/02/07 03:19:52 235 dumps were executed at 236.1/second (0 total errors)
2016/02/07 03:19:53 222 dumps were executed at 221.5/second (0 total errors)
2016/02/07 03:19:54 211 dumps were executed at 211.3/second (0 total errors)
2016/02/07 03:19:55 199 dumps were executed at 199.1/second (0 total errors)
2016/02/07 03:19:56 207 dumps were executed at 207.0/second (0 total errors)
2016/02/07 03:19:57 203 dumps were executed at 203.0/second (0 total errors)
2016/02/07 03:19:58 192 dumps were executed at 192.0/second (0 total errors)
After the example starts running, performance begins quickly and permanently degrading - eventually it seems to level off at around 30 dumps (inserts) per second, representing a 97% reduction in throughput.
Notably, restarting the block writer causes this phenomena to reset: it returns to its peak around 900 inserts/s, then degrades in the same fashion.
Experiments
I ran this tool in a number of experimental configurations trying to narrow down the problem:
- I ran the test with six nodes, three nodes, and finally on a single node cluster. The problem is present even on a single node, which eliminated replication traffic as the culprit.
- In order to see if the block writer itself was at fault, I reconfigured the example to run against postgres; postgres had no degradation and write load was very steady (at ~3700 inserts/s). Thus, the block writer itself was not the problem.
- Because restarting the block writer "reset" the problem, my first theory is that there was some bug with the SQL session itself that caused it to slow down over time; thus, I modified the block writer to occasionally force-close all SQL connections. However, the performance profile persisted even with a new connection.
- My next theory was that the performance was related to the size of the table: block writer deletes the existing block table whenever it starts up. I disabled this, and found my first real clue; if the table is not deleted, the performance remains degraded even after restarting the block writer.
- Next, I wanted to see if this problem persisted even if CockroachDB was reset. I stopped the single cockroach node and restarted it, preserving the existing tables. This resets the performance degredation.
- Finally, I tried changing the table being written to by the block writer; the performance problem resets every time a new table is selected, but returns when switching back to a table which was already degraded. The degradation appears to affect each table independently.
Summary of the Issue
I believe I have narrowed down the scope of the issue: subsequent inserts to the same table on the same running CockroachDB node will progressively degrade in performance. The performance degradation affects each table independently, and the degradation is reset if CockroachDB is restarted.
Unfortunately, I am having quite a lot of trouble profiling CockroachDB using pprof, which is probably a separate issue; therefore, I am throwing this issue to the team for now, hoping that someone has an idea of what could be causing it.