-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: GC queue busy loop after canceled import #20159
Copy link
Copy link
Closed
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Description
My laptop started burning and I investigated.
CockroachDB CCL v1.2-alpha.20171113-492-ga093de3cc-dirty (darwin amd64, built 2017/11/18 03:55:36, go1.9)
Found that the GC queue was processing replicas en masse, without ever getting to delete anything:
2017/11/20 11:51:42.877287 | 1.326651 | gc
-- | -- | --
11:51:42.877410 | . 123 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] processing replica with score queue=true with 214.74/fuzz(1.21)=177.49=valScaleScore(177.49)*deadFrac(1.00)+intentScore(0.00) likely last GC: 2m11.298931528s ago, 35 MiB non-live, curr. age 531 TiB*s, min exp. reduction: 528 TiB*s
11:51:42.877769 | . 359 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] iterating through range
11:51:44.199698 | 1.321885 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] pushing up to 0 transactions (concurrency 25)
11:51:44.199703 | . 5 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] resolving up to 0 intents
11:51:44.199705 | . 2 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] processing AbortSpan
11:51:44.202242 | . 2537 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] MVCC stats: {ContainsEstimates:false LastUpdateNanos:1511192683709528696 IntentAge:0 GCBytesAge:583883251303911 LiveBytes:0 LiveCount:0 KeyBytes:9267723 KeyCount:250479 ValBytes:27292481 ValCount:500958 IntentBytes:0 IntentCount:0 SysBytes:529 SysCount:9}
11:51:44.202255 | . 14 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] assembled GC keys, now proceeding to GC; stats so far {Now:1511196702.877305466,0 Policy:{TTLSeconds:90000} NumKeysAffected:0 IntentsConsidered:0 IntentTxns:0 TransactionSpanTotal:0 TransactionSpanGCAborted:0 TransactionSpanGCCommitted:0 TransactionSpanGCPending:0 TxnSpanGCThreshold:1511193102.877305466,0 AbortSpanTotal:0 AbortSpanConsidered:0 AbortSpanGCNum:0 PushTxn:0 ResolveTotal:0 ResolveSuccess:0 Threshold:1511106702.877305466,0}
11:51:44.202260 | . 4 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] sending batch 1 of 1
11:51:44.202268 | . 8 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] read-write path
11:51:44.202275 | . 7 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] command queue
11:51:44.202331 | . 57 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] applied timestamp cache
11:51:44.202387 | . 56 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] done with GC evaluation for 0 keys at 0.00 keys/sec. Deleted 0 versions
11:51:44.202452 | . 65 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] evaluated request
11:51:44.202462 | . 10 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] acquired {raft,replica}mu
11:51:44.203489 | . 1027 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] applying command
11:51:44.203974 | . 484 | ... [gc,n1,s1,r138/1:/Table/53/1/235441{582…-690…}] done GC'ing, new score is queue=true with 214.74/fuzz(1.21)=177.49=valScaleScore(177.49)*deadFrac(1.00)+intentScore(0.00) likely last GC: 1.3265734s ago, 35 MiB non-live, curr. age 531 TiB*s, min exp. reduction: 528 TiB*s
The replicas were created by a cancelled import:
root@:26257/> select * from crdb_internal.ranges where range_id=138; +----------+------------------------------------+--------------------------------+---------------------------------------+--------------------------------+----------+-------+-------+----------+--------------+
| range_id | start_key | start_pretty | end_key | end_pretty | database | table | index | replicas | lease_holder |
+----------+------------------------------------+--------------------------------+---------------------------------------+--------------------------------+----------+-------+-------+----------+--------------+
| 138 | "\xbd\x89\xfd\x03Dt\xdfN=\x80\x01" | /Table/53/1/235441582969290753 | "\xbd\x89\xfd\x03Dt\xf8S\x01\x00\x01" | /Table/53/1/235441690423394305 | | | | {1} | 1 |
+----------+------------------------------------+--------------------------------+---------------------------------------+--------------------------------+----------+-------+-------+----------+--------------+
(1 row)
Time: 27.878744ms
root@:26257/>
root@:26257/> show jobs;
+--------------------+---------+--------------------------------------------------------------------------------+----------+----------+---------------------------------+---------------------------------+----------------------------------+----------------------------------+--------------------+-------+----------------+
| id | type | description | username | status | created | started | finished | modified | fraction_completed | error | coordinator_id |
+--------------------+---------+--------------------------------------------------------------------------------+----------+----------+---------------------------------+---------------------------------+----------------------------------+----------------------------------+--------------------+-------+----------------+
| 297792826687520769 | RESTORE | RESTORE tpch.lineitem FROM 'azure://backup-test/benchmarks/tpch/scalefactor-5' | root | canceled | 2017-11-17 20:12:38.53302+00:00 | 2017-11-17 20:12:38.54475+00:00 | 2017-11-17 21:57:28.539675+00:00 | 2017-11-17 21:57:28.539677+00:00 | 0.7369297742843628 | | 1 |
+--------------------+---------+--------------------------------------------------------------------------------+----------+----------+---------------------------------+---------------------------------+----------------------------------+----------------------------------+--------------------+-------+----------------+
(1 row)
Time: 2.415237ms
Either there aren't actually any keys in that range (so the stats are just wrong) or the keys are there but at unusual timestamps (but since the trigger for GC is firing, it should imply reducing GCBytesAge by a factor of two, which isn't happening). So in all likelyhood the stats are wrong. I'll take a look.
Ping #20052
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
S-1-stabilitySevere stability issues that can be fixed by upgrading, but usually don’t resolve by restartingSevere stability issues that can be fixed by upgrading, but usually don’t resolve by restarting