Relevant system information:
- OS: Debian bullseye
- PostgreSQL version: 12.8
- TimescaleDB version: 2.4.1
- Installation method: apt install
Describe the bug
pg_dump was running and currently dumping a completely different table. This apparently holds a AccessSharedLock on all tables that it will dump in the future (see --jobs in the docs), including _timescaledb_catalog.continuous_aggs_invalidation_threshold.
Then, the continuous refresh policy started running, which tries to acquire a AccessExclusiveLock on continuous_aggs_invalidation_threshold, so that started waiting on pg_dump finishing. Why does this need an AccessExclusiveLock?
Then, other queries that tried to write to our main table (NOT an aggregate or anything) were blocked by the waiting continuous refresh policy, since interactions with the main hypertable require an AccessSharedLock on the invalidation threshold thing, and I guess postgresql doesn't reorder lock requests ever, even though two AccessSharedLocks could exist at the same time. This started piling up until all site operation was basically stopped with messages like:
"process 3620065 still waiting for AccessExclusiveLock on relation 2241176 of database 16401 after 1000.130 ms","Process holding the lock: 3594548. Wait queue: 3620065, 3618703, 3619894, 3619893, 3619807, 3619867, 3619759, 3620066, 3620067, 3620068, 3620069."
This took our whole site down yesterday since inserts in a timscaledb table failed, which is pretty surprising since I assumed anything about the continuous aggregate would not affect anything else (nothing was interacting with the aggregate in any way).
To Reproduce
I didn't test this, but unless I missed something this should reproduce it:
- have one hypertable
t and one continuous aggregate a on that table
- start a transaction and
LOCK _timescaledb_catalog.continuous_aggs_invalidation_threshold IN ACCESS SHARE MODE (this emulates the long-running pg_dump process)
- start a continuous aggregate update on
a, this should block on (1) by requiring an AccessExclusiveLock
- run any query on
t, this should block on (2) by requiring an AccessShareLock
Expected behavior
The contiuous aggregate update should either not require an AccessExclusiveLock or fail as soon as possible if this situation happens (e.g. set lock_timeout on aquiring that lock)
In general, I'd think an unused continuous aggregate shouldn't affect operations on the main table in any way.
I'd also expect some documentation about those "hidden" tables and how exactly they interact with each other. Can't find anything about this locking behaviour in the docs, only this issue #2308 which may in fact be related.
Actual behavior
Made our whole site cease functioning for an hour+ while pg_dump was running.
Relevant system information:
Describe the bug
pg_dump was running and currently dumping a completely different table. This apparently holds a AccessSharedLock on all tables that it will dump in the future (see
--jobsin the docs), including_timescaledb_catalog.continuous_aggs_invalidation_threshold.Then, the continuous refresh policy started running, which tries to acquire a AccessExclusiveLock on
continuous_aggs_invalidation_threshold, so that started waiting on pg_dump finishing. Why does this need an AccessExclusiveLock?Then, other queries that tried to write to our main table (NOT an aggregate or anything) were blocked by the waiting continuous refresh policy, since interactions with the main hypertable require an AccessSharedLock on the invalidation threshold thing, and I guess postgresql doesn't reorder lock requests ever, even though two AccessSharedLocks could exist at the same time. This started piling up until all site operation was basically stopped with messages like:
"process 3620065 still waiting for AccessExclusiveLock on relation 2241176 of database 16401 after 1000.130 ms","Process holding the lock: 3594548. Wait queue: 3620065, 3618703, 3619894, 3619893, 3619807, 3619867, 3619759, 3620066, 3620067, 3620068, 3620069."
This took our whole site down yesterday since inserts in a timscaledb table failed, which is pretty surprising since I assumed anything about the continuous aggregate would not affect anything else (nothing was interacting with the aggregate in any way).
To Reproduce
I didn't test this, but unless I missed something this should reproduce it:
tand one continuous aggregateaon that tableLOCK _timescaledb_catalog.continuous_aggs_invalidation_threshold IN ACCESS SHARE MODE(this emulates the long-running pg_dump process)a, this should block on (1) by requiring an AccessExclusiveLockt, this should block on (2) by requiring an AccessShareLockExpected behavior
The contiuous aggregate update should either not require an AccessExclusiveLock or fail as soon as possible if this situation happens (e.g. set lock_timeout on aquiring that lock)
In general, I'd think an unused continuous aggregate shouldn't affect operations on the main table in any way.
I'd also expect some documentation about those "hidden" tables and how exactly they interact with each other. Can't find anything about this locking behaviour in the docs, only this issue #2308 which may in fact be related.
Actual behavior
Made our whole site cease functioning for an hour+ while pg_dump was running.