postgres: Make the initial type map query less expensive#40876
postgres: Make the initial type map query less expensive#40876rafaelfranca merged 1 commit intorails:masterfrom
Conversation
When the pg_type table is large, the initial type map query is very expensive. This is due to two reasons: 1. The complete query does not allow use of any indexes. 2. The complete query returns much more rows than we need Point 1 is fixed by splitting up each OR into its own query. For reasons unknown to me, the query plan is slower with OR instead of separate queries. Point 2 is a problem when databases have lots of tables. This is because for each table, postgres creates both a type for each record, and a type for an array of records. The latter types are matched by the condition `t.typelem != 0`. There is no easy way to filter out these types, but since we already make a new query due to point 1, we can change our query to bring us exactly the data we need: all the types that reference (via `typelem`) a type we already know about. This way we greatly reduce the number of rows returned. Additionally, we remove the condition `t.typinput = 'array_in(cstring,oid,integer)'::regprocedure`, since it brings no value. Array types have a nonzero `typelem`. This change brings a type map initialization performance improvement. A production system with aprox 1.6MM rows in `pg_type`, improves from 2.5-4.1 seconds to 1.4-1.6 seconds.
|
I'm sorry to see that. I have pending to submit patch to (optionally) disable type map initialization. It turns out it's not strictly necessary, and (as seen in the linked discussion), can be quite costly in databases using def query_conditions_for_known_type_types
known_type_types = %w('r' 'e' 'd')
<<~SQL % known_type_types.join(", ")
WHERE
1 = 0
SQL
end
def query_conditions_for_array_types
known_type_oids = @store.keys.reject { |k| k.is_a?(String) }
<<~SQL % [known_type_oids.join(", ")]
WHERE
1 = 0
SQL
end |
|
Instead of the query ending up as: if it could be this then it might run quite a bit faster: |
|
FIY: As suggested by an user on stack overflow, upgrading to PostgreSQL 14 did the trick: After the upgrade the execution time of this query decreased from 9656.222 ms to 60.618 ms |
|
This change also makes our app query extremely slower, we're also on AWS RDS Aurora We haven't had the chance to upgrade Postgres to 14 yet. |
|
@jorgearimitsu - would you like to try the version that (for me at least) works faster under Postgres < 14? The general idea is in the post above from October 27 :) |
@fsateler I have an idea on how to optimize this as a single query, which we're interested in to reduce the number of query roundtrips at boot. Would you be willing to help verify performance of an updated query if I provided that? BTW I'm the author of the "use hashtable for IN () queries" performance improvement in Postgres, so it was interesting to see that show up here. |
|
Hi @jcoleman, I would be glad to help. |
|
@fsateler Here's a branch with changes; would love your feedback prior to my opening a PR. It looks like you never ended up adding an option to disable loading array types and types by type, but maybe I'm missing something. Either way, my patch should make array loading more performant (by turning that into an index scan) and retain the ability to disable loading types by type. If you disable that by overriding |
|
Hi @jcoleman ! Sorry for the delay, busy weeks. I scheduled some time for wednesday to check it out. |
|
Hi! I have news to report. On a database with 4_986_377 elements in An the resulting type map has 140 elements. After applying the patch, rails generates: But the resulting type map has only 116 elements 🤔 Overriding And a type_map size of 116 as well 🤔 . This is on Note: I couldn't get to run with a rails fork, so I monkeypatched according to the linked branch. In the interest of verifiability, here is my monkey patch: Monkey PatchActiveSupport::Reloader.to_prepare do
class ActiveRecord::ConnectionAdapters::PostgreSQL::OID::TypeMapInitializer
def query_conditions_for_known_type_names
known_type_names = @store.keys.map { |n| "'#{n}'" }
<<~SQL % known_type_names.join(", ")
t.typname IN (%s)
SQL
end
def query_conditions_for_known_type_types
known_type_types = %w('r' 'e' 'd')
<<~SQL % known_type_types.join(", ")
1 = 0
SQL
end
def query_conditions_for_array_types
known_type_oids = @store.keys.reject { |k| k.is_a?(String) }
<<~SQL % [known_type_oids.join(", ")]
t.typtype IN (%s)
SQL
end
end
class ActiveRecord::ConnectionAdapters::PostgreSQLAdapter
def load_additional_types(oids = nil)
initializer = OID::TypeMapInitializer.new(type_map)
load_types_queries(initializer, oids) do |query|
execute_and_clear(query, "SCHEMA", []) do |records|
initializer.run(records)
end
end
end
def pg_types_scan_sql(where_conditions_sql, select_typarray: false)
<<~SQL % where_conditions_sql
SELECT t.oid, t.typname, t.typelem, t.typdelim, t.typinput, t.typtype, t.typbasetype, t.typarray
FROM pg_type as t
WHERE %s
SQL
end
def load_types_queries(initializer, oids)
outer_query = <<~SQL
SELECT types.*, r.rngsubtype
FROM (
%s
) types
LEFT JOIN pg_range as r ON oid = rngtypid
SQL
if oids
yield outer_query % pg_types_scan_sql("t.oid IN (%s)" % oids.join(", "))
else
# We want to keep this as a single query to prevent unnecessary
# round trips to the database when a connection starts up. We also
# want to keep the query performing optimally when there are lots
# of entries in pg_types.
#
# This was previously split into multiple queries because Postgres
# doesn't optimize OR clauses well, so if we just OR the conditions
# together we lose the index scan on `pg_types.typname`. However if we have to
# scan on `pg_types.typtype` we're going to have to use a seq scan
# regardless, and we might as well do a single scan to find all of
# the entries. Some people want to
# avoid the seq scan entirely, so we retain the extraction of
# `TypeMapInitializer#query_conditions_for_known_type_types` so that
# that portion can be disabled by overriding it to return `"1=0"`.
#
# We're able to fold in the array types query in a performant way
# in the single query by using a CTE -- instead of doing a seq scan
# on `pg_types.typelem` we work the opposite direction and can look
# up entries using the `oid` index based on `typarray` values from
# the rows we're otherwise returning.
pg_types_conditions_sql = [
initializer.query_conditions_for_known_type_types,
initializer.query_conditions_for_known_type_names,
].join " OR "
inner_query = <<~SQL
WITH non_array_types AS (
#{pg_types_scan_sql(pg_types_conditions_sql, select_typarray: true)}
)
(
SELECT oid, typname, typelem, typdelim, typinput, typtype, typbasetype
FROM non_array_types
UNION
SELECT oid, typname, typelem, typdelim, typinput, typtype, typbasetype
FROM pg_type
WHERE oid = ANY ((SELECT array_agg(nat.typarray) FROM non_array_types nat)::oid[])
)
SQL
yield outer_query % inner_query
Rails.logger.info "type_map_size: #{type_map.instance_variable_get(:@mapping).size}"
end
end
end
end |
|
@fsateler Thanks for testing! Would you be able to run the following query on that database to find what rows are being left out? On one PG database of ours that yields: but our You could also find the differences between the type maps and query on those OIDs. The difference here I think is going to boil down to the fact that |
Does this validate your hipothesis? |
|
Those are the only two I see also — and ActiveRecord shouldn’t need to see those ones. Hmm. Are you able to introspect the schema cache itself and see what oid values are missing (and then query pg_type to see what types those represent)? |
Summary
When the pg_type table is large, the initial type map query is very expensive.
This is due to two reasons:
Point 1 is fixed by splitting up each OR into its own query. For reasons
unknown to me, the query plan is slower with OR instead of separate
queries. My suspicion is that reducing the number of checks to do on a full table scan is faster, because other checks are done with an index.
Point 2 is a problem when databases have lots of tables. This is because
for each table, postgres creates both a type for each record, and a type
for an array of records. The latter types are matched by the condition
t.typelem != 0. There is no easy way to filter out these types, butsince we already make a new query due to point 1, we can change our query
to bring us exactly the data we need: all the types that reference (via
typelem) a type we already know about. This way we greatly reduce thenumber of rows returned.
This change brings a type map initialization performance improvement.
A production system with aprox 1.6MM rows in
pg_type, improves from2.5-4.1 seconds to 1.4-1.6 seconds.
Other Information
Additionally, we remove the condition
t.typinput = 'array_in(cstring,oid,integer)'::regprocedure, since itappears to bring no value. Array types have a nonzero
typelem. I'm marking this PR as Draft because I'm not 100% sure about this.