Skip to content

model/query/submissions: avoid RETURNING * where possible#1544

Closed
alxndrsn wants to merge 8 commits intogetodk:masterfrom
alxndrsn:retunring-no-stars
Closed

model/query/submissions: avoid RETURNING * where possible#1544
alxndrsn wants to merge 8 commits intogetodk:masterfrom
alxndrsn:retunring-no-stars

Conversation

@alxndrsn
Copy link
Contributor

@alxndrsn alxndrsn commented Jul 11, 2025

INSERT...RETURNING * is a very convenient SQL structure, but it's often used to return values which are already available to the calling code. When this happens, there is unnecessary overhead to database communications.

There is risk to introducing this change, as:

  1. new columns may be added to an INSERT/UPDATE statement without noticing that they might also be excluded from the generated list of RETURNING columns
  2. a BEFORE INSERT trigger might change supplied values when returned
  3. supplied value might be coerced into a different type when returned

Related: getodk/central#1170

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to the API documentation? If so, please update docs/api.yaml as part of this PR.

Before submitting this PR, please make sure you have:

  • run make test and confirmed all checks still pass OR confirm CircleCI build passes
  • verified that any code from external sources are properly credited in comments or that everything is internally sourced

alxndrsn and others added 7 commits July 11, 2025 07:28
`INSERT...RETURNING *` is a very convenient SQL structure, but it's often
used to return values which are already available to the calling code.
When this happens, there is unnecessary overhead to database communications.

There is risk to introducing this change, as:

1. new columns may be added to an INSERT statement without noticing that
   they might also be exluded from the generated list of RETURNING columns
2. a BEFORE INSERT trigger might change supplied values when returned
3. supplied value might be coerced into a different type when returned
@alxndrsn alxndrsn changed the title model/query: avoid RETURNING * where possible model/query/submissions: avoid RETURNING * where possible Jul 29, 2025
const _defInsert = (id, partial, formDefId, actorId, root, deviceId, userAgent) => sql`insert into submission_defs ("submissionId", xml, "formDefId", "instanceId", "instanceName", "submitterId", "localKey", "encDataAttachmentName", "signature", "createdAt", root, current, "deviceId", "userAgent")
values (${id}, ${sql.binary(partial.xml)}, ${formDefId}, ${partial.instanceId}, ${partial.def.instanceName}, ${actorId}, ${partial.def.localKey}, ${partial.def.encDataAttachmentName}, ${partial.def.signature}, clock_timestamp(), ${root}, true, ${deviceId}, ${userAgent})
returning *`;
returning ${_defInsertReturnFields}`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we just return submissionId and clock_timestamp - only values that are generated by database.

Generally database only generates surrogate keys using pg_get_serial_sequence and timestamp through the application

Copy link
Contributor Author

@alxndrsn alxndrsn Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we just return submissionId and clock_timestamp

That's what's happening, just in a less direct way.

Do you think the code would be simpler if we declared fields we do want returned rather than fields we don't? It would certainly make this code shorter. Maybe a neater approach would be to mark fields as autogenerated when declaring Frames? E.g.

https://github.com/getodk/central-backend/blob/master/lib/model/frames/submission.js#L89-L100

Could become:

table('submission_defs', 'def'),
  'id', autogenerated,                  'submissionId',
  'formDefId',                          'submitterId',  readable,
  'localKey',                           'encDataAttachmentName',
  'signature',                          'createdAt',    readable, autogenerated,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of autogenerated attribute in the Frame.

@brontolosone
Copy link
Contributor

When this happens, there is unnecessary overhead to database communications.

That's true, but I was hoping to find it measurably true as well, so I set out to measure a performance difference on the submission path.
But for small submissions (1.7 KB) I can't detect a difference, and neither can I for larger submissions (215 KB).
In that light, is it still worth the risk to change this pattern? Or do we need better perf testing data than what comes rolling out of my ad-hoc perf testing strategy? Do we need to see a tangible improvement or would we also merge this purely for efficiency-aesthetics?

@alxndrsn
Copy link
Contributor Author

Do we need to see a tangible improvement or would we also merge this purely for efficiency-aesthetics?

I'd rather there's a demonstrable benefit, as it complicates the code. Maybe we should discuss scale.

@sadiqkhoja
Copy link
Contributor

It should be demonstrable quite easily if application and database are running on different machines as it is for all our cloud servers.

@yanokwa
Copy link
Member

yanokwa commented Aug 12, 2025

@brontolosone Can you test on dev to see if there was a demonstrable benefit?

@brontolosone brontolosone self-requested a review September 2, 2025 05:13
@brontolosone
Copy link
Contributor

On the benchmarks I've shown you a couple of weeks ago in the pure-sql submission insertion path discussion, this branch showed reduced peak memory usage (roughly halved) inside PostgreSQL compared to mainline, but no throughput speedups. So there's a demonstrable benefit albeit in a benchmark which wasn't particularly designed to measure the impact of this exact thing, and with an N of only 2. We could wait for more & better benchmarks but... the benchmark results make logical sense (smaller resultsets to be kept in buffers in PostgreSQL -> lower memory usage) and it's generally a good idea not to do useless work and it's not a very invasive change, so... I'd be happy to approve it? But it's drafted currently.

@yanokwa
Copy link
Member

yanokwa commented Sep 2, 2025

@alxndrsn Let's get this ready for review and merge. Less memory usage is a good win.

@alxndrsn
Copy link
Contributor Author

alxndrsn commented Sep 3, 2025

@alxndrsn Let's get this ready for review and merge. Less memory usage is a good win.

👍 I think ideally there would be a less-manual notation required for this - see #1544 (comment)

@alxndrsn
Copy link
Contributor Author

alxndrsn commented Sep 5, 2025

Closed for #1606; re-adding reviewers there.

@alxndrsn alxndrsn closed this Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants