Exactly-Once Semantics for Database Transactions
Problem
Golem provides fault-tolerant execution of WASM components by recording all host function interactions in a durable operation log (oplog). In the case of infrastructure failure (e.g., process crash, hardware fault), this allows component execution to resume precisely from the point of interruption.
However, when a component executes a database transaction, and a failure occurs after the database has made the transaction durable, but before Golem records the outcome in the oplog, the transaction may be re-executed during replay. This breaks exactly-once semantics and may result in duplicate effects (e.g., double inserts, duplicate charges) if the database operation is not idempotent.
This issue outlines the mechanism for achieving exactly-once semantics for database transactions under two execution models:
- Databases with native transaction introspection
- Databases without such support
Class 1: Databases with Native Transaction Status Introspection (e.g., Postgres)
Some databases (like modern Postgres versions) allow retrieving a unique identifier for the current transaction and querying its commit status after a crash.
Approach:
- At the start of a transaction, Golem retrieves a transaction identifier from the database and logs it to the oplog before executing any component logic.
- On commit or rollback, Golem logs an
end_transaction marker in the oplog indicating the outcome.
- During replay, if a
begin_transaction is encountered, Golem scans forward in the oplog:
- If a corresponding
end_transaction is present: no recovery action is needed.
- If missing: Golem queries the database using the stored transaction ID.
- If the database confirms the transaction was committed: do not replay.
- If the transaction was aborted or never completed: replay.
- If the database cannot determine the status: fail the instance.
This strategy avoids schema changes and imposes minimal performance overhead. It should be used wherever supported.
Class 2: Databases Without Transaction Introspection (e.g., MySQL)
Databases that do not expose transaction status must rely on a fallback mechanism involving durable, internal metadata.
Approach:
- When a transaction is about to begin, Golem checks for the existence of a special internal table (
golem_transactions). If the table does not exist, Golem attempts to create it.
- If creation fails due to insufficient permissions, the instance is failed immediately with a descriptive error.
- Golem generates a unique transaction identifier (UUID) and:
- Writes it to the internal table with a status of
incomplete
- Logs a
begin_transaction entry to the oplog using the UUID
- The component transaction is executed.
- Just before issuing commit or rollback, Golem updates the internal table entry (e.g.,
committed or aborted) as part of the same transaction.
- After the commit or rollback completes, Golem writes an
end_transaction marker to the oplog.
- Once this marker is durably recorded in the oplog, Golem deletes the corresponding row in the
golem_transactions table.
Replay Logic:
- During replay, for each
begin_transaction entry, Golem scans ahead in the oplog to check for a matching end_transaction:
- If the end marker is present: transaction is complete; no further action needed.
- If the end marker is missing: consult the internal table by UUID.
- If status is
committed: do not replay.
- If status is
aborted, incomplete, or absent: replay using a fresh UUID.
- Deletion of the row from the internal table is safe only after the
end_transaction marker has been durably persisted in the oplog.
This fallback mechanism ensures exactly-once semantics with minimal overhead and without requiring modifications to developer-managed schemas.
Recovery Behavior Summary
| Condition |
Recovery Action |
begin_transaction seen |
Scan forward in oplog |
→ end_transaction exists |
Skip replay |
→ No end_transaction |
Check DB for status |
↳ DB says committed |
Skip replay |
↳ DB says aborted or unknown |
Replay with new txn ID |
| ↳ DB error / access failure |
Fail instance |
Cleanup Policy
Once Golem logs the end_transaction marker in the oplog, the corresponding entry in the internal golem_transactions table is no longer required and should be deleted immediately. This prevents unbounded growth of the table and ensures fast recovery lookups. A timestamp field may optionally be included for future audit, debugging, or compaction strategies.
Definition of Done / Tests
The implementation is complete when the following test scenarios are covered:
Crash Recovery Paths
- Crash before transaction begins (no DB insert, no oplog entry): transaction is not replayed.
- Crash after metadata row inserted but before oplog write: transaction is not replayed.
- Crash after oplog write but before DB transaction begins: transaction is not replayed.
- Crash after DB commit but before oplog completion marker:
end_transaction missing → Golem checks DB for status.
- If
committed: skip.
- If
aborted or unknown: replay.
- Crash after rollback but before oplog marker: same as above.
- Crash after
end_transaction marker written: transaction is known complete; do not replay.
Oplog Replay Semantics
begin_transaction must always trigger forward oplog scan before consulting DB or fallback table.
end_transaction must always be logged immediately after commit/rollback completes.
Permissions
- Instance fails immediately and descriptively if:
- Golem cannot create or access the internal metadata table
- Golem lacks access to transaction introspection APIs
Fallback Table Behavior
- UUIDs are globally unique per transaction.
- Metadata rows are deleted only after oplog completion is confirmed.
- Table remains bounded in size under sustained load.
Compatibility
- Strategy automatically prefers introspection-based path when supported.
- Fallback path is used on MySQL, SQLite, or any similar database.
Performance
- No significant overhead from metadata writes or internal table lookups.
Notes
- Golem instances are non-interactive and isolated. Any required permissions or schema setup that cannot be satisfied must fail the instance deterministically, with clear error messages for developers or operational staff.
- This mechanism must not require schema changes outside Golem’s internal tables or coordination with external systems.
This implementation guarantees exactly-once transactional execution across all supported databases and execution environments, and is necessary for correctness in the presence of failures.
Exactly-Once Semantics for Database Transactions
Problem
Golem provides fault-tolerant execution of WASM components by recording all host function interactions in a durable operation log (oplog). In the case of infrastructure failure (e.g., process crash, hardware fault), this allows component execution to resume precisely from the point of interruption.
However, when a component executes a database transaction, and a failure occurs after the database has made the transaction durable, but before Golem records the outcome in the oplog, the transaction may be re-executed during replay. This breaks exactly-once semantics and may result in duplicate effects (e.g., double inserts, duplicate charges) if the database operation is not idempotent.
This issue outlines the mechanism for achieving exactly-once semantics for database transactions under two execution models:
Class 1: Databases with Native Transaction Status Introspection (e.g., Postgres)
Some databases (like modern Postgres versions) allow retrieving a unique identifier for the current transaction and querying its commit status after a crash.
Approach:
end_transactionmarker in the oplog indicating the outcome.begin_transactionis encountered, Golem scans forward in the oplog:end_transactionis present: no recovery action is needed.This strategy avoids schema changes and imposes minimal performance overhead. It should be used wherever supported.
Class 2: Databases Without Transaction Introspection (e.g., MySQL)
Databases that do not expose transaction status must rely on a fallback mechanism involving durable, internal metadata.
Approach:
golem_transactions). If the table does not exist, Golem attempts to create it.incompletebegin_transactionentry to the oplog using the UUIDcommittedoraborted) as part of the same transaction.end_transactionmarker to the oplog.golem_transactionstable.Replay Logic:
begin_transactionentry, Golem scans ahead in the oplog to check for a matchingend_transaction:committed: do not replay.aborted,incomplete, or absent: replay using a fresh UUID.end_transactionmarker has been durably persisted in the oplog.This fallback mechanism ensures exactly-once semantics with minimal overhead and without requiring modifications to developer-managed schemas.
Recovery Behavior Summary
begin_transactionseenend_transactionexistsend_transactioncommittedabortedor unknownCleanup Policy
Once Golem logs the
end_transactionmarker in the oplog, the corresponding entry in the internalgolem_transactionstable is no longer required and should be deleted immediately. This prevents unbounded growth of the table and ensures fast recovery lookups. A timestamp field may optionally be included for future audit, debugging, or compaction strategies.Definition of Done / Tests
The implementation is complete when the following test scenarios are covered:
Crash Recovery Paths
end_transactionmissing → Golem checks DB for status.committed: skip.abortedor unknown: replay.end_transactionmarker written: transaction is known complete; do not replay.Oplog Replay Semantics
begin_transactionmust always trigger forward oplog scan before consulting DB or fallback table.end_transactionmust always be logged immediately after commit/rollback completes.Permissions
Fallback Table Behavior
Compatibility
Performance
Notes
This implementation guarantees exactly-once transactional execution across all supported databases and execution environments, and is necessary for correctness in the presence of failures.