Skip to content

security: hash OTPs and encrypt TOTP secrets at rest with idempotent migration (#7, #8)#590

Merged
lakhansamani merged 4 commits intomainfrom
security/otp-totp-at-rest
Apr 7, 2026
Merged

security: hash OTPs and encrypt TOTP secrets at rest with idempotent migration (#7, #8)#590
lakhansamani merged 4 commits intomainfrom
security/otp-totp-at-rest

Conversation

@lakhansamani
Copy link
Copy Markdown
Contributor

Summary

#7 — OTPs stored in plaintext
schemas/otp.go.Otp held the raw 6-digit code. An offline DB dump revealed every active password-reset / signup-verification / MFA code, allowing account takeover for any user with an outstanding OTP request.

#8 — TOTP secrets stored in plaintext
schemas/authenticators.go.Secret held the base32 TOTP shared secret. TOTP secrets are long-lived (enrolled once, used forever) so a DB dump enables permanent MFA bypass.

Approach

OTP TOTP
Lifetime seconds–minutes forever
Recover raw value? No (only verify) Yes (re-derive code each check)
Primitive HMAC-SHA256 hash AES-256-GCM encrypt

OTPs only need verification, not recovery, so a one-way hash is enough. TOTP secrets must be decryptable on every Validate, so AES-GCM with the existing HKDF-derived key.

Changes

OTP (#7)

  • New crypto.HashOTP(plain, key) (HMAC-SHA256 hex) and crypto.VerifyOTPHash(plain, stored, key) (constant-time).
  • All write sites hash before UpsertOTP. The plaintext stays in a local variable for the email/SMS body. Affected:
    • internal/graphql/login.go (generateOTP closure: store hash, restore plaintext on the returned struct so callers can read otpData.Otp unchanged)
    • internal/graphql/signup.go
    • internal/graphql/forgot_password.go
    • internal/graphql/resend_otp.go
  • Verify sites accept BOTH hash and legacy plaintext for the duration of a rolling upgrade — old in-flight rows expire within minutes anyway:
    • internal/graphql/verify_otp.go
    • internal/graphql/reset_password.go
  • Migration: NONE required. All in-flight plaintext rows naturally expire during the upgrade window.

TOTP (#8)

  • Versioned ciphertext format enc:v1:<base64(nonce||ciphertext)>. The prefix lets the read path transparently handle both new ciphertext rows and legacy plaintext, and lets future migrations recognise already-encrypted rows so they never re-encrypt them.
  • New crypto helpers: EncryptTOTPSecret, DecryptTOTPSecret, IsEncryptedTOTPSecret.
  • TOTP authenticator:
    • Generate() encrypts before storing
    • Validate() decrypts (transparent legacy passthrough)
    • Lazy migration: if the stored value is still legacy plaintext and the user just produced a valid TOTP, re-encrypt and persist before the next verification. Avoids having to add a ListAuthenticators API to all 13 storage providers.
  • EncryptionKey wired into totp.Dependencies from cfg.JWTSecret.

Deprecation note

The lazy-migration code is intentionally DEPRECATED for removal two minor versions after this lands. By then all rolling upgrades will have completed. Operators upgrading from a pre-encryption release must deploy this version once and let TOTP users sign in once before upgrading further.

Test plan

  • go build ./...
  • go test ./internal/crypto/...
  • TEST_DBS=sqlite go test -run "TestSignup|TestLogin|TestVerifyOtp|TestResendOtp|TestForgotPassword|TestResetPassword|TestVerifyEmail" ./internal/integration_tests/
  • Manual: enroll TOTP, verify DB column starts with enc:v1:
  • Manual: legacy plaintext TOTP row continues to validate, then is re-encrypted on next check

…migration (#7, #8)

#7 — OTPs stored in plaintext
schemas/otp.go.Otp held the raw 6-digit code. An offline DB dump revealed
every active password-reset / signup-verification / MFA code, allowing
account takeover for any user with an outstanding OTP request.

OTPs are short-lived (minutes) so a reversible primitive is unnecessary —
the verifier only needs to recompute the digest from the candidate. Add:
- crypto.HashOTP(plain, key)        — HMAC-SHA256 hex
- crypto.VerifyOTPHash(plain, stored, key) — constant-time compare

All write sites now hash before UpsertOTP. The plaintext stays in a local
variable for the email/SMS body. Affected handlers:
- internal/graphql/login.go (generateOTP closure: store hash, restore
  plaintext on the returned struct so callers can read otpData.Otp)
- internal/graphql/signup.go (mobile SMS verification)
- internal/graphql/forgot_password.go (mobile reset OTP)
- internal/graphql/resend_otp.go (same closure pattern as login)

Verify sites accept BOTH hash and legacy plaintext for the duration of a
rolling upgrade — old in-flight rows expire within minutes anyway:
- internal/graphql/verify_otp.go
- internal/graphql/reset_password.go

Migration: NONE required. All in-flight plaintext rows naturally expire
during the upgrade window.

#8 — TOTP secrets stored in plaintext
schemas/authenticators.go.Secret held the base32 TOTP shared secret. TOTP
secrets are long-lived (enrolled once, used forever) so a DB dump enables
permanent MFA bypass.

Add a versioned ciphertext format (so migration is idempotent and a future
algorithm rotation is straightforward):
- crypto.EncryptTOTPSecret(plain, key) → "enc:v1:" + AES-256-GCM ciphertext
- crypto.DecryptTOTPSecret(stored, key) — transparent legacy passthrough
- crypto.IsEncryptedTOTPSecret(stored) — prefix check used by lazy migration

The TOTP authenticator now:
- Encrypts the secret in Generate() before storing
- Decrypts in Validate() before computing the expected code
- Lazy-migrates legacy plaintext rows on the first successful Validate by
  re-encrypting and updating the row in place. The lazy approach avoids
  having to add a ListAuthenticators API to all 13 storage providers.
  DEPRECATED: the lazy migration code is intended to be removed two minor
  versions after this lands; by then all rolling upgrades will have
  completed.

EncryptionKey is wired into totp.Dependencies from cfg.JWTSecret so the
TOTP package does not need direct access to the broader Config struct.

Properties:
- Idempotent: encrypted rows are recognised by the enc:v1: prefix and
  never re-encrypted
- Safe under partial migration / mid-rollout failure: read path handles
  both forms; write path always produces the new form
- Mixed cluster during a rolling deploy: old nodes still read enc:v1:
  values successfully because they fail open in DecryptTOTPSecret only
  if the secret happens to be plaintext-shaped — actual mixed clusters
  will need both versions to ship together, but the read path is
  forwards-compatible
Adds dedicated unit and integration tests for the at-rest hardening so
each piece of the new behaviour is exercised explicitly:

internal/crypto/otp_test.go (new)
- HashOTP round-trip with the original plaintext
- VerifyOTPHash rejects: wrong plaintext, wrong server key, mismatched
  digest length
- VerifyOTPHash MUST reject the stored digest itself when used as input
  (otherwise the digest is a replayable credential)
- HashOTP is deterministic for a fixed (plaintext, key) pair

internal/crypto/totp_test.go (new)
- EncryptTOTPSecret produces enc:v1: ciphertext that decrypts to the
  original plaintext
- AES-GCM nonce randomness — same input yields different ciphertext
- DecryptTOTPSecret transparently passes through legacy plaintext rows
- DecryptTOTPSecret detects tampered ciphertext (GCM authentication)
- DecryptTOTPSecret fails on the wrong key
- EncryptTOTPSecret on empty input is a no-op
- IsEncryptedTOTPSecret prefix detection

internal/integration_tests/totp_at_rest_test.go (new)
- Generate() writes the secret as enc:v1:<ciphertext>, never as the raw
  base32 string. Decrypted value matches the plaintext returned to the
  caller for QR rendering.
- Validate() accepts a code computed from the encrypted secret.
- Validate() lazy-migrates a legacy plaintext row: a row written without
  the enc:v1: prefix still validates, AND the row is rewritten into the
  enc:v1: form on the first successful Validate.
- Validate() rejects an obviously wrong code on an encrypted row.
- Validate() is idempotent on already-encrypted rows: the underlying
  plaintext does not change across multiple validations.

internal/integration_tests/{verify_otp,resend_otp,reset_password}_test.go
- Reworked to seed a known plaintext / digest pair via UpsertOTP and
  verify with the plaintext, instead of reading the stored digest back
  out of the database and replaying it. The previous flow only passed
  because of the legacy plaintext fallback (see below).
- Added new "should fail when stored hash is replayed as input"
  subtest to lock the security guarantee in.

internal/integration_tests/test_helper.go
- Expose Config and AuthenticatorProvider on testSetup so tests can
  access cfg.JWTSecret and the totp authenticator. Config was already
  in the struct but never assigned by initTestSetup; this fixes that
  oversight.

internal/graphql/verify_otp.go and reset_password.go
- Drop the legacy plaintext fallback in the OTP comparison. The
  fallback was originally added to keep in-flight OTPs verifying during
  a rolling upgrade, but it ALSO accepted the stored digest itself as
  valid input — turning the digest into a replayable credential for
  anyone with DB read access. OTPs expire in minutes so the rolling-
  upgrade window is naturally bounded; users on either side simply
  request a fresh code.
@lakhansamani lakhansamani force-pushed the security/otp-totp-at-rest branch from c18a8de to 18841a6 Compare April 7, 2026 05:27
The lazy migration that rewrites legacy plaintext TOTP rows into the
enc:v1: ciphertext format was always-on. This is unsafe for a rolling
deploy from a pre-encryption release: replicas still on the old version
do not understand the enc:v1: prefix and call totp.Validate(code, "enc:v1:...")
which treats the prefix as part of a base32 secret and fails to decode,
locking out users who happen to be routed to those replicas after a new
replica has migrated their row.

Changes:

1) New CLI flag --enable-totp-migration (Config.EnableTOTPMigration),
   default OFF. Plumbed through internal/authenticators.New() into
   totp.Dependencies.EnableLazyMigration. The READ path is unaffected:
   DecryptTOTPSecret transparently handles both legacy plaintext and
   enc:v1: rows regardless of the flag, so a fresh deploy never breaks
   in-flight users.

2) Recommended deploy sequence is documented inline on
   totp.Dependencies.EnableLazyMigration:
     a. Roll out this release to all replicas with the flag OFF.
     b. Confirm every replica is on the new version.
     c. Restart with --enable-totp-migration=true.
     d. After enough time for active users to validate at least once,
        the flag may be left on indefinitely (no-op for already-encrypted
        rows).

3) totp.Validate() is now best-effort on the post-validate row write.
   A failure to update the authenticator row no longer fails the login —
   the user supplied a valid TOTP code, refusing it because of a transient
   DB hiccup or a migration encrypt error would be a worse outcome than
   a delayed VerifiedAt or a delayed migration. Failures are logged with
   structured fields so operators can spot them.

4) Decrypt failure on read is logged loudly with a hint that the most
   likely cause is a --jwt-secret rotation since enrollment.

5) Concurrency: two replicas observing the same legacy row may both
   decrypt, re-encrypt, and write before either commits. The two writes
   carry semantically identical plaintext under different AES-GCM nonces;
   last writer wins, the row ends up in enc:v1: form, and the
   IsEncryptedTOTPSecret short-circuit prevents any further re-encryption.
   This is documented in the function comment.

Tests:

internal/integration_tests/totp_at_rest_test.go gains two subtests:
  - "Validate without migration leaves legacy plaintext rows unchanged"
    spins up a second test setup with EnableTOTPMigration=false and
    asserts that a legacy plaintext row still validates BUT is not
    rewritten.
  - "Validate succeeds even if the migration write fails (best-effort)"
    documents the contract that a successful Validate returns true
    regardless of the row update outcome.

The "Validate lazy-migrates a legacy plaintext row" subtest now runs
against a setup with EnableTOTPMigration=true, so both flag states are
covered end-to-end.
…n Validate

Replaces the --enable-totp-migration flag with an in-Validate
branch that handles both forms in one place. The migration is now
always-on and the storage helper is no longer permitted to silently
pass legacy values through.

Why
---
The flag-gated lazy migration was over-engineered for the typical
deploy model. It existed to protect rolling deploys across multiple
replicas, but the protection was conditional: operators had to remember
to flip the flag at the right time, and the implicit "DecryptTOTPSecret
returns plaintext for legacy rows" behaviour was easy to miss when
reading the code.

The new shape is closer to the obvious algorithm: decrypt if you can,
fall back to the raw value if you can't, migrate if the fallback
matched.

Changes
-------

internal/crypto/totp.go
- DecryptTOTPSecret is now strict: it returns the new sentinel
  ErrTOTPSecretNotEncrypted on a non-prefixed row instead of silently
  echoing the input. The "decrypt" function now does exactly one thing,
  which is much easier to reason about.
- Test updated to assert the sentinel error.

internal/authenticators/totp/totp.go::Validate
- Branches explicitly on the decrypt result. Three outcomes:
    decErr == nil                              → enc:v1: row, use plaintext
    errors.Is(decErr, ErrTOTPSecretNotEncrypted) → legacy row, use raw value, migrate
    other decErr                               → fail closed, log loudly
- Migration is always attempted on a successful legacy validation; the
  write is best-effort (failures logged, login still succeeds).
- The function comment now documents the rolling-deploy tradeoff
  explicitly so operators with multi-replica clusters can plan around
  it (atomic deploy preferred; brief maintenance window otherwise).

internal/authenticators/totp/provider.go
internal/authenticators/providers.go
internal/config/config.go
cmd/root.go
- Drop EnableLazyMigration / Config.EnableTOTPMigration / the
  --enable-totp-migration CLI flag. There is no operator-facing knob
  any more.

internal/integration_tests/totp_at_rest_test.go
- Drop the "without migration" subtest (the flag it tested no longer
  exists).
- The "lazy migration" subtest is unchanged structurally but now
  exercises the always-on path.

Concurrency, write semantics, and idempotency are unchanged from the
previous commit:
- Two replicas migrating the same row write semantically identical
  plaintexts under different AES-GCM nonces; last writer wins; the
  IsEncryptedTOTPSecret short-circuit prevents re-encryption.
- A failed migration write does NOT fail the login.
- A decrypt failure on a prefixed row (e.g. wrong --jwt-secret) is
  fail-closed.

Known tradeoff
--------------
For a rolling deploy from a pre-encryption release: as soon as one
replica handles a TOTP login, that user's row is in enc:v1: form. Any
old replica that subsequently reads the row will treat the prefix as
part of a base32 secret and fail. Either deploy atomically, or accept
a short window during which migrated users may be locked out of old
replicas (TOTP codes regenerate every 30s so the user-visible impact
is bounded).
@lakhansamani lakhansamani merged commit 9f0e0f8 into main Apr 7, 2026
@lakhansamani lakhansamani deleted the security/otp-totp-at-rest branch April 7, 2026 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant