Skip to content

fix(ha-mqtt): restore HA alarm panel arm/disarm end-to-end + PIN perf#46

Merged
ljmerza merged 9 commits into
mainfrom
fix/ha-mqtt-alarm-action-template-variable
May 2, 2026
Merged

fix(ha-mqtt): restore HA alarm panel arm/disarm end-to-end + PIN perf#46
ljmerza merged 9 commits into
mainfrom
fix/ha-mqtt-alarm-action-template-variable

Conversation

@ljmerza

@ljmerza ljmerza commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Four changes that together restore arm/disarm functionality from the Home Assistant alarm panel and cut its end-to-end latency from ~2s to subsecond:

  1. Use {{ action }} not {{ value }} in the alarm command_template. Without this, every click landed on the broker as {"action": "None", "code": "..."} and was rejected as "Unknown action."
  2. Append a per-process random suffix to the broker client_id. Defends against two MqttConnectionManager instances inside the same container (e.g. a diagnostic script that ran django.setup()) fighting the live daphne client for the same MQTT identity.
  3. Treat an empty code field in the MQTT command payload as no-code. Without this, arming with code_arm_required=False was rejected with "Code required." because HA always sends code: "" in the rendered template, and our parser surfaced that as "" rather than None.
  4. Lower-cost Argon2 hasher for UserCode PINs (~150-300ms → ~30ms per verify). Alarm PINs are 4-8 digits; the strong user-password cost gives no real attacker bar against a tiny keyspace but stacks ~hundreds of ms on every arm/disarm round trip.

Bug 1 — {{ action }} not {{ value }}

HA's alarm_control_panel.mqtt exposes only action and code to command_templatevalue is lock.mqtt's variable, not alarm's (HA docs). With {{ value }}, Jinja saw value as undefined and rendered it as the literal string "None", so HA published payloads like:

{"action": "None", "code": "1234"}

Latchpoint's parser uppercased "None""NONE", found no match in _ACTION_TO_STATE, and replied "Unknown action." to the error topic without invoking the state machine.

Why nothing caught it: the previous test round-tripped the broken template through json.loads and asserted equality against itself — it locked the bug in place. Replaced with a test that asserts the documented HA variable name, plus a render-simulation test using Django's template engine that proves the action survives substitution as a real string.

Bug 2 — Per-process client_id suffix

When two paho clients share the configured client_id (default "latchpoint-alarm"), the MQTT broker enforces single-client-per-id and boots the older one with session taken over. The loser auto-reconnects within ~1s, ping-ponging forever. Each successful connect fires _after_connect, republishing retained discovery + availability + state.

The collision shape encountered in practice during this debug session: a diagnostic script run via docker exec into the live container called django.setup(), which triggered transports_mqtt.apps.ready() and spawned a second MqttConnectionManager in the same container. Both clients connected as latchpoint-alarm. Same shape would hit any future scenario where a second consumer (parallel deploy, dev side-car, accidental import path) imports transports_mqtt.

Fix: append a per-process random suffix (secrets.token_hex(3), 6 hex chars) to the configured client_id when building the paho client. Stable within one process so reconnects don't churn the broker's view of the live client; divergent across processes so they cannot collide. test_connection() opts into a fresh suffix per call so its throwaway client never knocks the live one off, even from inside the same process.

The 6-hex suffix is sized so the default composed client_id latchpoint-alarm-XXXXXX lands at exactly 23 bytes — the MQTT 3.1 spec limit. paho-mqtt 2.0+ defaults to 3.1.1 (which raises this to 65535), but staying inside the 3.1 limit guards against any stricter broker. 6 hex still gives 16M distinct values, far above what's needed for 1-2 process collision avoidance. The user-facing setting was renamed to Client ID prefix with description noting the 16-char prefix budget; rc=2 connect-error messages were updated to match.

Bug 3 — Empty code field treated as missing

HA's MQTT alarm panel renders the command template with code = code or "", so when a user arms (or disarms) without typing a PIN the broker payload is {"action": "ARM_HOME", "code": ""} — the field is present but empty.

_handle_command_payload returned "" for that case, while the arm branch of handle_mqtt_alarm_command checked code_required = get_setting_bool(profile, "code_arm_required") or raw_code is not None. Empty string is not None, so the handler forced code validation even when code_arm_required was False, and the broker payload came back as error: "Code required.".

Fix: normalize empty/whitespace code to None once at the parser boundary in _handle_command_payload. The disarm branch already used truthiness (if not raw_code:) and was unaffected — collapsing the two representations at the parser removes the boundary inconsistency for any future caller too.

Verified on the live HA instance via the MCP server: alarm_control_panel.latchpoint_alarm already had code_arm_required: false in its state attributes, confirming the discovery republish pipeline was healthy and the failure was purely backend-side.

Perf — Lower-cost Argon2 for PINs

UserCode PINs are 4-8 digits (10K-100M keyspace). Stretching them at the default user-password Argon2 cost (~150-300ms/verify) doesn't materially raise an offline-brute-force attacker's bar — the entire keyspace is recoverable in minutes-to-hours even at strong cost — while the slow verify shows up as a multi-hundred-ms tax on every arm/disarm round trip from the HA MQTT alarm panel.

Add UserCodeArgon2Hasher (algorithm argon2-pin, time_cost=1, memory_cost=8 MiB, parallelism=2; ~30ms/verify) and route create_user_code / update_user_code through it. User passwords keep the strong default. Existing UserCode rows hashed under the original argon2 algorithm continue to verify on the strong hasher with no migration required — check_password resolves by hash prefix.

Combined with Bug 1, HA-driven arm/disarm clicks now land subsecond on the live home-lab broker.

Verification

  • python manage.py test transports_mqtt integrations_home_assistant.tests.test_mqtt_alarm_entity accounts.tests.test_user_codes_hasher -v 2 → 34 tests pass
  • uvx ruff check backend/transports_mqtt/ backend/integrations_home_assistant/ backend/accounts/ backend/alarm/settings_registry.py → clean
  • uvx ruff format --check ... → clean
  • ✅ Empirical: live broker capture before this change showed {"action": "None", "code": "1234"} on every click
  • ✅ Empirical: HA entity attributes already report code_arm_required: false after the discovery republish — confirmed via MCP ha_get_state alarm_control_panel.latchpoint_alarm

Test plan

After merge and a fresh image is built/redeployed:

  • Sniff latchpoint_alarm/alarm/# on the broker
  • Click arm-away in HA → expect {"action": "ARM_AWAY", "code": "..."} followed by latchpoint_alarm/alarm/state :: armed_away within ~1s
  • Repeat for arm-home, arm-night, and disarm
  • Confirm broker mqtt log shows a single stable latchpoint-alarm-XXXXXX client (6 hex chars, no session taken over cycling)
  • With code_arm_required: false in alarm settings, click arm-away in HA without typing a PIN → expect successful state transition (not error: "Code required.")
  • With code_arm_required: true, arm with a valid PIN → still works
  • Disarm with a valid PIN → still works and feels noticeably faster (PIN verify dropped from hundreds of ms to ~30ms)
  • In the MQTT settings UI, confirm the field is now labelled "Client ID prefix" with the suffix-explanation description
  • If HA's UI still doesn't update after a successful state publish, remove the Latchpoint device under HA's MQTT integration; Latchpoint's _after_connect will republish discovery and HA will recreate the entity with the corrected template.

Files changed

  • backend/integrations_home_assistant/mqtt_alarm_entity.py — template variable fix + empty-code normalization in _handle_command_payload
  • backend/integrations_home_assistant/tests/test_mqtt_alarm_entity.py — replace mirror-test, add render simulation, add HandleCommandPayloadTests for empty/whitespace/missing/real codes
  • backend/transports_mqtt/manager.py_client_id_suffix in __init__ (3 bytes / 6 hex chars to fit MQTT 3.1's 23-byte limit); _build_client becomes an instance method that appends the suffix and uses the paho 2.0+ kwarg call without a fallback; test_connection overrides with a fresh suffix; rc=2 error messages reference "client_id prefix"
  • backend/transports_mqtt/tests/test_manager.py — six tests covering distinct/stable/prefix-preserving/default-prefix/override/MQTT-3.1-length behaviors plus a wiring guard that test_connection forwards a non-live suffix
  • backend/alarm/settings_registry.pyclient_id schema title renamed to "Client ID prefix" with description explaining the auto-suffix and the 16-char prefix budget
  • backend/accounts/hashers.py — new UserCodeArgon2Hasher (argon2-pin, low cost)
  • backend/accounts/tests/test_user_codes_hasher.py — coverage for the PIN hasher
  • backend/accounts/use_cases/user_codes.py — route create/update through the PIN hasher
  • backend/config/settings.py — register the new hasher

HA's alarm_control_panel.mqtt exposes only `action` and `code` to
command_template — `value` is lock.mqtt's variable, not alarm's. With
`{{ value }}`, Jinja rendered the action as the literal string "None"
(undefined → str(None)) and HA published payloads like
`{"action": "None", "code": "1234"}`. Latchpoint's parser then returned
"Unknown action." without invoking the state machine, so arm/disarm
clicks in HA had no effect.

The previous test asserted the broken template against itself, which is
why nothing caught it — replaced with one that locks the correct
variable name and added a render-simulation test that asserts the
action survives as a real string.

Verified live: broker capture before this change showed
`{"action": "None", "code": "1234"}` on every click; HA docs confirm
the variable scope.
https://www.home-assistant.io/integrations/alarm_control_panel.mqtt/#command_template
Copilot AI review requested due to automatic review settings May 1, 2026 23:18

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Home Assistant MQTT alarm_control_panel discovery payload so arm/disarm commands publish a valid action value by using HA’s documented {{ action }} variable (instead of {{ value }}), preventing commands from being ignored by the server-side parser.

Changes:

  • Update HA MQTT discovery command_template to use {{ action }} and expand the inline documentation to match HA’s variable scope.
  • Replace the previous “self-equality” regression test with one that asserts the correct HA placeholders.
  • Add a render-simulation test to ensure the command template produces valid JSON with a concrete action/code.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
backend/integrations_home_assistant/mqtt_alarm_entity.py Switches discovery payload command_template to {{ action }} and documents HA’s template variable scope.
backend/integrations_home_assistant/tests/test_mqtt_alarm_entity.py Strengthens regression coverage around the command template placeholders and rendering behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/integrations_home_assistant/tests/test_mqtt_alarm_entity.py Outdated
Two paho clients sharing the configured `client_id` (default
"latchpoint-alarm") fight for the same broker-side identity. The broker
boots the older one with "session taken over" and the loser reconnects
within ~1s, ping-ponging forever. Each on_connect fires `_after_connect`,
which republishes discovery + availability + state — measured at HA as
constant retained-message spam.

The collision shape we hit in practice was a diagnostic script run via
`docker exec` into the live container: `django.setup()` triggered
`transports_mqtt.apps.ready()` and spawned a *second* MqttConnectionManager
in the same container, both connecting as `latchpoint-alarm`. The same
shape happens for any second instance — parallel deploys, dev side-cars,
accidental imports.

Defensively append a per-process random suffix to the configured
`client_id` when building paho clients. The suffix is generated once in
`MqttConnectionManager.__init__` so it stays stable across reconnects
and settings PATCHes within one process; across processes it diverges.
`test_connection()` opts into a fresh suffix per call so its throwaway
client never knocks the live one off, even from inside the same process.

The user-facing `client_id` setting is unchanged — it now functions as
a prefix and existing configurations keep working.
@ljmerza ljmerza changed the title fix(ha-mqtt): use {{ action }} not {{ value }} in alarm command_template fix(ha-mqtt): use {{ action }} in alarm command_template + unique broker client_id per process May 1, 2026
…ffix

Existing tests prove the `client_id_suffix=` override mechanism on
`_build_client` works when called directly, but no test asserted that
`test_connection()` actually passes a non-None override at runtime. If
that override regressed, a `test_connection` invoked from inside the
live container would reuse the manager's stable suffix and the broker
would evict the running daphne client mid-validation — the exact shape
of the bug PR #46 was meant to prevent.

Adds a SimpleTestCase that monkey-patches `_build_client` and
`_import_paho` on a manager instance, calls `test_connection()` against
a stub settings dict, and asserts the captured `client_id_suffix` is
non-None and differs from the manager's live `_client_id_suffix`.
Short-circuits via `RuntimeError` before threading or `connect_async`
are touched, so no broker / no I/O.
Copilot AI review requested due to automatic review settings May 1, 2026 23:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

backend/transports_mqtt/manager.py:399

  • In the except TypeError fallback, the code builds mqtt.Client() without any client_id, which drops the configured prefix + suffix entirely. If this fallback is ever hit (e.g., a paho version/signature mismatch), it defeats the collision-avoidance and the “stable within a process” guarantee. Consider retrying client construction with a positional client_id (or otherwise ensuring the computed client_id is still applied) rather than omitting it.
        try:
            client = mqtt.Client(client_id=client_id)
        except TypeError:
            client = mqtt.Client()


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/transports_mqtt/tests/test_manager.py
- Drop redundant assertNotEqual in HA mqtt alarm template test (implied by
  the prior assertEqual that already pins action to ARM_AWAY).
- Make _build_client's TypeError fallback preserve the computed
  <prefix>-<suffix> client_id by retrying positionally instead of falling
  through to a no-arg mqtt.Client(); a no-arg fallback silently dropped
  the per-process suffix and undid the collision fix it was paired with.
- Add a wiring guard test that fakes a kwarg-rejecting paho Client and
  asserts the fallback path still ships the suffix to the broker.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/transports_mqtt/manager.py
Comment thread backend/transports_mqtt/manager.py Outdated
Comment thread backend/transports_mqtt/tests/test_manager.py
ljmerza added 2 commits May 1, 2026 20:44
Alarm PINs are 4-8 digits with a 10K-100M keyspace. Stretching them at
the default user-password Argon2 cost (~150-300ms/verify) doesn't
materially raise an offline-brute-force attacker's bar - the keyspace
is so small that the entire space is recovered in minutes-to-hours
even at strong cost - while the slow verify shows up as a multi-
hundred-ms tax on every arm/disarm round trip from the HA MQTT alarm
panel.

Add UserCodeArgon2Hasher (algorithm "argon2-pin", time_cost=1,
memory_cost=8 MiB, parallelism=2; ~30ms/verify) and route
create_user_code / update_user_code through it. User passwords keep
the strong default. Existing UserCode rows hashed under the original
"argon2" algorithm continue to verify on the strong hasher with no
migration required - check_password resolves by hash prefix.

Combined with the PR #46 MQTT-template fix, this gets HA-driven arm/
disarm clicks down from ~2s to subsecond round-trip on the live
home-lab broker.
HA's MQTT alarm panel renders the command template with `code = code or ""`,
so when the user arms without typing a PIN the broker payload is
`{"action":"ARM_HOME","code":""}` — the field is present but empty.

`_handle_command_payload` returned `""` for that case while the arm branch
of `handle_mqtt_alarm_command` checked `raw_code is not None`. Empty string
is not None, so code validation was forced even when `code_arm_required`
was false, and the command failed with "Code required.".

Normalize an empty/whitespace `code` to None at the parser boundary so
both arm and disarm branches see the same "no code" contract. The disarm
branch already used truthiness (`if not raw_code:`) and is unaffected.
Copilot AI review requested due to automatic review settings May 2, 2026 01:06
@ljmerza ljmerza changed the title fix(ha-mqtt): use {{ action }} in alarm command_template + unique broker client_id per process fix(ha-mqtt): restore HA alarm panel arm/disarm end-to-end + PIN perf May 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

backend/config/settings.py:117

  • This PR introduces a new password hasher (UserCodeArgon2Hasher) and changes UserCode hashing behavior, but the PR title/description are scoped to HA MQTT fixes. Please either update the PR description to cover this security/perf change (and its rollout considerations), or split it into a separate PR so the MQTT fix can be reviewed/merged independently.
PASSWORD_HASHERS = [
    "django.contrib.auth.hashers.Argon2PasswordHasher",
    "accounts.hashers.UserCodeArgon2Hasher",
    "django.contrib.auth.hashers.PBKDF2PasswordHasher",
    "django.contrib.auth.hashers.PBKDF2SHA1PasswordHasher",
    "django.contrib.auth.hashers.BCryptSHA256PasswordHasher",
]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/accounts/use_cases/user_codes.py
Comment thread backend/accounts/hashers.py
ljmerza added 2 commits May 1, 2026 21:20
- Drop the `try mqtt.Client(client_id=...) except TypeError: mqtt.Client(client_id)` fallback in `_build_client`. paho-mqtt is pinned >=2.0; the kwarg call always succeeds. If the fallback ever did fire on paho 2.x, the positional arg would be fed to `callback_api_version` (expects an enum) and crash with a worse error than the kwarg call itself. Removed the matching test that locked in the broken behavior.
- Shrink the per-process client_id suffix from 8 hex chars (4 bytes) to 6 hex chars (3 bytes). Default composed `latchpoint-alarm-XXXXXX` now lands at exactly 23 bytes — the MQTT 3.1 spec maximum. paho 2.0+ defaults to 3.1.1 which raises the cap, but staying inside 3.1 protects against stricter brokers. 6 hex still gives 16M distinct values, far above what's needed for 1-2 process collision avoidance.
- Add `test_default_composed_client_id_fits_mqtt_3_1_length_limit` regression guard so a future suffix/prefix bump can't silently push us over.
- Update user-facing strings: settings registry title `"Client ID"` → `"Client ID prefix"` with description explaining the suffix and the 16-char prefix budget; rc=2 connect-error messages now reference "client_id prefix" so users know which knob they're turning.
@ljmerza ljmerza requested review from Copilot and removed request for Copilot May 2, 2026 01:32
@ljmerza ljmerza requested review from Copilot and removed request for Copilot May 2, 2026 01:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread backend/transports_mqtt/manager.py Outdated
Comment thread backend/transports_mqtt/manager.py Outdated
Comment thread backend/alarm/settings_registry.py Outdated
Comment thread backend/transports_mqtt/tests/test_manager.py Outdated
Comment thread backend/transports_mqtt/manager.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +408 to +410
base_client_id = str(settings.get("client_id") or "").strip() or "latchpoint-alarm"
suffix = client_id_suffix or self._client_id_suffix
client_id = f"{base_client_id}-{suffix}"
@ljmerza ljmerza merged commit 8e2f4b5 into main May 2, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants