fix: stop unexpected NodeNum regeneration from PKI key loss#10808
Conversation
⚡ Try this PR in the Web FlasherWarning This is an automated, unreviewed CI test build. Back up your device configuration Supported boards built by this PR (25)
Build artifacts expire on 2026-07-28. Updated for |
A node's NodeNum is derived from its PKI public key (my_node_num == crc32(public_key)), so it changes only when the keypair is regenerated -- which happens only when a keygen path runs while config.security.private_key.size != 32. Two paths could trigger that without the user intending a new identity: 1. Boot: loadFromDisk() collapsed every non-success loadProto() result for the config file into installDefaultConfig() (preserveKey=false), which memset()s config and wipes the private key. loadProto() distinguishes DECODE_FAILED (file present but undecodable/undecryptable) from OTHER_FAILURE (absent), so a transient/corrupt read silently minted a new identity. A new configDecodeFailed flag now gates the boot keygen and the boot config-save directly (not via region, so it survives the userprefs/region overrides that run later in loadFromDisk): identity is frozen, the unreadable file is left intact for a later clean boot to recover, and the node boots radio-silent (region UNSET, TX off). A genuinely-absent config still gets defaults + keygen. 2. AdminModule security SET: config.security = c.payload_variant.security wholesale-clobbered the keypair, then regenerated whenever the incoming private_key.size != 32. A client editing an unrelated security field without round-tripping the private key would re-mint identity. The existing keypair is now preserved when the incoming config omits it; first provisioning and key import are unchanged. Intentional reset goes through factory_reset. Native PlatformIO unit suite (Docker): 499/499 test cases pass.
864b587 to
da186b6
Compare
There was a problem hiding this comment.
Pull request overview
Prevents unintended node identity (NodeNum) changes caused by accidental PKI key loss, by distinguishing transient/corrupt config decode failures from true “no config” cases at boot and by preserving an existing identity keypair on partial/legacy security config updates.
Changes:
- AdminModule: preserve the existing PKI keypair when a security SET omits a valid 32-byte private key, avoiding silent identity regeneration.
- NodeDB: introduce
configDecodeFailedto (a) skip boot-time keygen and (b) suppress boot-time config persistence when the config file is present but fails to decode/decrypt. - NodeDB: keep radios silent on degraded boots by forcing region
UNSETandtx_enabled=false, and avoid applying userprefs region overrides in that state.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/modules/AdminModule.cpp | Preserves identity keypair on partial security SETs; only generates/derives keys when genuinely missing. |
| src/mesh/NodeDB.h | Adds configDecodeFailed state to track degraded config-load boots. |
| src/mesh/NodeDB.cpp | Treats DECODE_FAILED as a degraded boot: skip keygen and skip config overwrite; keep radio silent until recovery. |
Comments suppressed due to low confidence (1)
src/mesh/NodeDB.cpp:2347
- The comment here says the config file is "absent" on any non-successful load, but
LoadFileResult::OTHER_FAILUREalso covers other open/read failures (not just first boot). This can mislead future debugging and may hide that we're still treating some I/O failures like a missing config.
// Always-apply LoRa overrides: applied after loading saved config so they
| if (incoming.private_key.size != 32 && config.security.private_key.size == 32) { | ||
| LOG_WARN("Security set omitted private key; preserving existing identity keypair"); | ||
| incoming.private_key = config.security.private_key; | ||
| incoming.public_key = config.security.public_key; | ||
| } | ||
| config.security = incoming; |
There was a problem hiding this comment.
Added two native unit tests in test/test_admin_radio: test_handleSetConfig_security_preservesKeypairWhenPrivateOmitted (a security SET that omits the private key keeps the existing keypair) and test_handleSetConfig_security_acceptsSuppliedKeypair (a full supplied keypair is applied). AdminModuleTestShim defers saves so the lightweight harness doesn't saveToDisk an uninitialized node database. Full native suite: 501/501.
Firmware Size Report22 targets | vs
Show 17 more target(s)
Updated for 4a86a06 |
Addresses PR review feedback: - Add native unit tests (test_admin_radio) asserting handleSetConfig preserves the existing identity keypair when a security SET omits the private key, and applies a full supplied keypair (key import). Guards against reintroduced NodeNum/keypair regeneration. AdminModuleTestShim (now a friend) defers saves so the lightweight harness doesn't saveToDisk an uninitialized node database. - Clarify the non-DECODE_FAILED config-load comment: OTHER_FAILURE / NO_FILESYSTEM cover an absent or unopenable file, not just first boot.
Problem
Reports of nodes occasionally generating a new NodeNum, beyond the expected one-time NodeNum assignment during initial PKI key generation.
Root cause
A node's NodeNum is a pure function of its PKI public key:
createNewIdentity()already no-ops when the derived number is unchanged, and the public key is recomputed deterministically from the private key on every boot. So the NodeNum changes only when the keypair is regenerated, which happens only when a keygen path runs whileconfig.security.private_key.size != 32. An unexpected NodeNum change therefore means the private key went missing at the moment a keygen path ran. Two paths could do that without the user intending a new identity:1. Boot: a corrupt/transient config read wiped the keypair
loadFromDisk()collapsed every non-successloadProto()result for the config file intoinstallDefaultConfig()(preserveKey=false), whichmemset()sconfigand destroys the private key → fresh keypair → new NodeNum.But
loadProto()returns distinct codes:DECODE_FAILED— file present but couldn't be decoded/decrypted this boot (flash bit-rot, torn write, transient decrypt failure)OTHER_FAILURE— file genuinely absent (true first boot)Treating a transient/corrupt read identically to "no config at all" silently minted a brand-new identity. (
loadProto()also zeroes the destination before the failed decode, so the key is already gone from RAM — nothing could preserve it after the fact.)2. AdminModule security SET clobbered the keypair
config.security = c.payload_variant.security;wholesale-replaced the keypair, then regenerated whenever the incomingprivate_key.size != 32. A client that edits an unrelated security field (managed mode, admin keys, serial console, …) without round-tripping the private key would silently re-mint identity.Fix
Boot (
NodeDB.cpp): a newconfigDecodeFailedflag, set onDECODE_FAILED, gates the boot keygen and the boot config-save directly (not via region — robust against the userprefs/region overrides that run later inloadFromDisk()). On a decode failure the node:A genuinely-absent config (
OTHER_FAILURE) still gets defaults + keygen as before.AdminModule: when a security SET omits the private key but we already hold a valid one, the existing keypair is preserved and only the other security fields are applied. First provisioning and key import are unchanged; intentional reset goes through
factory_reset.Review
These changes went through an adversarial multi-lens review; findings addressed:
USERPREFS_CONFIG_LORA_REGIONre-set region after the degraded branch, defeating a region-based keygen guard → switched to gating keygen directly via the flag (no dependence on region surviving).saveToDiskcould persist the degraded UNSET defaults over a transiently-unreadable file → boot config-save is now suppressed on a degraded boot.#if.Known limitation (follow-up)
If a config is permanently corrupt (not transient), recovering the node by setting the region still routes through
ensurePkiKeys()and mints a new keypair → new NodeNum. This is inherent to genuine private-key loss and is now gated behind an explicit user action rather than happening silently on boot. The durable fix is to persist the PKI private key in its own small, atomically-written file independent of the LocalConfig blob; tracked separately.Testing
Local verification: native PlatformIO suite (Docker, CI parity) — 499/499 test cases pass.