Skip to content

fix: stop unexpected NodeNum regeneration from PKI key loss#10808

Merged
thebentern merged 2 commits into
developfrom
fix/nodenum-identity-preservation
Jun 28, 2026
Merged

fix: stop unexpected NodeNum regeneration from PKI key loss#10808
thebentern merged 2 commits into
developfrom
fix/nodenum-identity-preservation

Conversation

@thebentern

@thebentern thebentern commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Problem

Reports of nodes occasionally generating a new NodeNum, beyond the expected one-time NodeNum assignment during initial PKI key generation.

Root cause

A node's NodeNum is a pure function of its PKI public key:

my_node_num == crc32(config.security.public_key)

createNewIdentity() already no-ops when the derived number is unchanged, and the public key is recomputed deterministically from the private key on every boot. So the NodeNum changes only when the keypair is regenerated, which happens only when a keygen path runs while config.security.private_key.size != 32. An unexpected NodeNum change therefore means the private key went missing at the moment a keygen path ran. Two paths could do that without the user intending a new identity:

1. Boot: a corrupt/transient config read wiped the keypair

loadFromDisk() collapsed every non-success loadProto() result for the config file into installDefaultConfig() (preserveKey=false), which memset()s config and destroys the private key → fresh keypair → new NodeNum.

But loadProto() returns distinct codes:

  • DECODE_FAILED — file present but couldn't be decoded/decrypted this boot (flash bit-rot, torn write, transient decrypt failure)
  • OTHER_FAILURE — file genuinely absent (true first boot)

Treating a transient/corrupt read identically to "no config at all" silently minted a brand-new identity. (loadProto() also zeroes the destination before the failed decode, so the key is already gone from RAM — nothing could preserve it after the fact.)

2. AdminModule security SET clobbered the keypair

config.security = c.payload_variant.security; wholesale-replaced the keypair, then regenerated whenever the incoming private_key.size != 32. A client that edits an unrelated security field (managed mode, admin keys, serial console, …) without round-tripping the private key would silently re-mint identity.

Fix

Boot (NodeDB.cpp): a new configDecodeFailed flag, set on DECODE_FAILED, gates the boot keygen and the boot config-save directly (not via region — robust against the userprefs/region overrides that run later in loadFromDisk()). On a decode failure the node:

  • freezes its identity (no keygen → NodeNum unchanged, recovered from the separately-stored devicestate);
  • leaves the unreadable config file untouched, so a transient failure self-heals on the next clean boot instead of being overwritten with defaults;
  • boots radio-silent (region UNSET, TX off) until a good config is read back.

A genuinely-absent config (OTHER_FAILURE) still gets defaults + keygen as before.

AdminModule: when a security SET omits the private key but we already hold a valid one, the existing keypair is preserved and only the other security fields are applied. First provisioning and key import are unchanged; intentional reset goes through factory_reset.

Review

These changes went through an adversarial multi-lens review; findings addressed:

  • [HIGH] USERPREFS_CONFIG_LORA_REGION re-set region after the degraded branch, defeating a region-based keygen guard → switched to gating keygen directly via the flag (no dependence on region surviving).
  • [MEDIUM] the end-of-init saveToDisk could persist the degraded UNSET defaults over a transiently-unreadable file → boot config-save is now suppressed on a degraded boot.
  • [LOW] PKI-excluded builds still clobbered the stored key → key-preserve moved out of the PKI #if.

Known limitation (follow-up)

If a config is permanently corrupt (not transient), recovering the node by setting the region still routes through ensurePkiKeys() and mints a new keypair → new NodeNum. This is inherent to genuine private-key loss and is now gated behind an explicit user action rather than happening silently on boot. The durable fix is to persist the PKI private key in its own small, atomically-written file independent of the LocalConfig blob; tracked separately.

Testing

  • Native PlatformIO unit suite (Docker, CI parity): 499/499 test cases pass.

Local verification: native PlatformIO suite (Docker, CI parity) — 499/499 test cases pass.

@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

⚡ Try this PR in the Web Flasher

Flash this PR in the Web Flasher

firmware commit boards expires

Warning

This is an automated, unreviewed CI test build. Back up your device configuration
before flashing, and only flash devices you are able to recover.

Supported boards built by this PR (25)
Device Board Platform
Crowpanel Adv 3.5 TFT elecrow-adv-35-tft esp32-s3
Heltec HT62 heltec-ht62-esp32c3-sx1262 esp32-c3
Heltec Mesh Node 096 heltec-mesh-node-t096 nrf52840
Heltec Mesh Node T1 heltec-mesh-node-t1 nrf52840
Heltec Mesh Node T114 heltec-mesh-node-t114 nrf52840
Heltec V3 heltec-v3 esp32-s3
Heltec V4 heltec-v4 esp32-s3
Raspberry Pi Pico pico rp2040
Raspberry Pi Pico W picow rp2040
RAK WisMesh Tag rak_wismeshtag nrf52840
RAK WisBlock 11200 rak11200 esp32
RAK WisBlock 11310 rak11310 rp2040
RAK3312 rak3312 esp32-s3
RAK WisBlock 4631 rak4631 nrf52840
Seeed Wio Tracker L1 seeed_wio_tracker_L1 nrf52840
Seeed Xiao NRF52840 Kit seeed_xiao_nrf52840_kit nrf52840
Seeed Xiao ESP32-S3 seeed-xiao-s3 esp32-s3
Station G2 station-g2 esp32-s3
Station G3 station-g3 esp32-s3
LILYGO T-Deck t-deck-tft esp32-s3
LILYGO T-Echo t-echo nrf52840
LILYGO T-Echo Plus t-echo-plus nrf52840
LILYGO T-Impulse Plus t-impulse-plus nrf52840
LilyGo T3-C6 tlora-c6 esp32-c6
Seeed SenseCAP T1000-E tracker-t1000-e nrf52840

Build artifacts expire on 2026-07-28. Updated for da186b6.

@github-actions github-actions Bot added the bugfix Pull request that fixes bugs label Jun 28, 2026
@thebentern thebentern added bugfix Pull request that fixes bugs and removed bugfix Pull request that fixes bugs labels Jun 28, 2026
@thebentern thebentern requested review from NomDeTom and Copilot June 28, 2026 17:37
A node's NodeNum is derived from its PKI public key
(my_node_num == crc32(public_key)), so it changes only when the keypair is
regenerated -- which happens only when a keygen path runs while
config.security.private_key.size != 32. Two paths could trigger that without
the user intending a new identity:

1. Boot: loadFromDisk() collapsed every non-success loadProto() result for the
   config file into installDefaultConfig() (preserveKey=false), which memset()s
   config and wipes the private key. loadProto() distinguishes DECODE_FAILED
   (file present but undecodable/undecryptable) from OTHER_FAILURE (absent), so
   a transient/corrupt read silently minted a new identity. A new
   configDecodeFailed flag now gates the boot keygen and the boot config-save
   directly (not via region, so it survives the userprefs/region overrides that
   run later in loadFromDisk): identity is frozen, the unreadable file is left
   intact for a later clean boot to recover, and the node boots radio-silent
   (region UNSET, TX off). A genuinely-absent config still gets defaults + keygen.

2. AdminModule security SET: config.security = c.payload_variant.security
   wholesale-clobbered the keypair, then regenerated whenever the incoming
   private_key.size != 32. A client editing an unrelated security field without
   round-tripping the private key would re-mint identity. The existing keypair
   is now preserved when the incoming config omits it; first provisioning and
   key import are unchanged. Intentional reset goes through factory_reset.

Native PlatformIO unit suite (Docker): 499/499 test cases pass.
@thebentern thebentern force-pushed the fix/nodenum-identity-preservation branch from 864b587 to da186b6 Compare June 28, 2026 17:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents unintended node identity (NodeNum) changes caused by accidental PKI key loss, by distinguishing transient/corrupt config decode failures from true “no config” cases at boot and by preserving an existing identity keypair on partial/legacy security config updates.

Changes:

  • AdminModule: preserve the existing PKI keypair when a security SET omits a valid 32-byte private key, avoiding silent identity regeneration.
  • NodeDB: introduce configDecodeFailed to (a) skip boot-time keygen and (b) suppress boot-time config persistence when the config file is present but fails to decode/decrypt.
  • NodeDB: keep radios silent on degraded boots by forcing region UNSET and tx_enabled=false, and avoid applying userprefs region overrides in that state.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/modules/AdminModule.cpp Preserves identity keypair on partial security SETs; only generates/derives keys when genuinely missing.
src/mesh/NodeDB.h Adds configDecodeFailed state to track degraded config-load boots.
src/mesh/NodeDB.cpp Treats DECODE_FAILED as a degraded boot: skip keygen and skip config overwrite; keep radio silent until recovery.
Comments suppressed due to low confidence (1)

src/mesh/NodeDB.cpp:2347

  • The comment here says the config file is "absent" on any non-successful load, but LoadFileResult::OTHER_FAILURE also covers other open/read failures (not just first boot). This can mislead future debugging and may hide that we're still treating some I/O failures like a missing config.
    // Always-apply LoRa overrides: applied after loading saved config so they

Comment on lines +1045 to +1050
if (incoming.private_key.size != 32 && config.security.private_key.size == 32) {
LOG_WARN("Security set omitted private key; preserving existing identity keypair");
incoming.private_key = config.security.private_key;
incoming.public_key = config.security.public_key;
}
config.security = incoming;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added two native unit tests in test/test_admin_radio: test_handleSetConfig_security_preservesKeypairWhenPrivateOmitted (a security SET that omits the private key keeps the existing keypair) and test_handleSetConfig_security_acceptsSuppliedKeypair (a full supplied keypair is applied). AdminModuleTestShim defers saves so the lightweight harness doesn't saveToDisk an uninitialized node database. Full native suite: 501/501.

Comment thread src/mesh/NodeDB.cpp Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Firmware Size Report

22 targets | vs develop: 22 increased, net +181,420 (+177.2 KB)

Target Size vs develop
heltec-vision-master-e213-inkhud 2,227,696 📈 +11,568 (+11.3 KB)
heltec-v4 2,277,616 📈 +10,000 (+9.8 KB)
t-eth-elite 2,491,776 📈 +9,888 (+9.7 KB)
elecrow-adv-35-tft 3,416,896 📈 +9,808 (+9.6 KB)
heltec-ht62-esp32c3-sx1262 2,135,360 📈 +9,216 (+9.0 KB)
Show 17 more target(s)
Target Size vs develop
heltec-v3 2,264,384 📈 +9,168 (+9.0 KB)
tlora-c6 2,368,640 📈 +9,088 (+8.9 KB)
rak11200 1,860,944 📈 +8,960 (+8.8 KB)
seeed-xiao-s3 2,276,448 📈 +8,704 (+8.5 KB)
rak3312 2,272,560 📈 +8,656 (+8.5 KB)
station-g2 2,266,688 📈 +8,496 (+8.3 KB)
station-g3 2,266,704 📈 +8,496 (+8.3 KB)
picow 1,245,080 📈 +8,080 (+7.9 KB)
t-deck-tft 3,810,992 📈 +7,824 (+7.6 KB)
pico2w 1,220,772 📈 +7,704 (+7.5 KB)
pico 782,912 📈 +7,648 (+7.5 KB)
seeed_xiao_rp2040 781,112 📈 +7,648 (+7.5 KB)
rak11310 805,648 📈 +7,480 (+7.3 KB)
pico2 770,032 📈 +7,240 (+7.1 KB)
seeed_xiao_rp2350 768,176 📈 +7,240 (+7.1 KB)
rak3172 186,408 📈 +4,300 (+4.2 KB)
wio-e5 238,668 📈 +4,208 (+4.1 KB)

Updated for 4a86a06

Addresses PR review feedback:
- Add native unit tests (test_admin_radio) asserting handleSetConfig preserves the
  existing identity keypair when a security SET omits the private key, and applies a
  full supplied keypair (key import). Guards against reintroduced NodeNum/keypair
  regeneration. AdminModuleTestShim (now a friend) defers saves so the lightweight
  harness doesn't saveToDisk an uninitialized node database.
- Clarify the non-DECODE_FAILED config-load comment: OTHER_FAILURE / NO_FILESYSTEM
  cover an absent or unopenable file, not just first boot.
@thebentern thebentern merged commit 0488a46 into develop Jun 28, 2026
77 of 79 checks passed
@thebentern thebentern deleted the fix/nodenum-identity-preservation branch June 28, 2026 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Pull request that fixes bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants