refactor: migrate Tenstorrent reader to upstream luwen 0.8.5 crates#266
Conversation
Replace the four republished fork crates (all-smi-luwen-core/-if/-ref and all-smi-ttkmd-if) with the upstream luwen crates published on crates.io. The fork originally existed because all-smi is published to crates.io (which forbids git/path deps) and the reader needed APIs that only lived on luwen main; upstream now publishes a current crate set (0.8.5) that tracks main, so the fork is no longer necessary.
Changes:
- Cargo.toml: swap the all-smi-luwen-* deps in the Linux target block for luwen-api, luwen-pci, and luwen-def 0.8.5 (luwen-kmd, the ttkmd-if successor, is pulled in transitively).
- src/device/readers/tenstorrent.rs: remap imports to luwen_api::{ChipDetectOptions, chip::*} and luwen_def::Arch; switch detection to luwen_pci::detect_chips_silent, which returns Result<Vec<UninitChip>, _> and preserves the existing init loop; annotate the Arch match with #[allow(deprecated)] because Arch::Grayskull is now deprecated upstream and CI runs clippy with -D warnings; fix the stale init-error comment since InitError::PlatformError can occur even with an Infallible callback.
- scripts/vendor_all_luwen.sh: removed. The luwen-core/-if/-ref/ttkmd-if directories no longer exist on luwen main, so the vendoring script can no longer re-vendor against current upstream.
The Telemetry, DeviceInfo, and ChipDetectOptions surfaces are unchanged, so reader output is preserved. Verified with cargo check --lib and cargo clippy --lib --tests -- -D warnings (both clean).
luwen 0.8.x has sunset Grayskull support. detect_chips_silent scans every node under /dev/tenstorrent and opens each one; converting a Grayskull PCI id (0xfaca) to an Arch reaches unimplemented!() deep in luwen-kmd (PciDevice::open -> Arch::try_from), which panics instead of returning an error. That panic unwound through the reader's Ok/Err match and poisoned the INITIALIZED_CHIPS lock, so a present Grayskull card crashed Tenstorrent collection rather than degrading gracefully the way the previous all-smi-luwen fork did. Wrap the detection call in catch_unwind (matching the existing reader convention in amd.rs and intel_gpu_engine.rs) so an unsupported device degrades to a status message. The release profile still sets panic = "abort", so a Grayskull host aborts there; upstream removing Grayskull means a complete in-process fix is out of scope for this like-for-like migration.
Implementation Review SummaryIntent
Migration correctness (original commit 57879bc)
Findings Addressed
Remaining Items
Verification
|
…pre-check luwen 0.8.x panics (`unimplemented!`) when opening a Grayskull device, and detection opens every /dev/tenstorrent node, so a single Grayskull card would crash Tenstorrent collection for the whole host. Under the release `panic = "abort"` profile the existing catch_unwind cannot recover, so the binary aborts. Detect Grayskull (PCI vendor 0x1e52, device 0xfaca) from sysfs before calling luwen and skip detection with a status message. The check is conservative: it reports true only on a positive vendor+device match, so Wormhole (0x401e) and Blackhole (0xb140) hosts are never affected, and any sysfs read failure falls through to normal detection. catch_unwind stays as defense in depth for other unexpected panics.
luwen 0.8.x has sunset Grayskull, so the reader no longer detects it (it is skipped via a sysfs pre-check). Update the supported-hardware list to reflect Wormhole and Blackhole only. The historical Devlog is left as-is.
Follow-up: remaining HIGH resolved in-PRThe review's remaining HIGH (a Grayskull host aborts under the release
Verification (default features, aarch64-linux): Security/perf: the new code is a read-only scan of fixed sysfs paths with no external input (no traversal or injection surface), runs once and is cached, and degrades gracefully on any read error. |
The man page still listed "Grayskull and Wormhole" after the upstream luwen 0.8.x migration dropped Grayskull support. Align with the README, which already says "Wormhole and Blackhole" in the Linux features section and the supported-hardware table.
PR FinalizationSummaryLint/Format: Tests: No new tests added. The Documentation: One concrete gap found and fixed. The man page ( Commit pushed: All checks passing. Ready for merge. |
Summary
Replace the four republished fork crates (
all-smi-luwen-core/-if/-refandall-smi-ttkmd-if) with the upstreamluwencrates 0.8.5 (published on crates.io, tracking luwen main) in the Tenstorrent NPU reader, and remove the now-obsolete vendoring script. The fork existed only because all-smi is published to crates.io (which forbids git/path deps) and the reader needed APIs that lived solely on luwen main; upstream now ships a current crate set that tracks main, so the fork is no longer necessary.What changed
Cargo.toml: swap theall-smi-luwen-*deps in thecfg(target_os = "linux")block forluwen-api,luwen-pci, andluwen-def= "0.8.5".luwen-kmd(thettkmd-ifsuccessor) is pulled in transitively, so no direct dep is declared, and the stale# Tenstorrent dependencies from GitHubcomment is updated.src/device/readers/tenstorrent.rs: remap imports toluwen_api::{ChipDetectOptions, chip::{Chip, ChipImpl, Telemetry}}andluwen_def::Arch; switch detection toluwen_pci::detect_chips_silent(the PCIe-layer entry returningResult<Vec<UninitChip>, _>, not the differently-signedluwen_api::detect_chips_silent), which preserves the existingUninitChip::initloop; annotate theArchmatch with#[allow(deprecated)]becauseArch::Grayskullis now deprecated upstream and CI runs clippy with-D warnings; correct the stale init-error comment sinceInitError::PlatformErrorcan occur even with anInfalliblecallback.scripts/vendor_all_luwen.sh: removed. Theluwen-core/-if/-ref/ttkmd-ifdirectories no longer exist on luwen main, so the script can no longer re-vendor against current upstream.Cargo.lock: regenerated to lockluwen-api/luwen-def/luwen-pci/luwen-kmd0.8.5 and drop the fork crates.The
Telemetry,DeviceInfo, andChipDetectOptionssurfaces the reader uses are unchanged in name and signature, so device name, board type, telemetry detail fields, PCIe details, and the power/temperature/clock/utilization metrics are preserved. This is a like-for-like migration; surfacing the richer 0.8.x telemetry fields stays out of scope per the issue.Test plan
cargo check --libresolves and fetches luwen 0.8.5, updatesCargo.lock, and finishes clean.cargo clippy --lib --tests -- -D warningsis clean; the deprecatedArch::Grayskullis handled by the#[allow(deprecated)]annotation.cargo fmt --checkis clean.Closes #265