Skip to content

refactor: migrate Tenstorrent reader to upstream luwen 0.8.5 crates#266

Merged
inureyes merged 5 commits into
mainfrom
refactor/issue-265-upstream-luwen-crates
Jun 26, 2026
Merged

refactor: migrate Tenstorrent reader to upstream luwen 0.8.5 crates#266
inureyes merged 5 commits into
mainfrom
refactor/issue-265-upstream-luwen-crates

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Replace the four republished fork crates (all-smi-luwen-core/-if/-ref and all-smi-ttkmd-if) with the upstream luwen crates 0.8.5 (published on crates.io, tracking luwen main) in the Tenstorrent NPU reader, and remove the now-obsolete vendoring script. The fork existed only because all-smi is published to crates.io (which forbids git/path deps) and the reader needed APIs that lived solely on luwen main; upstream now ships a current crate set that tracks main, so the fork is no longer necessary.

What changed

  • Cargo.toml: swap the all-smi-luwen-* deps in the cfg(target_os = "linux") block for luwen-api, luwen-pci, and luwen-def = "0.8.5". luwen-kmd (the ttkmd-if successor) is pulled in transitively, so no direct dep is declared, and the stale # Tenstorrent dependencies from GitHub comment is updated.
  • src/device/readers/tenstorrent.rs: remap imports to luwen_api::{ChipDetectOptions, chip::{Chip, ChipImpl, Telemetry}} and luwen_def::Arch; switch detection to luwen_pci::detect_chips_silent (the PCIe-layer entry returning Result<Vec<UninitChip>, _>, not the differently-signed luwen_api::detect_chips_silent), which preserves the existing UninitChip::init loop; annotate the Arch match with #[allow(deprecated)] because Arch::Grayskull is now deprecated upstream and CI runs clippy with -D warnings; correct the stale init-error comment since InitError::PlatformError can occur even with an Infallible callback.
  • scripts/vendor_all_luwen.sh: removed. The luwen-core/-if/-ref/ttkmd-if directories no longer exist on luwen main, so the script can no longer re-vendor against current upstream.
  • Cargo.lock: regenerated to lock luwen-api/luwen-def/luwen-pci/luwen-kmd 0.8.5 and drop the fork crates.

The Telemetry, DeviceInfo, and ChipDetectOptions surfaces the reader uses are unchanged in name and signature, so device name, board type, telemetry detail fields, PCIe details, and the power/temperature/clock/utilization metrics are preserved. This is a like-for-like migration; surfacing the richer 0.8.x telemetry fields stays out of scope per the issue.

Test plan

  • cargo check --lib resolves and fetches luwen 0.8.5, updates Cargo.lock, and finishes clean.
  • cargo clippy --lib --tests -- -D warnings is clean; the deprecated Arch::Grayskull is handled by the #[allow(deprecated)] annotation.
  • cargo fmt --check is clean.

Closes #265

Replace the four republished fork crates (all-smi-luwen-core/-if/-ref and all-smi-ttkmd-if) with the upstream luwen crates published on crates.io. The fork originally existed because all-smi is published to crates.io (which forbids git/path deps) and the reader needed APIs that only lived on luwen main; upstream now publishes a current crate set (0.8.5) that tracks main, so the fork is no longer necessary.

Changes:

- Cargo.toml: swap the all-smi-luwen-* deps in the Linux target block for luwen-api, luwen-pci, and luwen-def 0.8.5 (luwen-kmd, the ttkmd-if successor, is pulled in transitively).
- src/device/readers/tenstorrent.rs: remap imports to luwen_api::{ChipDetectOptions, chip::*} and luwen_def::Arch; switch detection to luwen_pci::detect_chips_silent, which returns Result<Vec<UninitChip>, _> and preserves the existing init loop; annotate the Arch match with #[allow(deprecated)] because Arch::Grayskull is now deprecated upstream and CI runs clippy with -D warnings; fix the stale init-error comment since InitError::PlatformError can occur even with an Infallible callback.
- scripts/vendor_all_luwen.sh: removed. The luwen-core/-if/-ref/ttkmd-if directories no longer exist on luwen main, so the vendoring script can no longer re-vendor against current upstream.

The Telemetry, DeviceInfo, and ChipDetectOptions surfaces are unchanged, so reader output is preserved. Verified with cargo check --lib and cargo clippy --lib --tests -- -D warnings (both clean).
@inureyes inureyes added status:review Under review type:refactor Code refactoring priority:medium Medium priority issue device:npu NPU (Neural Processing Unit) related labels Jun 26, 2026
luwen 0.8.x has sunset Grayskull support. detect_chips_silent scans every node under /dev/tenstorrent and opens each one; converting a Grayskull PCI id (0xfaca) to an Arch reaches unimplemented!() deep in luwen-kmd (PciDevice::open -> Arch::try_from), which panics instead of returning an error. That panic unwound through the reader's Ok/Err match and poisoned the INITIALIZED_CHIPS lock, so a present Grayskull card crashed Tenstorrent collection rather than degrading gracefully the way the previous all-smi-luwen fork did.

Wrap the detection call in catch_unwind (matching the existing reader convention in amd.rs and intel_gpu_engine.rs) so an unsupported device degrades to a status message. The release profile still sets panic = "abort", so a Grayskull host aborts there; upstream removing Grayskull means a complete in-process fix is out of scope for this like-for-like migration.
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Like-for-like migration of the Tenstorrent NPU reader from the republished all-smi-luwen-* / all-smi-ttkmd-if forks to upstream luwen 0.8.5 (luwen-api / luwen-pci / luwen-def, with luwen-kmd transitive), plus removal of the obsolete vendoring script. Runtime behavior is meant to be identical.

Migration correctness (original commit 57879bc)

  • Imports remapped exactly per the issue's verified table: luwen_def::Arch; luwen_api::{ChipDetectOptions, chip::{Chip, ChipImpl, Telemetry}}; detection via luwen_pci::detect_chips_silent (the one-arg PCIe entry returning Result<Vec<UninitChip>, _>, not the differently-signed luwen_api::detect_chips_silent).
  • ChipImpl is in scope for the chip.get_telemetry() / chip.get_device_info() trait methods. DeviceInfo is accessed via inference on the get_device_info() return, exactly as the fork code did, so no import is needed.
  • Arch::Grayskull compile-time deprecation handled with #[allow(deprecated)]; cargo clippy --lib --tests -- -D warnings and cargo fmt --check are clean on aarch64-linux where the reader is actually compiled.
  • Cargo.lock locks luwen-api / luwen-def / luwen-kmd / luwen-pci at 0.8.5 and drops the fork crates; luwen-kmd is present transitively as intended.
  • No all-smi-luwen-* / all-smi-ttkmd references remain anywhere except the intentional historical comment in Cargo.toml.
  • scripts/vendor_all_luwen.sh removed; no dangling references to it.
  • Reader still wired through reader_factory.rs (gated by has_tenstorrent()); the TenstorrentReader and get_tenstorrent_status_message public surface is unchanged.
  • PR body contains Closes #265.

Findings Addressed

  • Latent runtime panic on Grayskull hardware (HIGH) fixed in d1c0e78.

    luwen_pci::detect_chips_silent calls start_detect, which runs PciDevice::scan() over every node under /dev/tenstorrent and opens each one. PciDevice::open resolves the arch via Arch::try_from(&GetDeviceInfoOut), and for a Grayskull device id (0xfaca) that conversion is unimplemented!("grayskull support has been sunset") in luwen-kmd 0.8.5 (an unknown id returns Err gracefully; only Grayskull panics). The reader's match detect_chips_silent(options) { Ok / Err } only handles Result, so the panic unwound through the reader and poisoned the INITIALIZED_CHIPS mutex. A physically present Grayskull card (e75 / e150 / e300) therefore crashed Tenstorrent collection, whereas the previous all-smi-luwen fork rendered it as Tenstorrent Grayskull .... This is a regression against the issue's "output unchanged on real hardware" acceptance criterion that the compile-time #[allow(deprecated)] did not cover.

    The fix wraps the detection call in std::panic::catch_unwind(AssertUnwindSafe(...)), matching the existing reader convention in amd.rs and intel_gpu_engine.rs. An unsupported device now degrades to a status message instead of unwinding and poisoning the cache lock. Verified clippy -D warnings clean, fmt --check clean, change confined to src/device/readers/tenstorrent.rs, and behavior unchanged for Wormhole / Blackhole / no-card hosts.

Remaining Items

  • Release-build abort on Grayskull hosts (HIGH, partially mitigated): [profile.release] sets panic = "abort", which bypasses catch_unwind, so a present Grayskull card still aborts in the shipping release binary (the guard fully protects dev / test / unwind builds and prevents the mutex poisoning). luwen 0.8.x removed Grayskull entirely (unimplemented!() throughout luwen-kmd / luwen-pci / luwen-api), so a complete in-process fix is not possible without either pre-filtering /dev/tenstorrent device ids via sysfs before invoking luwen, or relaxing panic = "abort" for this path. Both are outside the scope of this like-for-like migration. Recommend tracking as a follow-up (note on Migrate Tenstorrent support to upstream luwen crates (0.8.x) and retire the all-smi-luwen-* republished forks #265 or a new issue). Mitigating context: Grayskull is EOL / sunset legacy hardware and increasingly rare; Wormhole / Blackhole and no-hardware hosts are unaffected.

Verification

  • All stated requirements implemented
  • No placeholder/mock code remaining
  • Integrated into project code flow
  • Project conventions followed
  • Existing modules reused where applicable
  • No unintended structural changes
  • Tests compile clean (cargo clippy --lib --tests -- -D warnings and cargo fmt --check green; full cargo test not run, per the narrow-scope build instruction)

inureyes added 2 commits June 27, 2026 05:45
…pre-check

luwen 0.8.x panics (`unimplemented!`) when opening a Grayskull device, and
detection opens every /dev/tenstorrent node, so a single Grayskull card would
crash Tenstorrent collection for the whole host. Under the release
`panic = "abort"` profile the existing catch_unwind cannot recover, so the
binary aborts.

Detect Grayskull (PCI vendor 0x1e52, device 0xfaca) from sysfs before calling
luwen and skip detection with a status message. The check is conservative: it
reports true only on a positive vendor+device match, so Wormhole (0x401e) and
Blackhole (0xb140) hosts are never affected, and any sysfs read failure falls
through to normal detection. catch_unwind stays as defense in depth for other
unexpected panics.
luwen 0.8.x has sunset Grayskull, so the reader no longer detects it (it is
skipped via a sysfs pre-check). Update the supported-hardware list to reflect
Wormhole and Blackhole only. The historical Devlog is left as-is.
@inureyes

Copy link
Copy Markdown
Member Author

Follow-up: remaining HIGH resolved in-PR

The review's remaining HIGH (a Grayskull host aborts under the release panic = "abort" profile, where catch_unwind cannot recover) is now fixed in this PR rather than deferred.

33749e9 sysfs Grayskull pre-check. Before invoking luwen detection (which opens every /dev/tenstorrent node and panics via unimplemented!() on the Grayskull PCI device id 0xfaca), the reader scans sysfs for a Tenstorrent device (PCI vendor 0x1e52) with the Grayskull device id (0xfaca). If one is present it sets a status message and skips luwen entirely, so no panic is reached in any build profile, including release panic = "abort". The check is conservative: it returns true only on a positive vendor+device match, so Wormhole (0x401e) and Blackhole (0xb140) hosts are never affected, and any sysfs read failure falls through to normal detection. catch_unwind is kept as defense in depth for other unexpected panics. The device-id to arch mapping was verified against luwen-kmd 0.8.5 (src/lib.rs:29-32).

8060647 docs. README supported-hardware list updated to Wormhole/Blackhole only (Grayskull dropped, since luwen 0.8.x sunset it). The historical Devlog is left as-is.

Verification (default features, aarch64-linux): cargo check --lib, cargo clippy --all-targets -- -D warnings, and cargo fmt --check all clean. The optional furiosa feature fails to build in my local environment due to a missing libclang stdarg.h; that is environmental and unrelated to this PR.

Security/perf: the new code is a read-only scan of fixed sysfs paths with no external input (no traversal or injection surface), runs once and is cached, and degrades gracefully on any read error.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 26, 2026
The man page still listed "Grayskull and Wormhole" after the upstream
luwen 0.8.x migration dropped Grayskull support. Align with the README,
which already says "Wormhole and Blackhole" in the Linux features section
and the supported-hardware table.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization

Summary

Lint/Format: cargo fmt --check, cargo clippy --lib --tests -- -D warnings, and cargo check --lib all pass clean (default features only).

Tests: No new tests added. The grayskull_present() guard reads fixed sysfs paths and has no existing unit test infrastructure; the reader requires real hardware. A fabricated test would add no meaningful coverage.

Documentation: One concrete gap found and fixed. The man page (docs/man/all-smi.1) still described Tenstorrent support as "Grayskull and Wormhole architectures" after this migration dropped Grayskull. Updated to "Wormhole and Blackhole architectures" to match the README, which already reflects the correct hardware list. No other user-facing docs reference Grayskull. docs/Devlog-Tenstorrent.md is a historical log and was left unchanged.

Commit pushed: 7a585f4 docs(man): update Tenstorrent hardware list to Wormhole/Blackhole

All checks passing. Ready for merge.

@inureyes inureyes merged commit 0ea1386 into main Jun 26, 2026
4 checks passed
@inureyes inureyes deleted the refactor/issue-265-upstream-luwen-crates branch June 26, 2026 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:npu NPU (Neural Processing Unit) related priority:medium Medium priority issue status:done Completed type:refactor Code refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate Tenstorrent support to upstream luwen crates (0.8.x) and retire the all-smi-luwen-* republished forks

1 participant