Skip to content

feat: cryptoki via UNIX socket#3421

Merged
Bravo555 merged 6 commits intothin-edge:mainfrom
Bravo555:feat/hsm/unix-socket
Mar 24, 2025
Merged

feat: cryptoki via UNIX socket#3421
Bravo555 merged 6 commits intothin-edge:mainfrom
Bravo555:feat/hsm/unix-socket

Conversation

@Bravo555
Copy link
Copy Markdown
Member

@Bravo555 Bravo555 commented Feb 25, 2025

Proposed changes

This PR adds the tedge-p11-server component, that unlike other services is built separately and runs directly on the host to allow other thin-edge services access to PKCS#11 cryptographic tokens. The server listens on a UNIX domain socket and exposes two kinds of procedures that is passes for execution to the PKCS#11 cryptographic token:

  1. choosing the signature scheme from offered candidates
  2. signing the message

The crate also exports a client that performs rustls' choosing scheme and signing operations by connecting via the UNIX socket and requesting them from the server, which interacts with the cryptoki device directly.

Configuration changes

Replaced device.cryptoki.enable with device.cryptoki.mode which can be one of off/module/socket

  • off - cryptoki is disabled and private key is read from a file
  • module - thin-edge uses a cryptoki dynamic module directly, which is preferable in environments where that's supported, i.e. dynamic module can be loaded and thin-edge process has access to the cryptographic token (e.g. running directly on the host and using GNU libc)
  • socket Connecting to tedge-p11-server via a UNIX socket. This is preferable when not using GNU libc or when no access to cryptographic token due to running in an unprivileged container

Setting to module implies required device.cryptoki.module_path. Similarly, setting to socket implies required the new device/cryptoki.socket_path setting, which contains the path to the tedge-p11-server UNIX socket.

device.cryptoki.serial setting was removed, because it's currently unused. Once related functionality is implemented, it will be reintroduced.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

#3363

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

TODO

  • implement scheme selection
  • write and compare to non-grpc version
  • use a binary serialization format like CBOR or postcard
  • extract the parts that deal with raw byte streams into a Connection struct like in tokio framing guide
  • figure out how to do versioning
  • allow socket being created by systemd and activate the service upon connection

Follow-up

  • allow passing in token identifier (e.g. pkcs11:model=PKCS%%2315%%20emulated;manufacturer=piv_II;)
  • allow tedge-p11-server to read configuration from file
  • create a ticket to handle MQTT client error handling
    At the moment, errors are wrapped multiple times which results in duplicated info, for instance when the key pairs are mismatched (e.g. certificate does not match the private key), then the following error message shows the same error message repeated multiple times:
    error: Connection error while creating device in Cumulocity: Mqtt state: Mqtt serialization/deserialization error: IO: received fatal alert: HandshakeFailure: Mqtt serialization/deserialization error: IO: received fatal alert: HandshakeFailure: IO: received fatal alert: HandshakeFailure: received fatal alert: HandshakeFailure
    
  • figure out what parts to test and test them
  • Add documentation

Further comments

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 25, 2025

@Bravo555 Bravo555 self-assigned this Feb 26, 2025
@Bravo555 Bravo555 force-pushed the feat/hsm/unix-socket branch from 307d71e to 0bbbb1c Compare February 27, 2025 10:08
@Bravo555 Bravo555 mentioned this pull request Feb 27, 2025
11 tasks
@Bravo555 Bravo555 force-pushed the feat/hsm/unix-socket branch from 0bbbb1c to 0cdce00 Compare March 3, 2025 09:54
@Bravo555 Bravo555 temporarily deployed to Test Pull Request March 3, 2025 09:54 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the feat/hsm/unix-socket branch from 0cdce00 to 95b71f4 Compare March 3, 2025 11:46
@Bravo555 Bravo555 had a problem deploying to Test Pull Request March 3, 2025 11:46 — with GitHub Actions Failure
@Bravo555 Bravo555 force-pushed the feat/hsm/unix-socket branch from 95b71f4 to 064a7fe Compare March 3, 2025 11:47
@Bravo555 Bravo555 temporarily deployed to Test Pull Request March 3, 2025 11:47 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 3, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
599 0 3 599 100 1h42m42.560857999s

@Bravo555 Bravo555 temporarily deployed to Test Pull Request March 3, 2025 15:43 — with GitHub Actions Inactive
@Bravo555 Bravo555 temporarily deployed to Test Pull Request March 3, 2025 16:50 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the feat/hsm/unix-socket branch from 1fa67be to dcb1738 Compare March 5, 2025 16:54
@Bravo555 Bravo555 had a problem deploying to Test Pull Request March 5, 2025 16:54 — with GitHub Actions Failure
@Bravo555 Bravo555 force-pushed the feat/hsm/unix-socket branch from dcb1738 to 3665bb1 Compare March 5, 2025 17:03
@Bravo555 Bravo555 had a problem deploying to Test Pull Request March 5, 2025 17:03 — with GitHub Actions Failure
@Bravo555 Bravo555 force-pushed the feat/hsm/unix-socket branch from 3665bb1 to e8fef0c Compare March 5, 2025 17:03
@Bravo555 Bravo555 temporarily deployed to Test Pull Request March 5, 2025 17:03 — with GitHub Actions Inactive
@Bravo555 Bravo555 marked this pull request as ready for review March 5, 2025 17:04
@reubenmiller
Copy link
Copy Markdown
Contributor

@Bravo555 As a follow-up it would be good to update to cryptoki 0.9.0

Copy link
Copy Markdown
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points to discuss:

  • Does the tedge-p11-server support reading from a tedge.toml config file (it doesn't look like it)? Previously this was the case, and it could load the pkcs11 module path from it, and the pin, now the user has to provide it via arguments which will make this harder to package as a feature. So at the moment we have configuration in tedge.toml but nothing is reading it.
  • Who is responsible for providing the pin? The client or the server?
  • The cryptoki feature is not enabled by default yet, we can enable this now since the gnu build dependency for tedge has been removed.

@reubenmiller
Copy link
Copy Markdown
Contributor

During testing I also found an unhandled panic (the exact root cause is also unclear):

Unhandled panic during tedge reconnect

$ tedge reconnect c8y                                           
Disconnecting from Cumulocity
Removing bridge config file... ✓
Bridge doesn't exist. Device is already disconnected from Cumulocity.
Reconnecting with config:
        device id: rmi_exampleA0001
        cloud profile: <none>
        cloud host: thin-edge-io.eu-latest.cumulocity.com:8883
        auth method: certificate
        certificate file: /opt/homebrew/etc/tedge/device-certs/tedge-certificate.pem
        bridge: built-in
        service manager: homebrew
        mosquitto version: 2.0.21
thread 'main' panicked at crates/extensions/tedge-p11-server/src/signer.rs:48:54:
called `Result::unwrap()` on an `Err` value: Hit the end of buffer, expected more data
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

On the tedge-p11-server side, the following ERROR log entries are visible.

./target/release/tedge-p11-server --module-path /opt/homebrew/lib/libykcs11.dylib
2025-03-19T11:47:42.011947Z  INFO tedge_p11_server: crates/extensions/tedge-p11-server/src/main.rs:67: Using cryptoki configuration cryptoki_config=CryptokiConfigDirect { module_path: "/opt/homebrew/lib/libykcs11.dylib", pin: "123456", serial: None }
2025-03-19T11:47:42.012673Z  INFO tedge_p11_server: crates/extensions/tedge-p11-server/src/main.rs:84: Server listening listener=Some("./tedge-p11-server.sock")
2025-03-19T12:13:06.188422Z ERROR tedge_p11_server::server: crates/extensions/tedge-p11-server/src/server.rs:32: Incoming request failed: Socket is not connected (os error 57)
2025-03-19T12:18:52.293851Z ERROR tedge_p11_server::server: crates/extensions/tedge-p11-server/src/server.rs:32: Incoming request failed: Socket is not connected (os error 57)

Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is in a good shape.

rustls-native-certs = { workspace = true }
rustls-pemfile = { workspace = true }
sha-1 = { workspace = true }
tedge-p11-server = { workspace = true }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direct dependency on cryptoki can now be removed as being unused. Nice as this is the main goal of this PR!

I even wonder if we still need the cryptoki feature.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the feature flag can removed unless we find other unexpected blockers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the cryptoki feature can definitely be removed now, however we just spoke and there might still be value in keeping the direct loading of the cryptoki library in cases where tedge is compiled with gnu libc (this is the case in Yocto)...but the logic was more "we already have the logic there, so why not just leave it"...so if we run into the slightest of problems due to that decision, then we can remove it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cryptoki feature has been removed but not the dependency (which is actually unused).

@@ -41,34 +41,29 @@ pub fn create_tls_config_cryptoki(
client_certificate: impl AsRef<Path>,
cryptoki_config: CryptokiConfig,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this actually be something else than CryptokiConfig::SocketService ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to still allow using a cryptoki module directly (i.e. without spinning up and connecting to a signing server) if running on a supported target (GNU libc) to simplify deployment. tedge_config will be changed to make it more clear:

  • device.cryptoki.mode (will rename because enable implies a boolean)
    • off - don't use cryptoki
    • module - use cryptoki module directly, requires device.cryptoki.module_path to be set
    • socket/service/tedge-p11 - use a signing service, requires device.cryptoki.socket_path to be set

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you mean socket/service/tedge-p11 => socket or tedge-p11-server

Comment on lines +54 to +67
pub struct Frame {
/// A version tag for possible future versions, currently ignored.
// although i'm not sure if it's actually necessary now because we can always add it later if need be; it's because
// we know that 1) we're having one connection/one call, so we read until EOF and know we have a valid message (so
// don't have use a scheme like TLV) and 2) we can always add new fields
pub version: Version,
pub payload: Payload,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently implemented the version tag is indeed not so useful (the payload being independent of the version).

I would rather use somethings along these lines, so the version check is done by serde and let server and clients handle older versions.

Suggested change
pub struct Frame {
/// A version tag for possible future versions, currently ignored.
// although i'm not sure if it's actually necessary now because we can always add it later if need be; it's because
// we know that 1) we're having one connection/one call, so we read until EOF and know we have a valid message (so
// don't have use a scheme like TLV) and 2) we can always add new fields
pub version: Version,
pub payload: Payload,
}
pub enum Frame {
Version1 { pub payload: Payload },
Version2 { pub payload: Payload, pub extra: PayloadExtra },
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in 2f30c39.

Enum is indeed more suitable, because indeed it makes more sense for the type system to enforce that only (VersionA, PayloadA), (VersionB, PayloadB) are valid.

However I didn't include the version 2 variant because I don't know what the PayloadExtra should be. A map, perhaps, but of what?
Honestly, all this versioning makes my head spin where I kind of have a mental block because I have no idea if the things are doing will be backwards compatible or not 🙁. Maybe we should have a call about that, but I'm not sure if publishing the v1 with the guarantee that new clients will be able to work with this v1 server.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I didn't include the version 2 variant because I don't know what the PayloadExtra should be. A map, perhaps, but of what?

Yeah, it was just an example. As of now only Version1 makes sense.

I'm not sure if publishing the v1 with the guarantee that new clients will be able to work with this v1 server.

Encoding version in the message frames is indeed not enough. For that to work:

  1. A server that running version N+1 must still be able to accept and respond to requests version N and below.
  2. A server should be able to return an error to the client, notably to tell a request cannot be serialized.
  3. A client running version N+1 should be able to downgrade its request to the version actually used by the server (determined by the version of the error frame returned from the first error).

With the exception of the 2nd point (A frame aimed to return an error to the client), nothing specific has to be implemented in this PR.

@didier-wenzek
Copy link
Copy Markdown
Contributor

Does the tedge-p11-server support reading from a tedge.toml config file (it doesn't look like it)? Previously this was the case, and it could load the pkcs11 module path from it, and the pin, now the user has to provide it via arguments which will make this harder to package as a feature. So at the moment we have configuration in tedge.toml but nothing is reading it.

Also, the cryptoki settings in tedge.toml group two aspects (direct connection vs unix socket) with the risk to make things confusing (tedge trying to directly connect to cryptoki or the server looping to itself). Having both is the same file is okay but the config method in mqtt_config.rs makes easy for a client to be confused by a config file where both aspects have been set.

Comment on lines +78 to +88
pub enum Frame1 {
ChooseSchemeRequest(ChooseSchemeRequest),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an error case, is required to able to handle version in the future. See #3421 (comment)

Suggested change
pub enum Frame1 {
ChooseSchemeRequest(ChooseSchemeRequest),
pub enum Frame1 {
ProtocolError,
ChooseSchemeRequest(ChooseSchemeRequest),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1a3990a. I've also taken a liberty to add an error message string.

I've also sorted out a bit the idea of versioning in my head and have a bit better understanding.
Basically, because the version tag is the first thing we send over after the connection is established, we can read only the first byte and decide everything from there. Currently with version 1, after reading the version byte, we expect:

  1. Client to send over a Frame1 value that is a request.
  2. Client to close its sending half.
  3. At this point server reads the request until EOF, closes reading half, sends a response which is a single Frame1 that is not a request (can be an error) and closes sending half.
  4. Client should read the Frame1 response until EOF and close reading half.
  5. At this point the connection is terminated on both ends.

But in the future, using a hypothetical version 2, we can in theory move away from that model and send multiple requests and responses over a single connection.

  • old clients will keep working with new server, maintaining forwards compatibility (but this will probably be less common case)
  • new clients, to work with older server, will need some version negotiation mechanism to know what's the server version to choose appropriate message to send, which achieves backwards compatibility. This can be, as you suggested, an error returned by the server when client uses a new request that server can't support, at which point client can open a new connection and use the correct version, or a server can send over its own version first, which the client can read and decide what type of request to send. Due to using a UNIX socket, bandwidth and latency are not a concern, so the solution should be chosen based on simplicity.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1a3990a. I've also taken a liberty to add an error message string.

Perfect!

I've also sorted out a bit the idea of versioning in my head and have a bit better understanding. Basically, because the version tag is the first thing we send over after the connection is established, we can read only the first byte and decide everything from there. Currently with version 1, after reading the version byte, we expect:

I would simply let serde figure out which version is the received frame. Responding with the same version or returning a protocol error if the frame cannot be deserialized - forcing the peer to downgrade its protocol.

But in the future, using a hypothetical version 2, we can in theory move away from that model and send multiple requests and responses over a single connection.

Possibly.

For rustls we have `default-features = false` because we don't want to
use aws-lc-rs, but it also disabled other default features: `std` and
`tls12`. These features have been reenabled.

Feature determination scheme appears to be wonky as when I started
working on p11 server, rustls screamed at me about
`ClientConfig::builder` not existing. Indeed it requires `rustls/std`
feature, but it worked fine before.

Signed-off-by: Marcel Guzik <marcel.guzik@cumulocity.com>
Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay to merge this PR once approved by @reubenmiller . The code is neat and clear. I have not checked the integration with systemd sockets, though, my focus being on the socket protocol.

rustls-native-certs = { workspace = true }
rustls-pemfile = { workspace = true }
sha-1 = { workspace = true }
tedge-p11-server = { workspace = true }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cryptoki feature has been removed but not the dependency (which is actually unused).

Comment on lines +197 to +214
match self.mode {
Cryptoki::Off => Ok(None),
Cryptoki::Module => Ok(Some(CryptokiConfig::Direct(CryptokiConfigDirect {
module_path: self
.module_path
.or_config_not_set()
.context("required because `device.cryptoki.mode` is set to `module`")?
.clone(),
pin: self.pin.clone(),
serial: self.serial.or_none().cloned(),
}))),
Cryptoki::Socket => Ok(Some(CryptokiConfig::SocketService {
socket_path: self
.socket_path
.or_config_not_set()
.context("required because `device.cryptoki.mode` is set to `socket`")?
.clone(),
})),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely reduce the risk of misconfiguration highlighted here #3421 (comment)

Copy link
Copy Markdown
Contributor

@reubenmiller reubenmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

I was able to test both the socket and module modes, and was able to successfully connect to Cumulocity with the built-in bridge and using a Yubikey (when starting the tedge-p11-server on the command line). The systemd integration has not been tested, however I'm ok with merging it as is.

Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

Despite modifying certificate crate, this commit doesn't introduce any
functional changes there (only import paths are updated).

Signed-off-by: Marcel Guzik <marcel.guzik@cumulocity.com>
Instead of always binding the socket, check if any sockets are passed by
systemd using sd_listen_fds. If there are, use the first one. If not,
fallback to creating a socket manually.

The service is started by systemd's socket, but otherwise operates
normally, i.e. runs the accept() loop handling multiple requests. This
corresponds to `Accept=no` systemd socket configuration. Support for
`Accept=yes` will be added in the next commit.

A new dependency sd-listen-fds was added, that only contains the
required function, but it can be replaced with libsystemd in the future.

Signed-off-by: Marcel Guzik <marcel.guzik@cumulocity.com>
This commit adds the `device.cryptoki.socket_path` setting and, if
present, connects to the signing service socket under this path instead
of loading cryptoki dynamic module directly.

Signed-off-by: Marcel Guzik <marcel.guzik@cumulocity.com>
Signed-off-by: Marcel Guzik <marcel.guzik@cumulocity.com>
If can be `off`, `socket` or `module` to disable cryptoki or select a
mode in which it's used: using cryptoki module directly or
tedge-p11-server signing service.

Signed-off-by: Marcel Guzik <marcel.guzik@cumulocity.com>
@Bravo555 Bravo555 mentioned this pull request Apr 1, 2025
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:security Theme: Security related topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants