Skip to content

[multicast] M2P forwarding, OPTE port subscription, and sled-agent propagation#10070

Open
zeeshanlakhani wants to merge 20 commits into
multicast-e2efrom
zl/multicast-m2p-forwarding
Open

[multicast] M2P forwarding, OPTE port subscription, and sled-agent propagation#10070
zeeshanlakhani wants to merge 20 commits into
multicast-e2efrom
zl/multicast-m2p-forwarding

Conversation

@zeeshanlakhani

Copy link
Copy Markdown
Collaborator

Complete the multicast data path by adding per-sled M2P (multicast-to- physical) mapping, forwarding entry management, and OPTE port subscription for multicast group members.

Sled-agent:

  • Add multicast_subscribe / multicast_unsubscribe endpoints (API v29) that configure M2P, forwarding, and OPTE port subscription for a VMM
  • OPTE port_manager gains set/clear operations for M2P and forwarding
  • Port subscription cleanup on PortTicket release

Nexus:

  • New sled.rs (MulticastSledClient) encapsulating all sled-agent multicast interactions: subscribe/unsubscribe, M2P/forwarding propagation and teardown
  • Groups RPW propagates M2P and forwarding entries to all member sleds after DPD configuration, with convergent retry on failure
  • Members RPW uses MemberReconcileCtx to thread shared reconciliation state. This handles subscribe on join, unsubscribe on leave, and re-subscribe on migration
  • Dataplane client updated for bifurcated replication groups

Tests:

  • Integration tests for M2P/forwarding/subscribe lifecycle
  • Instance migration multicast re-convergence

@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch 4 times, most recently from 3efe7fa to 98e9742 Compare March 17, 2026 05:42
@zeeshanlakhani

zeeshanlakhani commented Mar 17, 2026

Copy link
Copy Markdown
Collaborator Author

Note: Both helios/deploy check-opte-ver/check-opte-ver in CI will fail for now. After R19 ships and OPTE PR #924 merges, we can switch from branch = "zl/filter-mcast-srcs" to rev = "<merged-sha>" and bump tools/opte_version + deploy.sh target to 0.40.

@zeeshanlakhani zeeshanlakhani marked this pull request as ready for review March 17, 2026 09:02
@zeeshanlakhani zeeshanlakhani self-assigned this Mar 23, 2026
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from 98e9742 to b510a9f Compare March 24, 2026 02:40
Comment thread common/src/api/internal/shared/mod.rs Outdated
@zeeshanlakhani

zeeshanlakhani commented Mar 26, 2026

Copy link
Copy Markdown
Collaborator Author

@jgallagher I started down the better path for network types in relation to #10139, where we could expand on the initial version in a separate PR. For the moment, I'm going to move the types back to omicron_common until #10158 finalizes.

@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch 2 times, most recently from 102d7fb to 043af28 Compare March 26, 2026 11:51

@FelixMcFelix FelixMcFelix left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to be posting comments in chunks given the apparent size of the PR, so please bear with me there if I miss something that's covered by a later file. Apart from that I'm seeing various changes from main interspersed here, so I don't know if something is up with the branch targeting.

Comment thread illumos-utils/src/opte/illumos.rs Outdated
Comment thread illumos-utils/src/opte/non_illumos.rs Outdated
Comment thread illumos-utils/src/opte/port_manager.rs Outdated
Comment thread illumos-utils/src/opte/non_illumos.rs Outdated
Comment thread dev-tools/releng/src/main.rs Outdated
Comment thread illumos-utils/src/opte/port_manager.rs Outdated
Comment thread illumos-utils/src/opte/port_manager.rs Outdated
Comment thread illumos-utils/src/opte/port_manager.rs
Comment thread illumos-utils/src/opte/port_manager.rs
Comment thread illumos-utils/src/opte/port_manager.rs
…main builds

Bump maghemite and OPTE to versions with the latest multicast support.

OPTE now has the option to be installed via p5p package override from buildomat
rather than directly downloading xde/opteadm binaries. The override
mechanism (tools/opte_version_override) is sourced and packaged for use with
install_opte.sh, deploy.sh, releng, and CI to install the unpublished OPTE
build until it lands in the helios pkg repo.

Note: CI check added to reject OPTE_COMMIT override on PRs targeting main.
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from 043af28 to 9cdf741 Compare March 28, 2026 09:36
…opagation

This completes the multicast data path by adding per-sled M2P (multicast-to-
physical) mapping, forwarding entry management, and OPTE port subscription
for multicast group members.

## Sled-agent + API update(s)

  - Add multicast endpoints at API v33 (MCAST_M2P_FORWARDING) for M2P,
    forwarding, and per-VMM subscribe/unsubscribe
  - Version v7 join/leave endpoints to v7..v33 with shim conversion
  - Move multicast types from omicron-common to sled-agent-types-versions
    v33 module (mcast_m2p_forwarding) with re-exports through sled-agent-types
  - OPTE port_manager gains set/clear operations for M2P and forwarding
  - Port subscription cleanup on PortTicket release
  - Consolidate per-port mutable state (eip_gateways, mcast) into PortState
  - Seed eip_gateways from global map on port creation to prevent stale
    gateway state on newly created ports
  - Lock ordering documented for ports, routes, eip_gateways

## Nexus

  - New `sled.rs` (MulticastSledClient) encapsulating all sled-agent
    multicast interactions: subscribe/unsubscribe, M2P/forwarding
    propagation and teardown
  - Groups RPW propagates M2P and forwarding entries to all member sleds
    after DPD configuration, with convergent retry on failure
  - Members RPW uses MemberReconcileCtx to thread shared reconciliation
    state. Handles subscribe on join, unsubscribe on leave, and
    re-subscribe on migration
  - `subscribe_vmm` gracefully handles missing propolis (mirrors unsubscribe)
  - `lookup_propolis_id` returns Ok(None) for missing instance
  - `lookup_and_update_member_sled_id` surfaces DB errors instead of
    swallowing them
  - Order-independent forwarding comparison to avoid spurious dataplane churn;
    always create forwarding entries for active groups even with empty next-hops
  - Dataplane client updated for bifurcated replication groups

## illumos-utils

  - Remove CIDR allow rules for multicast (handled by OPTE gateway layer)
  - Reject Reserved replication mode in `list_mcast_fwd` with
    InvalidMcastForwardingState error
  - Consolidate error variants into InvalidMcastUnderlay

## Tests

  - Integration tests for M2P/forwarding/subscribe lifecycle
  - Instance migration multicast re-convergence

@FelixMcFelix FelixMcFelix left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Zeeshan; I haven't looked at integration_tests/multicast/networking_integration.rs, but otherwise some thoughts going through the work.

Comment thread .github/buildomat/jobs/deploy.sh Outdated
Comment thread .github/workflows/check-opte-ver.yml
Comment thread illumos-utils/src/opte/port_manager.rs Outdated
Comment thread illumos-utils/src/opte/port_manager.rs Outdated
Comment thread nexus/src/app/background/tasks/multicast/groups.rs
Comment thread tools/install_opte.sh Outdated
Comment thread sled-agent/src/bootstrap/early_networking.rs
Comment thread nexus/src/app/multicast/sled.rs Outdated
Comment thread nexus/src/app/multicast/sled.rs
Comment thread sled-agent/src/instance.rs Outdated
Changes:

- Remove global eip_gateways map from PortManagerInner, as the VPC route manager RPW activates after instance start
- Refactor member reconciler methods to take &MemberReconcileCtx
- Change forwarding next hop from member sleds to a single switch zone IP
- Add resolver to MulticastSledClient for switch zone address lookup
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from ee88e45 to 46eb139 Compare April 1, 2026 07:39
@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from 46eb139 to 6b07a46 Compare April 1, 2026 11:06
@zeeshanlakhani zeeshanlakhani force-pushed the multicast-e2e branch 5 times, most recently from 50333e8 to dfda50f Compare May 13, 2026 01:54
…nd per-group source IP caps

New nexus external API version 2026_05_16_00 (MULTICAST_SOURCE_LIMITS) splits
the multicast group join endpoint to introduce two policy bounds:

- MAX_SOURCE_IPS_PER_MEMBER (32): caps a single member's source filter list
  and rejects duplicate entries explicitly rather than silently deduplicating.
- MAX_SOURCE_IPS_PER_GROUP (256): caps the union of source IPs across all
  active members of a group. Enforced atomically inside the member-attach
  CTE plus a preflight check at the Nexus app layer for a descriptive 400.

Both caps apply whenever a member declares a non-empty source list,
covering SSM groups and ASM members using INCLUDE-mode source filtering.

This quantifies and qualifies the Oxide policy framing alongside Linux igmp_max_msf and FreeBSD
maxsocksrc precedent, while also giving us a threshold to go off on the switch side
of things.

@FelixMcFelix FelixMcFelix left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but otherwise this looks to be setting up OPTE correctly from what I can understand. Thanks for iterating here Zeeshan.

Comment thread illumos-utils/src/opte/illumos.rs
Comment thread illumos-utils/src/opte/non_illumos.rs
Comment thread nexus/db-queries/src/db/datastore/multicast/ops/member_attach.rs Outdated
Comment thread Cargo.toml
Comment on lines +1139 to +1146
# TODO: This patch is a symptom of the broader maghemite -> omicron
# self-reference loop (maghemite depends on oximeter-producer, which pulls
# nexus-client -> nexus-types from omicron@main and not a rev). @jgallagher
# flagged this in review; @bnaecker suggested extracting oximeter-producer
# out of omicron to break it. When that lands, this patch (and any
# duplicated path+git entries for illumos-utils, nexus-types, etc.) can be
# removed.
oxlog = { path = "dev-tools/oxlog" }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting a comment here so this isn't lost.

@zeeshanlakhani zeeshanlakhani force-pushed the zl/multicast-m2p-forwarding branch from bcf5a6c to 02064d2 Compare May 23, 2026 15:34
zeeshanlakhani added a commit that referenced this pull request May 26, 2026
Final, pre-review pass on this work. It stacks atop #10070 and inherits
the multicast-to-physical (M2P) underlay forwarding and VMM-keyed instance
subscription endpoints.

This also builds on and integrates #10381.

Above these foundations, this work includes the final pass on mgd-ddmd
integration:

* Reconciler correctness:
  * `set_mcast_m2p` rolls back the xde M2P entry on per-NIC join
    failure, so the reconciler converges on a retry instead of
    leaving stale state pointing at the wrong underlay address.
  * `propolis_id` is threaded end-to-end through the sled-agent
    multicast endpoints to deal with live migration ambiguity.
  * MRIB advertisement is gated on a flag rather than running unconditionally
    after the DPD match arm, so that a DPD failure no longer leaves
    a route advertised via DDM with no programmed forwarding state.

* OPTE hardening (illumos-utils):
  * M2P entries upserted into a `BTreeMap<IpAddr, MulticastUnderlay>`
    rather than a Vec on the non-illumos mock, eliminating duplicate-key corner
    cases the production map already avoided.
  * `MulticastFilterMap` encapsulates the per-NIC filter socket and
    refcount state previously open-coded inside `PortManagerInner`,
    concentrating the "join socket per underlay group per NIC"
    invariant into one singular type.
  * underlay_nics typed as &[AddrObject] rather than &[String].
  * Per-NIC IPV6_JOIN_GROUP calls converted from libc::setsockopt to
    nix::sys::socket::setsockopt for the typed bind.

* Sled-agent (real and sim):
  * Sim v7 multicast endpoints fall through to the trait defaults
    instead of overriding with just `unimplemented!()`, matching how
    other versioned endpoints behave in the sim.
  * Sim VMM existence check on join/leave restored.

* Configuration:
  * `MulticastGroupReconcilerConfig` gains a group_concurrency_limit
    and member_concurrency_limit bounding the per-pass fan-out of the RPW's
    buffer_unordered streams.

* Test infra:
  * `populate_ddm_peers` no longer caches the peer map. The previous
    cache was keyed by sled-id set, but the synthesized port names
    embedded each sled's `sp_slot` from inventory, so cache reuse
    within the same sled set could produce stale port mappings.

* Documentation cleanup across the RPW, sled-agent multicast paths, and the
  new(er) sled-agent types module.
zeeshanlakhani added a commit that referenced this pull request May 26, 2026
Final, pre-review pass on this work. It stacks atop #10070 and inherits
the multicast-to-physical (M2P) underlay forwarding and VMM-keyed instance
subscription endpoints.

This also builds on and integrates #10381.

Above these foundations, this work includes the final pass on mgd-ddmd
integration:

* Reconciler correctness:
  * `set_mcast_m2p` rolls back the xde M2P entry on per-NIC join
    failure, so the reconciler converges on a retry instead of
    leaving stale state pointing at the wrong underlay address.
  * `propolis_id` is threaded end-to-end through the sled-agent
    multicast endpoints to deal with live migration ambiguity.
  * MRIB advertisement is gated on a flag rather than running unconditionally
    after the DPD match arm, so that a DPD failure no longer leaves
    a route advertised via DDM with no programmed forwarding state.

* OPTE hardening (illumos-utils):
  * M2P entries upserted into a `BTreeMap<IpAddr, MulticastUnderlay>`
    rather than a Vec on the non-illumos mock, eliminating duplicate-key corner
    cases the production map already avoided.
  * `MulticastFilterMap` encapsulates the per-NIC filter socket and
    refcount state previously open-coded inside `PortManagerInner`,
    concentrating the "join socket per underlay group per NIC"
    invariant into one singular type.
  * underlay_nics typed as &[AddrObject] rather than &[String].
  * Per-NIC IPV6_JOIN_GROUP calls converted from libc::setsockopt to
    nix::sys::socket::setsockopt for the typed bind.

* Sled-agent (real and sim):
  * Sim v7 multicast endpoints fall through to the trait defaults
    instead of overriding with just `unimplemented!()`, matching how
    other versioned endpoints behave in the sim.
  * Sim VMM existence check on join/leave restored.

* Configuration:
  * `MulticastGroupReconcilerConfig` gains a group_concurrency_limit
    and member_concurrency_limit bounding the per-pass fan-out of the RPW's
    buffer_unordered streams.

* Test infra:
  * `populate_ddm_peers` no longer caches the peer map. The previous
    cache was keyed by sled-id set, but the synthesized port names
    embedded each sled's `sp_slot` from inventory, so cache reuse
    within the same sled set could produce stale port mappings.

* Documentation cleanup across the RPW, sled-agent multicast paths, and the
  new(er) sled-agent types module.

@FelixMcFelix FelixMcFelix left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved, pending the one open TODO comment.

zeeshanlakhani added a commit that referenced this pull request May 26, 2026
Final, pre-review pass on this work. It stacks atop #10070 and inherits
the multicast-to-physical (M2P) underlay forwarding and VMM-keyed instance
subscription endpoints.

This also builds on and integrates #10381.

Above these foundations, this work includes the final pass on mgd-ddmd
integration:

* Reconciler correctness:
  * `set_mcast_m2p` rolls back the xde M2P entry on per-NIC join
    failure, so the reconciler converges on a retry instead of
    leaving stale state pointing at the wrong underlay address.
  * `propolis_id` is threaded end-to-end through the sled-agent
    multicast endpoints to deal with live migration ambiguity.
  * MRIB advertisement is gated on a flag rather than running unconditionally
    after the DPD match arm, so that a DPD failure no longer leaves
    a route advertised via DDM with no programmed forwarding state.

* OPTE hardening (illumos-utils):
  * M2P entries upserted into a `BTreeMap<IpAddr, MulticastUnderlay>`
    rather than a Vec on the non-illumos mock, eliminating duplicate-key corner
    cases the production map already avoided.
  * `MulticastFilterMap` encapsulates the per-NIC filter socket and
    refcount state previously open-coded inside `PortManagerInner`,
    concentrating the "join socket per underlay group per NIC"
    invariant into one singular type.
  * underlay_nics typed as &[AddrObject] rather than &[String].
  * Per-NIC IPV6_JOIN_GROUP calls converted from libc::setsockopt to
    nix::sys::socket::setsockopt for the typed bind.

* Sled-agent (real and sim):
  * Sim v7 multicast endpoints fall through to the trait defaults
    instead of overriding with just `unimplemented!()`, matching how
    other versioned endpoints behave in the sim.
  * Sim VMM existence check on join/leave restored.

* Configuration:
  * `MulticastGroupReconcilerConfig` gains a group_concurrency_limit
    and member_concurrency_limit bounding the per-pass fan-out of the RPW's
    buffer_unordered streams.

* Test infra:
  * `populate_ddm_peers` no longer caches the peer map. The previous
    cache was keyed by sled-id set, but the synthesized port names
    embedded each sled's `sp_slot` from inventory, so cache reuse
    within the same sled set could produce stale port mappings.

* Documentation cleanup across the RPW, sled-agent multicast paths, and the
  new(er) sled-agent types module.
@zeeshanlakhani zeeshanlakhani force-pushed the multicast-e2e branch 2 times, most recently from 868c08a to 08d7141 Compare May 28, 2026 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants