Skip to content

Proposal: Kmesh-daemon upgrades traffic without disruption#1441

Merged
kmesh-bot merged 1 commit intokmesh-net:mainfrom
072020127:KmeshUpdateDev
Oct 15, 2025
Merged

Proposal: Kmesh-daemon upgrades traffic without disruption#1441
kmesh-bot merged 1 commit intokmesh-net:mainfrom
072020127:KmeshUpdateDev

Conversation

@072020127
Copy link
Copy Markdown
Contributor

What type of PR is this?
/kind documentation

What this PR does / why we need it:
Introduced the implementation of the traffic uninterrupted startup logic of Kmesh-daemon

Which issue(s) this PR fixes:
Fixes #1409

Does this PR introduce a user-facing change?:

NONE

@kmesh-bot
Copy link
Copy Markdown
Collaborator

@072020127: The label(s) kind/documentation cannot be applied, because the repository doesn't have them.

Details

In response to this:

What type of PR is this?
/kind documentation

What this PR does / why we need it:
Introduced the implementation of the traffic uninterrupted startup logic of Kmesh-daemon

Which issue(s) this PR fixes:
Fixes #1409

Does this PR introduce a user-facing change?:

NONE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kmesh-bot
Copy link
Copy Markdown
Collaborator

Welcome @072020127! It looks like this is your first PR to kmesh-net/kmesh 🎉

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @072020127, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new documentation proposal that outlines a detailed strategy for enabling traffic-preserving upgrades for the Kmesh-daemon. The proposal addresses the challenge of maintaining seamless traffic continuity and preserving eBPF map state during version upgrades, aiming to reduce operational risk and improve reliability in production environments.

Highlights

  • Proposal for Traffic-Preserving Upgrades: This PR adds a comprehensive design document detailing how Kmesh-daemon can perform upgrades without disrupting existing traffic, addressing current limitations where eBPF map state changes can lead to connection drops.
  • Map Compatibility Detection: The proposal outlines a mechanism to detect changes in eBPF map definitions during upgrades. This involves runtime inspection of MapSpec (including MapType, KeySize, ValueSize, MaxEntries, Key, and Value), storing spec snapshots at startup, and performing recursive layout diffing (diffBTFStructFieldsRec) to identify any metadata or BTF layout differences.
  • Robust Data Migration Strategy: A detailed strategy for migrating eBPF map data is presented. This includes creating new maps with a _tmp suffix, employing a dual-write mechanism to simultaneously update old and new maps during the upgrade, converting and copying entries using convertStructValue (handling key/type changes vs. value layout changes), and atomically swapping map pins once migration is complete.
  • Hot Program Replacement: The design specifies using link.Update for atomic replacement of BPF programs after map migration, ensuring zero packet loss during the transition to the new program version.
  • Comprehensive Testing Plan: The proposal includes a plan for both unit tests (validating core logic like diffBTFStructFieldsRec, convertStructValue, and dual-write synchronization) and E2E tests (verifying data continuity, no packet loss, and zero connection resets during Kmesh upgrades with live traffic).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for implementing traffic-preserving upgrades in Kmesh-daemon. The proposal outlines motivation, goals, design details, and a testing plan. The review identifies a missing image link that needs to be fixed.

3. **Data Migration**: Entries are iterated from the old map and copied using `convertStructValue`, which transfers only matching fields and sets defaults for missing or incompatible ones. The logic handles two strategies:
- if key or type has changed, the old map is discarded and a new one is started fresh.
- if value layout has changed but keys remain compatible, entries are migrated.
![Dual-Write]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The image link for 'Dual-Write' is missing the URL. Please add the correct path to the image.

Suggested change
![Dual-Write]
![Dual-Write](<path-to-dual-write-diagram>)

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 35.93%. Comparing base (8f619b9) to head (d232f10).
Report is 5 commits behind head on main.

see 32 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update accb01d...d232f10. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


#### Hot Program Replacement

1. **Atomic Swap**: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `link.Update` to atomically swap the loaded program with a new one. This approach ensures there is no packet loss during the transition.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a more detailed explanation of link update?

@LiZhenCheng9527
Copy link
Copy Markdown
Contributor

Your PR includes commits from others. You need to rebase it.
Additionally, your proposal only mentions what you will do but does not explain how you will do it. Could you provide more details?

@072020127
Copy link
Copy Markdown
Contributor Author

屏幕截图 2025-07-20 184028 Markdown lint indicates that there is an issue with docs/ebpf_unit_test_zh.md, but I haven't made any changes to it, and indeed, there are problems with this file

@hzxuzhonghu
Copy link
Copy Markdown
Member

/assign @nlgwcy an expert on ebpf

@hzxuzhonghu hzxuzhonghu requested a review from Copilot September 5, 2025 03:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a proposal document that outlines the implementation of traffic-preserving upgrades for the Kmesh-daemon. The proposal addresses the current limitation where Kmesh supports traffic-preserving restarts but not upgrades, which can lead to connection drops and state loss during version changes.

Key changes:

  • Detailed design for map compatibility detection using runtime inspection
  • Data migration logic with atomic pin swapping for seamless state transfer
  • Hot program replacement strategy using atomic BPF program swaps

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


#### Map Compatibility Detection

1.**Runtime Inspection**: The comparison logic begins by loading each map’s runtime `MapSpec` which includes `MapType`, `KeySize`, `ValueSize`, `MaxEntries` , `Key` and `Value`.
Copy link

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the extra space before the comma after 'MaxEntries'.

Suggested change
1.**Runtime Inspection**: The comparison logic begins by loading each map’s runtime `MapSpec` which includes `MapType`, `KeySize`, `ValueSize`, `MaxEntries` , `Key` and `Value`.
1.**Runtime Inspection**: The comparison logic begins by loading each map’s runtime `MapSpec` which includes `MapType`, `KeySize`, `ValueSize`, `MaxEntries`, `Key` and `Value`.

Copilot uses AI. Check for mistakes.
And loads each map's key/value types, sizes, and attributes into a nested structure:

```go
map[string]map[string]*ebpf.MapSpec // map[pgkname][mapname]Mapspce
Copy link

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the typos in the comment: 'pgkname' should be 'pkgname' and 'Mapspce' should be 'MapSpec'.

Suggested change
map[string]map[string]*ebpf.MapSpec // map[pgkname][mapname]Mapspce
map[string]map[string]*ebpf.MapSpec // map[pkgname][mapname]MapSpec

Copilot uses AI. Check for mistakes.

#### Hot Program Replacement

**Atomic Swap***: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `utils.BpfProgUpdate()` to atomically swap the loaded program with a new one. BpfProgUpdate(progPinPath, cgopt) actually does two steps:
Copy link

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the malformed markdown bold formatting. The heading should be 'Atomic Swap:' instead of 'Atomic Swap*:'.

Suggested change
**Atomic Swap***: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `utils.BpfProgUpdate()` to atomically swap the loaded program with a new one. BpfProgUpdate(progPinPath, cgopt) actually does two steps:
**Atomic Swap**: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `utils.BpfProgUpdate()` to atomically swap the loaded program with a new one. BpfProgUpdate(progPinPath, cgopt) actually does two steps:

Copilot uses AI. Check for mistakes.
}
```

And loads each map's key/value types, sizes, and attributes into a nested structure:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how to do this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update my proposal this weekend with more detail on this section

And loads each map's key/value types, sizes, and attributes into a nested structure:

```go
map[string]map[string]*ebpf.MapSpec // map[pgkname][mapname]Mapspce
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is pgkname?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pkgName represents the logical grouping name of the compile-time map spec, which is the top-level key of the map returned by LoadCompileTimeSpecs. Each pkgName corresponds to a set of map definitions derived from a CollectionSpec (via LoadKmesh*()) generated from the bpf2go, such as KmeshCgroupSock, KmeshSockops etc.

Signed-off-by: 072020127 <mhy200253@gmail.com>
@072020127 072020127 reopened this Sep 7, 2025
@kmesh-bot kmesh-bot added size/L and removed size/XS labels Sep 7, 2025
@hzxuzhonghu
Copy link
Copy Markdown
Member

Defer to @nlgwcy to have a final look

@LiZhenCheng9527
Copy link
Copy Markdown
Contributor

/lgtm
/approve

@kmesh-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LiZhenCheng9527

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kmesh-bot kmesh-bot merged commit 9404fc3 into kmesh-net:main Oct 15, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[OSPP 2025] Kmesh-daemon upgrades traffic without disruption

5 participants