Proposal: Kmesh-daemon upgrades traffic without disruption#1441
Proposal: Kmesh-daemon upgrades traffic without disruption#1441kmesh-bot merged 1 commit intokmesh-net:mainfrom
Conversation
|
@072020127: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Welcome @072020127! It looks like this is your first PR to kmesh-net/kmesh 🎉 |
There was a problem hiding this comment.
Summary of Changes
Hello @072020127, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new documentation proposal that outlines a detailed strategy for enabling traffic-preserving upgrades for the Kmesh-daemon. The proposal addresses the challenge of maintaining seamless traffic continuity and preserving eBPF map state during version upgrades, aiming to reduce operational risk and improve reliability in production environments.
Highlights
- Proposal for Traffic-Preserving Upgrades: This PR adds a comprehensive design document detailing how Kmesh-daemon can perform upgrades without disrupting existing traffic, addressing current limitations where eBPF map state changes can lead to connection drops.
- Map Compatibility Detection: The proposal outlines a mechanism to detect changes in eBPF map definitions during upgrades. This involves runtime inspection of
MapSpec(includingMapType,KeySize,ValueSize,MaxEntries,Key, andValue), storing spec snapshots at startup, and performing recursive layout diffing (diffBTFStructFieldsRec) to identify any metadata or BTF layout differences. - Robust Data Migration Strategy: A detailed strategy for migrating eBPF map data is presented. This includes creating new maps with a
_tmpsuffix, employing a dual-write mechanism to simultaneously update old and new maps during the upgrade, converting and copying entries usingconvertStructValue(handling key/type changes vs. value layout changes), and atomically swapping map pins once migration is complete. - Hot Program Replacement: The design specifies using
link.Updatefor atomic replacement of BPF programs after map migration, ensuring zero packet loss during the transition to the new program version. - Comprehensive Testing Plan: The proposal includes a plan for both unit tests (validating core logic like
diffBTFStructFieldsRec,convertStructValue, and dual-write synchronization) and E2E tests (verifying data continuity, no packet loss, and zero connection resets during Kmesh upgrades with live traffic).
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for implementing traffic-preserving upgrades in Kmesh-daemon. The proposal outlines motivation, goals, design details, and a testing plan. The review identifies a missing image link that needs to be fixed.
| 3. **Data Migration**: Entries are iterated from the old map and copied using `convertStructValue`, which transfers only matching fields and sets defaults for missing or incompatible ones. The logic handles two strategies: | ||
| - if key or type has changed, the old map is discarded and a new one is started fresh. | ||
| - if value layout has changed but keys remain compatible, entries are migrated. | ||
| ![Dual-Write] |
8808088 to
d232f10
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
see 32 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
|
||
| #### Hot Program Replacement | ||
|
|
||
| 1. **Atomic Swap**: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `link.Update` to atomically swap the loaded program with a new one. This approach ensures there is no packet loss during the transition. |
There was a problem hiding this comment.
Could you provide a more detailed explanation of link update?
|
Your PR includes commits from others. You need to rebase it. |
207bc17 to
32fa210
Compare
3bd0167 to
1d3565b
Compare
daea512 to
1d3565b
Compare
6251e18 to
c789e88
Compare
d93f813 to
75731ae
Compare
75731ae to
87750cc
Compare
9c9ed69 to
77b8e70
Compare
77b8e70 to
2a3ad6c
Compare
|
/assign @nlgwcy an expert on ebpf |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a proposal document that outlines the implementation of traffic-preserving upgrades for the Kmesh-daemon. The proposal addresses the current limitation where Kmesh supports traffic-preserving restarts but not upgrades, which can lead to connection drops and state loss during version changes.
Key changes:
- Detailed design for map compatibility detection using runtime inspection
- Data migration logic with atomic pin swapping for seamless state transfer
- Hot program replacement strategy using atomic BPF program swaps
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| #### Map Compatibility Detection | ||
|
|
||
| 1.**Runtime Inspection**: The comparison logic begins by loading each map’s runtime `MapSpec` which includes `MapType`, `KeySize`, `ValueSize`, `MaxEntries` , `Key` and `Value`. |
There was a problem hiding this comment.
Remove the extra space before the comma after 'MaxEntries'.
| 1.**Runtime Inspection**: The comparison logic begins by loading each map’s runtime `MapSpec` which includes `MapType`, `KeySize`, `ValueSize`, `MaxEntries` , `Key` and `Value`. | |
| 1.**Runtime Inspection**: The comparison logic begins by loading each map’s runtime `MapSpec` which includes `MapType`, `KeySize`, `ValueSize`, `MaxEntries`, `Key` and `Value`. |
| And loads each map's key/value types, sizes, and attributes into a nested structure: | ||
|
|
||
| ```go | ||
| map[string]map[string]*ebpf.MapSpec // map[pgkname][mapname]Mapspce |
There was a problem hiding this comment.
Fix the typos in the comment: 'pgkname' should be 'pkgname' and 'Mapspce' should be 'MapSpec'.
| map[string]map[string]*ebpf.MapSpec // map[pgkname][mapname]Mapspce | |
| map[string]map[string]*ebpf.MapSpec // map[pkgname][mapname]MapSpec |
|
|
||
| #### Hot Program Replacement | ||
|
|
||
| **Atomic Swap***: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `utils.BpfProgUpdate()` to atomically swap the loaded program with a new one. BpfProgUpdate(progPinPath, cgopt) actually does two steps: |
There was a problem hiding this comment.
Fix the malformed markdown bold formatting. The heading should be 'Atomic Swap:' instead of 'Atomic Swap*:'.
| **Atomic Swap***: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `utils.BpfProgUpdate()` to atomically swap the loaded program with a new one. BpfProgUpdate(progPinPath, cgopt) actually does two steps: | |
| **Atomic Swap**: Once all maps are migrated, new BPF programs are attached. The upgrade process uses `utils.BpfProgUpdate()` to atomically swap the loaded program with a new one. BpfProgUpdate(progPinPath, cgopt) actually does two steps: |
| } | ||
| ``` | ||
|
|
||
| And loads each map's key/value types, sizes, and attributes into a nested structure: |
There was a problem hiding this comment.
I'll update my proposal this weekend with more detail on this section
| And loads each map's key/value types, sizes, and attributes into a nested structure: | ||
|
|
||
| ```go | ||
| map[string]map[string]*ebpf.MapSpec // map[pgkname][mapname]Mapspce |
There was a problem hiding this comment.
The pkgName represents the logical grouping name of the compile-time map spec, which is the top-level key of the map returned by LoadCompileTimeSpecs. Each pkgName corresponds to a set of map definitions derived from a CollectionSpec (via LoadKmesh*()) generated from the bpf2go, such as KmeshCgroupSock, KmeshSockops etc.
f004f28 to
ce665c0
Compare
ce665c0 to
7193230
Compare
Signed-off-by: 072020127 <mhy200253@gmail.com>
|
Defer to @nlgwcy to have a final look |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LiZhenCheng9527 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |

What type of PR is this?
/kind documentation
What this PR does / why we need it:
Introduced the implementation of the traffic uninterrupted startup logic of Kmesh-daemon
Which issue(s) this PR fixes:
Fixes #1409
Does this PR introduce a user-facing change?: