Feat: support no traffic disruption during kmesh upgrade#1503
Feat: support no traffic disruption during kmesh upgrade#1503kmesh-bot merged 24 commits intokmesh-net:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @072020127, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant feature enabling seamless, traffic-undisrupted upgrades for the kmesh-daemon. It achieves this by persisting BPF map specifications, comparing them during daemon restarts, and intelligently migrating map data to new versions. This ensures that ongoing network traffic is not interrupted when the daemon is updated.
Highlights
- Persistent BPF MapSpecs: Implemented logic for storing and reading persistent BPF MapSpecs, which are crucial for maintaining state across kmesh-daemon upgrades.
- MapSpec Comparison Logic: Added retrieval of the current MapSpec and implemented robust comparison logic for BPF Map structures to detect changes between versions.
- Seamless Daemon Upgrades: Introduced an 'Update' mode for kmesh-daemon, allowing BPF programs and maps to be updated without traffic disruption by migrating existing map data.
- Comprehensive Testing: Included new unit tests for MapSpec loading and comparison, as well as an end-to-end test to validate the traffic-undisrupted upgrade process.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
BPF maps persist, Upgrades flow, no traffic lost, Code ensures smooth ride.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature for non-disruptive upgrades of the kmesh-daemon. The core logic involves persisting BPF map specifications (MapSpecs), comparing them across versions, and migrating maps when their structure changes. The implementation adds new files for handling BPF updates and loading MapSpecs, along with extensive unit and e2e tests.
My review focuses on the new update logic. I've found a few critical issues related to error handling that could lead to panics, and a bug where changes in struct bitfield sizes are not detected, potentially causing data corruption. I've also identified areas with unhandled errors and code duplication that should be addressed to improve robustness and maintainability. The suggestions provided aim to fix these issues and refactor the code for better clarity.
pkg/bpf/restart/bpf_update.go
Outdated
| if map_old.Offset != uint32(map_new.Offset) { | ||
| log.Info(map_old.BitfieldSize, uint32(map_new.BitfieldSize)) | ||
| diff.OffsetChanged = true | ||
| break | ||
| } |
There was a problem hiding this comment.
The BitfieldSize of a struct member is not being compared. This is a critical bug, as a change in bitfield size alters the struct layout and should trigger a map migration to avoid data corruption. The existing log.Info call should be replaced with a proper comparison.
if map_old.Offset != uint32(map_new.Offset) || map_old.BitfieldSize != uint32(map_new.BitfieldSize) {
diff.OffsetChanged = true
break
}
pkg/bpf/bpf.go
Outdated
| mapspec, _ := restart.LoadCompileTimeSpecs(config) | ||
| restart.SnapshotSpecsByPkg(mapspec) |
There was a problem hiding this comment.
The error returned by restart.LoadCompileTimeSpecs is ignored. This could lead to restart.SnapshotSpecsByPkg being called with a nil mapspec, which might cause a panic. The error should be checked and handled to prevent this.
| mapspec, _ := restart.LoadCompileTimeSpecs(config) | |
| restart.SnapshotSpecsByPkg(mapspec) | |
| mapspec, err := restart.LoadCompileTimeSpecs(config) | |
| if err != nil { | |
| log.Errorf("failed to load compile time specs: %v", err) | |
| return nil | |
| } | |
| restart.SnapshotSpecsByPkg(mapspec) |
pkg/bpf/restart/bpf_update.go
Outdated
| specsByPkg, err := LoadCompileTimeSpecs(config) | ||
| if specsByPkg == nil { | ||
| log.Errorf("load oldSpecsByPkg failed") | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The error returned by LoadCompileTimeSpecs is not handled. If an error occurs, specsByPkg could be nil and the function might proceed, potentially causing a panic later. It's crucial to check and handle this error. Also, the log message on line 68 refers to oldSpecsByPkg which seems to be a typo.
| specsByPkg, err := LoadCompileTimeSpecs(config) | |
| if specsByPkg == nil { | |
| log.Errorf("load oldSpecsByPkg failed") | |
| return nil | |
| } | |
| specsByPkg, err := LoadCompileTimeSpecs(config) | |
| if err != nil { | |
| log.Errorf("load compile-time map specs failed: %v", err) | |
| return nil | |
| } |
| func migrateMap( | ||
| oldMapSpec *PersistedMapSpec, | ||
| newMapSpec *ebpf.MapSpec, | ||
| pkgName, mapName, pinPath string, | ||
| ) (*ebpf.Map, error) { | ||
| if oldMapSpec == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| if oldMapSpec.Type != newMapSpec.Type.String() || | ||
| oldMapSpec.KeySize != newMapSpec.KeySize || | ||
| oldMapSpec.ValueSize != newMapSpec.ValueSize || | ||
| oldMapSpec.MaxEntries != newMapSpec.MaxEntries { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
|
|
||
| oldKeyStruct := oldMapSpec.KeyInfo | ||
| newKeyStruct, okNewKey := newMapSpec.Key.(*btf.Struct) | ||
| if okNewKey { | ||
| diff := diffStructInfoAgainstBTF(oldKeyStruct, newKeyStruct, make(map[string]bool)) | ||
| if diff.Added || diff.Removed || diff.TypeChanged || | ||
| diff.OffsetChanged || diff.NestedChanged { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| if newMapSpec.Key == nil { | ||
| if len(oldKeyStruct.Members) != 0 || oldKeyStruct.Name != "" { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| newKeyTypeName := newMapSpec.Key.TypeName() | ||
| if len(oldKeyStruct.Members) != 0 || newKeyTypeName != oldKeyStruct.Name { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| oldValueStruct := oldMapSpec.ValueInfo | ||
| newValueStruct, okNewValue := newMapSpec.Value.(*btf.Struct) | ||
| if okNewValue { | ||
| diff := diffStructInfoAgainstBTF(oldValueStruct, newValueStruct, make(map[string]bool)) | ||
| if diff.Added || diff.Removed || diff.TypeChanged || | ||
| diff.OffsetChanged || diff.NestedChanged { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| log.Info(diff.Added ,diff.Removed,diff.TypeChanged , | ||
| diff.OffsetChanged,diff.NestedChanged) | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| if newMapSpec.Value == nil { | ||
| if len(oldValueStruct.Members) != 0 || oldValueStruct.Name != "" { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } else { | ||
| newValTypeName := newMapSpec.Value.TypeName() | ||
| if len(oldValueStruct.Members) != 0 || newValTypeName != oldValueStruct.Name { | ||
| oldMap, err := ebpf.LoadPinnedMap(pinPath, &ebpf.LoadPinOptions{}) | ||
| if err == nil { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, oldMap) | ||
| } else { | ||
| return createEmptyMap(newMapSpec, pinPath, mapName, nil) | ||
| } | ||
| } | ||
| } | ||
| } | ||
| log.Info("pass all check") | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
The migrateMap function has high complexity and significant code duplication. The logic to decide whether to recreate a map is repeated in multiple branches. This makes the code hard to read and maintain. Consider refactoring this function to extract the map recreation logic into a helper function and simplify the conditional checks.
hzxuzhonghu
left a comment
There was a problem hiding this comment.
I am concerned this is a very big change, could introduce regression. Can you please add a feature flag, we can turn on until we have enough tests and documents
|
|
||
| // LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages. | ||
| // Returns a nested map: packageName to mapName to *ebpf.MapSpec | ||
| func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) { |
There was a problem hiding this comment.
what do you mean by CompileTime
There was a problem hiding this comment.
My intention is to describe that these specs are fixed when the binary is compiled (via bpf2go), as opposed to the specs loaded from the persisted on-disk snapshot. Maybe LoadEmbeddedSpecs will be clearer to reflect its role?
| } else { | ||
| specs["KmeshCgroupSockWorkload"] = coll.Maps | ||
| } | ||
| if coll, err := dualengine.LoadKmeshCgroupSockWorkloadCompat(); err != nil { |
There was a problem hiding this comment.
we should only call either LoadKmeshCgroupSockWorkload or LoadKmeshCgroupSockWorkloadCompat
please check
if helper.KernelVersionLowerThan5_13() {
spec, err = bpf2go.LoadKmeshSockopsCompat()
} else {
spec, err = bpf2go.LoadKmeshSockops()
}
|
|
||
| // LoadCompileTimeSpecs loads all compile-time MapSpecs from bpf2go-generated packages. | ||
| // Returns a nested map: packageName to mapName to *ebpf.MapSpec | ||
| func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) { |
There was a problem hiding this comment.
I am wondering how we can not miss one map, even after we add a new map later.
There was a problem hiding this comment.
I wrote a go file for gen-bpf_stpecs.go in the/back directory, which can automatically generate LoadCompileTimeSpecs by analyzing the bpf2Go.go file. If there are new maps or progs in the future, it can be easily kept synchronized.
| func LoadCompileTimeSpecs(config *options.BpfConfig) (map[string]map[string]*ebpf.MapSpec, error) { | ||
| specs := make(map[string]map[string]*ebpf.MapSpec) | ||
|
|
||
| if config.KernelNativeEnabled() { |
There was a problem hiding this comment.
With the tag //go:build enhanced, it is for kernel native mode, so we donot need the below else branch
pkg/bpf/bpf.go
Outdated
| return nil | ||
| } | ||
|
|
||
| mapspec, err := restart.LoadCompileTimeSpecs(config) |
There was a problem hiding this comment.
| mapspec, err := restart.LoadCompileTimeSpecs(config) | |
| mapSpecs, err := restart.LoadCompileTimeSpecs(config) |
| // It will migrate any BPF maps whose on‑disk pin already exists but whose | ||
| // compiled MapSpec has changed | ||
| func UpdateMapHandler(versionMap *ebpf.Map, kmBpfPath string, config *options.BpfConfig) *ebpf.Map { | ||
| persistedSpecs, err := LoadPersistedSnapshot() |
There was a problem hiding this comment.
persistedSpecs can be nil, which cause panic later, please check all
pkg/bpf/restart/bpf_update.go
Outdated
| log.Warnf("failed to remove old map pinpath: %v (continuing)", err) | ||
| } | ||
| case hasNew && !hasOld: | ||
| if _, err := migrateMap(nil, newSpec, pkgName, mapName, pinPath); err != nil { |
There was a problem hiding this comment.
Would prefer directly use call the map create function
| return versionMap | ||
| } | ||
|
|
||
| func updateVersionInfo(versionMap *ebpf.Map) { |
There was a problem hiding this comment.
we can share with that one storeVersionInfo in pkg/bpf/bpf.go
There was a problem hiding this comment.
I need to import package restart in pkg/bpf/bpf.go. If i call storeVersionInfo in pkg/bpf/bpf.go, this will result in recursive import.
test/e2e/upgarde_test.go
Outdated
| } | ||
|
|
||
| // patchKmesh applies a strategic merge patch to the Kmesh DaemonSet and waits for rollout completion. | ||
| func patchKmesh_upgarde(t framework.TestContext, patchData string) { |
| func upgradeKmesh(t framework.TestContext) { | ||
| newImage := os.Getenv("KMESH_UPGRADE_IMAGE") | ||
| if newImage == "" { | ||
| newImage = "localhost:5000/kmesh" |
There was a problem hiding this comment.
isnot this same image with old daemonset?
There was a problem hiding this comment.
I attached a new image which is build in run_test.sh set_daemonupgarde_testcase_image() to KMESH_UPGRADE_IMAGE. But E2e test istio 1.23~1.25 report no space left in device. I think the reason is there are two images existing ((localhost:5000/kmesh) and localhost:5000/kmesh/test-upgrade-map-change)
a21a29c to
aee4427
Compare
ba67a19 to
5ffbc87
Compare
|
We update configuration of e2e to fix this errors. Please rebase and then resubmit the code. |
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
|
lgtm please fix the ci |
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
|
The CI is frequently getting stuck on the eBPF unit test section. However, the code in this PR does not modify any eBPF-related modules, and the eBPF unit tests are able to pass locally. I have re-tested 4 times. In those 4 tests, the eBPF unit tests ran for 1 hour twice, and reported a "no space left" error the other two times. Therefore, I think it may be a problem with the CI environment |
|
This issue was caused by using the Ubuntu 22.04 image version in CI. However, it has been resolved by upgrading the Ubuntu image version. So maybe you update the Ubuntu image in CI can solve this issue |
Signed-off-by: 072020127 <mhy200253@gmail.com>
Signed-off-by: 072020127 <mhy200253@gmail.com>
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LiZhenCheng9527 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |

What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1409
Special notes for your reviewer:
Does this PR introduce a user-facing change?: